maxsnew/parsing-with-presheaves: parsing-with-presheaves.tex

\documentclass[12pt]{article}
%AMS-TeX packages
\usepackage{tikz}
\usepackage[utf8]{inputenc}
\usepackage{amssymb,amsmath,amsthm,stmaryrd} 
%geometry (sets margin) and other useful packages
\usepackage[margin=1.25in]{geometry}
\usepackage{graphicx,ctable,booktabs}
\usepackage{tikz-cd}
\usepackage{mathpartir}

\usepackage[sort&compress,square,comma,authoryear]{natbib}
\bibliographystyle{plainnat}

\newcommand{\leftmultimap}{\mathop{\rotatebox[origin=c]{180}{$\multimap$}}}
\newcommand{\Set}{\textrm{Set}}
\newcommand{\Lang}{\textrm{Lang}}
\newcommand{\Grammar}{\textrm{Grammar}}
\newcommand{\sem}[1]{\llbracket{#1}\rrbracket}

\newcommand\rsepimp{\mathbin{-\mkern-6mu*}}
\newcommand\lsepimp{\mathbin{\rotatebox[origin=c]{180}{$-\mkern-6mu*$}}}
\newcommand\sepconj{\mathbin{*}}
\newcommand\bunch{\mathcal W}

\begin{document}
\title{Parsing with Presheaves}
\author{Max S. New}
\maketitle

\begin{abstract}
  We show that formal grammars can naturally be given semantics in the
  category of presheaves on strings. The monoid structure of strings
  induces Day convolution structure on grammars that models the
  sequencing operation of regular expressions and other forms of
  formal grammar. The closed structure on this category can be used to
  define the \emph{derivative} of a language, etc.

  %% Presheaves over a monoidal category is a common way to construct
  %% models of separation logic. We show that we can use (ordered)
  %% separation logic propositions as grammars and proofs in ordered
  %% separation logic as parses and more generally parse-transformers, so
  %% that parsing is equivalent to a proof search in separation logic.
\end{abstract}n

A \emph{language} over an alphabet $\Sigma$ is a subset of strings
$\Sigma^*$. A \emph{grammar} $g$ is a syntactic object that presents a
language. There are many descriptions of formal grammars: regular
expressions, context-free grammars, context-sensitive etc. The problem
of \emph{language recognition} is to take a string $w \in \Sigma^*$
and a grammar $g$ and determine if $w$ is a member of the language
defined by $g$.

When writing a compiler, or communicating over a network, we are not
just interested in the problem of language \emph{recognition},
determining whether or not a word is in a language, but in the problem
of \emph{parsing}, the attempt to produce a \emph{parse tree} from the
string. We can think of the parsing as a ``constructive'' version of
the language recognition problem: the parse tree is constructive
evidence that the word is a member of the language. Another way to say
this is that parsing is a \emph{categorification} of language
recognition. A language is a subset of $\Sigma^*$, which is equivalent
to a \emph{predicate} on $\Sigma^*$:
\[ L : \Sigma^* \to \Omega \]
where $\Omega$ is the set of \emph{truth values}\footnote{classically,
this is equivalent to any two element set, but constructively to
assume this set is boolean would require the language is
decidable}\footnote{In Coq this is called Prop}. We can categorify
this situation by replacing the partially ordered set of truth values
$\Omega$ with the category of sets $\Set$:
\[ L : \Sigma^* \to \Set \]
That is, a language is a \emph{functor} from $\Sigma^*$ (viewed as a
discrete category) to the category of sets and functions.
Functors from $\Sigma^*$ to $\Set$ are called \emph{presheaves} over
$\Sigma^*$ and the category of such functors is extremely
well-behaved. We show in the remainder of this note that this provides
us with a rich language for constructing languages algebraically.
\begin{quote}
  A formal language is a presheaf over the set of strings
\end{quote}

Then different classes of formal grammars are described by expression
languages that internalize parts of this rich algebraic structure of this
category.

\section{Algebraic Structure of the Category of Formal Grammars}

First, since $\Grammar$ is a presheaf category, we can define the
\emph{representable} presheaves $Y w$\footnote{$Y$ stands for the
\emph{Yoneda} embedding} where $w \in \Sigma^*$. Since $\Sigma^*$ is a
set,
\[ (Y w)w' = \{ * | w = w' \} \]
Most useful will be the representables $Y c$ where $c \in \Sigma$.

Next, since $\Sigma^*$ is a \emph{monoid} wrt concatenation of
strings\footnote{in fact the free monoid}, this induces a monoidal
structure on the category $\Lang$ called the \emph{Day convolution},
with models the \emph{sequencing} of the languages:
\[ (L \sepconj L')w = \Sigma_{w_1w_2=w} Lw_1 \times L w_2 \]
That is, a parse of $w$ in the language $L \sepconj L'$ consists of a
choice of splitting $w$ into $w_1w_2$ together with a parse of $w_1$
by $L$ and a parse of $w_2$ by $L'$. Note that this is associative
(up to isomorphism) but not symmetric if $|\Sigma| \ge 2$.
This has a unit as well, which is the language of just the empty string:
\[ I w = w = \epsilon \]
Since $\Sigma^*$ is a free monoid, every word $w$ is a finite sequence
of characters $c \in \Sigma$. Given $w = c_1c_2\cdots$ we can now
choose between two different representations:
\[ Y(c_1c_2\cdots) \qquad \textrm{or} \qquad Yc_1 \sepconj Yc_2 \sepconj \cdots\]
But it is not difficult to prove that the Yoneda embedding preserves
the monoidal structure up to isomorphism.

Next, $\Lang$ has all limits and colimits. Coproducts give us the
``union'' of languages, with initial object the empty language, and
products give us the intersection, with the terminal object being the
language that accepts all strings, but with no information in the
parse.
\[ (L_1 + L_2) w = (L_1 w) + (L_2 w) \]
\[ 0 w = \emptyset \]
\[ (L_1 \times L_2) w =  L_1 w \times L_2 w\]
\[ 1 w =  1 \]

Finally, given an endofunctor $F : \Lang \to \Lang$, given certain
assumptions about $F$, we can construct the \emph{initial algebra}
$F(\mu F) \to \mu F$, the constructive variant of the least fixed
point. The simplest condition is that $F$ preserve $\omega$-colimits,
i.e., colimits of diagrams of the form
\[ A_0 \to A_1 \to A_2 \to \cdots \]
Then the ``least fixed point'' $\mu F$ can be constructed ``iteratively'' as the colimit of
\[ F^00 \to F^10 \to F^20\to \cdots \]
(TODO: Adamek-Pohlova citation) generalizing Kleene's fixed point
theorem for $\omega$-cpos.

These constructions alone are enough to give a semantics to regular
and context free grammars.

\subsection{Closed Structure}

By general properties of presheaf categories, the category of
languages has \emph{closed} structure with respect to the monoidal
products of sequencing $\sepconj$ and product $\times$.

First, the product has a right (multi-variable) adjoints:
\[ (L_1 \times L_2 \to  L_3) \cong L_1 \to (L_2 \Rightarrow L_3) \cong L_2 \to (L_1 \Rightarrow L_3) \]

Defined by
\[ (L \Rightarrow L')w = \textrm{Hom}((Y w \times L), L') = Lw \to L'w \]
Informally, $(L \Rightarrow L')w$ is constructive \emph{implication}
of languages: a parse of $w$ is a function that takes $L$-parses of
$w$ to $L'$-parses of $w$

Next, the sequencing operator $\sepconj$, since it is not symmetric, has
two right multi-variable adjoints:
\[ (L_1 \sepconj L_2 \to L_3) \cong (L_1 \to L_3 \lsepimp L_2) \cong (L_2 \to L_2 \rsepimp L_3) \]

Defined by
\[ (L \multimap L')w = \Pi_{w_l} L w_l \to L'(w_lw) \]
\[ (L' \leftmultimap L)w = \Pi_{w_r} L w_r \to L'(ww_r) \]

A parse of $(L \multimap L')w$ is a kind of ``partial'' parse of $L'$:
given any parse $L w_l$, we can get a complete parse $L' (w_lw)$. So
if $(L \multimap L') w$ successfully parses (and $L$ is non-empty),
then we have succesfully parsed a suffix of $L'$, that when
concatenated with a prefix that can by parsed as $L$ will yield a full
parse of $L'$.

Dually, the left hom $(L' \leftmultimap L)$ is a partial parse from
the right of the string.

These left and right hom operations can be used to define the
\emph{derivative} of a language with respect to a character or
word. If we focus on left-to-right parsing this would be:
\[ d^l_c(L) = Yc \multimap L \]
Why?
\[ (Yc \multimap L)w = \Pi_{w_l} \{ 1 | c = w_l \} \to L(w_lw) \cong L(cw) \]
Note that this makes some 

If we reduce to language recognition, this is a de Morgan-dual of the
``quotient'' of a formal language.

\[ (L_1 / L_2) w \iff \exists w' \in L_2. L_1(ww') \iff \neg (\forall w' \in L_2. \neg L_1(ww')) \iff \neg ((\neg L_1 \leftmultimap L_2) w) \]

Finally, for any string $w$, there is also a further right adjoint:
\[ ((Yw \multimap L) \to L') \cong L \to L'_{w} \]
defined by
\[ (L'_{w'})w = \textrm{Hom}(Yw' \multimap Yw, L') = ?? \]

\subsection{Morphisms}

In the language recognition setting, we can define the
\emph{inclusion} of languages $L \subseteq L'$. In the constructive
setting this becomes a \emph{derivation transformer} $\phi : L \to
L'$, which is simply a natural transformation. Since $\Sigma^*$ is a
set this is simply for each $w \in \Sigma^*$ a function
\[ \phi_w : Lw \to L' w \]
with no non-trivial side-conditions.
However it is natural to define a \emph{morphism of multiple variables}
$\phi : L_1,\ldots \to L'$ as giving for any sequence of words $w_1,\ldots$ a function
\[ L_1w_1 \times \cdots \to L'(w_1\ldots) \]
Specializing to the $0$-input case, a morphism of zero inputs is a
parse of the empty string:
\[ \phi : 1 \to L'(\epsilon) \cong L'(\epsilon) \]
This gives the structure of a (planar) multicategory, in which the
operations $\sepconj,I,\multimap,\leftmultimap$ can be given appropriate
universal properties.

A grammar $g$ is
\begin{enumerate}
\item \emph{total} if for every $w$ there is at least one parse
\item \emph{deterministic} if for every $w$ there is at most one parse
\item \emph{empty} if for every $w$ there are no parses
\end{enumerate}

%% \begin{definition}
%%   A grammar $g$ is \emph{uniquely decidable} if there exists a
%%   ``complement'' $g'$ such that $g \vee g'$ is total and
%%   deterministic.
%% \end{definition}

In other words, $g \vee g' \cong \top$

\subsection{Endofunctors}

TODO

\section{Associativity, Precedence, and Ordering using Category-valued Presheaves}

The presheaf of sets model of grammars allows us to reason about not
just whether a string parses, but also about \emph{ambiguity} of a
grammar: whether there exists \emph{more than one} parse. This is
modeled by the grammar $g$ having more than one parse of the same
string $w$. We can use this as above to reason about when a grammar is
ambiguous or not, and if the grammar is unambiguous, we have a clear,
deterministic specification for the parser.

But what about parsing \emph{ambiguous} grammars? We can simply say
that a parser should produce any parse, or perhaps all of them, but in
practice many parser generators and programming languages supporting
infix/mixfix notation allow for \emph{associativity} and
\emph{precedence} declarations. These are described by providing an
ambiguous grammar and disambiguating not by rewriting the grammar, but
instead by expressing a \emph{preference} between the multiple
possible parses. For instance, one of the most common sources of
ambiguity in parsing is in associativity and precedence of binary
operators. For an example of associativity, a grammar for subtraction
might initially be written as:
\begin{mathpar}
  \begin{array}{rcl}
    E &::= & E - E \,|\,(E)\,|\,Number\\
  \end{array}
\end{mathpar}

But then we note that there are multiple, semantically different
parses of $3 - 4 - 5$: trees semantically equivalent to $(3 - 4) - 5$
or $3 - (4 - 5)$. We can rule out the right-nested version by
\emph{factoring} the grammar:
\begin{mathpar}
  \begin{array}{rcl}
    E &::= & E - L \,|\,L\\
    L &::= & ( E ) \,|\, Number
  \end{array}
\end{mathpar}
This grammar has fewer parses than the previous: now $3 - 4 - 5$ will
only parse as a left-nested term since $4 - 5$ does not match the
``leaf'' production $L$. Instead, most languages declare that $-$ is a
left associative operator. We can think of this as expressing a
\emph{preference} for certain parses over others, specifically the tree
% https://q.uiver.app/?q=WzAsNSxbMiwyLCJFXzIiXSxbMywxLCJFXzMiXSxbMiwwLCItIl0sWzEsMSwiLSJdLFswLDIsIkVfMSJdLFsyLDFdLFszLDRdLFszLDBdLFsyLDNdXQ==
\[\begin{tikzcd}
	&& {-} \\
	& {-} && {E_3} \\
	{E_1} && {E_2}
	\arrow[from=1-3, to=2-4]
	\arrow[from=2-2, to=3-1]
	\arrow[from=2-2, to=3-3]
	\arrow[from=1-3, to=2-2]
\end{tikzcd}\]
is prefered over
% https://q.uiver.app/?q=WzAsNSxbMSwwLCItIl0sWzAsMSwiRV8xIl0sWzIsMSwiLSJdLFsxLDIsIkVfMiJdLFszLDIsIkVfMyJdLFswLDFdLFswLDJdLFsyLDNdLFsyLDRdXQ==
\[\begin{tikzcd}
	& {-} \\
	{E_1} && {-} \\
	& {E_2} && {E_3}
	\arrow[from=1-2, to=2-1]
	\arrow[from=1-2, to=2-3]
	\arrow[from=2-3, to=3-2]
	\arrow[from=2-3, to=3-4]
\end{tikzcd}\]
We can express this as an \emph{ordering}, where the prefered tree is
``bigger''.

Fortunately, we can express this with a small change. Rather than
taking functors from $\Sigma^*$ to \emph{sets} we can take functors to
\emph{partially ordered sets}, or even more generally into
\emph{categories}. So a grammar $g$ defines for each string $w$ a
category $g w$ whose objects are parses and for any two parses
$p_1,p_2 \in g w$, we have a set $(g w)(p_1,p_2)$ of ``reasons to
prefer $p_2$ over $p_1$''. Furthermore, preference is transitive and
reflexive, with associativity and unit of transitive reasoning.

Then the parsing problem can be rephrased as, given a syntactic
description of a grammar $g$ and a string $w$, what is the \emph{best}
parse in $g w$, if any? In categorical language, can we construct a
terminal object of $g w$?

%% \section{Ordered Separation Logic for Formal Grammars}

%% Next, based on our algebraic constructions in the previous section, we
%% describe a \emph{polymorphic ordered separation logic} for
%% grammars. While this may seem a surprising application of separation
%% logic, it is quite natural when considering the semantics: Day
%% convolution of presheaves over a \emph{commutative} monoid is one of
%% the oldest semantic interpretations of separation logic proofs. For
%% parsing, we are taking presheaves over a non-commutative monoid (if
%% $|\Sigma| \ge 2$). This non-commutativity is obvious for instance in
%% regular expressions: $r_1r_2$ almost never presents the same language
%% as $r_2r_1$.

%% This consists of three judgments.
%% \begin{enumerate}
%% \item An \emph{open} grammar $g$ parameterized by a context of grammar
%%   variables $\Gamma = \gamma_1,\ldots$:
%%   \[ \Gamma \vdash g \]
%% \item For each $\Gamma$ a judgment of \emph{parse terms} of a grammar
%%   $g$ parameterized by a ``bunch'' of parsed words $\bunch$, where $g$
%%   and $\bunch$ are both parameterized by $\Gamma$:
%%   \[ \Gamma | \bunch \vdash p : g \]
%% \end{enumerate}

%% The intended denotational semantics is
%% \begin{enumerate}
%% \item A grammar $\Gamma \vdash g$ denotes a (mixed-variance?) functor
%%   $\sem{g} : \Lang^{|\Gamma|} \to \Lang$
%% \item A parse term $\Gamma | \mathcal W \vdash p : g$ denotes a
%%   natural transformation $\mathcal W \Rightarrow g$ where a bunch $\mathcal W$
%%   is interpreted appropriately using the two monoidal structures on
%%   $\Lang$:
%%   \begin{align*}
%%     \sem{\cdot_\epsilon} &= \epsilon\\
%%     \sem{\mathcal W_m,w : g} &= \sem{\mathcal W_m}\sepconj \sem{g}\\
%%     \sem{\mathcal W_m,\mathcal W_a} &= \sem{\mathcal W_m}\sepconj \sem{\mathcal W_a}\\
%%     \sem{\cdot_\top} &= \top\\
%%     \sem{\mathcal W_a;\mathcal W_m} &= \sem{\mathcal W_a} \times \sem{\mathcal W_m}        
%%   \end{align*}
%% \end{enumerate}

%% To get a feel for the differences between the multiplicative and
%% additive structure here are some examples.
%% A parse term
%% \[ \cdot_\epsilon \vdash p : g\]
%% denotes a $g$-parse of the empty string, whereas a term
%% \[ \cdot_\top \vdash p : g\]
%% denotes a function from strings $w \in \Sigma^*$ to $g$-parses of $w$.
%% The grammar $g \sepconj g'$ presents the language that parses words
%% $w$ that can be split as $w_1w_2$ with a pair of parses $gw_1$ and
%% $g'w_2$. On the other hand, $g \times g'$ is the intersection of the
%% two languages: a parse of a word $w$ is a pair of a $g$ parse and a
%% $g'$ parse of the entire word.  $g \times 1$ is equivalent to $g$,
%% whereas a parse of $w$ using $g \sepconj 1$ is a $g$-parse of a prefix
%% $w'$ of $w$, and correspondingly $1 \sepconj g$ parses words with a
%% $g$ suffix. On the other hand $\epsilon \times g$ is the restriction
%% of $g$ to only the parses of the empty string.

%% \begin{figure}
%%   \begin{mathpar}
%%     \inferrule
%%     {\Delta, \gamma \vdash g \and \textrm{where $\gamma$ strictly positive}}
%%     {\Delta \vdash \mu \gamma. g}

    
%%   \end{mathpar}
%%   \caption{Ordered Separation Logic}
%% \end{figure}

%% Given any string $w$ over our alphabet, there is a corresponding
%% grammar $\lceil w \rceil$ that accepts precisely $w$ with a single
%% parse. Then we can formulate the \emph{parsing problem} as a
%% \emph{program synthesis} problem:
%% \begin{quote}
%%   Given a closed grammar $\cdot \vdash g$ and a word $w$, attempt to
%%   construct a parse $x : \lceil w \rceil \vdash p : g$
%% \end{quote}
%% Or, even more naturally, we can break w up into its constituent
%% characters:
%% \begin{quote}
%%   Given a closed grammar $\cdot \vdash g$ and a word $c_1\ldots_n$,
%%   construct a parse $x_1 : \lceil c_1 \rceil,\ldots_n \vdash p
%%   : g$
%% \end{quote}

%% \begin{figure}
%%   \begin{mathpar}
%%     \inferrule{g \in \mathcal G}{\mathcal G \vdash g}

    

%%   \end{mathpar}
%%   \caption{Context-free Expressions}
%% \end{figure}

\section{Semantic Actions and Grammar/Parser Combinators}

Finally, we arrive at the last main component of a parser, the
so-called ``semantic actions''. While the constructive parsing
approach does produce a parse \emph{tree}, it is not usually the case
that this parse tree is \emph{the same} as the desired typed of
ASTs. Instead, we need only that there is a \emph{function} from parse
trees to ASTs. In general this might fail to be injective (such as
forgetting whether parentheses were used) or surjective (if there are
internal compiler forms that do not directly correspond to source
programs).


We can formalize this by noting that there is a functor, the constant
functor $\Delta$ from $\Set$ to $\Set^{\Sigma^*}$ that sends a set $A$
to the constant presheaf $\Delta(A)_w = A$. Then we can take the comma
category $\Set^{\Sigma^*}/K$. The objects of this category are pairs
of a grammar $g$ and a set $A$ with a function
\[ \prod_{w \in \Sigma^*} g_w \to A \]

Then the projection functor $cod : \Set^{\Sigma^*}/\Delta \to \Set$ is
an opfibration. The pushforward is the ``map'' operation on
grammars. We can equivalently view an opfibration as a functor from
$\Set$ to $\Set^{\Sigma^*}$


%% While the presheaf presentation directly captures the notion of a
%% parse tree, in a program the parse tree described by a grammar is
%% often massaged to express precedence data and so the tree structure of
%% a parse is more complex than the actual data being parsed. For
%% instance, an AST data type for a single binary arithmetic operation
%% can be represented as a binary tree with numbers at the leaves, such
%% as the following Haskell/Agda-like data definition:
%% \begin{verbatim}
%% data Tree where
%%   op   : Tree -> Tree -> Tree
%%   leaf : Num  -> Tree
%% \end{verbatim}


%% If the single operator op is defined to be left-associative (such as
%% subtraction/division), a typical grammar for parsing these expressions
%% would be defined by \emph{factoring} the grammar into two mutually
%% recursive non-terminals, \emph{E} for expression and \emph{L} for
%% leaf:
%% \begin{mathpar}
%%   \begin{array}{rcl}
%%     E &::= & E - L \,|\,L\\
%%     L &::= & ( E ) \,|\, Number
%%   \end{array}
%% \end{mathpar}

%% This roughly corresponds to the following mutually recursive datatypes:
%% \begin{verbatim}
%% data Expr where
%%   op   : Expr -> Leaf -> Expr
%%   leaf : Leaf -> Expr

%% data Leaf where
%%   parens : Expr -> Leaf
%%   num    : Number -> Leaf
%% \end{verbatim}

%% I.e., an \texttt{Expr} is a non-empty list of \texttt{Expr + Number}.

%% This mismatch is usually handled by \emph{semantic actions}, a kind of
%% fold over the parse tree that is compositionally expressed alongside
%% the definition of the parser itself. This can be seen as a manually
%% performed optimization to make up for the compiler's inability to
%% \emph{fuse} the two functions of parsing (an unfold) with constructing
%% the intended AST (a fold). However, semantic actions become more
%% useful when we have a \emph{compositional} language of grammars with
%% corresponding semantic actions, such as in parser combinator
%% libraries.

%% Parser combinator libraries are typically based on some notion of
%% backtracking error monad: a term of type \texttt{Parser Sigma A} is a
%% parser of strings from the alphabet Sigma that on correct parses produces
%% an associated value of type \texttt{A}.
%% %
%% Intuitively, this should be semantically (though not performantly)
%% isomorphic to a grammar $g$ paired with a function from $g$-parses to
%% $A$.

\section{From Grammars to Parsers}

A grammar $g$ defines for each string $w$ a category of parses and
preferences. The task of a \emph{parser} is, given a syntactic
description of a grammar and a string, to produce a \emph{best} parse,
i.e., a terminal object of the category $g w$ of parses.

\section{Future Work: Tree Languages and Grammars}



\end{document}