\documentclass[12pt]{article}
\usepackage{tikz}
\usepackage[utf8]{inputenc}
\usepackage{amssymb,amsmath,amsthm,stmaryrd}
\usepackage[margin=1.25in]{geometry}
\usepackage{graphicx,ctable,booktabs}
\usepackage{tikz-cd}
\usepackage{mathpartir}
\usepackage[sort&compress,square,comma,authoryear]{natbib}
\bibliographystyle{plainnat}
\newcommand{\leftmultimap}{\mathop{\rotatebox[origin=c]{180}{$\multimap$}}}
\newcommand{\Set}{\textrm{Set}}
\newcommand{\Lang}{\textrm{Lang}}
\newcommand{\Grammar}{\textrm{Grammar}}
\newcommand{\sem}[1]{\llbracket{#1}\rrbracket}
\newcommand\rsepimp{\mathbin{-\mkern-6mu*}}
\newcommand\lsepimp{\mathbin{\rotatebox[origin=c]{180}{$-\mkern-6mu*$}}}
\newcommand\sepconj{\mathbin{*}}
\newcommand\bunch{\mathcal W}
\begin{document}
\title{Parsing with Presheaves}
\author{Max S. New}
\maketitle
\begin{abstract}
We show that formal grammars can naturally be given semantics in the
category of presheaves on strings. The monoid structure of strings
induces Day convolution structure on grammars that models the
sequencing operation of regular expressions and other forms of
formal grammar. The closed structure on this category can be used to
define the \emph{derivative} of a language, etc.
\end{abstract}n
A \emph{language} over an alphabet $\Sigma$ is a subset of strings
$\Sigma^*$. A \emph{grammar} $g$ is a syntactic object that presents a
language. There are many descriptions of formal grammars: regular
expressions, context-free grammars, context-sensitive etc. The problem
of \emph{language recognition} is to take a string $w \in \Sigma^*$
and a grammar $g$ and determine if $w$ is a member of the language
defined by $g$.
When writing a compiler, or communicating over a network, we are not
just interested in the problem of language \emph{recognition},
determining whether or not a word is in a language, but in the problem
of \emph{parsing}, the attempt to produce a \emph{parse tree} from the
string. We can think of the parsing as a ``constructive'' version of
the language recognition problem: the parse tree is constructive
evidence that the word is a member of the language. Another way to say
this is that parsing is a \emph{categorification} of language
recognition. A language is a subset of $\Sigma^*$, which is equivalent
to a \emph{predicate} on $\Sigma^*$:
\[ L : \Sigma^* \to \Omega \]
where $\Omega$ is the set of \emph{truth values}\footnote{classically,
this is equivalent to any two element set, but constructively to
assume this set is boolean would require the language is
decidable}\footnote{In Coq this is called Prop}. We can categorify
this situation by replacing the partially ordered set of truth values
$\Omega$ with the category of sets $\Set$:
\[ L : \Sigma^* \to \Set \]
That is, a language is a \emph{functor} from $\Sigma^*$ (viewed as a
discrete category) to the category of sets and functions.
Functors from $\Sigma^*$ to $\Set$ are called \emph{presheaves} over
$\Sigma^*$ and the category of such functors is extremely
well-behaved. We show in the remainder of this note that this provides
us with a rich language for constructing languages algebraically.
\begin{quote}
A formal language is a presheaf over the set of strings
\end{quote}
Then different classes of formal grammars are described by expression
languages that internalize parts of this rich algebraic structure of this
category.
\section{Algebraic Structure of the Category of Formal Grammars}
First, since $\Grammar$ is a presheaf category, we can define the
\emph{representable} presheaves $Y w$\footnote{$Y$ stands for the
\emph{Yoneda} embedding} where $w \in \Sigma^*$. Since $\Sigma^*$ is a
set,
\[ (Y w)w' = \{ * | w = w' \} \]
Most useful will be the representables $Y c$ where $c \in \Sigma$.
Next, since $\Sigma^*$ is a \emph{monoid} wrt concatenation of
strings\footnote{in fact the free monoid}, this induces a monoidal
structure on the category $\Lang$ called the \emph{Day convolution},
with models the \emph{sequencing} of the languages:
\[ (L \sepconj L')w = \Sigma_{w_1w_2=w} Lw_1 \times L w_2 \]
That is, a parse of $w$ in the language $L \sepconj L'$ consists of a
choice of splitting $w$ into $w_1w_2$ together with a parse of $w_1$
by $L$ and a parse of $w_2$ by $L'$. Note that this is associative
(up to isomorphism) but not symmetric if $|\Sigma| \ge 2$.
This has a unit as well, which is the language of just the empty string:
\[ I w = w = \epsilon \]
Since $\Sigma^*$ is a free monoid, every word $w$ is a finite sequence
of characters $c \in \Sigma$. Given $w = c_1c_2\cdots$ we can now
choose between two different representations:
\[ Y(c_1c_2\cdots) \qquad \textrm{or} \qquad Yc_1 \sepconj Yc_2 \sepconj \cdots\]
But it is not difficult to prove that the Yoneda embedding preserves
the monoidal structure up to isomorphism.
Next, $\Lang$ has all limits and colimits. Coproducts give us the
``union'' of languages, with initial object the empty language, and
products give us the intersection, with the terminal object being the
language that accepts all strings, but with no information in the
parse.
\[ (L_1 + L_2) w = (L_1 w) + (L_2 w) \]
\[ 0 w = \emptyset \]
\[ (L_1 \times L_2) w = L_1 w \times L_2 w\]
\[ 1 w = 1 \]
Finally, given an endofunctor $F : \Lang \to \Lang$, given certain
assumptions about $F$, we can construct the \emph{initial algebra}
$F(\mu F) \to \mu F$, the constructive variant of the least fixed
point. The simplest condition is that $F$ preserve $\omega$-colimits,
i.e., colimits of diagrams of the form
\[ A_0 \to A_1 \to A_2 \to \cdots \]
Then the ``least fixed point'' $\mu F$ can be constructed ``iteratively'' as the colimit of
\[ F^00 \to F^10 \to F^20\to \cdots \]
(TODO: Adamek-Pohlova citation) generalizing Kleene's fixed point
theorem for $\omega$-cpos.
These constructions alone are enough to give a semantics to regular
and context free grammars.
\subsection{Closed Structure}
By general properties of presheaf categories, the category of
languages has \emph{closed} structure with respect to the monoidal
products of sequencing $\sepconj$ and product $\times$.
First, the product has a right (multi-variable) adjoints:
\[ (L_1 \times L_2 \to L_3) \cong L_1 \to (L_2 \Rightarrow L_3) \cong L_2 \to (L_1 \Rightarrow L_3) \]
Defined by
\[ (L \Rightarrow L')w = \textrm{Hom}((Y w \times L), L') = Lw \to L'w \]
Informally, $(L \Rightarrow L')w$ is constructive \emph{implication}
of languages: a parse of $w$ is a function that takes $L$-parses of
$w$ to $L'$-parses of $w$
Next, the sequencing operator $\sepconj$, since it is not symmetric, has
two right multi-variable adjoints:
\[ (L_1 \sepconj L_2 \to L_3) \cong (L_1 \to L_3 \lsepimp L_2) \cong (L_2 \to L_2 \rsepimp L_3) \]
Defined by
\[ (L \multimap L')w = \Pi_{w_l} L w_l \to L'(w_lw) \]
\[ (L' \leftmultimap L)w = \Pi_{w_r} L w_r \to L'(ww_r) \]
A parse of $(L \multimap L')w$ is a kind of ``partial'' parse of $L'$:
given any parse $L w_l$, we can get a complete parse $L' (w_lw)$. So
if $(L \multimap L') w$ successfully parses (and $L$ is non-empty),
then we have succesfully parsed a suffix of $L'$, that when
concatenated with a prefix that can by parsed as $L$ will yield a full
parse of $L'$.
Dually, the left hom $(L' \leftmultimap L)$ is a partial parse from
the right of the string.
These left and right hom operations can be used to define the
\emph{derivative} of a language with respect to a character or
word. If we focus on left-to-right parsing this would be:
\[ d^l_c(L) = Yc \multimap L \]
Why?
\[ (Yc \multimap L)w = \Pi_{w_l} \{ 1 | c = w_l \} \to L(w_lw) \cong L(cw) \]
Note that this makes some
If we reduce to language recognition, this is a de Morgan-dual of the
``quotient'' of a formal language.
\[ (L_1 / L_2) w \iff \exists w' \in L_2. L_1(ww') \iff \neg (\forall w' \in L_2. \neg L_1(ww')) \iff \neg ((\neg L_1 \leftmultimap L_2) w) \]
Finally, for any string $w$, there is also a further right adjoint:
\[ ((Yw \multimap L) \to L') \cong L \to L'_{w} \]
defined by
\[ (L'_{w'})w = \textrm{Hom}(Yw' \multimap Yw, L') = ?? \]
\subsection{Morphisms}
In the language recognition setting, we can define the
\emph{inclusion} of languages $L \subseteq L'$. In the constructive
setting this becomes a \emph{derivation transformer} $\phi : L \to
L'$, which is simply a natural transformation. Since $\Sigma^*$ is a
set this is simply for each $w \in \Sigma^*$ a function
\[ \phi_w : Lw \to L' w \]
with no non-trivial side-conditions.
However it is natural to define a \emph{morphism of multiple variables}
$\phi : L_1,\ldots \to L'$ as giving for any sequence of words $w_1,\ldots$ a function
\[ L_1w_1 \times \cdots \to L'(w_1\ldots) \]
Specializing to the $0$-input case, a morphism of zero inputs is a
parse of the empty string:
\[ \phi : 1 \to L'(\epsilon) \cong L'(\epsilon) \]
This gives the structure of a (planar) multicategory, in which the
operations $\sepconj,I,\multimap,\leftmultimap$ can be given appropriate
universal properties.
A grammar $g$ is
\begin{enumerate}
\item \emph{total} if for every $w$ there is at least one parse
\item \emph{deterministic} if for every $w$ there is at most one parse
\item \emph{empty} if for every $w$ there are no parses
\end{enumerate}
In other words, $g \vee g' \cong \top$
\subsection{Endofunctors}
TODO
\section{Associativity, Precedence, and Ordering using Category-valued Presheaves}
The presheaf of sets model of grammars allows us to reason about not
just whether a string parses, but also about \emph{ambiguity} of a
grammar: whether there exists \emph{more than one} parse. This is
modeled by the grammar $g$ having more than one parse of the same
string $w$. We can use this as above to reason about when a grammar is
ambiguous or not, and if the grammar is unambiguous, we have a clear,
deterministic specification for the parser.
But what about parsing \emph{ambiguous} grammars? We can simply say
that a parser should produce any parse, or perhaps all of them, but in
practice many parser generators and programming languages supporting
infix/mixfix notation allow for \emph{associativity} and
\emph{precedence} declarations. These are described by providing an
ambiguous grammar and disambiguating not by rewriting the grammar, but
instead by expressing a \emph{preference} between the multiple
possible parses. For instance, one of the most common sources of
ambiguity in parsing is in associativity and precedence of binary
operators. For an example of associativity, a grammar for subtraction
might initially be written as:
\begin{mathpar}
\begin{array}{rcl}
E &::= & E - E \,|\,(E)\,|\,Number\\
\end{array}
\end{mathpar}
But then we note that there are multiple, semantically different
parses of $3 - 4 - 5$: trees semantically equivalent to $(3 - 4) - 5$
or $3 - (4 - 5)$. We can rule out the right-nested version by
\emph{factoring} the grammar:
\begin{mathpar}
\begin{array}{rcl}
E &::= & E - L \,|\,L\\
L &::= & ( E ) \,|\, Number
\end{array}
\end{mathpar}
This grammar has fewer parses than the previous: now $3 - 4 - 5$ will
only parse as a left-nested term since $4 - 5$ does not match the
``leaf'' production $L$. Instead, most languages declare that $-$ is a
left associative operator. We can think of this as expressing a
\emph{preference} for certain parses over others, specifically the tree
\[\begin{tikzcd}
&& {-} \\
& {-} && {E_3} \\
{E_1} && {E_2}
\arrow[from=1-3, to=2-4]
\arrow[from=2-2, to=3-1]
\arrow[from=2-2, to=3-3]
\arrow[from=1-3, to=2-2]
\end{tikzcd}\]
is prefered over
\[\begin{tikzcd}
& {-} \\
{E_1} && {-} \\
& {E_2} && {E_3}
\arrow[from=1-2, to=2-1]
\arrow[from=1-2, to=2-3]
\arrow[from=2-3, to=3-2]
\arrow[from=2-3, to=3-4]
\end{tikzcd}\]
We can express this as an \emph{ordering}, where the prefered tree is
``bigger''.
Fortunately, we can express this with a small change. Rather than
taking functors from $\Sigma^*$ to \emph{sets} we can take functors to
\emph{partially ordered sets}, or even more generally into
\emph{categories}. So a grammar $g$ defines for each string $w$ a
category $g w$ whose objects are parses and for any two parses
$p_1,p_2 \in g w$, we have a set $(g w)(p_1,p_2)$ of ``reasons to
prefer $p_2$ over $p_1$''. Furthermore, preference is transitive and
reflexive, with associativity and unit of transitive reasoning.
Then the parsing problem can be rephrased as, given a syntactic
description of a grammar $g$ and a string $w$, what is the \emph{best}
parse in $g w$, if any? In categorical language, can we construct a
terminal object of $g w$?
\section{Semantic Actions and Grammar/Parser Combinators}
Finally, we arrive at the last main component of a parser, the
so-called ``semantic actions''. While the constructive parsing
approach does produce a parse \emph{tree}, it is not usually the case
that this parse tree is \emph{the same} as the desired typed of
ASTs. Instead, we need only that there is a \emph{function} from parse
trees to ASTs. In general this might fail to be injective (such as
forgetting whether parentheses were used) or surjective (if there are
internal compiler forms that do not directly correspond to source
programs).
We can formalize this by noting that there is a functor, the constant
functor $\Delta$ from $\Set$ to $\Set^{\Sigma^*}$ that sends a set $A$
to the constant presheaf $\Delta(A)_w = A$. Then we can take the comma
category $\Set^{\Sigma^*}/K$. The objects of this category are pairs
of a grammar $g$ and a set $A$ with a function
\[ \prod_{w \in \Sigma^*} g_w \to A \]
Then the projection functor $cod : \Set^{\Sigma^*}/\Delta \to \Set$ is
an opfibration. The pushforward is the ``map'' operation on
grammars. We can equivalently view an opfibration as a functor from
$\Set$ to $\Set^{\Sigma^*}$
\section{From Grammars to Parsers}
A grammar $g$ defines for each string $w$ a category of parses and
preferences. The task of a \emph{parser} is, given a syntactic
description of a grammar and a string, to produce a \emph{best} parse,
i.e., a terminal object of the category $g w$ of parses.
\section{Future Work: Tree Languages and Grammars}
\end{document}