DPX4J2SS2HPH6GR4MALYCLRQL2NTDIQA5PCSAW5PPVURGRMSWSUQC
\begin{abstract}
We show that formal grammars can naturally be given semantics in the
category of presheaves on strings. The monoid structure of strings
induces Day convolution structure on grammars that models the
sequencing operation of regular expressions and other forms of
formal grammar. The closed structure on this category can be used to
define the \emph{derivative} of a language, etc.
%% Presheaves over a monoidal category is a common way to construct
%% models of separation logic. We show that we can use (ordered)
%% separation logic propositions as grammars and proofs in ordered
%% separation logic as parses and more generally parse-transformers, so
%% that parsing is equivalent to a proof search in separation logic.
\end{abstract}n
A grammar $g$ is
\begin{enumerate}
\item \emph{total} if for every $w$ there is at least one parse
\item \emph{deterministic} if for every $w$ there is at most one parse
\item \emph{empty} if for every $w$ there are no parses
\end{enumerate}
%% \begin{definition}
%% A grammar $g$ is \emph{uniquely decidable} if there exists a
%% ``complement'' $g'$ such that $g \vee g'$ is total and
%% deterministic.
%% \end{definition}
\section{Formal Grammars and their Semantics}
The presheaf of sets model of grammars allows us to reason about not
just whether a string parses, but also about \emph{ambiguity} of a
grammar: whether there exists \emph{more than one} parse. This is
modeled by the grammar $g$ having more than one parse of the same
string $w$. We can use this as above to reason about when a grammar is
ambiguous or not, and if the grammar is unambiguous, we have a clear,
deterministic specification for the parser.
But what about parsing \emph{ambiguous} grammars? We can simply say
that a parser should produce any parse, or perhaps all of them, but in
practice many parser generators and programming languages supporting
infix/mixfix notation allow for \emph{associativity} and
\emph{precedence} declarations. These are described by providing an
ambiguous grammar and disambiguating not by rewriting the grammar, but
instead by expressing a \emph{preference} between the multiple
possible parses. For instance, one of the most common sources of
ambiguity in parsing is in associativity and precedence of binary
operators. For an example of associativity, a grammar for subtraction
might initially be written as:
\begin{mathpar}
\begin{array}{rcl}
E &::= & E - E \,|\,(E)\,|\,Number\\
\end{array}
\end{mathpar}
But then we note that there are multiple, semantically different
parses of $3 - 4 - 5$: trees semantically equivalent to $(3 - 4) - 5$
or $3 - (4 - 5)$. We can rule out the right-nested version by
\emph{factoring} the grammar:
\begin{mathpar}
\begin{array}{rcl}
E &::= & E - L \,|\,L\\
L &::= & ( E ) \,|\, Number
\end{array}
\end{mathpar}
This grammar has fewer parses than the previous: now $3 - 4 - 5$ will
only parse as a left-nested term since $4 - 5$ does not match the
``leaf'' production $L$. Instead, most languages declare that $-$ is a
left associative operator. We can think of this as expressing a
\emph{preference} for certain parses over others, specifically the tree
% https://q.uiver.app/?q=WzAsNSxbMiwyLCJFXzIiXSxbMywxLCJFXzMiXSxbMiwwLCItIl0sWzEsMSwiLSJdLFswLDIsIkVfMSJdLFsyLDFdLFszLDRdLFszLDBdLFsyLDNdXQ==
\[\begin{tikzcd}
&& {-} \\
& {-} && {E_3} \\
{E_1} && {E_2}
\arrow[from=1-3, to=2-4]
\arrow[from=2-2, to=3-1]
\arrow[from=2-2, to=3-3]
\arrow[from=1-3, to=2-2]
\end{tikzcd}\]
is prefered over
% https://q.uiver.app/?q=WzAsNSxbMSwwLCItIl0sWzAsMSwiRV8xIl0sWzIsMSwiLSJdLFsxLDIsIkVfMiJdLFszLDIsIkVfMyJdLFswLDFdLFswLDJdLFsyLDNdLFsyLDRdXQ==
\[\begin{tikzcd}
& {-} \\
{E_1} && {-} \\
& {E_2} && {E_3}
\arrow[from=1-2, to=2-1]
\arrow[from=1-2, to=2-3]
\arrow[from=2-3, to=3-2]
\arrow[from=2-3, to=3-4]
\end{tikzcd}\]
We can express this as an \emph{ordering}, where the prefered tree is
``bigger''.
Fortunately, we can express this with a small change. Rather than
taking functors from $\Sigma^*$ to \emph{sets} we can take functors to
\emph{partially ordered sets}, or even more generally into
\emph{categories}. So a grammar $g$ defines for each string $w$ a
category $g w$ whose objects are parses and for any two parses
$p_1,p_2 \in g w$, we have a set $(g w)(p_1,p_2)$ of ``reasons to
prefer $p_2$ over $p_1$''. Furthermore, preference is transitive and
reflexive, with associativity and unit of transitive reasoning.
Then the parsing problem can be rephrased as, given a syntactic
description of a grammar $g$ and a string $w$, what is the \emph{best}
parse in $g w$, if any? In categorical language, can we construct a
terminal object of $g w$?
%% \section{Ordered Separation Logic for Formal Grammars}
%% Next, based on our algebraic constructions in the previous section, we
%% describe a \emph{polymorphic ordered separation logic} for
%% grammars. While this may seem a surprising application of separation
%% logic, it is quite natural when considering the semantics: Day
%% convolution of presheaves over a \emph{commutative} monoid is one of
%% the oldest semantic interpretations of separation logic proofs. For
%% parsing, we are taking presheaves over a non-commutative monoid (if
%% $|\Sigma| \ge 2$). This non-commutativity is obvious for instance in
%% regular expressions: $r_1r_2$ almost never presents the same language
%% as $r_2r_1$.
%% This consists of three judgments.
%% \begin{enumerate}
%% \item An \emph{open} grammar $g$ parameterized by a context of grammar
%% variables $\Gamma = \gamma_1,\ldots$:
%% \[ \Gamma \vdash g \]
%% \item For each $\Gamma$ a judgment of \emph{parse terms} of a grammar
%% $g$ parameterized by a ``bunch'' of parsed words $\bunch$, where $g$
%% and $\bunch$ are both parameterized by $\Gamma$:
%% \[ \Gamma | \bunch \vdash p : g \]
%% \end{enumerate}
%% The intended denotational semantics is
%% \begin{enumerate}
%% \item A grammar $\Gamma \vdash g$ denotes a (mixed-variance?) functor
%% $\sem{g} : \Lang^{|\Gamma|} \to \Lang$
%% \item A parse term $\Gamma | \mathcal W \vdash p : g$ denotes a
%% natural transformation $\mathcal W \Rightarrow g$ where a bunch $\mathcal W$
%% is interpreted appropriately using the two monoidal structures on
%% $\Lang$:
%% \begin{align*}
%% \sem{\cdot_\epsilon} &= \epsilon\\
%% \sem{\mathcal W_m,w : g} &= \sem{\mathcal W_m}\sepconj \sem{g}\\
%% \sem{\mathcal W_m,\mathcal W_a} &= \sem{\mathcal W_m}\sepconj \sem{\mathcal W_a}\\
%% \sem{\cdot_\top} &= \top\\
%% \sem{\mathcal W_a;\mathcal W_m} &= \sem{\mathcal W_a} \times \sem{\mathcal W_m}
%% \end{align*}
%% \end{enumerate}
%% To get a feel for the differences between the multiplicative and
%% additive structure here are some examples.
%% A parse term
%% \[ \cdot_\epsilon \vdash p : g\]
%% denotes a $g$-parse of the empty string, whereas a term
%% \[ \cdot_\top \vdash p : g\]
%% denotes a function from strings $w \in \Sigma^*$ to $g$-parses of $w$.
%% The grammar $g \sepconj g'$ presents the language that parses words
%% $w$ that can be split as $w_1w_2$ with a pair of parses $gw_1$ and
%% $g'w_2$. On the other hand, $g \times g'$ is the intersection of the
%% two languages: a parse of a word $w$ is a pair of a $g$ parse and a
%% $g'$ parse of the entire word. $g \times 1$ is equivalent to $g$,
%% whereas a parse of $w$ using $g \sepconj 1$ is a $g$-parse of a prefix
%% $w'$ of $w$, and correspondingly $1 \sepconj g$ parses words with a
%% $g$ suffix. On the other hand $\epsilon \times g$ is the restriction
%% of $g$ to only the parses of the empty string.
%% \begin{figure}
%% \begin{mathpar}
%% \inferrule
%% {\Delta, \gamma \vdash g \and \textrm{where $\gamma$ strictly positive}}
%% {\Delta \vdash \mu \gamma. g}
%% \end{mathpar}
%% \caption{Ordered Separation Logic}
%% \end{figure}
%% Given any string $w$ over our alphabet, there is a corresponding
%% grammar $\lceil w \rceil$ that accepts precisely $w$ with a single
%% parse. Then we can formulate the \emph{parsing problem} as a
%% \emph{program synthesis} problem:
%% \begin{quote}
%% Given a closed grammar $\cdot \vdash g$ and a word $w$, attempt to
%% construct a parse $x : \lceil w \rceil \vdash p : g$
%% \end{quote}
%% Or, even more naturally, we can break w up into its constituent
%% characters:
%% \begin{quote}
%% Given a closed grammar $\cdot \vdash g$ and a word $c_1\ldots_n$,
%% construct a parse $x_1 : \lceil c_1 \rceil,\ldots_n \vdash p
%% : g$
%% \end{quote}
%% \begin{figure}
%% \begin{mathpar}
%% \inferrule{g \in \mathcal G}{\mathcal G \vdash g}
%% \end{mathpar}
%% \caption{Context-free Expressions}
%% \end{figure}
\section{Semantic Actions and Grammar/Parser Combinators}
%% While the presheaf presentation directly captures the notion of a
%% parse tree, in a program the parse tree described by a grammar is
%% often massaged to express precedence data and so the tree structure of
%% a parse is more complex than the actual data being parsed. For
%% instance, an AST data type for a single binary arithmetic operation
%% can be represented as a binary tree with numbers at the leaves, such
%% as the following Haskell/Agda-like data definition:
%% \begin{verbatim}
%% data Tree where
%% op : Tree -> Tree -> Tree
%% leaf : Num -> Tree
%% \end{verbatim}
%% If the single operator op is defined to be left-associative (such as
%% subtraction/division), a typical grammar for parsing these expressions
%% would be defined by \emph{factoring} the grammar into two mutually
%% recursive non-terminals, \emph{E} for expression and \emph{L} for
%% leaf:
%% \begin{mathpar}
%% \begin{array}{rcl}
%% E &::= & E - L \,|\,L\\
%% L &::= & ( E ) \,|\, Number
%% \end{array}
%% \end{mathpar}
%% This roughly corresponds to the following mutually recursive datatypes:
%% \begin{verbatim}
%% data Expr where
%% op : Expr -> Leaf -> Expr
%% leaf : Leaf -> Expr
%% data Leaf where
%% parens : Expr -> Leaf
%% num : Number -> Leaf
%% \end{verbatim}
%% I.e., an \texttt{Expr} is a non-empty list of \texttt{Expr + Number}.
%% This mismatch is usually handled by \emph{semantic actions}, a kind of
%% fold over the parse tree that is compositionally expressed alongside
%% the definition of the parser itself. This can be seen as a manually
%% performed optimization to make up for the compiler's inability to
%% \emph{fuse} the two functions of parsing (an unfold) with constructing
%% the intended AST (a fold). However, semantic actions become more
%% useful when we have a \emph{compositional} language of grammars with
%% corresponding semantic actions, such as in parser combinator
%% libraries.
%% Parser combinator libraries are typically based on some notion of
%% backtracking error monad: a term of type \texttt{Parser Sigma A} is a
%% parser of strings from the alphabet Sigma that on correct parses produces
%% an associated value of type \texttt{A}.
%% %
%% Intuitively, this should be semantically (though not performantly)
%% isomorphic to a grammar $g$ paired with a function from $g$-parses to
%% $A$.
%% We can formalize this by noting that there is a functor, the
%% ``Konstant'' functor $K$ from $\Set$ to $\Set^{\Sigma^*}$ that sends a
%% set $A$ to the constant presheaf $K(A)_w = A$. Then we can take the
%% comma category $\Set^{\Sigma^*}/K$. The objects of this category are
%% pairs of a grammar $g$ and a set $A$ with a function
%% \[ \prod_{w \in \Sigma^*} g_w \to A \]
%% Then the projection functor $cod : \Set^{\Sigma^*}/K \to \Set$ is an
%% opfibration. The pushforward is the ``map'' operation on grammars.
\section{From Grammars to Parsers}
A grammar $g$ defines for each string $w$ a category of parses and
preferences. The task of a \emph{parser} is, given a syntactic
description of a grammar and a string, to produce a \emph{best} parse,
i.e., a terminal object of the category $g w$ of parses.