jaredj/glotawk: doc/tour.org

#+TITLE: A tour of the code and design of glotawk, by file

* Makefile, GNUmakefile

I wanted glotawk to build the same whether you're using FreeBSD's
bmake or GNU Make. I found that bmake pulls in .depends automatically,
but to get .depends built and include it took a couple more
incantations in GNU Make. Happily, it has its own GNUmakefile name,
which it looks for first.

There's a bit about the build process in the [[file+emacs:build-process.org][build-process.org]] file.
I'm afraid there are some GNUisms in the benchmark target.

* data.awk

This is where the data structures live, and the AWK functions that
deal with them.

N is a global variable containing the next number to use, when
creating a non-literal: a cons, a symbol, or a string. Note that AWK
arrays are /associative/: even though we are using numbers to "index"
them, these are keys not direct indices; so the numeric keys in one of
the heap arrays will not be contiguous, especially as garbage
collections happen.

_TYPE, _CAR, _CDR, _STRING, and _SYM_NUMBERS are the associative
arrays whose keys are these ascending numbers. (_SYM_NAMES has strings
as keys, and numbers as values.) When something is stored in one of
these arrays, N is incremented and its new value is the identity of
the new thing.

Values stored in _CAR and _CDR, or passed around between functions,
are AWK numbers for pointers, or short AWK strings of certain format
for literal values. The short strings to use are written here; for
speed, they are also written in [[printer.awk]].

Conceptually, a cons cell has an address, a car, and a cdr.
Concretely, the address of a cons cell here is the N value with which
it was created (for example, 1234); its car lives in _CAR[1234] and
its cdr in _CDR[1234].

Strings can be stored in AWK arrays, so they are stored in _STRING.
Continuing our example above, if 1234 is a cons cell, then _TYPE[1234]
will contain "(". Perhaps we have a list with one element, the string
"foo". Then _TYPE[1233] might be "s" for string; _STRING[1233] would
be "foo"; _CAR[1234] would be 1233; and _CDR[1234] would be the AWK
string "nil".

It's important to use the functions provided here (such as _is_null),
rather than assuming the AWK strings used, so that this file remains
as the place where these strings are written down. Profiling with GNU
awk indicates that changes made here in the name of speed aren't worth
their complexity.

_GLOBALS and _MACROS are pointers to the beginning of the global
variables alist and the macros alist respectively. When you ~(setq foo
"bar")~, ~(foo "bar")~ is consed onto the beginning of _GLOBALS. (This
is a simplification. See the code.)

* first-symbols.awk

So the "address" a symbol gets is just a matter of when it's used
first -- and any mention of a symbol will cause it to be interned.
This file exists to mention the first symbols in a well-defined order,
before any code is evaluated, including the code in [[lib-eval.awk]]. This
way, the actual numbers assigned to these symbols will always be the
same, and will be small, contiguous numbers. This way, [[gc.awk]] can
preemptively mark them, and avoid accidentally collecting them.

If you define a new special form, you need to mention it here. Keep
things in the same order as they're mentioned in the big if-else's in
[[eval.awk]], [[math.awk]] and [[osf.awk]].

* reader.awk

Code to tokenize input and use data.awk's functions to read it into
the appropriate data structures. Faint echoes of the ~mal~ ([[https://github.com/kanaka/mal][Make a
Lisp]]) directions may be detected here.

* printer.awk

Code to represent data from the internal structures in a format
readable by the reader.

* eval.awk

The big one. Here I followed /Lisp from Nothing/ by Nils Holm - mostly
the chapter about LISINT. Here live progn, cond, macros, and yes,
eval3. It started out as eval from the early chapters of LFN, but
gained the tail-calling goodness of LISINT. I just kind of wrote down
the AWK code that manipulates the [[data.awk]] structures to do the same
thing as the Lisp code in the book. (That code is in fact in the
public domain, but the text of the book was pivotal for me.)

After too many else-if stanzas slowed eval3 down, some were split off;
because we can depend on the addresses of the symbols used to be small
positive numbers, we can compare them and dispatch to different
functions full of more else-ifs, like those in [[math.awk]] and [[osf.awk]].

* math.awk

All the math operators and functions AWK makes available are provided
here to glotawk programs. I decided to make "only2+" and similar
special forms so that I could have a fixed number of arguments, and
then deal with variable numbers of arguments inside the Lisp. (Later
in [[osf.awk]] I did some variadic functions, so maybe this "only2"
pattern is not needed.)

For quotient, modulo, and power operators, I borrowed punctuation from
Python rather than spelling them out like Common Lisp does.

* osf.awk

This stands for "other special forms." It's like a miscellaneous bin.
It's string functions and file/pipe input and output, as well as GC
and dumping special forms, whose implementations get their own source
files [[gc.awk]] and [[dump.awk]].

* gc.awk

Garbage collection! It's a simple mark-and-sweep collector. This is
the first one I've ever coded and I had problems with it, so I
undertook to make its behavior somewhat better visualizable; see under
[[Garbage collection debugging]] below.

* dump.awk

Every function defined is just a tree of cars and cdrs, pointing every
which way, including occasionally to strings or symbols, and sometimes
some literals. All of those things are stored in the AWK global
variables _TYPE, _CAR, _CDR, _STRING, et cetera. Those arrays can be
iterated over, using AWK code. Their contents (and a few other things)
can be written down as assignment statements in the AWK language. And
if that AWK code is executed, along with all the other AWK function
definitions we've discussed, it will reconstitute the glotawk
environment that existed at dump time.

That's what this file does. That has two uses:

1. The forms in [[lib-eval.awk]] are slow to read using the reader at
   every startup, but a bunch of AWK code is much faster to run. So,
   as explained in the [[file+emacs:build-process.org][build-process.org]] file, we execute the slow
   glotawk definition code once at build time, then dump the image,
   and replace the AWK code that says "run all this glotawk code" with
   the dumped image for fast startup.

2. If /you/ define functions in glotawk, and you want your glotawk to
   know them every time it starts up, you can dump an image and
   replace the image.awk section of your ~glotawk~ file with your
   dump. Quick startup, and all your functions at your disposal.

* Garbage collection debugging

The ~dump-dot~ and ~gc-dot~ functions write files in the Graphviz
language.

When you call ~dump-dot~ with an output filename, it dumps an image of
the heap, not in AWK code, but as a directed graph to be drawn by
Graphviz. You can ~dot -Tx11 that-file.dot~ (on a Unix-like system
with Graphviz installed) to see the graph. Caution: If your heap is
large, Graphviz might have to think a long time about how to draw it.
If you see nothing, try zooming in.

When, in addition, you call ~gc-dot~ with two output filenames, one
for marks and the other for sweeps, additional Graphviz code will be
written that colors the nodes in the graph according to whether they
are about to be garbage-collected.

See the ~heapviz~ target in the ~Makefile~.

* glotawk-build.tmpl.awk, glotawk-run.tmpl.awk, glotawk-common.awk

These are the mechanism by which, first, a glotawk binary is built
that is slow but contains the definitions of the library, and then a
glotawk binary is built that instead contains a dump of that library
that can be loaded by executing it. By the agency of [[tmpl.awk]] they
bring together all the other source files here mentioned.

* image.awk

This file is a build product: it is the image dumped from the
build-glotawk, used to create the glotawk for running.

* lib-eval.awk

Definitions written in glotawk. Mostly, these are functions that
invoke the special forms, so that you can ~(reduce append my-lists)~,
or such like. The special forms only work when they are the first
member of a list, so without these setq's, you can't talk about them
inside glotawk as values.

Speaking of ~reduce~, it's written here, along with other important
things like ~mapcar~, ~let~, and ~quasiquote~. Variadic arithmetic
functions, such as you would expect in a Lisp (e.g. ~(+ 1 2 3 4)~) are
here.

* logging.awk

The logg_* AWK functions and the LOG_LEVEL global AWK variable are
here. Note that if you call a logging function from some AWK code, and
its message is not emitted because the LOG_LEVEL is too low, the
contents of the message are /still computed/. That's why most of the
logg_dbg calls are commented out: removing all those calls to _repr
made glotawk /eight times faster/.

It's here that /depth/ is dealt with, by indenting messages. That's
what the ~depth~ or ~d~ parameter to a lot of functions in [[eval.awk]]
is.

* repl.awk

Calls ~print _eval(_read(...))~ repeatedly.

* sane_lisp

The test suite, written using the Test Anything Protocol (TAP).

* tmpl.awk

Deals with @directives written in ~*.tmpl.awk~ files. Produces
bigger awk files.

* tmpl-depends.awk

Deals with mostly the same @directives written in ~*.tmpl.awk~ files,
but produces a .depends file for ~make~ to read, which lists files
that certain targets depend on.