#+TITLE: A tour of the code and design of glotawk, by file * Makefile, GNUmakefile I wanted glotawk to build the same whether you're using FreeBSD's bmake or GNU Make. I found that bmake pulls in .depends automatically, but to get .depends built and include it took a couple more incantations in GNU Make. Happily, it has its own GNUmakefile name, which it looks for first. There's a bit about the build process in the [[file+emacs:build-process.org][build-process.org]] file. I'm afraid there are some GNUisms in the benchmark target. * data.awk This is where the data structures live, and the AWK functions that deal with them. N is a global variable containing the next number to use, when creating a non-literal: a cons, a symbol, or a string. Note that AWK arrays are /associative/: even though we are using numbers to "index" them, these are keys not direct indices; so the numeric keys in one of the heap arrays will not be contiguous, especially as garbage collections happen. _TYPE, _CAR, _CDR, _STRING, and _SYM_NUMBERS are the associative arrays whose keys are these ascending numbers. (_SYM_NAMES has strings as keys, and numbers as values.) When something is stored in one of these arrays, N is incremented and its new value is the identity of the new thing. Values stored in _CAR and _CDR, or passed around between functions, are AWK numbers for pointers, or short AWK strings of certain format for literal values. The short strings to use are written here; for speed, they are also written in [[printer.awk]]. Conceptually, a cons cell has an address, a car, and a cdr. Concretely, the address of a cons cell here is the N value with which it was created (for example, 1234); its car lives in _CAR[1234] and its cdr in _CDR[1234]. Strings can be stored in AWK arrays, so they are stored in _STRING. Continuing our example above, if 1234 is a cons cell, then _TYPE[1234] will contain "(". Perhaps we have a list with one element, the string "foo". Then _TYPE[1233] might be "s" for string; _STRING[1233] would be "foo"; _CAR[1234] would be 1233; and _CDR[1234] would be the AWK string "nil". It's important to use the functions provided here (such as _is_null), rather than assuming the AWK strings used, so that this file remains as the place where these strings are written down. Profiling with GNU awk indicates that changes made here in the name of speed aren't worth their complexity. _GLOBALS and _MACROS are pointers to the beginning of the global variables alist and the macros alist respectively. When you ~(setq foo "bar")~, ~(foo "bar")~ is consed onto the beginning of _GLOBALS. (This is a simplification. See the code.) * first-symbols.awk So the "address" a symbol gets is just a matter of when it's used first -- and any mention of a symbol will cause it to be interned. This file exists to mention the first symbols in a well-defined order, before any code is evaluated, including the code in [[lib-eval.awk]]. This way, the actual numbers assigned to these symbols will always be the same, and will be small, contiguous numbers. This way, [[gc.awk]] can preemptively mark them, and avoid accidentally collecting them. If you define a new special form, you need to mention it here. Keep things in the same order as they're mentioned in the big if-else's in [[eval.awk]], [[math.awk]] and [[osf.awk]]. * reader.awk Code to tokenize input and use data.awk's functions to read it into the appropriate data structures. Faint echoes of the ~mal~ ([[https://github.com/kanaka/mal][Make a Lisp]]) directions may be detected here. * printer.awk Code to represent data from the internal structures in a format readable by the reader. * eval.awk The big one. Here I followed /Lisp from Nothing/ by Nils Holm - mostly the chapter about LISINT. Here live progn, cond, macros, and yes, eval3. It started out as eval from the early chapters of LFN, but gained the tail-calling goodness of LISINT. I just kind of wrote down the AWK code that manipulates the [[data.awk]] structures to do the same thing as the Lisp code in the book. (That code is in fact in the public domain, but the text of the book was pivotal for me.) After too many else-if stanzas slowed eval3 down, some were split off; because we can depend on the addresses of the symbols used to be small positive numbers, we can compare them and dispatch to different functions full of more else-ifs, like those in [[math.awk]] and [[osf.awk]]. * math.awk All the math operators and functions AWK makes available are provided here to glotawk programs. I decided to make "only2+" and similar special forms so that I could have a fixed number of arguments, and then deal with variable numbers of arguments inside the Lisp. (Later in [[osf.awk]] I did some variadic functions, so maybe this "only2" pattern is not needed.) For quotient, modulo, and power operators, I borrowed punctuation from Python rather than spelling them out like Common Lisp does. * osf.awk This stands for "other special forms." It's like a miscellaneous bin. It's string functions and file/pipe input and output, as well as GC and dumping special forms, whose implementations get their own source files [[gc.awk]] and [[dump.awk]]. * gc.awk Garbage collection! It's a simple mark-and-sweep collector. This is the first one I've ever coded and I had problems with it, so I undertook to make its behavior somewhat better visualizable; see under [[Garbage collection debugging]] below. * dump.awk Every function defined is just a tree of cars and cdrs, pointing every which way, including occasionally to strings or symbols, and sometimes some literals. All of those things are stored in the AWK global variables _TYPE, _CAR, _CDR, _STRING, et cetera. Those arrays can be iterated over, using AWK code. Their contents (and a few other things) can be written down as assignment statements in the AWK language. And if that AWK code is executed, along with all the other AWK function definitions we've discussed, it will reconstitute the glotawk environment that existed at dump time. That's what this file does. That has two uses: 1. The forms in [[lib-eval.awk]] are slow to read using the reader at every startup, but a bunch of AWK code is much faster to run. So, as explained in the [[file+emacs:build-process.org][build-process.org]] file, we execute the slow glotawk definition code once at build time, then dump the image, and replace the AWK code that says "run all this glotawk code" with the dumped image for fast startup. 2. If /you/ define functions in glotawk, and you want your glotawk to know them every time it starts up, you can dump an image and replace the image.awk section of your ~glotawk~ file with your dump. Quick startup, and all your functions at your disposal. * Garbage collection debugging The ~dump-dot~ and ~gc-dot~ functions write files in the Graphviz language. When you call ~dump-dot~ with an output filename, it dumps an image of the heap, not in AWK code, but as a directed graph to be drawn by Graphviz. You can ~dot -Tx11 that-file.dot~ (on a Unix-like system with Graphviz installed) to see the graph. Caution: If your heap is large, Graphviz might have to think a long time about how to draw it. If you see nothing, try zooming in. When, in addition, you call ~gc-dot~ with two output filenames, one for marks and the other for sweeps, additional Graphviz code will be written that colors the nodes in the graph according to whether they are about to be garbage-collected. See the ~heapviz~ target in the ~Makefile~. * glotawk-build.tmpl.awk, glotawk-run.tmpl.awk, glotawk-common.awk These are the mechanism by which, first, a glotawk binary is built that is slow but contains the definitions of the library, and then a glotawk binary is built that instead contains a dump of that library that can be loaded by executing it. By the agency of [[tmpl.awk]] they bring together all the other source files here mentioned. * image.awk This file is a build product: it is the image dumped from the build-glotawk, used to create the glotawk for running. * lib-eval.awk Definitions written in glotawk. Mostly, these are functions that invoke the special forms, so that you can ~(reduce append my-lists)~, or such like. The special forms only work when they are the first member of a list, so without these setq's, you can't talk about them inside glotawk as values. Speaking of ~reduce~, it's written here, along with other important things like ~mapcar~, ~let~, and ~quasiquote~. Variadic arithmetic functions, such as you would expect in a Lisp (e.g. ~(+ 1 2 3 4)~) are here. * logging.awk The logg_* AWK functions and the LOG_LEVEL global AWK variable are here. Note that if you call a logging function from some AWK code, and its message is not emitted because the LOG_LEVEL is too low, the contents of the message are /still computed/. That's why most of the logg_dbg calls are commented out: removing all those calls to _repr made glotawk /eight times faster/. It's here that /depth/ is dealt with, by indenting messages. That's what the ~depth~ or ~d~ parameter to a lot of functions in [[eval.awk]] is. * repl.awk Calls ~print _eval(_read(...))~ repeatedly. * sane_lisp The test suite, written using the Test Anything Protocol (TAP). * tmpl.awk Deals with @directives written in ~*.tmpl.awk~ files. Produces bigger awk files. * tmpl-depends.awk Deals with mostly the same @directives written in ~*.tmpl.awk~ files, but produces a .depends file for ~make~ to read, which lists files that certain targets depend on.