Thursday, August 25, 2011

A Survey of Syntactic Extension Techniques in the Lisp Family of Languages (Part 3)

Last time we talked about the Lisps which use something we might call defmacro-style syntactic extensions (Common Lisp, Emacs Lisp, Clojure, Arc, and things we didn't specifically talk about, like LuSH). The time before that, we talked about dynamically scoped interpreted Lisps with fexprs in which the fact that a piece of code's value is always dynamically defined was exploited to let various run-time approaches to staging execution stand in for macros.

Hopefully, throughout the previous posts, I've impressed upon the reader the importance of scope in all of these discussions. The key reason that picoLisp or newLisp can get away with fexrp-style macros is that code literally means what it says whenever it is you evaluate it. This has some things to recommend it, but these days lexical scope is the biggest rooster in the yard. Lexical scope makes reading and writing code easier and prevents certain kinds of bugs, but it introduces problems with defmacro-style macros, which manipulate code without knowing anything about the lexical contexts associated with variables. What this principally means is that a defmacro-style macro:

  1. Will behave incorrectly if it depends on other macros/special forms whose definitions are changed between its declaration and interpretation.
  2. Can introduce symbol-bindings which unexpectedly shadow symbol bindings in expanded sub-forms.

In Lisps with defmacro style macros, 1 is kind of dealt with by having separate namespaces - generally, one just hopes that people won't often redefine macros and, if they do, they will do so in their own namespace. Macros implicitly invoked by a macro expansion will hopefully or by design expand into package-qualified symbols, which we trust the user not to clobber without knowing what she is doing.

Point 2 is dealt with by generating symbols during macro expansion which are guaranteed (under most circumstances) to be unique from any other symbols which are in use. This gives macros a means of expanding into code which references its own private variables without any concern of clobbering some bindings unexpectedly.

If we restrict our attention to regular old Lisp code, without macros, in a lexical Lisp, we might begin to think that syntactic extensions are kind of lacking when compared to other kinds of kinds of things with names. Of course, we can't pass syntactic extension tokens around at run time or anything like that but we might want them to behave in a way a little more like variables. That is, we want them to obey reasonable scoping relationships. In a lexically scoped language, for instance:

(let ((x 10)) (lambda (y) (+ x y)))

The x inside the lambda will always refer to the x bound in the enclosing let expression. If you want to check the value of an expression that uses that lambda, you don't have to understand the whole history of the symbol x throughout the program. You just need to know the value passed in (y), and then go look at the lexical context of the lambda to see what x is.

What if we could invent a macro system where this was true of macros? Where the meaning of a macro could be determined by looking at the inputs at the location of macro expansion, but then only look at the macro definition, and its lexical context, to understand what the expansion means?

Extending Our Code Representation

Most Lisps represent code as a list of atoms and lists. This is fine for most things, but we've discussed at length how this representation makes macro expansion problematic. Scheme extend their code representation to include extra information, most interestingly about the lexical context of a piece of code. Code, then, is represented as syntax objects.

In Racket, the Scheme(+) implementation I recommend for fiddling with Scheme, you can see this explicitly. The quotation operator still produces naked code:

> 'x ;-> x

But the sharp quote operator, short hand for syntax, produces syntax, which is a data object which contains (abstractly) lists of atoms and lists, but decorated:

> #'x -> #<syntax:11:10 x>

(Note that #' used in this way is specific to Racket).

Syntax
Syntax, unlike the raw code representation of other Lisps, contains information about the lexical context and binding of symbols.

The Racket printer shows a little of the information associated with the syntax (line number, it appears), but lots of other information is stored. Racket macros transform and produce syntax, not raw lists. Except for this fact, they are the same as defmacro style macros. You may have heard people say that you can't write Scheme macros in Scheme (as you write Lisp macros in Lisp), but this isn't really true. Syntax transformations have to produce and consume syntax, but otherwise are unrestricted. It is true that Schemes vary in the degree to which you are provided tools to arbitrarily manipulate syntax, but that is at least partly a separate issue.

Introducing New Syntax

Scheme provides an operation, define-syntax, which informs the system that a symbol should be associated with a new syntax transformer. This is the equivalent of defmacro in other Lisps. It has the form:

(define-syntax <symbol> <syntax-transformer-expression>)

The first symbol is equivalent to the name of the macro in defmacro lisps. The second expression must be a piece of Scheme code that generates a syntax-transformer which is a regular function (as we shall see) that must take and return syntax1. After executing a define-syntax, if the <symbol> is detected during macro-expansion, its contents are passed to the syntax transformer, and the results are stitched back into the code.

Simple Tools Help Maintain Hygiene

So how do we write new syntax-transformers? In Lisp, we just use the list manipulation functions, quasiquote and unquote, to take apart and build up our new source code. Schemes provide equivalent procedures for syntax objects, but most macros are simple enough to be written in a high-level form which helps maintain hygiene automatically. There are many ways to do this, the simplest of which is syntax-rules.

First in words: syntax-rules is a form which takes a series of patterns and expansions and produces a syntax-transformer which examines syntax until a pattern matches and returns the associated expansions. Critically, when a pattern is matched, the parts of the pattern are bound to pieces of syntax which can be used in the expansion. Let's do let*, like we did last time:

(define-syntax my-let*
 (syntax-rules 
  [(my-let* () body ...)
   (begin body ...)]
  [(my-let* ((id expr)) body ...)
   ((lambda (id) body ...) expr)]
  [(my-let* ((id0 expr0) (id expr) ...) body ...)
   ((lambda (id0) (my-let* ((id expr) ...) body ...)) 
    expr0)]))

I think people find syntax-rules confusing, when coming from defmacro, because it handles a lot of things for you. For one thing, the fact that this expression is a syntax transformer isn't as obvious as it could be, but think of syntax-rules as an expression which returns a function, which is exactly what it is. That isn't so odd!

But syntax-rules also handles pattern matching for you and binding the results of that matching to pieces of syntax in the expansion expression. On top of that, neither the pattern nor the expansion constitutes regular Scheme code. Each is, in fact, its own unique, but related, domain specific pattern language, a bit like a regular expression match and replacement language for syntax.

Let's look at each specific pattern and its expansion. The first pattern handles the case where someone invokes a my-let* binding expression without any bindings. It matches a piece of code which looks like:

(my-let* () body ...)

There is a piece of that domain specific pattern language in this code. The expression body ... tells syntax-rules to match zero-or-more forms of arbitrary structure and bind them to the pattern body ... in the expansion part of the clause. Since we aren't doing anything with the body we don't care if it is there or not, since we handle it the same way. We expand this clause into:

(begin body ...)

Here body ... now refers to a piece of bound syntax produced by the pattern part of the expression. When the expansion is expanded, it will contain whatever was matched to body ... in the pattern, which might be nothing.

The next clause's pattern is:

(my-let* ((id expr)) body ...)

Here we instruct the pattern language to only match when the binding part of the let contains one two-element list which should be an identifier and an expression (we will have to think about error checking later). The identifier is bound to id, the expressions is bound to expr. body ... is bound as above. The expansion is given as:

((lambda (id) body ...) expr)

Because id, body ..., and expr were bound in the pattern, they have special meaning in the expansion. The symbol lambda, however, is a piece of syntax which is corresponds to the currently operating lambda syntax. Note that syntax-rules is basically handling automatically what we would have done manually with gensym and quasiquotation/unquotation.

At the risk of being tedious, lets look at the final pattern. It doesn't contain anything new, but it does highlight some of the strategies often encountered in writing this kind of syntax transformer. The pattern is:

(my-let* ((id0 expr0) (id expr) ...)
         body ...)

Here, var0 and expr0 are bound to the first id/expr pair. We write out the first explicitely for two reasons. The first is that (id expr) ... matches zero or more expressions, and we want at least one. The second is that we need to split up the id and the expr part of our match and insert the ... rest of them somewhere else. You can't do this with the pattern ... match, which must appear in the expansion as pattern .... We could write:

(my-let* ((id0 expr0) binder ...)
    body ...)

of course, but by specifying a more concrete match we make our macro a little bit better in terms of robustness. We really want at least a pair in each binder position, so we say so directly.

We expand this pattern as:

((lambda (id0) (my-let* ((id expr) ...) body ...)) expr0)

Which should be understandable enough. The main thing to note is that we are recursively expanding this syntax-transformation with calls to itself. Because syntax-transformations often represent purely functional computations on input syntax, recursion is a common (and elegant) strategy.

The Scheme universe offers other forms that allow more complex macro expansions, but they all owe a lot to the basic shape of syntax-rules.

Shortcomings of Syntax Rules

The standard way of demonstrating the shortcomings of syntax-rules is by trying to write an anaphoric if macro. Anaphoric if is a macro that binds the result of its predicate part to a variable, which is available for use inside either branch expression (often the symbol it is used for this purpose). Since you've read Part 1 and Part 2, I can just explain this by writing the appropriate macro in Common Lisp:

(defmacro anaphoric-if (pred true &optional false)
 `(let ((it ,pred))
    (if ,it ,true ,false)))

Simple! The utility of the macro is related to the fact that if treats lots of things other than just booleans as true. Imagine you have a function which processes a command entered by the user. The command is returned by the function get-command. If the user just pressed enter, nil is returned.

(anaphoric-if (get-command) (handle-command it) nil)

it contains the result of get-command. Our macro has captured the symbol it and is using it for its own purposes. If we already had an it in scope, this might be inconvenient (and, in my opinion, there are lots of things wrong with this kind of macro, but some people are into it). Misgivings about variable capture aside, it is easy to write this variable capturing macro in Common Lisp, and it behaves just as we expect it to with our filthy, untutored brains.

We might try to write this macro using our new knowledge of syntax-rules like so:

(define-syntax anaphoric-if
 (syntax-rules ()
  [(anaphoric-if pred true false)
   (let ((it pred))
     (if it true false))]))

But if we attempt to use this syntax-transformer, we'll quickly run into problems:

(anaphoric-if 10 (format "it is ~a~n" it) #f)
  . . reference to an identifier before its definition: it
  >

We expected the string "It is 10\n", but we got an error! What is happening is that syntax-rules is trying to protect us from ourselves. Inside the expansion we introduce a piece of syntax, it. It isn't, however, simply a symbol - it is syntax, and its metadata tells the Scheme system that its scope is limited to the macro expansion context. We can use it as the predicate spot in our if statement, but any it symbols appearing in the expressions matched to true or false will only be superficially similar. The Scheme system will know that the its in those expressions are different than the one we create in our macro expansion. Hence, when the evaluator encounters it in one of our branches, it checks that symbol's original lexical context (in our example, the top-level binding environment) for a binding, which it doesn't find. Hence the error.

We can approach the solution in two ways (at least). The first is to force the user to provide the symbol name herself. Scheme will then know that the symbol, whatever it is, is meant to be the same for all appearances in the expansion:

(define-syntax named-anaphoric-if 
 (syntax-rules () 
  [(named-anaphoric-if id pred true false)
   (let ((id pred))
     (if id true false))]))

Or we can try to tell scheme we want to capture the symbol it. I don't know of any way to do this with syntax-rules, although I suspect that if you really wanted to bend over backwards, there might be some way. We need something else! Plus, there are lots of other idioms, not just the ugly duckling variable capture, which are hard to express in syntax-rules. Hence, Schemers cooked up...

Syntax-case

Before we talk about syntax-case I want to emphasize that both syntax-rules and syntax-case are just tools that produce syntax-transformers. Their unusual, domain specific languages are meant to make writing transformers correctly easier, but they are not, themselves, the Scheme macro system. They are just tools built on top of that system. The underlying system is much like the one in defmacro Lisps, except it transforms syntax, not lists.

syntax-case looks a lot like syntax-rules. Here is how you'd write the named-anaphoric-if in syntax-case:

(define-syntax named-anaphoric-if* 
  (lambda (stx) 
    (syntax-case stx ()
      [(named-anaphoric-if* id pred true false)
       (syntax (let ((id pred))
                 (if id true false)))])))

Almost identical. Sharp eyed readers will not that the expansion has been wrapped with a syntax expression, which tells the environment to evaluate the contents contained therein as syntax, which means automatically handling any syntax bound by the pattern match. We also have to wrap syntax-case in a lambda, which is the syntax-transformer proper. Unlike syntax-rules, which is a form resulting in a syntax-transformer, syntax-case merely transforms syntax. It does not produce a transformer itself.

However, since the expansion part of syntax-case clauses is not automatically interpreted as syntax, it allows us to finagle the matched syntax manually. Turns out this is just what we need to "capture" a variable. Here is how we do it:

(define-syntax anaphoric-if** 
 (lambda (stx)
  (syntax-case stx ()
   [(anaphoric-if** pred true false)
    (with-syntax ([it-id (datum->syntax (syntax true) 'it)])
     (syntax 
      (let ((it-id pred))
       (if it-id true false))))])))

So this is a syntax-case expression with one clause, which matches things like (anaphoric-if** a b c). The next part of the clause must return a piece of syntax, but it can do any arbitrary scheme activity to produce that syntax, unlike syntax-rules. The expression with-syntax allows us to introduce an environment with new syntax-bindings in addition to those produced by the match expression. We use it to introduce a binding to it-id. This means we can use the expression it-id inside our expansion (within the with-syntax body) and it will be interpreted correctly by syntax expressions. How do we actually create a new piece of syntax, though? The function datum->syntax takes a piece of data (in our case, the symbol it) and transforms it into a piece of syntax. But syntax contains much more information than just its value, and some of that information determines how it is interpreted during macroexpansion and subsequent compilation. datum->syntax allows us to pass in, as the first argument, a piece of syntax from which we will clone that additional information. Since we want to let it operate in the same environment as the true and false clauses, we clone the syntax-metadata from the true branch. We then return a piece of syntax corresponding to our entire expression. Again, syntax automatically handles the equivalent of quasi/un-quotation as we might do it in Lisp. It looks at the syntax bindings which are active and automatically interpolates them. We can now say, with a straight face:

(anaphoric-if** 10 it it)

And get 10, as we expect to. We now know how to write selectively-non-hygienic macros. Go us.

Recapitulation

We started, back in Part 1 by talking about dynamically scoped Lisps where code was always written and executed in the most obvious way possible. These Lisps maybe have been difficult to write programs in, or at least unfamiliar to modern Lispers and programmers at large, but they did have one thing going for them: their treatment of code/data was so simple that special forms could be treated identically with functions. These so-called fexprs could choose when to evaluate their inputs because they put the onus of maintaining various bindings to a single symbol on the user.

In these languages, macros were often considered to be less useful than fexprs, and why not? We could see when we looked at the defmacro Lisps shared some problems which, if not really a problem in practical use cases, at least left us philosophically unsettled. Macros could easily be written which unintentionally captured variables, they were not first class, and didn't follow regular lexical binding rules. In exchange for these shortcomings, we did get macros which were almost as nice as fexprs and worked better in lexically scoped languages, because they forced us to emphasize the fact that macro-expansion occurs before execution. Tools like gensym and packages/namespaces helped us avoid the worst hygiene problems.

And so we've finally arrived at Scheme, were macros, now called syntax-transformers work not on an undecorated representation of Lisp code, as they do in defmacro lisps, but on syntax, a code representation where information about context and binding is represented as well. Adding this additional representation to the code allows forms like syntax-rules to automatically handle syntax transformation for us while maintaining hygiene.

Finally, we grudgingly admitted that people somewhere might want to intentionally violate hygiene, and introduced the syntax-case expression, which allows the user to manipulate syntax-objects to the end of capturing variables and the like.

Further Reading

There are tons of things still to be said about the subject, especially regarding error checking and reporting. And the small, but enthusiastic communities around newLisp and picoLisp as well as the thriving Emacs Lisp community demonstrates that despite academic discussion, even issues of lexical or dynamic scope is far from resolved.

Further reading might entail getting one's head around syntax-parse a special form implemented in Racket and (partially) in Clojure, which is meant to make writing hygienic (and selectively unhygienic) macros both easy to write and easy to make provide useful, informative error messages. You could also check out Kernel, which is an experimental Lisp which tries to address all of these issues with tons of dollar signs and abstraction.

As always, I hope this has been useful to someone other than myself!

Back Matter

If all you care about is Lisp, then we've reached the end of this series on syntax. However, discussions with a reader, Dru Nelson, compel me to mention The Factor Language as something of a conceptual digestif. Factor is a concatenative language which has many similarities with Lisp, which puts it in the odd category of being more Lispy than Lisp in many respects.

Factor, for programmers unfamiliar with concatenative languages, in its ordinary operation, does not have variables: values are passed between subroutines (called "words") on a data stack. I strongly encourage readers to download Factor and give it a spin - I find writing Factor programs to be delightful mental calisthenics. Here is how you calculate (3 + 13) * 6 in Factor:

( scratchpad ) 3 13 + 6 *

--- Data stack:
96

A Factor program is, apparently, a flat list of "words" with data flow handled implicitly. What is the equivalent of a "lambda" expression in such an exotic language? Well, its just a list! Since there are no named things except for words, a list of the words (and literals, which can be thought of as words which "push" themselves onto the stack) which make up a subroutine is all you need or can have. In Factor, code really is data and execution of that code is even simpler than it is in dynamically scoped languages. You can't even make the common mistakes of dynamical languages respecting variable binding because, well, you don't have variables. You don't need to write Macros, such as we have them in Lisp, in Factor, because your functions are free to shuffle around, decompose, build and partially evaluate quotations to their heart's content2.

Factor goes one step further towards simplicity than even picoLisp - it completely eliminates variables, and as such, data becomes even closer to code than it is in that language.

More Lispy than Lisp indeed!


1 I'm pretty sure that syntax, as (essentially) a wrapper for data, is a monad, but that is definitely another story.

2 Factor actually requires that quotations have infer-able stack-effects, which places limitations on them in practice (surprisingly few, though!)

2 comments:

R.M. Loveland said...

Thanks for writing these essays, and this one in particular. It's the most comprehensible explanation of Scheme's pattern-matching I've found. Very nice that you provided examples for the reader, some of us learn best that way.

J.V. Toups said...

Thanks for the comment - it is always good to hear people are reading this stuff. If you are the same RM Loveland who authors science fiction, you might be interested in another project of mine: Subluminal, which is a science fiction epic in rhyming verse that I am writing.