Thursday, April 5, 2012

Shadchen-el introduction and defpattern tutorial

Lately I've been working on an Emacs Lisp library I'm pretty proud of, Shadchen. It implements extensible pattern matching, somewhat like Racket's Match facility for Emacs Lisp. In this tutorial/introduction I'll explain how to use Shadchen's various facilities and, most importantly, how to extend Shadchen itself with new patterns, which is an interesting subject in and of itself, combining compile time and run-time execution in interesting ways. If you already know about pattern matching, feel free to jump to the end, where I talk about writing non-trivial patterns using Shadchen's defpattern.

The Problem Shadchen Solves

Shadchen solves several problems you may not know you may have. In one sentence, Shadchen lets you concisely express both destructuring and type checking for complex data structures. It can be thought of as combining the features of cond, case and assert into one nice package. My experience is that this collection of features helps me write better code, since it encourages me to dilineate exactly the kind of data a function or form expects before doing anything with it.

Shadchen is also conservative - unless you provide a pattern that matches the input data, it will fail with a match error, so you know something is wrong before something strange happens.

For instance, suppose we were writing an interpreter for Lisp. It might look like:

(defun eval (form env)
 (cond form 
  ((symbolp form) (eval-symbol form env))
  ((listp form)
   (case (car form)
    (if (handle-if (cadr form)
                   (caddr form)
                   (caddr form) env)) 
    (let (handle-let 
          (cadr form)
          (cdddr form)))
    ...))))

Note that we have both a cond and a case here, and that after we test our data, we destructure it. Here is a similar piece of code using Shadchen:

(defun eval (form env)
 (match form 
  ((p #'symbolp s) (eval-symbol s env))
  ((list) nil)
  ((list 'if pred true-branch false-branch)
   (handle-if pred true-branch false-branch env))
  ((list-rest 'let (list-rest pairs) body)
   (handle-let pairs body env))
  ...))

(Notes: the pattern p passes when the predicate as its second argument is true on the match value, and then matches against the third argument. So (p #'symbolp s) matches only when form is a symbol and then binds s to that symbol. list matches when the input is a list and each pattern in the the list pattern matches each corresponding element in the list. list-rest is similar but any leftover parts of the list are matched against the final pattern.)

This code is more concise, and yet it is also much more explicit, both in that it provides better naming for values and it provides more explicit error checking. For instance, this version will only match if with three expressions, where as the previous evaluator would have been fine with the expression (if a b c d e f). This version also asserts explicitly that the binding part of the let needs to be a list. With a custom pattern we could also ensure it was a list of symbol/expression pairs in almost the same space.

It takes all kinds, but I found that once I got used to programming with pattern matching, it was hard to go back.

Other Rad Features of Shadchen

match-let

Shadchen wants to let you program in a functional style. To that end, in addition to the regular pattern matching form match, it also provides some other, nice features. Many algorithms involve examining some intermediate data, checking its structure somehow, and then recursively processing the next step. Shadchen allows this kind of thing with the Scheme-flavored match-let form.

The form match-let can be used exactly like let:

(match-let 
  ((x 10)
   (y 11))
 (+ x y))

But in each "binding" pair, the symbol may be replaced with any Shadchen pattern. Eg:

(match-let 
 ((x 10)
  (y 11)
  ((list q r s) (list 1 2 3)))
 (+ x y q r s))

Will give you 27. If any pattern fails, the form produces a match fail error, which means you can use match-let as a let form with tidy type checking.

Finally, a match-let form allows tail recursion. Invoking recur in a tail position inside the form causes the match-let to be re-entered without growing the stack. For instance:

(match-let 
 (((list x y) (list 0 0)))
  (if (< (+ x y) 10000)
      (recur (list (+ x 1) (+ x y)))
      (list x y)))

Results in (141 9870) and can't blow the stack. It is an error to invoke recur in a non-tail position, but because of limitations in Emacs Lisp, it is difficult to enforce this statically.

defun-match

The form defun-match lets you write functions which pattern match on their arguments and split their calculations across multiple bodies, in a bit like the style of Shen or Haskell.

For instance, suppose we have an animal simulator, where each animal is represented by a list, the first element of which is a symbol representing the animal name. We can say:

(defun-match- vocalize ((list-rest 'cat properties))
  "Cat vocalization."
  (message "Meow"))

(defun-match vocalize ((list-rest 'dog properties))
  "Dog vocalization."
  (message "Woof"))

Then:

(vocalize '(cat :name tess))
(vocalize '(dog :name bowzer))

Functions defined with defun-match can also use recur to re-enter themselves without growing the stack. Consider a function which causes a list of animals to vocalize:

(defun-match- vocalize-list (nil) nil)
(defun-match vocalize-list ((cons animal animals))
  (vocalize animal)
  (recur animals))

recur can dispatch to any of the bodies defined for the function and it doesn't grow the stack. It must be invoked from tail position, though non-tail calls can be affected by simply calling the function.

(N.B. defun-match- with that dangling minus sign causes previous bodies to be expunged before defining the indicated body.)

Extending Shadchen with defpattern

Shadchen is an extensible pattern matching facility. We can define new patterns much in the way we define new functions, although patterns are more like macros than functions. Let's look at a simple example, and then I'll guide you through a more complex example I just added to the library using the defpattern.

A quirk of Common and Emacs Lisp is that (car nil) is nil even though nil is not a cons cell, and so does not have a car or a cdr. I hate this behavior, because its quite evident that (cons nil some-list) is different from nil, but car can't tell that - the user has to do more inspection to find this out. Bugs waiting to happen, let me tell you.

However, I'm nothing if not accommodating, and so the cons pattern in Shadchen will, in fact, match against nil. So:

(match nil 
 ((cons a b) (list a b)))

Will be '(nil nil). Let's define a pattern which is like cons, but only matches against actual cons cells, into which category nil fails to fall.

(defpattern strict-cons (car cdr)
 `(p #'consp (cons ,car ,cdr)))

A defpattern body must evaluate to a legal shadchen pattern. Each argument to the defpattern is also a shadchen pattern. So this pattern reads "define a new pattern strict-cons, which first checks that the match value is a cons cell using the p pattern, and then matches the car and cdr of that cons cell against the patterns car and cdr.".

During the expansion of a shadchen pattern matching form, user defined patterns are looked up and their expansions are inserted into the macro expansion. In short, defpattern allows you to define new patterns in terms of old patterns.

This might seem very restrictive, but Shadchen provides primitive patterns that allow you to write arbitrarily complex pattern matchers that can perform rich computations on their way to rejecting or accepting a match.

Implementing concat, a non-trivial pattern

I just used defpattern to implement a pretty complex pattern, concat and it was something of a learning experience. Writing complex patterns definitely takes some thought and practice, but hopefully this tutorial will bootstrap users to a point where their own patterns can be implemented without too much pain.

What is so complicated about a concat pattern? Well, we want concat to match the concatenation of patterns which match strings. Eg:

(concat "dog" "cat")

Should match "dogcat". Writing a pattern that has this behavior is easy:

(defpattern concat (&rest strings)
 (reduce #'concat strings))

This pattern can't match subpatterns that are anything other than strings, however. We'd really like to be able to match, for instance:

(concat (and (or "dog" "cat") which) "dog")

against either "dogdog" or "catdog", binding which to whatever the initial string contents actually are. How can we do this?

Nailing down concat's semantics.

We want concat to function this way:

If the initial pattern is not a string, then try matching that pattern against larger and larger substrings until either you run out of string to match against, or you match. If you match, then match, again using concat with the unused patterns, against whatever is left of the string after you've removed the part that matched. Repeat until all patterns are exhausted and then make sure the string has been completely consumed too.

If the initial pattern is a string, then just cleave off the same length of characters from the input, and if they match, recursively match the rest. Here is the entry point:

(defpattern concat (&rest patterns)
  (cond 
   ((length=0 patterns)
    "")
   ((length=1 patterns)
    `(? #'stringp ,(car patterns)))
   (:otherwise
    (cond 
     ((stringp (car patterns))
      `(simple-concat ,@patterns))
     (:otherwise 
      `(full-concat 0 ,@patterns))))))

The :otherwise has all the meat, but we defer it to to other helper-patterns; simple-concat and full-concat. Simple concat looks like this:

(defpattern simple-concat (&rest patterns)
  (cond 
   ((length=0 patterns)
    "")
   ((length=1 patterns)
    `(? #'stringp ,(car patterns)))
   (:otherwise
    (let* ((the-string (car patterns))
           (static-len (length the-string)))
      `(and 
        (p #'stringp)
        (p (lambda (s)
             (>= (length s) ,static-len)))
        (p 
         (lambda (s)
           (string= (substring s 0 ,static-len) ,the-string)))
        (funcall (lambda (s)
                   (substring s ,static-len))
                 (concat ,@(cdr patterns))))))))

Look at the backquoted expression. It is an and pattern, which only succeeds if all the patterns beneath it also succeed. These patterns are (p #'stringp), which asserts that the input is a string, (p (lambda (s) (string= (substring s 0 ,static-len) ,the-string))) which asserts that the input is at least long enough to contain the string we want to match against. The next form asserts that the substring of the input equal to the pattern string in length is equal to the pattern. If this is true, then the pattern matches, and we use the funcall pattern to match against the rest of the string with the leftover patterns.

The funcall pattern takes the input to the match, applies a function to it, and then matches the output of that function application to the pattern provided as its third slot.

full-concat is more complex. Note that when we invoke full-concat we provide it an numerical first argument. This number tells the pattern how far into the string to match we've looked, so it starts at zero. After all the first pattern could match the empty string. full-concat looks like this2:

(defpattern full-concat (pivot &rest patterns)
  (assert (numberp pivot)
          ()
          "Pivot should be a number.")
  (cond 
   ((length=0 patterns)
    "")
   ((length=1 patterns)
    `(? #'stringp ,(car patterns)))
   (:otherwise
    `(and 
      (p (lambda (s)
           (>= (length s) ,pivot)))
      (or 
       (and (funcall
             (lambda (s)
               (substring s 0 ,pivot))
             ,(car patterns))
            (funcall 
             (lambda (s)
               (substring s ,pivot))
             (concat ,@(cdr patterns))))
       (full-concat ,(+ pivot 1) ,@patterns))))))

Here we use and again. We first check that the input string is long enough to grab the substring indicated by pivot. If this isn't true, the match fails. We then use the or pattern to indicate a branch. Either of the or patterns might succeed, but the first to do so is the only one that will happen. The first pattern to or uses funcall to peel off the substring of the input from 0 to pivot. If the initial pattern matches, then we use funcall again to get the rest of the string, and invoke concat again.

If this match fails, then we invoke full-concat again, but increment the pivot by one, indicating that we want to check against a larger substring.

If this is confusing, and it is understandible if it is, remember the following: when writing defpatterns, or is used for flow control, and is used to assert multiple things about the input, p is used to assert individual arbitrary conditions on the input, and funcall is used to transform the input for further matching. Recursive pattern expansion is used for iteration 1.

And feel free to contact me with questions, if they come up.


1 It is a lot like writing prolog, actually. Pattern matching is a significant distance from lisp to prolog.

2 After writing this I realized we can do better. If we get a match for the initial pattern, and then check the rest of the patterns, its possible they will fail because the initial match didn't consume enough of the string. It is simple to say, "if the subsequent match fails, keep increasing the pivot and trying again." I leave it as an exercise to the reader to figure out how to represent this trivial backtrackingish thing - but you can always check the source for the solution.

No comments: