Tuesday, October 18, 2011

Better-monad-parse Now Available

I've pushed better-monad-parse.el to my github repository. This is a refactoring of monad-parse, principally to make its interface more regular and grokable. It also is more intellegently polymorphic over different kinds of parseable objects, principally by making use of my implementation of multimethods instead of EIEIO, which, as far as I'm able to ascertain, doesn't allow method dispatch on built-in classes.

Unfortunately, its a bit slow, even compiled. I'd love to have advice on how it might be sped up, although I'm thinking its a lost cause - this kind of programming might be too taxing for Emacs Lisp.

Notes

This library follows the following conventions.

  1. Parsers are functions which accept a generic input state and return a list of cons cells, where the car is the monadic return value and the cdr is the unparsed input.
  2. All parsers defined by the library start with the symbol =, eg =nil, =item, etc.
  3. Parsers are often parameterized, that is, a parser will be created based on some value passed to a function. Functions which produce parsers begin with the symbol =>, eg: =>string, =>items, =>or, =>and and so forth.

New parsers should be most easily created with the parser special form, which introduces a monadic expression context for parser construction:

 (parser 
   (n <- =number)
   (letters <- (=>zero-plus-more =alpha))
   (m-return (list n letters)))

Produces a parser which binds n to a number parsed from the input, and then binds letters to a list of zero or more letters parsed from the input. All the parses defined by the library are generic, so they can parse buffers, strings or lists. There is a rich vocabulary of parsers to combine in the library, but if you wish to write your own generic parsers, you'll need to create a multimethod to cover all input types.

Understanding mulitmethods is also critical if you want to extend the parser library to cover additional input types.

This library is all about defining parsers, so there are special forms for that purpose:

(defparser =number-then-letters
   (n <- =number)
   (letters <- (=>zero-plus-more =alpha))
   (m-return (list n letters)))

Sets =number-then-letters symbol and function value to the parser produced by the body expression evaluated in the parser monad. This allows you to call the parser as a function and also refer to it as a variable, since both actions are common.

defparser allows the creation of parametric parsers too:

(defparser (=>specific-number-then-letters n)
   (n-parsed <- =number)
   (if (= n n-parsed)
       (parser 
          (letters <- (=>zero-plus-more =alpha))
          (m-return n letters))
       =nil))

defun's a function which produces the indicated parser. Arguments are automatically lexically closed over. This version doesn't set the symbol-value of the name.

As stated above, the library performance is a bit slow still - I haven't determined to what that slowness owes, but I recommend byte-compiling everything.

I'm almost certainly going to port this whole thing to Common Lisp at some point, so I can call out to it via slime. That should help with performance, as I have a nagging feeling that the bottleneck is my ad-hoc multimethod dispatch.


Friday, October 7, 2011

Update to my Monadic Parser Combinator Elisp Library (+ examples/tutorial)

I've updated my emacs lisp monadic parser combinator library. The underlying mechanisms for building parsers remain unchanged, but I've introduced a few new syntactic forms to make writing parses a little easier than it used to be.

If you want to understand how the library works, under the hood, check out my screencast about monadic parser combinators in Elisp located here.

This post will introduce those features and give a quick tour of how I'm actually using them in my day to day life as a researcher.

To use the library, download my code from github, add the directory to your emacs path, and then:

(require 'monad-parse)

Which will load the parser monad and its associated functions. It really helps to byte-compile this library, because it relies on some serious macro magic behind the scenes. Performances is tolerable for the compiled version, but can be a real drag without compilation.

If you get an error about filename being undefined during compilation, evaluate:

(setq filename nil)

in your scratch buffer. This is some emacs bug that has to do with CEDIT, which I don't even use. Beats me!

We'll be using an example from my day job today. I work in an electrochemistry lab. My background is in data analysis and modeling, and so I get lots of files containing data from my colleagues. The files are given human-readable names describing the kinds of data they contain:

firstElectrode/1000uM_DA_.4-1-3bg_CV.txt
firstElectrode/500uM_DA_.4-1-3bg_CV.txt
firstElectrode/250uM_DA_.4-1-3bg_CV.txt
firstElectrode/100uM_DA_.4-1-3bg_CV.txt
firstElectrode/1000uM_DA_.4-1-3cv_CV.txt
firstElectrode/500uM_DA_.4-1-3cv_CV.txt
firstElectrode/250uM_DA_.4-1-3cv_CV.txt
firstElectrode/100uM_DA_.4-1-3cv_CV.txt
firstElectrode/1000uM_DA_.4-1-3bg_CV1.txt
firstElectrode/500uM_DA_.4-1-3bg_CV1.txt
firstElectrode/250uM_DA_.4-1-3bg_CV1.txt
firstElectrode/100uM_DA_.4-1-3bg_CV1.txt
firstElectrode/1000uM_DA_.4-1-3cv_CV1.txt
michaelElectrode/500uM_DA_.4-1-3cv_CV1.txt
michaelElectrode/250uM_DA_.4-1-3cv_CV1.txt
michaelElectrode/100uM_DA_.4-1-3cv_CV1.txt

I do analysis which requires that I grab all these files and process them together, and I need to know things like what dopamine concentration each file has and other information which is in the file name, but not easy for a computer to get at. My strategy is to write a parser which converts the human readable filenames into a data structure, and then generate a new, very easy to parse, filename to copy the file to.

This is because I'm doing the data analysis in Matlab, where its hard to write a general purpose parser. Better to let emacs do the work of parsing the complex names and generate simple names for Matlab.

Building Up a Parser

We'll start with preliminaries, little parsers we need to break up the file names:

(defparser slash (=string "/"))
(defparser underscore (=string "_"))
(defparser dash (=string "-"))

Next we need to write parsers for larger bits of text. These are still relatively simple.

(defparser experiment-label
     (characters <- (=zero-or-more (=not (=string "/"))))
     (m-return
      (alist>> :label (coerce characters 'string))))

When parsing strings and buffers, =zero-or-more creates a parser which returns a list of character codes. We bind that list to a variable, characters and coerce it into a string, which is returned to the caller as the monadic return value.

defparser's body allows you to write code more or less like you would in Haskell's do notation. It expands to a defvar expression, but the body is placed inside a parser special form. parser lets you create anonymous parsers.

We can test this parser like so:

(parse-string experiment-label "abcdef/gea") ;-> (:label "abcdef")

parse-string returns just the parsed part of the input, and throws away the leftovers. It also converts its second argument to a parser input stream. You can use a parser directly like this:

(funcall experiment-label (->in "abcdef/gea"))

Which returns the entire parser state:

(((:label "abcdef") . [object <parser-input-string> "<parser-input-string>" "abcdef/gea" 6]))

The library supports non-deterministic parsing, so parsers return a list of all the parser states produced, each of which is a pair containing the parsed value and an object represening the rest of the parse state. In most cases, you'll want to use parse-string. But the library supports buffer parsing and list parsing.

Now lets build our first multi-part parser. This will parse a quantity and its units.

(defparser amount
  (magnitude <- (=number->number))
  (units <- (=string "uM" "nM"))
  (m-return (alist>> :amount magnitude :units units)))

This parser makes use of the =number->number function, which returns a parser which converts a string representation of a base 10 decimal number into an actual number. We then use =string to capture units of either uM or nM, the only units in my data set.

These are packed into an association list for later use.

Data files come in two flavors, "background" and "cv". This parser parses the string representing one of them:

(defparser cv-type
  (type <- (=string "bg" "cv"))
  (m-return (alist>> :cvType type)))

As above, the =string combinator produces a parser which matches any of the strings it receives.

A parser for substances looks like, then:

(defparser substance
  (sub <- (=stringi "DA" "AA" "pH" "H2o2"))
  (m-return (alist>> :substance sub)))

Here we use a variant of =string, =stringi which matches any of the input strings without considering case. It returns the actual match.

Now we need a parser to deal with that soup of numbers and dashes which is supposed to represent a voltage range:

michaelElectrode/100uM_DA_.4-1-3cv_CV1.txt  
                          ------
                          this stuff

Here is the parser:

(defparser voltage-range
  (low <- (=number->number))
  dash
  (pre-decimal <- (=number->number))
  (post-decimal <- (=number->number))
  (m-return
   (alist>> :voltageLow low
            :voltageHigh (+ pre-decimal (/ post-decimal 10.0)))))

Finally, some experiments contain multiple electrodes. If this is the case, the electrode number appears. We need a parser that tries to find a number and returns 0 if it can't find one:

(defparser electrode-number
  (n <- (=maybe (=number->number)))
  (m-return (alist>> :electrode (if n n 0))))

Now we put them all together:

(defparser file
  (label <- experiment-label)
  slash
  (amt <- amount)
  underscore
  (sub <- substance)
  underscore
  (rng <- voltage-range)
  (type <- cv-type)
  underscore
  (=string "CV")
  (el <- electrode-number)
  (m-return
   (append label amt sub rng type el)))

And we make it work:

 (parse-string file "michaelElectrode/100uM_DA_.4-1-3cv_CV1.txt")

Produces:

 ((:label "michaelElectrode") 
  (:amount 100) 
  (:units "uM") 
  (:substance "DA") 
  (:voltageLow 0.4) 
  (:voltageHigh 1.3) 
  (:cvType "cv") 
  (:electrode 1))

I have a utility function in my chemistry library that will convert such an alist into a much easier to parse form:

(require 'chemistry)
(condition->filename '((:label "michaelElectrode") 
  (:amount 100) 
  (:units "uM") 
  (:substance "DA") 
  (:voltageLow 0.4) 
  (:voltageHigh 1) 
  (:cvType "cv") 
  (:electrode 1)))

"label=michaelElectrode=amount=000100=units=uM=substance=DA=voltageLow=0.400000=voltageHigh=000001=cvType=cv=electrode=000001"

That is it!

Extra Features

You can use defparser to create parameterized parsers too. That is, parsers which depend on some input. Suppose you want a parser which parses anything numerically equal to a particular value:

(defparser (parse-this-number n)
  (parsed-number <- (=number->number))
  (if (= parsed-number n) (m-return parsed-number)
      parser-fail))

(parse-string (parse-this-number 10) "10") ;-> 10
(parse-string (parse-this-number 10) "100") ;-> nil

Here is the github link again, if you want to play with the library. Documentation is coming, eventually!

I'm going to do a full rewrite of this library eventually, but it will be under a new name. This guy should stabilize eventually.