Friday, October 7, 2011

Update to my Monadic Parser Combinator Elisp Library (+ examples/tutorial)

I've updated my emacs lisp monadic parser combinator library. The underlying mechanisms for building parsers remain unchanged, but I've introduced a few new syntactic forms to make writing parses a little easier than it used to be.

If you want to understand how the library works, under the hood, check out my screencast about monadic parser combinators in Elisp located here.

This post will introduce those features and give a quick tour of how I'm actually using them in my day to day life as a researcher.

To use the library, download my code from github, add the directory to your emacs path, and then:

(require 'monad-parse)

Which will load the parser monad and its associated functions. It really helps to byte-compile this library, because it relies on some serious macro magic behind the scenes. Performances is tolerable for the compiled version, but can be a real drag without compilation.

If you get an error about filename being undefined during compilation, evaluate:

(setq filename nil)

in your scratch buffer. This is some emacs bug that has to do with CEDIT, which I don't even use. Beats me!

We'll be using an example from my day job today. I work in an electrochemistry lab. My background is in data analysis and modeling, and so I get lots of files containing data from my colleagues. The files are given human-readable names describing the kinds of data they contain:

firstElectrode/1000uM_DA_.4-1-3bg_CV.txt
firstElectrode/500uM_DA_.4-1-3bg_CV.txt
firstElectrode/250uM_DA_.4-1-3bg_CV.txt
firstElectrode/100uM_DA_.4-1-3bg_CV.txt
firstElectrode/1000uM_DA_.4-1-3cv_CV.txt
firstElectrode/500uM_DA_.4-1-3cv_CV.txt
firstElectrode/250uM_DA_.4-1-3cv_CV.txt
firstElectrode/100uM_DA_.4-1-3cv_CV.txt
firstElectrode/1000uM_DA_.4-1-3bg_CV1.txt
firstElectrode/500uM_DA_.4-1-3bg_CV1.txt
firstElectrode/250uM_DA_.4-1-3bg_CV1.txt
firstElectrode/100uM_DA_.4-1-3bg_CV1.txt
firstElectrode/1000uM_DA_.4-1-3cv_CV1.txt
michaelElectrode/500uM_DA_.4-1-3cv_CV1.txt
michaelElectrode/250uM_DA_.4-1-3cv_CV1.txt
michaelElectrode/100uM_DA_.4-1-3cv_CV1.txt

I do analysis which requires that I grab all these files and process them together, and I need to know things like what dopamine concentration each file has and other information which is in the file name, but not easy for a computer to get at. My strategy is to write a parser which converts the human readable filenames into a data structure, and then generate a new, very easy to parse, filename to copy the file to.

This is because I'm doing the data analysis in Matlab, where its hard to write a general purpose parser. Better to let emacs do the work of parsing the complex names and generate simple names for Matlab.

Building Up a Parser

We'll start with preliminaries, little parsers we need to break up the file names:

(defparser slash (=string "/"))
(defparser underscore (=string "_"))
(defparser dash (=string "-"))

Next we need to write parsers for larger bits of text. These are still relatively simple.

(defparser experiment-label
     (characters <- (=zero-or-more (=not (=string "/"))))
     (m-return
      (alist>> :label (coerce characters 'string))))

When parsing strings and buffers, =zero-or-more creates a parser which returns a list of character codes. We bind that list to a variable, characters and coerce it into a string, which is returned to the caller as the monadic return value.

defparser's body allows you to write code more or less like you would in Haskell's do notation. It expands to a defvar expression, but the body is placed inside a parser special form. parser lets you create anonymous parsers.

We can test this parser like so:

(parse-string experiment-label "abcdef/gea") ;-> (:label "abcdef")

parse-string returns just the parsed part of the input, and throws away the leftovers. It also converts its second argument to a parser input stream. You can use a parser directly like this:

(funcall experiment-label (->in "abcdef/gea"))

Which returns the entire parser state:

(((:label "abcdef") . [object <parser-input-string> "<parser-input-string>" "abcdef/gea" 6]))

The library supports non-deterministic parsing, so parsers return a list of all the parser states produced, each of which is a pair containing the parsed value and an object represening the rest of the parse state. In most cases, you'll want to use parse-string. But the library supports buffer parsing and list parsing.

Now lets build our first multi-part parser. This will parse a quantity and its units.

(defparser amount
  (magnitude <- (=number->number))
  (units <- (=string "uM" "nM"))
  (m-return (alist>> :amount magnitude :units units)))

This parser makes use of the =number->number function, which returns a parser which converts a string representation of a base 10 decimal number into an actual number. We then use =string to capture units of either uM or nM, the only units in my data set.

These are packed into an association list for later use.

Data files come in two flavors, "background" and "cv". This parser parses the string representing one of them:

(defparser cv-type
  (type <- (=string "bg" "cv"))
  (m-return (alist>> :cvType type)))

As above, the =string combinator produces a parser which matches any of the strings it receives.

A parser for substances looks like, then:

(defparser substance
  (sub <- (=stringi "DA" "AA" "pH" "H2o2"))
  (m-return (alist>> :substance sub)))

Here we use a variant of =string, =stringi which matches any of the input strings without considering case. It returns the actual match.

Now we need a parser to deal with that soup of numbers and dashes which is supposed to represent a voltage range:

michaelElectrode/100uM_DA_.4-1-3cv_CV1.txt  
                          ------
                          this stuff

Here is the parser:

(defparser voltage-range
  (low <- (=number->number))
  dash
  (pre-decimal <- (=number->number))
  (post-decimal <- (=number->number))
  (m-return
   (alist>> :voltageLow low
            :voltageHigh (+ pre-decimal (/ post-decimal 10.0)))))

Finally, some experiments contain multiple electrodes. If this is the case, the electrode number appears. We need a parser that tries to find a number and returns 0 if it can't find one:

(defparser electrode-number
  (n <- (=maybe (=number->number)))
  (m-return (alist>> :electrode (if n n 0))))

Now we put them all together:

(defparser file
  (label <- experiment-label)
  slash
  (amt <- amount)
  underscore
  (sub <- substance)
  underscore
  (rng <- voltage-range)
  (type <- cv-type)
  underscore
  (=string "CV")
  (el <- electrode-number)
  (m-return
   (append label amt sub rng type el)))

And we make it work:

 (parse-string file "michaelElectrode/100uM_DA_.4-1-3cv_CV1.txt")

Produces:

 ((:label "michaelElectrode") 
  (:amount 100) 
  (:units "uM") 
  (:substance "DA") 
  (:voltageLow 0.4) 
  (:voltageHigh 1.3) 
  (:cvType "cv") 
  (:electrode 1))

I have a utility function in my chemistry library that will convert such an alist into a much easier to parse form:

(require 'chemistry)
(condition->filename '((:label "michaelElectrode") 
  (:amount 100) 
  (:units "uM") 
  (:substance "DA") 
  (:voltageLow 0.4) 
  (:voltageHigh 1) 
  (:cvType "cv") 
  (:electrode 1)))

"label=michaelElectrode=amount=000100=units=uM=substance=DA=voltageLow=0.400000=voltageHigh=000001=cvType=cv=electrode=000001"

That is it!

Extra Features

You can use defparser to create parameterized parsers too. That is, parsers which depend on some input. Suppose you want a parser which parses anything numerically equal to a particular value:

(defparser (parse-this-number n)
  (parsed-number <- (=number->number))
  (if (= parsed-number n) (m-return parsed-number)
      parser-fail))

(parse-string (parse-this-number 10) "10") ;-> 10
(parse-string (parse-this-number 10) "100") ;-> nil

Here is the github link again, if you want to play with the library. Documentation is coming, eventually!

I'm going to do a full rewrite of this library eventually, but it will be under a new name. This guy should stabilize eventually.


No comments: