I've updated my emacs lisp monadic parser combinator library. The
underlying mechanisms for building parsers remain unchanged, but I've
introduced a few new syntactic forms to make writing parses a little
easier than it used to be.
If you want to understand how the library works, under the hood, check
out my screencast about monadic parser combinators in Elisp located
here.
This post will introduce those features and give a quick tour of how
I'm actually using them in my day to day life as a researcher.
To use the library, download my code from github, add the
directory to your emacs path, and then:
(require 'monad-parse)
Which will load the parser monad and its associated functions. It
really helps to byte-compile this library, because it relies on some
serious macro magic behind the scenes. Performances is tolerable for
the compiled version, but can be a real drag without compilation.
If you get an error about filename being undefined during
compilation, evaluate:
(setq filename nil)
in your scratch buffer. This is some emacs bug that has to do with
CEDIT, which I don't even use. Beats me!
We'll be using an example from my day job today. I work in an
electrochemistry lab. My background is in data analysis and modeling,
and so I get lots of files containing data from my colleagues. The
files are given human-readable names describing the kinds of data they
contain:
firstElectrode/1000uM_DA_.4-1-3bg_CV.txt
firstElectrode/500uM_DA_.4-1-3bg_CV.txt
firstElectrode/250uM_DA_.4-1-3bg_CV.txt
firstElectrode/100uM_DA_.4-1-3bg_CV.txt
firstElectrode/1000uM_DA_.4-1-3cv_CV.txt
firstElectrode/500uM_DA_.4-1-3cv_CV.txt
firstElectrode/250uM_DA_.4-1-3cv_CV.txt
firstElectrode/100uM_DA_.4-1-3cv_CV.txt
firstElectrode/1000uM_DA_.4-1-3bg_CV1.txt
firstElectrode/500uM_DA_.4-1-3bg_CV1.txt
firstElectrode/250uM_DA_.4-1-3bg_CV1.txt
firstElectrode/100uM_DA_.4-1-3bg_CV1.txt
firstElectrode/1000uM_DA_.4-1-3cv_CV1.txt
michaelElectrode/500uM_DA_.4-1-3cv_CV1.txt
michaelElectrode/250uM_DA_.4-1-3cv_CV1.txt
michaelElectrode/100uM_DA_.4-1-3cv_CV1.txt
I do analysis which requires that I grab all these files and process
them together, and I need to know things like what dopamine
concentration each file has and other information which is in the file
name, but not easy for a computer to get at. My strategy is to write
a parser which converts the human readable filenames into a data
structure, and then generate a new, very easy to parse, filename to
copy the file to.
This is because I'm doing the data analysis in Matlab, where its hard
to write a general purpose parser. Better to let emacs do the work of
parsing the complex names and generate simple names for Matlab.
Building Up a Parser
We'll start with preliminaries, little parsers we need to break up the
file names:
(defparser slash (=string "/"))
(defparser underscore (=string "_"))
(defparser dash (=string "-"))
Next we need to write parsers for larger bits of text. These are
still relatively simple.
(defparser experiment-label
(characters <- (=zero-or-more (=not (=string "/"))))
(m-return
(alist>> :label (coerce characters 'string))))
When parsing strings and buffers, =zero-or-more creates a parser
which returns a list of character codes. We bind that list to a
variable, characters and coerce it into a string, which is returned
to the caller as the monadic return value.
defparser's body allows you to write code more or less like you
would in Haskell's do notation. It expands to a defvar
expression, but the body is placed inside a parser special form.
parser lets you create anonymous parsers.
We can test this parser like so:
(parse-string experiment-label "abcdef/gea") ;-> (:label "abcdef")
parse-string returns just the parsed part of the input, and throws
away the leftovers. It also converts its second argument to a parser
input stream. You can use a parser directly like this:
(funcall experiment-label (->in "abcdef/gea"))
Which returns the entire parser state:
(((:label "abcdef") . [object <parser-input-string> "<parser-input-string>" "abcdef/gea" 6]))
The library supports non-deterministic parsing, so parsers return a
list of all the parser states produced, each of which is a pair
containing the parsed value and an object represening the rest of the
parse state. In most cases, you'll want to use parse-string. But
the library supports buffer parsing and list parsing.
Now lets build our first multi-part parser. This will parse a
quantity and its units.
(defparser amount
(magnitude <- (=number->number))
(units <- (=string "uM" "nM"))
(m-return (alist>> :amount magnitude :units units)))
This parser makes use of the =number->number function, which returns
a parser which converts a string representation of a base 10 decimal
number into an actual number. We then use =string to capture units
of either uM or nM, the only units in my data set.
These are packed into an association list for later use.
Data files come in two flavors, "background" and "cv". This parser
parses the string representing one of them:
(defparser cv-type
(type <- (=string "bg" "cv"))
(m-return (alist>> :cvType type)))
As above, the =string combinator produces a parser which matches any
of the strings it receives.
A parser for substances looks like, then:
(defparser substance
(sub <- (=stringi "DA" "AA" "pH" "H2o2"))
(m-return (alist>> :substance sub)))
Here we use a variant of =string, =stringi which matches any of
the input strings without considering case. It returns the actual
match.
Now we need a parser to deal with that soup of numbers and dashes
which is supposed to represent a voltage range:
michaelElectrode/100uM_DA_.4-1-3cv_CV1.txt
------
this stuff
Here is the parser:
(defparser voltage-range
(low <- (=number->number))
dash
(pre-decimal <- (=number->number))
(post-decimal <- (=number->number))
(m-return
(alist>> :voltageLow low
:voltageHigh (+ pre-decimal (/ post-decimal 10.0)))))
Finally, some experiments contain multiple electrodes. If this is the
case, the electrode number appears. We need a parser that tries to
find a number and returns 0 if it can't find one:
(defparser electrode-number
(n <- (=maybe (=number->number)))
(m-return (alist>> :electrode (if n n 0))))
Now we put them all together:
(defparser file
(label <- experiment-label)
slash
(amt <- amount)
underscore
(sub <- substance)
underscore
(rng <- voltage-range)
(type <- cv-type)
underscore
(=string "CV")
(el <- electrode-number)
(m-return
(append label amt sub rng type el)))
And we make it work:
(parse-string file "michaelElectrode/100uM_DA_.4-1-3cv_CV1.txt")
Produces:
((:label "michaelElectrode")
(:amount 100)
(:units "uM")
(:substance "DA")
(:voltageLow 0.4)
(:voltageHigh 1.3)
(:cvType "cv")
(:electrode 1))
I have a utility function in my chemistry library that will convert
such an alist into a much easier to parse form:
(require 'chemistry)
(condition->filename '((:label "michaelElectrode")
(:amount 100)
(:units "uM")
(:substance "DA")
(:voltageLow 0.4)
(:voltageHigh 1)
(:cvType "cv")
(:electrode 1)))
"label=michaelElectrode=amount=000100=units=uM=substance=DA=voltageLow=0.400000=voltageHigh=000001=cvType=cv=electrode=000001"
That is it!
Extra Features
You can use defparser to create parameterized parsers too. That
is, parsers which depend on some input. Suppose you want a parser
which parses anything numerically equal to a particular value:
(defparser (parse-this-number n)
(parsed-number <- (=number->number))
(if (= parsed-number n) (m-return parsed-number)
parser-fail))
(parse-string (parse-this-number 10) "10") ;-> 10
(parse-string (parse-this-number 10) "100") ;-> nil
Here is the github link again, if you want to play
with the library. Documentation is coming, eventually!
I'm going to do a full rewrite of this library eventually, but it will
be under a new name. This guy should stabilize eventually.