I've updated my emacs lisp monadic parser combinator library. The underlying mechanisms for building parsers remain unchanged, but I've introduced a few new syntactic forms to make writing parses a little easier than it used to be.
If you want to understand how the library works, under the hood, check out my screencast about monadic parser combinators in Elisp located here.
This post will introduce those features and give a quick tour of how I'm actually using them in my day to day life as a researcher.
To use the library, download my code from github, add the directory to your emacs path, and then:
Which will load the parser monad and its associated functions. It really helps to byte-compile this library, because it relies on some serious macro magic behind the scenes. Performances is tolerable for the compiled version, but can be a real drag without compilation.
If you get an error about
filename being undefined during
(setq filename nil)
in your scratch buffer. This is some emacs bug that has to do with CEDIT, which I don't even use. Beats me!
We'll be using an example from my day job today. I work in an electrochemistry lab. My background is in data analysis and modeling, and so I get lots of files containing data from my colleagues. The files are given human-readable names describing the kinds of data they contain:
firstElectrode/1000uM_DA_.4-1-3bg_CV.txt firstElectrode/500uM_DA_.4-1-3bg_CV.txt firstElectrode/250uM_DA_.4-1-3bg_CV.txt firstElectrode/100uM_DA_.4-1-3bg_CV.txt firstElectrode/1000uM_DA_.4-1-3cv_CV.txt firstElectrode/500uM_DA_.4-1-3cv_CV.txt firstElectrode/250uM_DA_.4-1-3cv_CV.txt firstElectrode/100uM_DA_.4-1-3cv_CV.txt firstElectrode/1000uM_DA_.4-1-3bg_CV1.txt firstElectrode/500uM_DA_.4-1-3bg_CV1.txt firstElectrode/250uM_DA_.4-1-3bg_CV1.txt firstElectrode/100uM_DA_.4-1-3bg_CV1.txt firstElectrode/1000uM_DA_.4-1-3cv_CV1.txt michaelElectrode/500uM_DA_.4-1-3cv_CV1.txt michaelElectrode/250uM_DA_.4-1-3cv_CV1.txt michaelElectrode/100uM_DA_.4-1-3cv_CV1.txt
I do analysis which requires that I grab all these files and process them together, and I need to know things like what dopamine concentration each file has and other information which is in the file name, but not easy for a computer to get at. My strategy is to write a parser which converts the human readable filenames into a data structure, and then generate a new, very easy to parse, filename to copy the file to.
This is because I'm doing the data analysis in Matlab, where its hard to write a general purpose parser. Better to let emacs do the work of parsing the complex names and generate simple names for Matlab.
Building Up a Parser
We'll start with preliminaries, little parsers we need to break up the file names:
(defparser slash (=string "/")) (defparser underscore (=string "_")) (defparser dash (=string "-"))
Next we need to write parsers for larger bits of text. These are still relatively simple.
(defparser experiment-label (characters <- (=zero-or-more (=not (=string "/")))) (m-return (alist>> :label (coerce characters 'string))))
When parsing strings and buffers,
=zero-or-more creates a parser
which returns a list of character codes. We bind that list to a
characters and coerce it into a string, which is returned
to the caller as the monadic return value.
defparser's body allows you to write code more or less like you
would in Haskell's
do notation. It expands to a
expression, but the body is placed inside a
parser special form.
parser lets you create anonymous parsers.
We can test this parser like so:
(parse-string experiment-label "abcdef/gea") ;-> (:label "abcdef")
parse-string returns just the parsed part of the input, and throws
away the leftovers. It also converts its second argument to a parser
input stream. You can use a parser directly like this:
(funcall experiment-label (->in "abcdef/gea"))
Which returns the entire parser state:
(((:label "abcdef") . [object <parser-input-string> "<parser-input-string>" "abcdef/gea" 6]))
The library supports non-deterministic parsing, so parsers return a
list of all the parser states produced, each of which is a pair
containing the parsed value and an object represening the rest of the
parse state. In most cases, you'll want to use
the library supports buffer parsing and list parsing.
Now lets build our first multi-part parser. This will parse a quantity and its units.
(defparser amount (magnitude <- (=number->number)) (units <- (=string "uM" "nM")) (m-return (alist>> :amount magnitude :units units)))
This parser makes use of the
=number->number function, which returns
a parser which converts a string representation of a base 10 decimal
number into an actual number. We then use
=string to capture units
nM, the only units in my data set.
These are packed into an association list for later use.
Data files come in two flavors, "background" and "cv". This parser parses the string representing one of them:
(defparser cv-type (type <- (=string "bg" "cv")) (m-return (alist>> :cvType type)))
As above, the
=string combinator produces a parser which matches any
of the strings it receives.
A parser for substances looks like, then:
(defparser substance (sub <- (=stringi "DA" "AA" "pH" "H2o2")) (m-return (alist>> :substance sub)))
Here we use a variant of
=stringi which matches any of
the input strings without considering case. It returns the actual
Now we need a parser to deal with that soup of numbers and dashes which is supposed to represent a voltage range:
michaelElectrode/100uM_DA_.4-1-3cv_CV1.txt ------ this stuff
Here is the parser:
(defparser voltage-range (low <- (=number->number)) dash (pre-decimal <- (=number->number)) (post-decimal <- (=number->number)) (m-return (alist>> :voltageLow low :voltageHigh (+ pre-decimal (/ post-decimal 10.0)))))
Finally, some experiments contain multiple electrodes. If this is the case, the electrode number appears. We need a parser that tries to find a number and returns 0 if it can't find one:
(defparser electrode-number (n <- (=maybe (=number->number))) (m-return (alist>> :electrode (if n n 0))))
Now we put them all together:
(defparser file (label <- experiment-label) slash (amt <- amount) underscore (sub <- substance) underscore (rng <- voltage-range) (type <- cv-type) underscore (=string "CV") (el <- electrode-number) (m-return (append label amt sub rng type el)))
And we make it work:
(parse-string file "michaelElectrode/100uM_DA_.4-1-3cv_CV1.txt")
((:label "michaelElectrode") (:amount 100) (:units "uM") (:substance "DA") (:voltageLow 0.4) (:voltageHigh 1.3) (:cvType "cv") (:electrode 1))
I have a utility function in my chemistry library that will convert such an alist into a much easier to parse form:
(require 'chemistry) (condition->filename '((:label "michaelElectrode") (:amount 100) (:units "uM") (:substance "DA") (:voltageLow 0.4) (:voltageHigh 1) (:cvType "cv") (:electrode 1))) "label=michaelElectrode=amount=000100=units=uM=substance=DA=voltageLow=0.400000=voltageHigh=000001=cvType=cv=electrode=000001"
That is it!
You can use
defparser to create parameterized parsers too. That
is, parsers which depend on some input. Suppose you want a parser
which parses anything numerically equal to a particular value:
(defparser (parse-this-number n) (parsed-number <- (=number->number)) (if (= parsed-number n) (m-return parsed-number) parser-fail)) (parse-string (parse-this-number 10) "10") ;-> 10 (parse-string (parse-this-number 10) "100") ;-> nil
Here is the github link again, if you want to play with the library. Documentation is coming, eventually!
I'm going to do a full rewrite of this library eventually, but it will be under a new name. This guy should stabilize eventually.