Building a Parser-Generator in Erlang, Part 1

May 27, 2009 by Sean Cribbs

First, a confession: I’m a bit of a nerd when it comes to computer languages. I love reading about them, studying how they work, and implementing them. My favorite class in my undergrad was Compiler Construction — when everyone else was seeking incompletes to finish their projects, I turned mine in half-a-week early. Since I graduated, there have been advancements in language parsing research and applications. Of particular interest to me are Parsing Expression Grammars and their accompanying “packrat” parsers — so named for their memoizing features. Being a Rubyist, I was exposed to treetop, which is a nice toolkit for building packrat parsers from PEGs.

Now, to come to the point. At the end of April, I was flying out to Palo Alto, CA for Erlang Factory. My conference habit is to pick a side project, work on it during downtimes at the conference, and ask questions of knowledgeable presenters and attendees. I had just come off a month of doing some heavy Story-Driven Development with Cucumber, a Ruby integration/acceptance testing library that uses plain-text “stories” as test-cases. I wanted a tool in native Erlang for supporting this paradigm. eunit, etap, and the rest are great, but they are pale in comparison to the maturity of libraries like RSpec and Cucumber for Ruby.

Having built several parsers in Erlang before, most notably herml, I thought I’d take my usual path of using leex and yecc to build the parser, based on Cucumber’s PEG. Insanity ensued, and for these reasons:

LALR parser-generators like yecc are designed for context-free grammars (CFGs). PEGs are generally not context-free.
PEGs imply combining the lexical and syntactic analysis steps, so literal strings are common in the midst of nonterminals. This means that “tokenization” of a language defined in a PEG can be more difficult.
PEGs can convey certain patterns — specifically repetition, optional sub-expressions, and non-consuming lookahead — in much simpler forms than can CFGs. Factoring the corresponding PEG rules into a yecc-friendly format was confusing and error-prone.

So there I was at Erlang Factory (by the way, the BEST tech conference I’ve been to in a long time – go to the London one if you can!). I started asking around if people knew of any PEG/packrat libraries in Erlang. The new guys didn’t know of anything, and the most helpful response I got from an old-hand was from Robert Virding, was that Joe Armstrong had written a parser for a PEG. Richard Carlsson from Kreditor (incidentally, the creator of eunit) tried to convince me packrat parsers were a bad idea, although I was already aware of the tradeoffs.

So now my mission was to create a PEG/packrat library in Erlang, and then, once complete, create my Cucumber clone. My progress thus far:

I have a proposed syntax for the grammar definitions, although writing them directly in Erlang is not bad.
I have a parser-combinator library with the memoization features necessary for packrat-parsing.
I have a prototype of the “metagrammar” in both plaintext and Erlang.
A really cool parse-transform to make the parser prettier.

Things still to do:

Decide on a syntax for doing semantic transformations. I’m tempted to go simplistic here.
Verify my metagrammar with more unit tests.
Generate parsers from the output of the metagrammar.
Track line/column numbers for failures (and successes?).

In the next few posts, I’ll talk about the internals and some of the design decisions of the library.