Building a Parser-Generator in Erlang, Part 2

One of the beautiful aspects of programming in Erlang is its ability to express certain types of concepts very succinctly and powerfully. The primary feature I’m referring to in this case is higher-order functions, which is a common feature of many functional and multi-paradigm languages.

Higher-order functions come in two types:

  1. functions that produce functions as their return values (currying)
  2. functions that take functions as arguments (composition)

Higher order functions depend on two features in most languages that support them:

  1. closure, the binding of scope at runtime – which supports currying
  2. a “function” datatype (Lisp’s lambda, Ruby’s proc, Erlang’s fun) – which supports composition

So what does this have to do with parsing? Parsing is a natural fit for functional languages — Lisp is famous for this, actually — because the nested data structures parsers must produce fit well with recursive algorithms. Furthermore, a lot of the work a parser has to do can be generalized using higher-order functions — things like consuming strings, choosing between alternatives, variable repetition, and predicates. In this post, I’m going to give you an overview of the way that my parsing library uses both types of higher-order functions to build up a parser that looks only slightly different from the original grammar.

The first thing to do was to decide on a protocol for the HOFs, or as they should be properly called, parser combinators, that they all can implement. Essentially, you want to be able to operate on all funs in the same manner so that they are interchangeable, so that they can be composed or combined. The interface should opaque and consistent at runtime. Deciding on this is actually pretty easy, because I cribbed a lot of my code from an Erlang translation of Haskell’s “parsec” library that I found linked on Robert Virding’s blog. Every parser combinator returns a function that:

  1. Takes the current input buffer as its sole argument
  2. Returns a two-tuple of the parsed result and the remaining input buffer when it passes
  3. Returns the atom fail when it fails

So let’s take a look at one the simplest combinators, consuming a string.

string(S) ->
  fun(Input) ->
      case lists:prefix(S, Input) of
        true -> {S,  Input -- S};
        _ -> fail
      end
  end.

The returned function from the combinator will succeed if the string you pass, S, is on the head of the Input buffer (list). (For those of you not familiar, strings in Erlang are lists of bytes.) Notice how the combinator itself doesn’t actually do any work, it simply curries a function with the string you want to match. For those from a Gang-of-Four OO background, this is similar the Command pattern, except our return value is a function, not an object.

A slightly more interesting combinator is the assert combinator which uses both composition and currying to implement non-consuming positive lookahead. That is, it will verify that the given parser will succeed, but will not advance the input buffer under any condition. Here’s the (surprisingly short) code:

assert(P) ->
  fun(Input) ->
      case P(Input) of
        fail -> fail;
        _ -> {[], Input}
      end
  end.

The combinator takes the parser function P, and wraps it another function which will succeed if P succeeds, but will keep the input buffer at its current place.

Now, these are interesting and all, but what use are they? As it turns out, this pattern lets me have a one-to-one correspondence between features of the types of PEG I want to support and functions that are used to parse them, as shown in this table:

PEG Erlang Meaning
. anything() match any character (only fails on end of buffer)
"some string" string("some string") match a string
[A-Z0-9] charclass("[A-Z0-9]") match one character from a class, like regexps
&a assert(fun a/1) positive lookahead, as described above
!a not_(fun a/1) negative lookahead, make sure the parse fails, but consume no input
a? optional(fun a/1) match an optional subexpression
a / b choose([fun a/1, fun b/1]) ordered choice between alternatives
a b seq([fun a/1, fun b/1]) match a sequence of terminals and nonterminals
a+ one_or_more(fun a/1) greedy repetition with at least one match
a* zero_or_more(fun a/1) optional greedy repetition (any number of matches, including none)

With this one-to-one correspondence, look how nicely the code for Brian Ford’s “arithmetic” grammar comes out:

-module(arithmetic).
-include_lib("peg/include/peg.hrl").

?root(additive).

rule(additive) ->
  peg:choose([peg:seq([fun multitive/1,
                       peg:string("+"),
                       fun additive/1]),
              fun multitive/1]);

rule(multitive) ->
  peg:choose([peg:seq([fun primary/1,
                       peg:string("*"),
                       fun multitive/1]),
              fun primary/1]);

rule(primary) ->
  peg:choose([peg:seq([peg:string("("),
                       fun additive/1,
                       peg:string(")")]),
              fun decimal/1]);

rule(decimal) ->
  peg:charclass("[0-9]").

Now, I must admit that there is some secret awesome-sauce going on here, namely an Erlang parse transform that is invoked via the included header file, but I’ll cover that in a later post. Nevertheless, the point stands, that this code was extremely easy to write given the original PEG:

additive <- multitive "+" additive / multitive
multitive <- primary "*" multitive / primary
primary <- "(" additive ")" / decimal
decimal <- [0-9]

The PEG is still more expressive than the Erlang version, but that is one of the shortest parsers I have ever written, and much more comprehensible than something generated by an LALR generator for sure.

Coming up in future posts: adding packrat-style memoization, using parse_transform to keep the parser module looking clean, adding semantic transformation, and tracking line/column numbers of the input.

Building a Parser-Generator in Erlang, Part 1

First, a confession: I’m a bit of a nerd when it comes to computer languages. I love reading about them, studying how they work, and implementing them. My favorite class in my undergrad was Compiler Construction — when everyone else was seeking incompletes to finish their projects, I turned mine in half-a-week early. Since I graduated, there have been advancements in language parsing research and applications. Of particular interest to me are Parsing Expression Grammars and their accompanying “packrat” parsers — so named for their memoizing features. Being a Rubyist, I was exposed to treetop, which is a nice toolkit for building packrat parsers from PEGs.

Now, to come to the point. At the end of April, I was flying out to Palo Alto, CA for Erlang Factory. My conference habit is to pick a side project, work on it during downtimes at the conference, and ask questions of knowledgeable presenters and attendees. I had just come off a month of doing some heavy Story-Driven Development with Cucumber, a Ruby integration/acceptance testing library that uses plain-text “stories” as test-cases. I wanted a tool in native Erlang for supporting this paradigm. eunit, etap, and the rest are great, but they are pale in comparison to the maturity of libraries like RSpec and Cucumber for Ruby.

Having built several parsers in Erlang before, most notably herml, I thought I’d take my usual path of using leex and yecc to build the parser, based on Cucumber’s PEG. Insanity ensued, and for these reasons:

  1. LALR parser-generators like yecc are designed for context-free grammars (CFGs). PEGs are generally not context-free.
  2. PEGs imply combining the lexical and syntactic analysis steps, so literal strings are common in the midst of nonterminals. This means that “tokenization” of a language defined in a PEG can be more difficult.
  3. PEGs can convey certain patterns — specifically repetition, optional sub-expressions, and non-consuming lookahead — in much simpler forms than can CFGs. Factoring the corresponding PEG rules into a yecc-friendly format was confusing and error-prone.

So there I was at Erlang Factory (by the way, the BEST tech conference I’ve been to in a long time – go to the London one if you can!). I started asking around if people knew of any PEG/packrat libraries in Erlang. The new guys didn’t know of anything, and the most helpful response I got from an old-hand was from Robert Virding, was that Joe Armstrong had written a parser for a PEG. Richard Carlsson from Kreditor (incidentally, the creator of eunit) tried to convince me packrat parsers were a bad idea, although I was already aware of the tradeoffs.

So now my mission was to create a PEG/packrat library in Erlang, and then, once complete, create my Cucumber clone. My progress thus far:

  • I have a proposed syntax for the grammar definitions, although writing them directly in Erlang is not bad.
  • I have a parser-combinator library with the memoization features necessary for packrat-parsing.
  • I have a prototype of the “metagrammar” in both plaintext and Erlang.
  • A really cool parse-transform to make the parser prettier.

Things still to do:

  • Decide on a syntax for doing semantic transformations. I’m tempted to go simplistic here.
  • Verify my metagrammar with more unit tests.
  • Generate parsers from the output of the metagrammar.
  • Track line/column numbers for failures (and successes?).

In the next few posts, I’ll talk about the internals and some of the design decisions of the library.