Building a Parser-Generator in Erlang, Part 2

May 29, 2009 by Sean Cribbs

One of the beautiful aspects of programming in Erlang is its ability to express certain types of concepts very succinctly and powerfully. The primary feature I’m referring to in this case is higher-order functions, which is a common feature of many functional and multi-paradigm languages.

Higher-order functions come in two types:

functions that produce functions as their return values (currying)
functions that take functions as arguments (composition)

Higher order functions depend on two features in most languages that support them:

closure, the binding of scope at runtime – which supports currying
a “function” datatype (Lisp’s lambda, Ruby’s proc, Erlang’s fun) – which supports composition

So what does this have to do with parsing? Parsing is a natural fit for functional languages — Lisp is famous for this, actually — because the nested data structures parsers must produce fit well with recursive algorithms. Furthermore, a lot of the work a parser has to do can be generalized using higher-order functions — things like consuming strings, choosing between alternatives, variable repetition, and predicates. In this post, I’m going to give you an overview of the way that my parsing library uses both types of higher-order functions to build up a parser that looks only slightly different from the original grammar.

The first thing to do was to decide on a protocol for the HOFs, or as they should be properly called, parser combinators, that they all can implement. Essentially, you want to be able to operate on all funs in the same manner so that they are interchangeable, so that they can be composed or combined. The interface should opaque and consistent at runtime. Deciding on this is actually pretty easy, because I cribbed a lot of my code from an Erlang translation of Haskell’s “parsec” library that I found linked on Robert Virding’s blog. Every parser combinator returns a function that:

Takes the current input buffer as its sole argument
Returns a two-tuple of the parsed result and the remaining input buffer when it passes
Returns the atom fail when it fails

So let’s take a look at one the simplest combinators, consuming a string.

string(S) ->
  fun(Input) ->
      case lists:prefix(S, Input) of
        true -> {S,  Input -- S};
        _ -> fail
      end
  end.

The returned function from the combinator will succeed if the string you pass, S, is on the head of the Input buffer (list). (For those of you not familiar, strings in Erlang are lists of bytes.) Notice how the combinator itself doesn’t actually do any work, it simply curries a function with the string you want to match. For those from a Gang-of-Four OO background, this is similar the Command pattern, except our return value is a function, not an object.

A slightly more interesting combinator is the assert combinator which uses both composition and currying to implement non-consuming positive lookahead. That is, it will verify that the given parser will succeed, but will not advance the input buffer under any condition. Here’s the (surprisingly short) code:

assert(P) ->
  fun(Input) ->
      case P(Input) of
        fail -> fail;
        _ -> {[], Input}
      end
  end.

The combinator takes the parser function P, and wraps it another function which will succeed if P succeeds, but will keep the input buffer at its current place.

Now, these are interesting and all, but what use are they? As it turns out, this pattern lets me have a one-to-one correspondence between features of the types of PEG I want to support and functions that are used to parse them, as shown in this table:

PEG	Erlang	Meaning
`.`	`anything()`	match any character (only fails on end of buffer)
`"some string"`	`string("some string")`	match a string
`[A-Z0-9]`	`charclass("[A-Z0-9]")`	match one character from a class, like regexps
`&a`	`assert(fun a/1)`	positive lookahead, as described above
`!a`	`not_(fun a/1)`	negative lookahead, make sure the parse fails, but consume no input
`a?`	`optional(fun a/1)`	match an optional subexpression
`a / b`	`choose([fun a/1, fun b/1])`	ordered choice between alternatives
`a b`	`seq([fun a/1, fun b/1])`	match a sequence of terminals and nonterminals
`a+`	`one_or_more(fun a/1)`	greedy repetition with at least one match
`a*`	`zero_or_more(fun a/1)`	optional greedy repetition (any number of matches, including none)

With this one-to-one correspondence, look how nicely the code for Brian Ford’s “arithmetic” grammar comes out:

-module(arithmetic).
-include_lib("peg/include/peg.hrl").

?root(additive).

rule(additive) ->
  peg:choose([peg:seq([fun multitive/1,
                       peg:string("+"),
                       fun additive/1]),
              fun multitive/1]);

rule(multitive) ->
  peg:choose([peg:seq([fun primary/1,
                       peg:string("*"),
                       fun multitive/1]),
              fun primary/1]);

rule(primary) ->
  peg:choose([peg:seq([peg:string("("),
                       fun additive/1,
                       peg:string(")")]),
              fun decimal/1]);

rule(decimal) ->
  peg:charclass("[0-9]").

Now, I must admit that there is some secret awesome-sauce going on here, namely an Erlang parse transform that is invoked via the included header file, but I’ll cover that in a later post. Nevertheless, the point stands, that this code was extremely easy to write given the original PEG:

additive <- multitive "+" additive / multitive
multitive <- primary "*" multitive / primary
primary <- "(" additive ")" / decimal
decimal <- [0-9]

The PEG is still more expressive than the Erlang version, but that is one of the shortest parsers I have ever written, and much more comprehensible than something generated by an LALR generator for sure.

Coming up in future posts: adding packrat-style memoization, using parse_transform to keep the parser module looking clean, adding semantic transformation, and tracking line/column numbers of the input.