Building a Parser-Generator in Erlang, Part 5

I’ve realized I gave short-shrift to the parse_transform code that I posted yesterday. I wandered around the topic without much direction. So I’m going to revisit that code and show you the real meat of it: how I convert

rule(nonterminal) ->
  parser_fun();

into:

nonterminal(Input, Index) ->
  peg:p(Input, Index, nonterminal, fun(I,D) ->
                                      (parser_fun())(I,D)
                                      end,
                                   fun(Node) ->
                                      transform(nonterminal, Node)
                                    end).

First steps

The AST of an Erlang module is a list of tuples. Those tuples represent attributes (things like the module name, the exports, etc), and function definitions. The general strategy will be this:

  1. Fold over the AST, looking for definitions for the rule function.
  2. Map all of the function clauses for the rule function into their top-level equivalents.

This is the essence of the transform_rules function:

transform_rules(AST) ->
  transform_rules(AST, []).

transform_rules([], Accum) ->
  lists:reverse(Accum);
transform_rules([{function,_Line,rule,1,Clauses}|Rest], Accum) ->
  Rules = lists:reverse(build_rules(Clauses)),
  transform_rules(Rest, Rules++Accum);
transform_rules([H|Rest], Accum) ->
  transform_rules(Rest, [H|Accum]).

Like a typical iteration routine in Erlang, it pushes things on the head of a list and then reverses at the end. When it finds a matching function, it takes the clauses of that function, transforms them, and then pushes them on the head. Note that the second clause of transform_rules will only match a single-arity function (that’s what the number 1 is doing in there). Since I’m use lists:map() in the build_rules function as you will see below, I have to reverse them before I prepend them to the accumulator.

Mapping the rules to new functions

Now let’s look at build_rules.

build_rules(Clauses) ->
  Rules = lists:map(fun build_rule/1, Clauses),
  [generate_file_function(),generate_parse_function()|Rules].

This is pretty simple. I take the clauses, and map them into their transformation, then prepend the file and parse functions to the front. If you were evil and wanted to screw up this transform, you could provide multiple definitions (not simply clauses) of the rules function. But then it wouldn’t be very useful, would it?

Now we get to an individual rule. Since the logic is a little complicated, I broke it out into several functions. First, is build_rule.

build_rule({clause,Line,[{atom,_,Name}],_,Stmt}) ->
  ets:insert_new(peg_transform, {root, Name, Line}),
  Wrapped = wrap_reductions(Stmt, Name),
  {function,Line,Name,2,
   [{clause,Line,[{var,Line,'Input'},{var,Line,'Index'}],[],Wrapped}]}.

The very first line lets us record the first rule in the grammar/parser as the root rule. This is useful later in generate_parse_function. If the row with the key root is already in the ets table, it ignores the insert. Second, we wrap the “reductions” (body of the rule) in our peg:p function as we will see below. Lastly, we return a function definition with the rule-name as the function name, an arity of 2, and a single clause that contains our wrapped reductions.

Wrap it up

Remember the whole point of this exercise was to avoid typing peg:p() over and over. Here’s where we get to that.

wrap_reductions(Stmt,Name) ->
  Inner = hd(Stmt),
  Line = element(2, Inner), % [{call,Line,_,_}]
  Fun = wrap_fun(Inner, Line),
  Transform = semantic_fun(Name, Line),
  [{call,Line,
    {remote,Line,{atom,Line,peg},{atom,Line,p}},
    [
     {var,Line,'Input'},
     {var,Line,'Index'},
     {atom,Line,Name},
     Fun,
     Transform
    ]}].

wrap_reductions only allows a single statement (one top-level combinator). We extract the line number out of that statement, pass the statement to another wrapper, which will return an inline-function/lambda containing the parser statement. Then we generate a similar function that will take the parsed node and transform it. Last, we pop these two funs into our peg:p() call.

Before we look at wrap_fun, let’s look at semantic_fun, which is a little simpler:

semantic_fun(Rule, Line) ->
  {'fun',Line,
   {clauses,
    [{clause,Line,
      [{var,Line,'Node'}],
      [],
      [{call,Line,
        {atom,Line,transform},
        [{atom,Line,Rule},{var,Line,'Node'}]}]}]}}.

Interestingly, unlike function definitions, 'fun' definitions always use clauses, even if you only have one clause! This one’s really straightforward, just putting the rulename and the line number in the right places.

Now, our final wrapper, wrap_fun.

wrap_fun(Stmts, Line) when is_tuple(Stmts) ->
  {'fun',Line,
   {clauses,
    [{clause,Line,
      [{var,Line,'I'},{var,Line,'D'}],
      [],
      [{call,Line,
        Stmts,
        [{var,Line,'I'},{var,Line,'D'}]}]}]}};
wrap_fun(Stmts, Line) ->
  io:format("peg rules must be double-arity functions or statements and cannot be sequences of statements!~n"),
  throw({parse_error, not_tuple, Line, Stmts}).

I engaged in a little paranoia here — I wanted to make sure someone hand-writing a parser wouldn’t try to do something that would break the convention of a single top-level combinator in each rule. Maybe a little overboard.

For the sake of keeping things short and not colliding with variables in the same scope, I used I and D as local variable instead of Input and Index. What is really curious is how the call statement comes out. In regular Erlang, in order to call a fun that is the result of some other operation, you have to wrap it in parentheses, like so:

(some_fun())(Var)

However, in AST, it looks like any old call, except that instead of being an atom, variable, or remote, it is the substatement.

Back to the top

Now that we’ve dug into the depths of the transform, let’s look again at the top-level.

parse_transform(AST, _Options) ->
  ets:new(peg_transform, [named_table, set]),
  [{attribute, _, module, ModName}] = lists:filter(fun(Form)->
                                                       case Form of
                                                         {attribute,_,module,_} -> true;
                                                         _ -> false
                                                       end
                                                   end, AST),
  ets:insert(peg_transform, {module, ModName}),
  Result = transform_rules(AST),
  ets:delete(peg_transform),
  Result.

You’ll notice the call to transform_rules near the end of the function – where the action happens. The rest of the parse_transform function is devoted to:

  1. Setting up an ets table that we saw used in previous snippets.
  2. Capturing the module name in the ets table, which we use in generating the parse function.
  3. Cleaning up the ets table when we’re done.

Conclusion

I hope that is a bit more clear and gives you a better idea of how I’m accomplishing the transformation.

Building a Parser-Generator in Erlang, Part 4

Notes on my progress

Great news! I’ve crossed several milestones with the parser-generator that I’m anxious to tell you about. But first, two pieces of business.

  1. In case you didn’t notice, I’ve renamed the library from “peggy” to “neotoma”. Neotoma is the genus under Rodentia encompassing packrats. Fitting, don’t you think? If you cloned the repository from github, please update your remotes appropriately.
  2. In the last installment, I showed the code for the peg:p/3,4 memoization function. Well, I discovered it was fatally flawed — specifically related to tracking the “current index”. Certain strings were mysteriously getting rearranged! I bit the bullet and decided to be a good functional programmer, threading the current index through all of the parser combinators. This gives me the benefit of referential transparency, that is, no side-effects, and it’s easier to figure out where the index is advancing (only 3 parser-combinators are responsible). Additionally, it should be simple to extend the parser-combinators to track line and column numbers now, passing around a data structure instead of just an integer.

Milestone 1 – Metagrammar

I had mentioned in the first post that I had a prototype of a “metagrammar”, that is, a grammar for PEG grammars. On Monday night, after resolving the memoization issues, I was able to complete the semantic analysis/code-generation piece of the metagrammar, and now it can regenerate itself (at least the ‘rules’ portion).

Although it would be ideal to add semantic transformation details to the grammar file, I’m not quite ready to do that. I’d appreciate suggestions along those lines.

Milestone 2 – Code generation utilities

Once I had the metagrammar functioning properly, the next step was to build some tools around that portion of the code-generation for creating valid Erlang files that can be compiled into a parser. This was mostly busy-work, generating strings, writing files.

parse_transform

I’ve mentioned this parse_transform thing before, even almost to the point of teasing you with it. Now, my knowledge of how to build Erlang statements and terms programmatically is limited, but I will try to impart the knowledge I’ve gained while working on neotoma’s peg_transform.

Introduction

The gist of the parse-transform is this: Erlang will optionally pass the Abstract Syntax Tree (AST), structured as nested tuples and lists, to a function you define, following the parsing phase of compilation. The only restrictions are that your function return a valid AST and that the original code is valid Erlang.

This is both quite dangerous and quite powerful. Just like the open-classes feature of Ruby, one must use the power for good, and not for awesome. Sagely, the Erlang documentation advises against using it and warns that its usage is not supported.

Why

My purpose for delving into the dark art of the parse_transform was probably a little bit of selfishness. If I had come to Erlang from something like Java, I likely would have slogged through writing many repetitive lines of code without batting an eye. However, I came from Ruby and love the DRY principle. I was not about to write peg:p(...) 20 times in my hand-written parser. I thought, “These statements are so similar, there must be a way to at least hide the repetition. I should be able to support both generated and hand-written parsers with the same amount of elegance and simplicity.”

Getting started

The first step was to figure out what the AST looked like. This part was really simple – just pretty-print the terms without transforming them.

%% include/peg_i.hrl
-compile([{parse_transform, peg_itransform}]).
%% src/peg_itransform.erl
-module(peg_itransform).
-export([parse_transform/2]).

parse_transform(AST, _Options) ->
  io:format("~n~p~n", [AST]),
  AST

In order to get a handle on both what I wanted to generate and what I wanted the original to look like, I created a pair of files. You’ve seen one of them before — the “arithmetic” parser.

By using the identity parse_transform shown above, I was able to figure out how to transform the rule(nonterminal) -> parsers form into the more verbose function form including peg:p().

Diving in the deep end

To complete the transform, I did lots of copying, pasting and modifying, especially with the two functions that are nearly module-independent, parse/1 and file/1. As Kevin warned me when I went down this path, copying and pasting and analyzing the AST by hand is necessary initially – you probably won’t figure it out otherwise. Here are some things I learned:

  • The simple types are easy to understand:
    • {var, Line, 'VarName'}
    • {atom, Line, some_atom}
    • {integer, Line, 0}
    • {float, Line, 3.14159}
    • {tuple, Line, ListOfElements}
    • The empty list: {nil, Line}
  • Lists are strange, but it makes sense when you think about head/tail logic in the language: {cons, Line, Head, Tail}. Single-item lists have a Tail of the empty list. Multi-item lists nest additional cons tuples in the Tail, eventually ending with the empty list.
  • match (representing a “match” statement with =) is a 4-tuple: {match, Line, LHS, RHS}.
  • call (call a function) is also a 4-tuple: {call, Line, FunSpec, ArgumentList}. FunSpec is one of these:
    • An atom (as above), representing a function in the same module.
    • A variable (as above), representing a “fun” (you hope)
    • Any single substatement, hopefully resulting in a “fun”, or containing one directly. (This is used a lot when I turn the parser specs into “funs”.)
    • A call to a function in another module, represented thus: {remote, Line, ModuleAtom, FunctionAtom}.
  • case statements, function declarations and “funs” all have an idea of multiple clauses, something which I’m loathe to go into here. The difference between a single clause and multiple clauses, as represented in the AST is slightly unintuitive. Have a look at my parse-transform module for some examples.
  • You might have noticed the Line in all the statements I described above. Yes, it’s everywhere.

So to bring this back around to something more practical, here’s how I’m generating the parse/1 function that goes into every parser.

generate_parse_function() ->
  ModName = ets:lookup_element(peg_transform, module, 2),
  [{_, Rule, Line}] = ets:lookup(peg_transform, root),
  {function,Line,parse,1,
   [{clause,Line,
     [{var,Line,'Input'}],
     [],
     [{call,Line,
       {remote,Line,{atom,Line,peg},{atom,Line,setup_memo}},
       [{atom,Line,ModName}]},
      {match,Line,
       {var,Line,'Result'},
       {'case',Line,
        {call,Line,{atom,Line,Rule},[{var,Line,'Input'},{integer,Line,0}]},
        [{clause,Line,[{tuple,Line,[{var,Line,'AST'},{nil,Line},{var,Line, '_Index'}]}],[],[{var,Line,'AST'}]},
         {clause,Line,[{var,Line,'Any'}],[],[{var,Line,'Any'}]}]}},
      {call,Line,{remote,Line,{atom,Line,peg},{atom,Line,release_memo}},[]},
      {var,Line,'Result'}]}]}.

The two lines at the top load some information I’d previously stored in an ETS table about the module being compiled and the root rule (with line number) of the parser. The rest interpolates that information into the near-verbatim copy of the function generated from the identity transform.

Conclusion

There are a lot of interesting uses for parse_transform, mostly for the sake of metaprogramming as I’ve done here. If you are interested in doing things at runtime, you might check out smerl. I invite your comments below.

Building a Parser-Generator in Erlang, Part 3

In this installment, I’m going to talk about adding packrat-style memoization to the parser library. If you’re curious for more detail, you can get the code now! I’ve released it on GitHub as neotoma (UPDATE: renamed). As always, it is released under an MIT License, but I would appreciate contributions.

So, last time we talked about the combinators that the peg module provides. Now, if we built a parser with just these combinators, we’d be fine for short inputs. However, given a sufficiently-large input, we would see an explosion of the amount of time it takes to parse! This is because a pure recursive-descent parser without prediction executes in O(2^N) time, that is, exponential. The parser will backtrack over and over when rules fail, without tracking any intermediate results that might have succeeded.

Packrat parsers, as described in Bryan Ford’s thesis, execute in linear-time ( O(N) ) by memoizing intermediate results. Theoretically, this means that you construct a table of the parsing results at each input position for each rule in the grammar (known as tabular parsing). In practice, you have some kind of “global” two-dimensional data-structure that stores the results of attempted parses for later use, but you don’t bother to calculate the result of every rule. The primary limitation of this is the classic trade-off in Computer Science; that is, you sacrifice memory usage for speed. Some empirical studies of packrat parsers show a usage of 400 bytes of memory per byte of input. While that’s a large number, it is generally a constant proportion to the amount of input.

One of the biggest challenges for me was teasing apart Ford’s theory from his implementation. His packrat parser is written in Haskell and makes use of the inherent laziness and algebraic data-structures in the language. Since I’m using Erlang, which lacks inherent laziness and discourages global information, I had to take a different approach.

The strategy I decided on goes like this:

  1. At the start of the parse, create an ets table for memoizing results. Put the handle to this table in the process dictionary.
  2. Track the intermediate results, keyed on the index into the input, in the ets table.
  3. Track the current index in the ets table as well (although this is not strictly necessary, it simplifies portions of the code).
  4. Wrap each rule/reduction in a function that handles the memoization automatically.
  5. Destroy the ets table when the parse ends and remove the handle from the process dictionary.

ETS tables were a great fit for this problem because they are cheap to create and use, and essentially global. Although Erlang is great at urging you to remove side-effects from your programs, this was a case where a side-effect was necessary and expedient, and it plays nicely by cleaning up its mess when done.

Steps 1 and 5 happen in the outermost parse function, which I use an Erlang parse-transform to generate via the included header file we saw in the last installment. Let’s take a look at that function and then we’ll walk through the steps. We’ll use the example from our arithmetic grammar, which is available in tests/examples/arithmetic.erl

parse(Input) ->
  peg:setup_memo(arithmetic),
  Result = case additive(Input) of
             {AST, []} ->
                AST;
             fail -> fail
           end,
  peg:release_memo(),
  Result.

It creates the ets table, parses by calling the root rule, releases the table, and returns the result. Pretty straight-forward.

Step 1 – Create the ets table

Note: All of the functions below are in the peg module.

setup_memo(Name) ->
  TID = ets:new(Name, [set]),
  put(ets_table, TID),
  set_index(0).

This function is so short it almost needs no explanation. We create a new ets table, using the set type since we will have at most one entry per index into the input. We put that table ID into the process dictionary so we can grab it later, and then we set the stored input index to zero. Here’s what set_index/1 and its pair index/0 do, just so you know when we see them again later:

set_index(Value) ->
  ets:insert(get(ets_table), {current_index, Value}).

index() -> 
  ets:lookup_element(get(ets_table), current_index, 2).

Note how we get the ETS table ID out of the process dictionary from the key where we put it before. Again, this is probably not The Right Way™ to do things, but it works.

Step 2 & 3 – Track intermediate results and the input position

Here’s the meat of the process. This function is heavily commented, so I’ll just give some commentary at the end.

p(Inp, Name, ParseFun) ->
  p(Inp, Name, ParseFun, fun(N) -> N end).
p(Inp, Name, ParseFun, TransformFun) ->
  % Record the starting index
  StartIndex = index(),
  % Grab the memo table from ets
  Memo = get_memo(StartIndex),
  % See if the current reduction is memoized
  case dict:find(Name, Memo) of
    % If it is, set the new index, and return the result
    {ok, {Result, NewIndex}} ->
      set_index(NewIndex),
      {Result, lists:nthtail(NewIndex - StartIndex, Inp)};
    % If not, attempt to parse
    _ ->
      case ParseFun(Inp) of
        % If it fails, reset the index, memoize the failure
        fail ->
          memoize(StartIndex, dict:store(Name, fail, Memo)),
          set_index(StartIndex),
          fail;
        % If it passes, advance the index, memoize the result.
        {Result, InpRem} ->
          NewIndex = StartIndex + (length(Inp) - length(InpRem)),
          Transformed = TransformFun(Result),
          memoize(StartIndex, dict:store(Name, {Transformed, NewIndex}, Memo)),
          set_index(NewIndex),
          {Transformed, InpRem}
      end
  end.

In our previous style of creating higher-order functions, we pass the peg:p/4, two functions, one that performs the syntactic analysis (the parser function), and one that performs the semantic analysis, or transforms the AST into its final form. By default (using peg:p/3), we don’t transform anything.

Now let’s look at the functions we haven’t seen yet, memoize/2 and get_memo/1. These manage the memoization table for us.

memoize(Position, Struct) ->
  ets:insert(get(ets_table), {Position, Struct}).

get_memo(Position) ->
  case ets:lookup(get(ets_table), Position) of
    [] -> dict:new();
    [{Position, Dict}] -> Dict
  end.

When memoizing, it essentially overwrites any record that was already in the table. When retrieving a memoized result, it creates a new dict if the input position hasn’t been memoized before.

Step 4 – Wrap rules in our memoization function

Now we just have to wrap our existing parser functions in the memoization routine. In the actual library, this is accomplished with a parse-transform, but I’ll pull an example from the expanded “arithmetic” parser that I use for testing.

decimal(Input) ->
  peg:p(Input, decimal, fun(I) ->
                            (peg:charclass("[0-9]"))(I)
                        end,
       fun(Node) -> transform(decimal, Node) end).

I also introduce a semantic transformation function, but as we saw above, that is optional.

Step 5 – Clean up the memo table

Just as straightforward as our setup function is the teardown one:

release_memo() ->
  ets:delete(get(ets_table)),
  erase(ets_table).

Conclusion

So there you have it, we’ve added packrat-style memoization to our already capable parser library. I’d appreciate any feedback you can give regarding the style and execution of the library, and as always, patches are greatly appreciated.

Building a Parser-Generator in Erlang, Part 2

One of the beautiful aspects of programming in Erlang is its ability to express certain types of concepts very succinctly and powerfully. The primary feature I’m referring to in this case is higher-order functions, which is a common feature of many functional and multi-paradigm languages.

Higher-order functions come in two types:

  1. functions that produce functions as their return values (currying)
  2. functions that take functions as arguments (composition)

Higher order functions depend on two features in most languages that support them:

  1. closure, the binding of scope at runtime – which supports currying
  2. a “function” datatype (Lisp’s lambda, Ruby’s proc, Erlang’s fun) – which supports composition

So what does this have to do with parsing? Parsing is a natural fit for functional languages — Lisp is famous for this, actually — because the nested data structures parsers must produce fit well with recursive algorithms. Furthermore, a lot of the work a parser has to do can be generalized using higher-order functions — things like consuming strings, choosing between alternatives, variable repetition, and predicates. In this post, I’m going to give you an overview of the way that my parsing library uses both types of higher-order functions to build up a parser that looks only slightly different from the original grammar.

The first thing to do was to decide on a protocol for the HOFs, or as they should be properly called, parser combinators, that they all can implement. Essentially, you want to be able to operate on all funs in the same manner so that they are interchangeable, so that they can be composed or combined. The interface should opaque and consistent at runtime. Deciding on this is actually pretty easy, because I cribbed a lot of my code from an Erlang translation of Haskell’s “parsec” library that I found linked on Robert Virding’s blog. Every parser combinator returns a function that:

  1. Takes the current input buffer as its sole argument
  2. Returns a two-tuple of the parsed result and the remaining input buffer when it passes
  3. Returns the atom fail when it fails

So let’s take a look at one the simplest combinators, consuming a string.

string(S) ->
  fun(Input) ->
      case lists:prefix(S, Input) of
        true -> {S,  Input -- S};
        _ -> fail
      end
  end.

The returned function from the combinator will succeed if the string you pass, S, is on the head of the Input buffer (list). (For those of you not familiar, strings in Erlang are lists of bytes.) Notice how the combinator itself doesn’t actually do any work, it simply curries a function with the string you want to match. For those from a Gang-of-Four OO background, this is similar the Command pattern, except our return value is a function, not an object.

A slightly more interesting combinator is the assert combinator which uses both composition and currying to implement non-consuming positive lookahead. That is, it will verify that the given parser will succeed, but will not advance the input buffer under any condition. Here’s the (surprisingly short) code:

assert(P) ->
  fun(Input) ->
      case P(Input) of
        fail -> fail;
        _ -> {[], Input}
      end
  end.

The combinator takes the parser function P, and wraps it another function which will succeed if P succeeds, but will keep the input buffer at its current place.

Now, these are interesting and all, but what use are they? As it turns out, this pattern lets me have a one-to-one correspondence between features of the types of PEG I want to support and functions that are used to parse them, as shown in this table:

PEG Erlang Meaning
. anything() match any character (only fails on end of buffer)
"some string" string("some string") match a string
[A-Z0-9] charclass("[A-Z0-9]") match one character from a class, like regexps
&a assert(fun a/1) positive lookahead, as described above
!a not_(fun a/1) negative lookahead, make sure the parse fails, but consume no input
a? optional(fun a/1) match an optional subexpression
a / b choose([fun a/1, fun b/1]) ordered choice between alternatives
a b seq([fun a/1, fun b/1]) match a sequence of terminals and nonterminals
a+ one_or_more(fun a/1) greedy repetition with at least one match
a* zero_or_more(fun a/1) optional greedy repetition (any number of matches, including none)

With this one-to-one correspondence, look how nicely the code for Brian Ford’s “arithmetic” grammar comes out:

-module(arithmetic).
-include_lib("peg/include/peg.hrl").

?root(additive).

rule(additive) ->
  peg:choose([peg:seq([fun multitive/1,
                       peg:string("+"),
                       fun additive/1]),
              fun multitive/1]);

rule(multitive) ->
  peg:choose([peg:seq([fun primary/1,
                       peg:string("*"),
                       fun multitive/1]),
              fun primary/1]);

rule(primary) ->
  peg:choose([peg:seq([peg:string("("),
                       fun additive/1,
                       peg:string(")")]),
              fun decimal/1]);

rule(decimal) ->
  peg:charclass("[0-9]").

Now, I must admit that there is some secret awesome-sauce going on here, namely an Erlang parse transform that is invoked via the included header file, but I’ll cover that in a later post. Nevertheless, the point stands, that this code was extremely easy to write given the original PEG:

additive <- multitive "+" additive / multitive
multitive <- primary "*" multitive / primary
primary <- "(" additive ")" / decimal
decimal <- [0-9]

The PEG is still more expressive than the Erlang version, but that is one of the shortest parsers I have ever written, and much more comprehensible than something generated by an LALR generator for sure.

Coming up in future posts: adding packrat-style memoization, using parse_transform to keep the parser module looking clean, adding semantic transformation, and tracking line/column numbers of the input.

Building a Parser-Generator in Erlang, Part 1

First, a confession: I’m a bit of a nerd when it comes to computer languages. I love reading about them, studying how they work, and implementing them. My favorite class in my undergrad was Compiler Construction — when everyone else was seeking incompletes to finish their projects, I turned mine in half-a-week early. Since I graduated, there have been advancements in language parsing research and applications. Of particular interest to me are Parsing Expression Grammars and their accompanying “packrat” parsers — so named for their memoizing features. Being a Rubyist, I was exposed to treetop, which is a nice toolkit for building packrat parsers from PEGs.

Now, to come to the point. At the end of April, I was flying out to Palo Alto, CA for Erlang Factory. My conference habit is to pick a side project, work on it during downtimes at the conference, and ask questions of knowledgeable presenters and attendees. I had just come off a month of doing some heavy Story-Driven Development with Cucumber, a Ruby integration/acceptance testing library that uses plain-text “stories” as test-cases. I wanted a tool in native Erlang for supporting this paradigm. eunit, etap, and the rest are great, but they are pale in comparison to the maturity of libraries like RSpec and Cucumber for Ruby.

Having built several parsers in Erlang before, most notably herml, I thought I’d take my usual path of using leex and yecc to build the parser, based on Cucumber’s PEG. Insanity ensued, and for these reasons:

  1. LALR parser-generators like yecc are designed for context-free grammars (CFGs). PEGs are generally not context-free.
  2. PEGs imply combining the lexical and syntactic analysis steps, so literal strings are common in the midst of nonterminals. This means that “tokenization” of a language defined in a PEG can be more difficult.
  3. PEGs can convey certain patterns — specifically repetition, optional sub-expressions, and non-consuming lookahead — in much simpler forms than can CFGs. Factoring the corresponding PEG rules into a yecc-friendly format was confusing and error-prone.

So there I was at Erlang Factory (by the way, the BEST tech conference I’ve been to in a long time – go to the London one if you can!). I started asking around if people knew of any PEG/packrat libraries in Erlang. The new guys didn’t know of anything, and the most helpful response I got from an old-hand was from Robert Virding, was that Joe Armstrong had written a parser for a PEG. Richard Carlsson from Kreditor (incidentally, the creator of eunit) tried to convince me packrat parsers were a bad idea, although I was already aware of the tradeoffs.

So now my mission was to create a PEG/packrat library in Erlang, and then, once complete, create my Cucumber clone. My progress thus far:

  • I have a proposed syntax for the grammar definitions, although writing them directly in Erlang is not bad.
  • I have a parser-combinator library with the memoization features necessary for packrat-parsing.
  • I have a prototype of the “metagrammar” in both plaintext and Erlang.
  • A really cool parse-transform to make the parser prettier.

Things still to do:

  • Decide on a syntax for doing semantic transformations. I’m tempted to go simplistic here.
  • Verify my metagrammar with more unit tests.
  • Generate parsers from the output of the metagrammar.
  • Track line/column numbers for failures (and successes?).

In the next few posts, I’ll talk about the internals and some of the design decisions of the library.

Daily log - April 23, 2009

Application accepted. I learned how to use Javascript’s apply method today (properly). For some context, I was using LowPro awesome DOM.Builder functions to create some @option@s programmatically and insert them into a select. The only thing I thought that is exceptionally quirky about the implementation is that the last argument to a “tagFunc” appears to be an array of children. However on closer inspection, the tagFunc actually takes any number of arguments after the attributes as children, but not an array. Unfortunately, since Javascript doesn’t have Ruby’s awesome “splat” operator (*) for arrays, I had to find another way. Mozilla’s Developer Connection really helped out here. At first I was trying this:

var options = [ ]; // build some options
$select({"name": "the_select_box"}, options);

Instead, this is what worked:

var options = [ ]; // build some options
options.unshift({"name": "the_select_box"}); // put the attributes first in the array
$select.apply(this, options); // "this" is not strictly necessary in this case, "null" would work

Yay for higher-order functions! UPDATE It seems Dan has corrected this issue in the LowPro available on GitHub. I’ll be upgrading!