Building a Parser-Generator in Erlang, Part 5

I’ve realized I gave short-shrift to the parse_transform code that I posted yesterday. I wandered around the topic without much direction. So I’m going to revisit that code and show you the real meat of it: how I convert

rule(nonterminal) ->
  parser_fun();

into:

nonterminal(Input, Index) ->
  peg:p(Input, Index, nonterminal, fun(I,D) ->
                                      (parser_fun())(I,D)
                                      end,
                                   fun(Node) ->
                                      transform(nonterminal, Node)
                                    end).

First steps

The AST of an Erlang module is a list of tuples. Those tuples represent attributes (things like the module name, the exports, etc), and function definitions. The general strategy will be this:

  1. Fold over the AST, looking for definitions for the rule function.
  2. Map all of the function clauses for the rule function into their top-level equivalents.

This is the essence of the transform_rules function:

transform_rules(AST) ->
  transform_rules(AST, []).

transform_rules([], Accum) ->
  lists:reverse(Accum);
transform_rules([{function,_Line,rule,1,Clauses}|Rest], Accum) ->
  Rules = lists:reverse(build_rules(Clauses)),
  transform_rules(Rest, Rules++Accum);
transform_rules([H|Rest], Accum) ->
  transform_rules(Rest, [H|Accum]).

Like a typical iteration routine in Erlang, it pushes things on the head of a list and then reverses at the end. When it finds a matching function, it takes the clauses of that function, transforms them, and then pushes them on the head. Note that the second clause of transform_rules will only match a single-arity function (that’s what the number 1 is doing in there). Since I’m use lists:map() in the build_rules function as you will see below, I have to reverse them before I prepend them to the accumulator.

Mapping the rules to new functions

Now let’s look at build_rules.

build_rules(Clauses) ->
  Rules = lists:map(fun build_rule/1, Clauses),
  [generate_file_function(),generate_parse_function()|Rules].

This is pretty simple. I take the clauses, and map them into their transformation, then prepend the file and parse functions to the front. If you were evil and wanted to screw up this transform, you could provide multiple definitions (not simply clauses) of the rules function. But then it wouldn’t be very useful, would it?

Now we get to an individual rule. Since the logic is a little complicated, I broke it out into several functions. First, is build_rule.

build_rule({clause,Line,[{atom,_,Name}],_,Stmt}) ->
  ets:insert_new(peg_transform, {root, Name, Line}),
  Wrapped = wrap_reductions(Stmt, Name),
  {function,Line,Name,2,
   [{clause,Line,[{var,Line,'Input'},{var,Line,'Index'}],[],Wrapped}]}.

The very first line lets us record the first rule in the grammar/parser as the root rule. This is useful later in generate_parse_function. If the row with the key root is already in the ets table, it ignores the insert. Second, we wrap the “reductions” (body of the rule) in our peg:p function as we will see below. Lastly, we return a function definition with the rule-name as the function name, an arity of 2, and a single clause that contains our wrapped reductions.

Wrap it up

Remember the whole point of this exercise was to avoid typing peg:p() over and over. Here’s where we get to that.

wrap_reductions(Stmt,Name) ->
  Inner = hd(Stmt),
  Line = element(2, Inner), % [{call,Line,_,_}]
  Fun = wrap_fun(Inner, Line),
  Transform = semantic_fun(Name, Line),
  [{call,Line,
    {remote,Line,{atom,Line,peg},{atom,Line,p}},
    [
     {var,Line,'Input'},
     {var,Line,'Index'},
     {atom,Line,Name},
     Fun,
     Transform
    ]}].

wrap_reductions only allows a single statement (one top-level combinator). We extract the line number out of that statement, pass the statement to another wrapper, which will return an inline-function/lambda containing the parser statement. Then we generate a similar function that will take the parsed node and transform it. Last, we pop these two funs into our peg:p() call.

Before we look at wrap_fun, let’s look at semantic_fun, which is a little simpler:

semantic_fun(Rule, Line) ->
  {'fun',Line,
   {clauses,
    [{clause,Line,
      [{var,Line,'Node'}],
      [],
      [{call,Line,
        {atom,Line,transform},
        [{atom,Line,Rule},{var,Line,'Node'}]}]}]}}.

Interestingly, unlike function definitions, 'fun' definitions always use clauses, even if you only have one clause! This one’s really straightforward, just putting the rulename and the line number in the right places.

Now, our final wrapper, wrap_fun.

wrap_fun(Stmts, Line) when is_tuple(Stmts) ->
  {'fun',Line,
   {clauses,
    [{clause,Line,
      [{var,Line,'I'},{var,Line,'D'}],
      [],
      [{call,Line,
        Stmts,
        [{var,Line,'I'},{var,Line,'D'}]}]}]}};
wrap_fun(Stmts, Line) ->
  io:format("peg rules must be double-arity functions or statements and cannot be sequences of statements!~n"),
  throw({parse_error, not_tuple, Line, Stmts}).

I engaged in a little paranoia here — I wanted to make sure someone hand-writing a parser wouldn’t try to do something that would break the convention of a single top-level combinator in each rule. Maybe a little overboard.

For the sake of keeping things short and not colliding with variables in the same scope, I used I and D as local variable instead of Input and Index. What is really curious is how the call statement comes out. In regular Erlang, in order to call a fun that is the result of some other operation, you have to wrap it in parentheses, like so:

(some_fun())(Var)

However, in AST, it looks like any old call, except that instead of being an atom, variable, or remote, it is the substatement.

Back to the top

Now that we’ve dug into the depths of the transform, let’s look again at the top-level.

parse_transform(AST, _Options) ->
  ets:new(peg_transform, [named_table, set]),
  [{attribute, _, module, ModName}] = lists:filter(fun(Form)->
                                                       case Form of
                                                         {attribute,_,module,_} -> true;
                                                         _ -> false
                                                       end
                                                   end, AST),
  ets:insert(peg_transform, {module, ModName}),
  Result = transform_rules(AST),
  ets:delete(peg_transform),
  Result.

You’ll notice the call to transform_rules near the end of the function – where the action happens. The rest of the parse_transform function is devoted to:

  1. Setting up an ets table that we saw used in previous snippets.
  2. Capturing the module name in the ets table, which we use in generating the parse function.
  3. Cleaning up the ets table when we’re done.

Conclusion

I hope that is a bit more clear and gives you a better idea of how I’m accomplishing the transformation.

Building a Parser-Generator in Erlang, Part 4

Notes on my progress

Great news! I’ve crossed several milestones with the parser-generator that I’m anxious to tell you about. But first, two pieces of business.

  1. In case you didn’t notice, I’ve renamed the library from “peggy” to “neotoma”. Neotoma is the genus under Rodentia encompassing packrats. Fitting, don’t you think? If you cloned the repository from github, please update your remotes appropriately.
  2. In the last installment, I showed the code for the peg:p/3,4 memoization function. Well, I discovered it was fatally flawed — specifically related to tracking the “current index”. Certain strings were mysteriously getting rearranged! I bit the bullet and decided to be a good functional programmer, threading the current index through all of the parser combinators. This gives me the benefit of referential transparency, that is, no side-effects, and it’s easier to figure out where the index is advancing (only 3 parser-combinators are responsible). Additionally, it should be simple to extend the parser-combinators to track line and column numbers now, passing around a data structure instead of just an integer.

Milestone 1 – Metagrammar

I had mentioned in the first post that I had a prototype of a “metagrammar”, that is, a grammar for PEG grammars. On Monday night, after resolving the memoization issues, I was able to complete the semantic analysis/code-generation piece of the metagrammar, and now it can regenerate itself (at least the ‘rules’ portion).

Although it would be ideal to add semantic transformation details to the grammar file, I’m not quite ready to do that. I’d appreciate suggestions along those lines.

Milestone 2 – Code generation utilities

Once I had the metagrammar functioning properly, the next step was to build some tools around that portion of the code-generation for creating valid Erlang files that can be compiled into a parser. This was mostly busy-work, generating strings, writing files.

parse_transform

I’ve mentioned this parse_transform thing before, even almost to the point of teasing you with it. Now, my knowledge of how to build Erlang statements and terms programmatically is limited, but I will try to impart the knowledge I’ve gained while working on neotoma’s peg_transform.

Introduction

The gist of the parse-transform is this: Erlang will optionally pass the Abstract Syntax Tree (AST), structured as nested tuples and lists, to a function you define, following the parsing phase of compilation. The only restrictions are that your function return a valid AST and that the original code is valid Erlang.

This is both quite dangerous and quite powerful. Just like the open-classes feature of Ruby, one must use the power for good, and not for awesome. Sagely, the Erlang documentation advises against using it and warns that its usage is not supported.

Why

My purpose for delving into the dark art of the parse_transform was probably a little bit of selfishness. If I had come to Erlang from something like Java, I likely would have slogged through writing many repetitive lines of code without batting an eye. However, I came from Ruby and love the DRY principle. I was not about to write peg:p(...) 20 times in my hand-written parser. I thought, “These statements are so similar, there must be a way to at least hide the repetition. I should be able to support both generated and hand-written parsers with the same amount of elegance and simplicity.”

Getting started

The first step was to figure out what the AST looked like. This part was really simple – just pretty-print the terms without transforming them.

%% include/peg_i.hrl
-compile([{parse_transform, peg_itransform}]).
%% src/peg_itransform.erl
-module(peg_itransform).
-export([parse_transform/2]).

parse_transform(AST, _Options) ->
  io:format("~n~p~n", [AST]),
  AST

In order to get a handle on both what I wanted to generate and what I wanted the original to look like, I created a pair of files. You’ve seen one of them before — the “arithmetic” parser.

By using the identity parse_transform shown above, I was able to figure out how to transform the rule(nonterminal) -> parsers form into the more verbose function form including peg:p().

Diving in the deep end

To complete the transform, I did lots of copying, pasting and modifying, especially with the two functions that are nearly module-independent, parse/1 and file/1. As Kevin warned me when I went down this path, copying and pasting and analyzing the AST by hand is necessary initially – you probably won’t figure it out otherwise. Here are some things I learned:

  • The simple types are easy to understand:
    • {var, Line, 'VarName'}
    • {atom, Line, some_atom}
    • {integer, Line, 0}
    • {float, Line, 3.14159}
    • {tuple, Line, ListOfElements}
    • The empty list: {nil, Line}
  • Lists are strange, but it makes sense when you think about head/tail logic in the language: {cons, Line, Head, Tail}. Single-item lists have a Tail of the empty list. Multi-item lists nest additional cons tuples in the Tail, eventually ending with the empty list.
  • match (representing a “match” statement with =) is a 4-tuple: {match, Line, LHS, RHS}.
  • call (call a function) is also a 4-tuple: {call, Line, FunSpec, ArgumentList}. FunSpec is one of these:
    • An atom (as above), representing a function in the same module.
    • A variable (as above), representing a “fun” (you hope)
    • Any single substatement, hopefully resulting in a “fun”, or containing one directly. (This is used a lot when I turn the parser specs into “funs”.)
    • A call to a function in another module, represented thus: {remote, Line, ModuleAtom, FunctionAtom}.
  • case statements, function declarations and “funs” all have an idea of multiple clauses, something which I’m loathe to go into here. The difference between a single clause and multiple clauses, as represented in the AST is slightly unintuitive. Have a look at my parse-transform module for some examples.
  • You might have noticed the Line in all the statements I described above. Yes, it’s everywhere.

So to bring this back around to something more practical, here’s how I’m generating the parse/1 function that goes into every parser.

generate_parse_function() ->
  ModName = ets:lookup_element(peg_transform, module, 2),
  [{_, Rule, Line}] = ets:lookup(peg_transform, root),
  {function,Line,parse,1,
   [{clause,Line,
     [{var,Line,'Input'}],
     [],
     [{call,Line,
       {remote,Line,{atom,Line,peg},{atom,Line,setup_memo}},
       [{atom,Line,ModName}]},
      {match,Line,
       {var,Line,'Result'},
       {'case',Line,
        {call,Line,{atom,Line,Rule},[{var,Line,'Input'},{integer,Line,0}]},
        [{clause,Line,[{tuple,Line,[{var,Line,'AST'},{nil,Line},{var,Line, '_Index'}]}],[],[{var,Line,'AST'}]},
         {clause,Line,[{var,Line,'Any'}],[],[{var,Line,'Any'}]}]}},
      {call,Line,{remote,Line,{atom,Line,peg},{atom,Line,release_memo}},[]},
      {var,Line,'Result'}]}]}.

The two lines at the top load some information I’d previously stored in an ETS table about the module being compiled and the root rule (with line number) of the parser. The rest interpolates that information into the near-verbatim copy of the function generated from the identity transform.

Conclusion

There are a lot of interesting uses for parse_transform, mostly for the sake of metaprogramming as I’ve done here. If you are interested in doing things at runtime, you might check out smerl. I invite your comments below.

Building a Parser-Generator in Erlang, Part 3

In this installment, I’m going to talk about adding packrat-style memoization to the parser library. If you’re curious for more detail, you can get the code now! I’ve released it on GitHub as neotoma (UPDATE: renamed). As always, it is released under an MIT License, but I would appreciate contributions.

So, last time we talked about the combinators that the peg module provides. Now, if we built a parser with just these combinators, we’d be fine for short inputs. However, given a sufficiently-large input, we would see an explosion of the amount of time it takes to parse! This is because a pure recursive-descent parser without prediction executes in O(2^N) time, that is, exponential. The parser will backtrack over and over when rules fail, without tracking any intermediate results that might have succeeded.

Packrat parsers, as described in Bryan Ford’s thesis, execute in linear-time ( O(N) ) by memoizing intermediate results. Theoretically, this means that you construct a table of the parsing results at each input position for each rule in the grammar (known as tabular parsing). In practice, you have some kind of “global” two-dimensional data-structure that stores the results of attempted parses for later use, but you don’t bother to calculate the result of every rule. The primary limitation of this is the classic trade-off in Computer Science; that is, you sacrifice memory usage for speed. Some empirical studies of packrat parsers show a usage of 400 bytes of memory per byte of input. While that’s a large number, it is generally a constant proportion to the amount of input.

One of the biggest challenges for me was teasing apart Ford’s theory from his implementation. His packrat parser is written in Haskell and makes use of the inherent laziness and algebraic data-structures in the language. Since I’m using Erlang, which lacks inherent laziness and discourages global information, I had to take a different approach.

The strategy I decided on goes like this:

  1. At the start of the parse, create an ets table for memoizing results. Put the handle to this table in the process dictionary.
  2. Track the intermediate results, keyed on the index into the input, in the ets table.
  3. Track the current index in the ets table as well (although this is not strictly necessary, it simplifies portions of the code).
  4. Wrap each rule/reduction in a function that handles the memoization automatically.
  5. Destroy the ets table when the parse ends and remove the handle from the process dictionary.

ETS tables were a great fit for this problem because they are cheap to create and use, and essentially global. Although Erlang is great at urging you to remove side-effects from your programs, this was a case where a side-effect was necessary and expedient, and it plays nicely by cleaning up its mess when done.

Steps 1 and 5 happen in the outermost parse function, which I use an Erlang parse-transform to generate via the included header file we saw in the last installment. Let’s take a look at that function and then we’ll walk through the steps. We’ll use the example from our arithmetic grammar, which is available in tests/examples/arithmetic.erl

parse(Input) ->
  peg:setup_memo(arithmetic),
  Result = case additive(Input) of
             {AST, []} ->
                AST;
             fail -> fail
           end,
  peg:release_memo(),
  Result.

It creates the ets table, parses by calling the root rule, releases the table, and returns the result. Pretty straight-forward.

Step 1 – Create the ets table

Note: All of the functions below are in the peg module.

setup_memo(Name) ->
  TID = ets:new(Name, [set]),
  put(ets_table, TID),
  set_index(0).

This function is so short it almost needs no explanation. We create a new ets table, using the set type since we will have at most one entry per index into the input. We put that table ID into the process dictionary so we can grab it later, and then we set the stored input index to zero. Here’s what set_index/1 and its pair index/0 do, just so you know when we see them again later:

set_index(Value) ->
  ets:insert(get(ets_table), {current_index, Value}).

index() -> 
  ets:lookup_element(get(ets_table), current_index, 2).

Note how we get the ETS table ID out of the process dictionary from the key where we put it before. Again, this is probably not The Right Way™ to do things, but it works.

Step 2 & 3 – Track intermediate results and the input position

Here’s the meat of the process. This function is heavily commented, so I’ll just give some commentary at the end.

p(Inp, Name, ParseFun) ->
  p(Inp, Name, ParseFun, fun(N) -> N end).
p(Inp, Name, ParseFun, TransformFun) ->
  % Record the starting index
  StartIndex = index(),
  % Grab the memo table from ets
  Memo = get_memo(StartIndex),
  % See if the current reduction is memoized
  case dict:find(Name, Memo) of
    % If it is, set the new index, and return the result
    {ok, {Result, NewIndex}} ->
      set_index(NewIndex),
      {Result, lists:nthtail(NewIndex - StartIndex, Inp)};
    % If not, attempt to parse
    _ ->
      case ParseFun(Inp) of
        % If it fails, reset the index, memoize the failure
        fail ->
          memoize(StartIndex, dict:store(Name, fail, Memo)),
          set_index(StartIndex),
          fail;
        % If it passes, advance the index, memoize the result.
        {Result, InpRem} ->
          NewIndex = StartIndex + (length(Inp) - length(InpRem)),
          Transformed = TransformFun(Result),
          memoize(StartIndex, dict:store(Name, {Transformed, NewIndex}, Memo)),
          set_index(NewIndex),
          {Transformed, InpRem}
      end
  end.

In our previous style of creating higher-order functions, we pass the peg:p/4, two functions, one that performs the syntactic analysis (the parser function), and one that performs the semantic analysis, or transforms the AST into its final form. By default (using peg:p/3), we don’t transform anything.

Now let’s look at the functions we haven’t seen yet, memoize/2 and get_memo/1. These manage the memoization table for us.

memoize(Position, Struct) ->
  ets:insert(get(ets_table), {Position, Struct}).

get_memo(Position) ->
  case ets:lookup(get(ets_table), Position) of
    [] -> dict:new();
    [{Position, Dict}] -> Dict
  end.

When memoizing, it essentially overwrites any record that was already in the table. When retrieving a memoized result, it creates a new dict if the input position hasn’t been memoized before.

Step 4 – Wrap rules in our memoization function

Now we just have to wrap our existing parser functions in the memoization routine. In the actual library, this is accomplished with a parse-transform, but I’ll pull an example from the expanded “arithmetic” parser that I use for testing.

decimal(Input) ->
  peg:p(Input, decimal, fun(I) ->
                            (peg:charclass("[0-9]"))(I)
                        end,
       fun(Node) -> transform(decimal, Node) end).

I also introduce a semantic transformation function, but as we saw above, that is optional.

Step 5 – Clean up the memo table

Just as straightforward as our setup function is the teardown one:

release_memo() ->
  ets:delete(get(ets_table)),
  erase(ets_table).

Conclusion

So there you have it, we’ve added packrat-style memoization to our already capable parser library. I’d appreciate any feedback you can give regarding the style and execution of the library, and as always, patches are greatly appreciated.