Building a Parser-Generator in Erlang, Part 3

Jun 11, 2009 by Sean Cribbs

In this installment, I’m going to talk about adding packrat-style memoization to the parser library. If you’re curious for more detail, you can get the code now! I’ve released it on GitHub as neotoma (UPDATE: renamed). As always, it is released under an MIT License, but I would appreciate contributions.

So, last time we talked about the combinators that the peg module provides. Now, if we built a parser with just these combinators, we’d be fine for short inputs. However, given a sufficiently-large input, we would see an explosion of the amount of time it takes to parse! This is because a pure recursive-descent parser without prediction executes in O(2^N) time, that is, exponential. The parser will backtrack over and over when rules fail, without tracking any intermediate results that might have succeeded.

Packrat parsers, as described in Bryan Ford’s thesis, execute in linear-time ( O(N) ) by memoizing intermediate results. Theoretically, this means that you construct a table of the parsing results at each input position for each rule in the grammar (known as tabular parsing). In practice, you have some kind of “global” two-dimensional data-structure that stores the results of attempted parses for later use, but you don’t bother to calculate the result of every rule. The primary limitation of this is the classic trade-off in Computer Science; that is, you sacrifice memory usage for speed. Some empirical studies of packrat parsers show a usage of 400 bytes of memory per byte of input. While that’s a large number, it is generally a constant proportion to the amount of input.

One of the biggest challenges for me was teasing apart Ford’s theory from his implementation. His packrat parser is written in Haskell and makes use of the inherent laziness and algebraic data-structures in the language. Since I’m using Erlang, which lacks inherent laziness and discourages global information, I had to take a different approach.

The strategy I decided on goes like this:

At the start of the parse, create an ets table for memoizing results. Put the handle to this table in the process dictionary.
Track the intermediate results, keyed on the index into the input, in the ets table.
Track the current index in the ets table as well (although this is not strictly necessary, it simplifies portions of the code).
Wrap each rule/reduction in a function that handles the memoization automatically.
Destroy the ets table when the parse ends and remove the handle from the process dictionary.

ETS tables were a great fit for this problem because they are cheap to create and use, and essentially global. Although Erlang is great at urging you to remove side-effects from your programs, this was a case where a side-effect was necessary and expedient, and it plays nicely by cleaning up its mess when done.

Steps 1 and 5 happen in the outermost parse function, which I use an Erlang parse-transform to generate via the included header file we saw in the last installment. Let’s take a look at that function and then we’ll walk through the steps. We’ll use the example from our arithmetic grammar, which is available in tests/examples/arithmetic.erl

parse(Input) ->
  peg:setup_memo(arithmetic),
  Result = case additive(Input) of
             {AST, []} ->
                AST;
             fail -> fail
           end,
  peg:release_memo(),
  Result.

It creates the ets table, parses by calling the root rule, releases the table, and returns the result. Pretty straight-forward.

Step 1 – Create the `ets` table

Note: All of the functions below are in the peg module.

setup_memo(Name) ->
  TID = ets:new(Name, [set]),
  put(ets_table, TID),
  set_index(0).

This function is so short it almost needs no explanation. We create a new ets table, using the set type since we will have at most one entry per index into the input. We put that table ID into the process dictionary so we can grab it later, and then we set the stored input index to zero. Here’s what set_index/1 and its pair index/0 do, just so you know when we see them again later:

set_index(Value) ->
  ets:insert(get(ets_table), {current_index, Value}).

index() -> 
  ets:lookup_element(get(ets_table), current_index, 2).

Note how we get the ETS table ID out of the process dictionary from the key where we put it before. Again, this is probably not The Right Way™ to do things, but it works.

Step 2 & 3 – Track intermediate results and the input position

Here’s the meat of the process. This function is heavily commented, so I’ll just give some commentary at the end.

p(Inp, Name, ParseFun) ->
  p(Inp, Name, ParseFun, fun(N) -> N end).
p(Inp, Name, ParseFun, TransformFun) ->
  % Record the starting index
  StartIndex = index(),
  % Grab the memo table from ets
  Memo = get_memo(StartIndex),
  % See if the current reduction is memoized
  case dict:find(Name, Memo) of
    % If it is, set the new index, and return the result
    {ok, {Result, NewIndex}} ->
      set_index(NewIndex),
      {Result, lists:nthtail(NewIndex - StartIndex, Inp)};
    % If not, attempt to parse
    _ ->
      case ParseFun(Inp) of
        % If it fails, reset the index, memoize the failure
        fail ->
          memoize(StartIndex, dict:store(Name, fail, Memo)),
          set_index(StartIndex),
          fail;
        % If it passes, advance the index, memoize the result.
        {Result, InpRem} ->
          NewIndex = StartIndex + (length(Inp) - length(InpRem)),
          Transformed = TransformFun(Result),
          memoize(StartIndex, dict:store(Name, {Transformed, NewIndex}, Memo)),
          set_index(NewIndex),
          {Transformed, InpRem}
      end
  end.

In our previous style of creating higher-order functions, we pass the peg:p/4, two functions, one that performs the syntactic analysis (the parser function), and one that performs the semantic analysis, or transforms the AST into its final form. By default (using peg:p/3), we don’t transform anything.

Now let’s look at the functions we haven’t seen yet, memoize/2 and get_memo/1. These manage the memoization table for us.

memoize(Position, Struct) ->
  ets:insert(get(ets_table), {Position, Struct}).

get_memo(Position) ->
  case ets:lookup(get(ets_table), Position) of
    [] -> dict:new();
    [{Position, Dict}] -> Dict
  end.

When memoizing, it essentially overwrites any record that was already in the table. When retrieving a memoized result, it creates a new dict if the input position hasn’t been memoized before.

Step 4 – Wrap rules in our memoization function

Now we just have to wrap our existing parser functions in the memoization routine. In the actual library, this is accomplished with a parse-transform, but I’ll pull an example from the expanded “arithmetic” parser that I use for testing.

decimal(Input) ->
  peg:p(Input, decimal, fun(I) ->
                            (peg:charclass("[0-9]"))(I)
                        end,
       fun(Node) -> transform(decimal, Node) end).

I also introduce a semantic transformation function, but as we saw above, that is optional.

Step 5 – Clean up the memo table

Just as straightforward as our setup function is the teardown one:

release_memo() ->
  ets:delete(get(ets_table)),
  erase(ets_table).

Conclusion

So there you have it, we’ve added packrat-style memoization to our already capable parser library. I’d appreciate any feedback you can give regarding the style and execution of the library, and as always, patches are greatly appreciated.