Building a Parser-Generator in Erlang, Part 3
by Sean Cribbs
In this installment, I’m going to talk about adding packrat-style memoization to the parser library. If you’re curious for more detail, you can get the code now! I’ve released it on GitHub as neotoma (UPDATE: renamed). As always, it is released under an MIT License, but I would appreciate contributions.
So, last time we talked about the combinators that the peg
module provides. Now, if we built a parser with just these combinators, we’d be fine for short inputs. However, given a sufficiently-large input, we would see an explosion of the amount of time it takes to parse! This is because a pure recursive-descent parser without prediction executes in O(2^N)
time, that is, exponential. The parser will backtrack over and over when rules fail, without tracking any intermediate results that might have succeeded.
Packrat parsers, as described in Bryan Ford’s thesis, execute in linear-time ( O(N)
) by memoizing intermediate results. Theoretically, this means that you construct a table of the parsing results at each input position for each rule in the grammar (known as tabular parsing). In practice, you have some kind of “global” two-dimensional data-structure that stores the results of attempted parses for later use, but you don’t bother to calculate the result of every rule. The primary limitation of this is the classic trade-off in Computer Science; that is, you sacrifice memory usage for speed. Some empirical studies of packrat parsers show a usage of 400 bytes of memory per byte of input. While that’s a large number, it is generally a constant proportion to the amount of input.
One of the biggest challenges for me was teasing apart Ford’s theory from his implementation. His packrat parser is written in Haskell and makes use of the inherent laziness and algebraic data-structures in the language. Since I’m using Erlang, which lacks inherent laziness and discourages global information, I had to take a different approach.
The strategy I decided on goes like this:
- At the start of the parse, create an
ets
table for memoizing results. Put the handle to this table in the process dictionary. - Track the intermediate results, keyed on the index into the input, in the
ets
table. - Track the current index in the
ets
table as well (although this is not strictly necessary, it simplifies portions of the code). - Wrap each rule/reduction in a function that handles the memoization automatically.
- Destroy the
ets
table when the parse ends and remove the handle from the process dictionary.
ETS
tables were a great fit for this problem because they are cheap to create and use, and essentially global. Although Erlang is great at urging you to remove side-effects from your programs, this was a case where a side-effect was necessary and expedient, and it plays nicely by cleaning up its mess when done.
Steps 1 and 5 happen in the outermost parse
function, which I use an Erlang parse-transform to generate via the included header file we saw in the last installment. Let’s take a look at that function and then we’ll walk through the steps. We’ll use the example from our arithmetic grammar, which is available in tests/examples/arithmetic.erl
It creates the ets
table, parses by calling the root rule, releases the table, and returns the result. Pretty straight-forward.
Step 1 – Create the ets
table
Note: All of the functions below are in the peg
module.
This function is so short it almost needs no explanation. We create a new ets table, using the set
type since we will have at most one entry per index into the input. We put
that table ID into the process dictionary so we can grab it later, and then we set the stored input index to zero. Here’s what set_index/1
and its pair index/0
do, just so you know when we see them again later:
Note how we get the ETS table ID out of the process dictionary from the key where we put it before. Again, this is probably not The Right Way™ to do things, but it works.
Step 2 & 3 – Track intermediate results and the input position
Here’s the meat of the process. This function is heavily commented, so I’ll just give some commentary at the end.
In our previous style of creating higher-order functions, we pass the peg:p/4
, two functions, one that performs the syntactic analysis (the parser function), and one that performs the semantic analysis, or transforms the AST into its final form. By default (using peg:p/3
), we don’t transform anything.
Now let’s look at the functions we haven’t seen yet, memoize/2
and get_memo/1
. These manage the memoization table for us.
When memoizing, it essentially overwrites any record that was already in the table. When retrieving a memoized result, it creates a new dict if the input position hasn’t been memoized before.
Step 4 – Wrap rules in our memoization function
Now we just have to wrap our existing parser functions in the memoization routine. In the actual library, this is accomplished with a parse-transform, but I’ll pull an example from the expanded “arithmetic” parser that I use for testing.
I also introduce a semantic transformation function, but as we saw above, that is optional.
Step 5 – Clean up the memo table
Just as straightforward as our setup function is the teardown one:
Conclusion
So there you have it, we’ve added packrat-style memoization to our already capable parser library. I’d appreciate any feedback you can give regarding the style and execution of the library, and as always, patches are greatly appreciated.