Building a Parser-Generator in Erlang, Part 4

Jun 21, 2009 by Sean Cribbs

Notes on my progress

Great news! I’ve crossed several milestones with the parser-generator that I’m anxious to tell you about. But first, two pieces of business.

In case you didn’t notice, I’ve renamed the library from “peggy” to “neotoma”. Neotoma is the genus under Rodentia encompassing packrats. Fitting, don’t you think? If you cloned the repository from github, please update your remotes appropriately.
In the last installment, I showed the code for the peg:p/3,4 memoization function. Well, I discovered it was fatally flawed — specifically related to tracking the “current index”. Certain strings were mysteriously getting rearranged! I bit the bullet and decided to be a good functional programmer, threading the current index through all of the parser combinators. This gives me the benefit of referential transparency, that is, no side-effects, and it’s easier to figure out where the index is advancing (only 3 parser-combinators are responsible). Additionally, it should be simple to extend the parser-combinators to track line and column numbers now, passing around a data structure instead of just an integer.

Milestone 1 – Metagrammar

I had mentioned in the first post that I had a prototype of a “metagrammar”, that is, a grammar for PEG grammars. On Monday night, after resolving the memoization issues, I was able to complete the semantic analysis/code-generation piece of the metagrammar, and now it can regenerate itself (at least the ‘rules’ portion).

Although it would be ideal to add semantic transformation details to the grammar file, I’m not quite ready to do that. I’d appreciate suggestions along those lines.

Milestone 2 – Code generation utilities

Once I had the metagrammar functioning properly, the next step was to build some tools around that portion of the code-generation for creating valid Erlang files that can be compiled into a parser. This was mostly busy-work, generating strings, writing files.

parse_transform

I’ve mentioned this parse_transform thing before, even almost to the point of teasing you with it. Now, my knowledge of how to build Erlang statements and terms programmatically is limited, but I will try to impart the knowledge I’ve gained while working on neotoma’s peg_transform.

Introduction

The gist of the parse-transform is this: Erlang will optionally pass the Abstract Syntax Tree (AST), structured as nested tuples and lists, to a function you define, following the parsing phase of compilation. The only restrictions are that your function return a valid AST and that the original code is valid Erlang.

This is both quite dangerous and quite powerful. Just like the open-classes feature of Ruby, one must use the power for good, and not for awesome. Sagely, the Erlang documentation advises against using it and warns that its usage is not supported.

Why

My purpose for delving into the dark art of the parse_transform was probably a little bit of selfishness. If I had come to Erlang from something like Java, I likely would have slogged through writing many repetitive lines of code without batting an eye. However, I came from Ruby and love the DRY principle. I was not about to write peg:p(...) 20 times in my hand-written parser. I thought, “These statements are so similar, there must be a way to at least hide the repetition. I should be able to support both generated and hand-written parsers with the same amount of elegance and simplicity.”

Getting started

The first step was to figure out what the AST looked like. This part was really simple – just pretty-print the terms without transforming them.

%% <a href="http://github.com/seancribbs/neotoma/blob/master/include/peg_i.hrl">include/peg_i.hrl</a>
-compile([{parse_transform, peg_itransform}]).

%% <a href="http://github.com/seancribbs/neotoma/blob/master/src/peg_itransform.erl">src/peg_itransform.erl</a>
-module(peg_itransform).
-export([parse_transform/2]).

parse_transform(AST, _Options) ->
  io:format("~n~p~n", [AST]),
  AST

In order to get a handle on both what I wanted to generate and what I wanted the original to look like, I created a pair of files. You’ve seen one of them before — the “arithmetic” parser.

By using the identity parse_transform shown above, I was able to figure out how to transform the rule(nonterminal) -> parsers form into the more verbose function form including peg:p().

Diving in the deep end

To complete the transform, I did lots of copying, pasting and modifying, especially with the two functions that are nearly module-independent, parse/1 and file/1. As Kevin warned me when I went down this path, copying and pasting and analyzing the AST by hand is necessary initially – you probably won’t figure it out otherwise. Here are some things I learned:

The simple types are easy to understand:
- {var, Line, 'VarName'}
- {atom, Line, some_atom}
- {integer, Line, 0}
- {float, Line, 3.14159}
- {tuple, Line, ListOfElements}
- The empty list: {nil, Line}
Lists are strange, but it makes sense when you think about head/tail logic in the language: {cons, Line, Head, Tail}. Single-item lists have a Tail of the empty list. Multi-item lists nest additional cons tuples in the Tail, eventually ending with the empty list.
match (representing a “match” statement with =) is a 4-tuple: {match, Line, LHS, RHS}.
call (call a function) is also a 4-tuple: {call, Line, FunSpec, ArgumentList}. FunSpec is one of these:
- An atom (as above), representing a function in the same module.
- A variable (as above), representing a “fun” (you hope)
- Any single substatement, hopefully resulting in a “fun”, or containing one directly. (This is used a lot when I turn the parser specs into “funs”.)
- A call to a function in another module, represented thus: {remote, Line, ModuleAtom, FunctionAtom}.
case statements, function declarations and “funs” all have an idea of multiple clauses, something which I’m loathe to go into here. The difference between a single clause and multiple clauses, as represented in the AST is slightly unintuitive. Have a look at my parse-transform module for some examples.
You might have noticed the Line in all the statements I described above. Yes, it’s everywhere.

So to bring this back around to something more practical, here’s how I’m generating the parse/1 function that goes into every parser.

generate_parse_function() ->
  ModName = ets:lookup_element(peg_transform, module, 2),
  [{_, Rule, Line}] = ets:lookup(peg_transform, root),
  {function,Line,parse,1,
   [{clause,Line,
     [{var,Line,'Input'}],
     [],
     [{call,Line,
       {remote,Line,{atom,Line,peg},{atom,Line,setup_memo}},
       [{atom,Line,ModName}]},
      {match,Line,
       {var,Line,'Result'},
       {'case',Line,
        {call,Line,{atom,Line,Rule},[{var,Line,'Input'},{integer,Line,0}]},
        [{clause,Line,[{tuple,Line,[{var,Line,'AST'},{nil,Line},{var,Line, '_Index'}]}],[],[{var,Line,'AST'}]},
         {clause,Line,[{var,Line,'Any'}],[],[{var,Line,'Any'}]}]}},
      {call,Line,{remote,Line,{atom,Line,peg},{atom,Line,release_memo}},[]},
      {var,Line,'Result'}]}]}.

The two lines at the top load some information I’d previously stored in an ETS table about the module being compiled and the root rule (with line number) of the parser. The rest interpolates that information into the near-verbatim copy of the function generated from the identity transform.

Conclusion

There are a lot of interesting uses for parse_transform, mostly for the sake of metaprogramming as I’ve done here. If you are interested in doing things at runtime, you might check out smerl. I invite your comments below.