Xerl: empty modules

2013 30 Jan

Let's build a programming language. I call it Xerl: eXtended ERLang. It'll be an occasion for us to learn a few things, especially me.

Unlike in Erlang, in this language, everything is an expression. This means that modules and functions are expression, and indeed that you can have more than one module per file.

We are just starting, so let's no go ahead of ourselves here. We'll begin with writing the code allowing us to compile an empty module.

We will compile to Core Erlang: this is one of the many intermediate step your Erlang code compiles to before it becomes BEAM machine code. Core Erlang is a very neat language for machine generated code, and we will learn many things about it.

Today we will only focus on compiling the following code:

mod my_module
begin
end

Compilation will be done in a few steps. First, the source file will be transformed in a tree of tokens by the lexer. Then, the parser will use that tree of tokens and convert it to the AST, bringing semantical meaning to our representation. Finally, the code generator will transform this AST to Core Erlang AST, which will then be compiled.

We will use leex for the lexer. This lexer uses .xrl files which are then compiled to .erl files that you can then compile to BEAM. The file is divided in three parts: definitions, rules and Erlang code. Definitions and Erlang code are obvious; rules are what concerns us.

We only need two things: atoms and whitespaces. Atoms are a lowercase letter followed by any letter, number, _ or @. Whitespace is either a space, an horizontal tab, \r or \n. There exists other kinds of whitespaces but we simply do not allow them in the Xerl language.

Rules consist of a regular expression followed by Erlang code. The latter must return a token representation or the atom skip_token.

{L}{A}* :
    Atom = list_to_atom(TokenChars),
    {token, case reserved_word(Atom) of
        true -> {Atom, TokenLine};
        false -> {atom, TokenLine, Atom}
    end}.

{WS}+ : skip_token.

The first rule matches an atom, which is converted to either a special representation for reserved words, or an atom tuple. The TokenChars variable represents the match as a string, and the TokenLine variable contains the line number. View the complete file.

We obtain the following result from the lexer:

[{mod,1},{atom,1,my_module},{'begin',2},{'end',3}]

The second step is to parse this list of tokens to add semantic meaning and generate what is called an abstract syntax tree. We will be using the yecc parser generator for this. This time it will take .yrl files but the process is the same as before. The file is a little more complex than for the lexer, we need to define at the very least terminals, nonterminals and root symbols, the grammar itself, and optionally some Erlang code.

To compile our module, we need a few things. First, everything is an expression. We thus need list of expressions and individual expressions. We will support a single expression for now, the mod expression which defines a module. And that's it! We end up with the following grammar:

exprs -> expr : ['$1'].
exprs -> expr exprs : ['$1' | '$2'].

expr -> module : '$1'.

module -> 'mod' atom 'begin' 'end' :
	{'mod', ?line('$1'), '$2', []}.

View the complete file.

We obtain the following result from the parser:

[{mod,1,{atom,1,my_module},[]}]

We obtain a list of a single mod expression. Just like we wanted. Last step is generating the Core Erlang code from it.

Code generation generally is comprised of several steps. We will discuss these in more details later on. For now, we will focus on the minimal needed for successful compilation.

We can use the cerl module to generate Core Erlang AST. We will simply be using functions, which allows us to avoid learning and keeping up to date with the internal representation.

There's one important thing to do when generating Core Erlang AST for a module: create the module_info/{0,1} functions. Indeed, these are added to Erlang before it becomes Core Erlang, and so we need to replicate this ourselves. Do not be concerned however, as this only takes a few lines of extra code.

As you can see by looking at the complete file, the code generator echoes the grammar we defined in the parser, and simply applies the appropriate Core Erlang functions for each expressions.

We obtain the following pretty-printed Core Erlang generated code:

module 'my_module' ['module_info'/0,
                       'module_info'/1]
    attributes []
'module_info'/0 =
    fun () ->
        call 'erlang':'get_module_info'
            ('empty_module')
'module_info'/1 =
    fun (Key) ->
        call 'erlang':'get_module_info'
            ('empty_module', Key)
end

For convenience I added all the steps in a xerl:compile/1 function that you can use against your own .xerl files.

That's it for today! We will go into more details over each steps in the next few articles.