Recently I got quite interested in the lambda calculus, more specifically its untyped variant. The lambda calculus is a model of computation that is the base for many functional programming languages such as Haskell (which is typed) or Scheme. For quite some time I also wanted to build an interpreter for a programming language so I decided to give it a try and build one for the lambda calculus.
I will call this interpreter lambda.
Overview
Before we get started, let’s get a quick overview of how the interpreter for lambda (and most interpreters for other languages, really) will be built. It consists of three basic parts, a lexer, a parser and an evaluator, each of them serving a different purpose:
- The lexer takes a program in form of an input string and turns it into a list of tokens or lexemes.
- The parser takes these tokens and composes them into an abstract syntax tree (or “AST”) according to the grammar of the language.
- The evaluator finally takes the AST and evaluates its statements and expressions according to evaluation rules and yields some behaviour.
This project is somewhat large and so my writing about it will will focus on the grammar and lexer of lambda.
The Grammar
The grammar for the untyped Lambda calculus is fairly simple. Here it is in EBNF:
variable = "v" | variable, "'";
application = term, " ", term;
abstraction = "λ", letter, ".", term;
term = variable | "(", application, ")" | "(", abstraction, ")";
Examples for valid terms
(“lambda-terms”) based on this grammar are:
v
(v v')
(λv.v)
(λv.(v v')
((λv.((λv'.(v' v'')) v)) v''')
This actually looks quite complicated. There are many parentheses and the variables are only distinguished by the number of '
suffixes.
So let’s make some small changes to the lambda calculus' grammar for our convenience:
First, we replace the variable rule by the following one:
variable = "a" | "b" | ... | "z";
While the original rule technically allowed for an infinite number of variables, which is important to some proofs, such as Barendregt’s Substituion Lemma, we very likely won’t run into issues limiting ourselves to just 26 variable names.
Secondly, we add a rule to allow omitting the outermost parentheses:
lambdaterm = variable | application | abstraction | term;
Now, we make a small change to the abstraction
rule:
abstraction = "\", letter, "." , ( abstraction | term );
This is allows us to write nested abstractions more easily, without having to use many parentheses, for example λx.λy.(x y)
.
Finally, also substitute the λ
symbol for \
to make it easier to type.
As a result, our valid lambda-terms from before would now look like this:
v
x y
\x.x
\x.(x y)
(x y) (x y)
\x.\y.(x y)
Of course we can still use the parentheses if we want to.
The final grammar now looks like this:
variable = "a" | "b" | ... | "z";
application = term, " ", term;
abstraction = "\", letter, "." , ( abstraction | term );
term = variable | "(", application, ")" | "(", abstraction, ")";
lambdaterm = variable | application | abstraction | term;
Implementing the lexer
Now that we have the grammar settled, we can start implementing the lexer. It’s main purpose is to split the incoming stream of characters into so called “tokens” or “lexemes”. These tokens will later be consumed by the parser to compose the Abstract Syntax Tree according to our grammar.
Let’s start by defining a Token structure and all the tokens we need:
|
|
The Token
struct will hold three things we need later: the Kind
of token we are dealing with, the Literal, that is the “value” of the token as it appears in the input stream,
and the position of a tokens first character.
We have a Kind
for each type of token that can appear according to our grammar. Notice how we do not differentiate between upper- and lower-case letters for identifiers yet.
EOF
and ILLEGAL
are special. They signify the end of the input stream as well as illegal characters (such as numeric digits).
The lexer itself is pretty straightforward. It exposes a single method we call Next()
which emits whatever is the next token in the input. Internally, it will read byte-by-byte as much input as needed.
|
|
Here we define our Lexer
which holds a copy of the input string, some information about the current and next read position and the current byte we are inspecting.
readChar()
is defined like this:
|
|
It reads exactly one byte of input and updates the internal state of the lexer accordingly.
Finally, let’s have a look at Next()
:
|
|
It is basically just a switch
statement that returns the correct kind of token based on the current char under inspection. There are two special cases here:
On line 37 we check for the zero byte 0
. This is being set by readChar()
once we have reached the end of the input. Accordingly, the token to return is token.EOF
.
Secondly, on line 40 we use the defalt cause to assert whether l.ch
(the current byte) is actually a letter we want to allow. If that is the case we can plug the byte right into
the token Literal. This is possible because according to our grammar, all identifiers for variables have to be a single character. Other lexers usually have a function called consumeBytes()
or similar which will read a sequence of bytes that qualify as a single identifier and return them for the token literal.
In case our byte is not a valid (lowercase) character, we return token.ILLEGAL
as we have checked all other valid cases already.
For completeness, here are the definitions of newToken
and isLetter
:
|
|
Conclusion
And that is already it for this part of the interpreter. In future we will explore the parser which takes the tokens emitted by the lexer and compose them into the Abstract Syntax Tree. You can find all the code shown here in the lambda GitHub repository.