Building lexer, AST and interpreter - The Cely Programming Language

Making a Lexer

The lexer was the easiest part, it basically iterate over the raw source code to annotate characters/words when it makes sense, we call them tokens. Let’s take the following lines as an example:

let name = "John";
let age = 23;

The lexer splits it in a more structured way to get the following tokens:

[
	{value = "let", type = Keyword},
	{value = "name", type = Identifier},
	{value = "=", type = Assign},
	{value = "John", type = Literal(String)},
	{value = ";", type = Separator},
	{value = "let", type = Keyword},
	{value = "age", type = Identifier},
	{value = "=", type = Assign},
	{value = "23", type = Literal(Number)},
	{value = ";", type = Separator}
]

NOTE

The above snippet is just a representation of an array of object, not real code.

It’s as simple as that, then the tokens are passed to the parser to generate an AST.

Generating the Abstract Syntax Tree (AST)

I restarted the AST from scratch twice.

The first time, I went with the most basic approach: allocating each node on the heap and using pointer to other nodes, it would have worked fine but I know I wanted to make an interpreter directly for the AST later and I also want to be able to execute Ciel code as plugin in C++ programs. Even if AST (tree-walker) interpreters probably have the worst performance, I can still make some choices to improve it, so to reduce the number of allocations and potential cache misses I did rewrite it from scratch using only one array and indexes.

The nodes are not aware of the AST itself, each node contain the indexes of the needed nodes to interpret the node or to go forward into the AST depending of the type of node. Most of the time the required/next nodes are placed right after the one that needs it.

Nodes are basically tagged unions, they can be a statement or an expression which are themself a tagged union of different kind of statements or expressions.

Let’s take an example of an “assign statement” that assigns the result of a addition to a variable, this is how the array looks like in memory:

Ciel AST assign statement

The array of nodes is wrapped into a class that make it easier to walk through the tree and the instance of this class is passed to the parser so it can add nodes.

Checking types and making symbol table

When the AST is complete, I make a first pass to check top level function declarations only and store them in a symbol table so I don’t need forward declarations.

Then I make a second pass to check all types for assignments, functions arguments, etc. During the second pass I also add more info into nodes to help the future interpreter, for example a function call initialy contains the identifier only, but during the second pass I add the node index of the function body (first node of the function) so the interpreter doesn’t need to check the symbol table and can directly jump to the function node and execute it

NOTE

At this point, I already started the interpreter and I’m developing both at the same time because when adding features I need to modify both at the same time, because I don’t know what I’m gonna need (it’s the first time I do it).

Just before the first pass I also did pass a list of builtin functions to populate the symbol table, it contains the function names, args, return types and callbacks to execute the functions for the interpreter.

Executing the program by interpreting the AST

Each statement node has a next_node_index property that contains the index of the next statement to be executed, however statements related to control flow (if, while) have a slight variation, for example the “if” statement has additional properties that contains the indexes of the nodes to execute if the condition is true or optionally if it’s not true.

There is also a “MemoryStack” structure that is essentialy an “array of hashmap” that contains the variable names and their values, each time we enter a block (could be a function body, an if statement, etc.) I push a new hashmap into the stack and pop it when the block ends.

Builtin function declarations were also added directly as nodes into the AST so there is no need to check the symbol table until the very last moment to get the callback and execute the builtin function.
For example when executing a “function” node I just need to jump to the declaration node (that might be really close and be in cache), iterate over the argument nodes to add the variables to the stack before either jumping to the function body or getting the callback of a builtin function.

Result

I can currently run this kind of script successfully and it really feel like an accomplishment :)

let name = "John";
let age = add(23, 1);

func add(a, b) {
	let res = a + b;
	return res + 1;
}

std.println(name + " has " + std.to_string(age) + " y.o");

return 42;

NOTE

Methods from std. are builtin functions, but there is no module or classes at the moment, the dot notation means nothing, it’s just that they’re currently allowed in variable/function names.

What’s next ?

Condition statement “if/else” and loop statement “while”
I need better error handling and messages, for now there is minimal indication (line, column and type of error)
Modules, or anything that allows me to use multiple files, I’ve not decided yet.