Rift – Lexical Analysis

In which we ramble about Flex, scanners and regular expressions.

To put it bluntly, a compiler basically reads plain text files and generates something from them. That something may be x86 code, C code, LLVM code, even things like Python code, etc. For example, gcc reads a few files containing C code and generates a binary file ready to be executed. The first step in this complex process is called lexical analysis. To develop this part of the Rift compiler a program called flex is going to be used.

So what is lexical analysis? Lexical analysis deals with recognizing certain words and symbols in the plain text file and assigning them meaning. A plain text file is just a sequence of characters, nothing more. Lexical analysis goes through this sequence and is able to recognize which part of that sequence is a reserved word like if, while, int, which part is a string, an integer, etc. The sequence of characters then gets turned into a sequence of tokens, which is then used by the other parts of the compiler.

To illustrate this idea take the following Rift code snippet:

void say(char str[]) {
    write(str);
}

async say("Not a Go");
say("Rip Off");

For a programmer the content of the code is obvious. There is the definition of a function along with two calls for that function. One of those calls actually starts a new thread running the say function. For the compiler however, those are just characters. They are a v, an o, a space, a newline, etc. After lexical analysis this code will be a sequence of tokens. The first four characters will be recognized as the token T_VOID. The characters of the string “Rip Off” will be recognized as a T_STRING token. The string token will actually have the value of the string associated with it in some form, but more on that later.

In the end of the lexical analysis the compiler will “understand” the first line of the code like this:

# Original Line
void say(char str[]) {

# How the compiler looks at it during this phase
T_VOID T_ID T_PAR_O T_CHAR T_ID T_BRACK_O T_BRACK_C T_PAR_C T_BRACE_O

So there is a token for the void word, the tokens for the parenthesis ( and ), which are T_PAR_O and T_PAR_O respectively and so on. We no longer see loose characters.

This process is usually combined with the parsing process, which will be discuss in a later post. This process is also known as scanner. So how do we implement a scanner for this project? Using Flex.

Flex is lexical analyser generator, which means that it is a program that generates a specific scanner for our needs. To generate this scanner we have to define a file called scanner.l, for this project the initial version of this file is in this commit at Github. To fully understand the contents of this file you should read the flex’s manual available online in this link. There really isn’t any tutorial that can explain how to make a scanner better then the manual, which is quit good.

To understand the very basics note that one of the first things the file defines is the following:

ID [a-zA-Z_][a-zA-Z0-9_]*

INT_B2_LIT 0b[0-1]+
INT_B8_LIT 0[0-7]+
INT_B10_LIT [0-9]+
INT_B16_LIT 0x[0-9a-fA-F]+
FLOAT_LIT ([0-9]*\.[0-9]+)|([0-9]+\.[0-9]*)

STRING_LIT \"(\\.|[^\\"])*\"
CHAR_LIT \'(\\.|[^\\'])\'

COMMENT "#".*"\n"
IGNORE [ \t\n]

These are regular expressions that define how a few tokens look like. The first one defines how an identifier looks like. So basically any sequence of characters that starts with a letter or underscore and is followed by either a letter, number or underscore is recognized as an identifier. Any sequence that starts with a 0, followed by a b, followed by a few 0s and 1s is the literal binary representation of an integer. Similar for the for the other bases and float numbers. Also similar for the other regular expressions.

The rest of the file has instructions that look like the following:

"int" { return RET_YY_T(T_INT); }
{ID} { return T_ID; }

The first line tell the scanner what to do when it recognizes the word int. What happens is that the scanner will execute the code in front of it. The RET_YY_T is a macro that gets expanded to yylval.token = T_INT. The parser function, which is yet to be defined, will receive the T_INT token from the scanner indicating that the reserved word int was scanned.

The second line is similar, with the difference that it just identifies an identifier token and it uses a regular expression ID defined before instead of a literal value. This rule is actually incomplete, but the missing part will be added later when another part of the parser is done;

Tags: , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: