Archive | Rift RSS for this section

Rift – Syntactic Analysis

In which we define a big grammar to parse a program.

The parser for the Rift programming language compiler is done. It was developed using GNU Bison. The grammar file for the parser is in this file.

The rules of the grammar have two kinds of symbols. The terminals and the non-terminals. The non-terminals all have their own rules within the grammar and follow the conventions that all of it’s characters are lower case. The terminal symbols follow the convention of having uppercase characters and all start with T_. A rule from the grammar looks like the following:

var_decl: data_types T_ID T_SEMICOLON
| data_types T_ID T_EQUAL expr T_SEMICOLON
;

The terminal symbols in the grammar are called tokens and they came directly from the lexical analysis part of this compiler. This part, developed using Flex, defined a lot of rules that identified certain sequences of characters has a specific token. For example, the T_INT_B10_LIT token is used to expressed that a base 10 integer was found. Similarly, the T_EQUAL token is used when the = symbol is found.

Each grammar symbol has a type which is defined using the C, or in this case C++ types. Those type can be an integer, float, structure, class, etc. A special union is used in the grammar file to define the various types. Has of this version it looks like the following:

%union {
int token;
int ph;
}

It’s still very poor. It only defines two types, the token type which is an integer and a ph type, also an integer. The token type will be the type for every token returned by the scanner. Later that will also be changed, but for now it will be like this. The ph type stands for place holder and it will be the type for every non-terminal symbol. The non-terminal symbols will have a specific type later which will be defined using a class of the abstract tree. For now the placeholder is used because the tree was not developed yet.

The different rules should have an action associated with them that defines what the program should do when it recognizes that rule. For example, at same stage for parsing a rift file there would be a variable assign that would be recognized by this rule:

var_assign: postfix_expr T_EQUAL expr T_SEMICOLON { $$ = 10; };

The action of this rule is defined inside the curly braces. Inside the scope of the action there are variables that can be used and start with the dollar ($) symbol. This variables refer to the different symbols in the rule. In the example, $$ is the *var_assign* symbol, $1 is the postfix_expr symbol, $2 the T_EQUAL symbol, and so on. In the action the only thing that happens is that var_assign get the value 10 assign to it. Why?

Well, that happens because I just needed a placeholder action. Many of the rules don’t have any actions yet but some needed to have them to avoid some conflict. So I just left that in there. I’m sure that there might be a better wait for this temporary fix. But it works like this.

Rift – Lexical Analysis

In which we ramble about Flex, scanners and regular expressions.

To put it bluntly, a compiler basically reads plain text files and generates something from them. That something may be x86 code, C code, LLVM code, even things like Python code, etc. For example, gcc reads a few files containing C code and generates a binary file ready to be executed. The first step in this complex process is called lexical analysis. To develop this part of the Rift compiler a program called flex is going to be used.

So what is lexical analysis? Lexical analysis deals with recognizing certain words and symbols in the plain text file and assigning them meaning. A plain text file is just a sequence of characters, nothing more. Lexical analysis goes through this sequence and is able to recognize which part of that sequence is a reserved word like if, while, int, which part is a string, an integer, etc. The sequence of characters then gets turned into a sequence of tokens, which is then used by the other parts of the compiler.

To illustrate this idea take the following Rift code snippet:

void say(char str[]) {
    write(str);
}

async say("Not a Go");
say("Rip Off");

For a programmer the content of the code is obvious. There is the definition of a function along with two calls for that function. One of those calls actually starts a new thread running the say function. For the compiler however, those are just characters. They are a v, an o, a space, a newline, etc. After lexical analysis this code will be a sequence of tokens. The first four characters will be recognized as the token T_VOID. The characters of the string “Rip Off” will be recognized as a T_STRING token. The string token will actually have the value of the string associated with it in some form, but more on that later.

In the end of the lexical analysis the compiler will “understand” the first line of the code like this:

# Original Line
void say(char str[]) {

# How the compiler looks at it during this phase
T_VOID T_ID T_PAR_O T_CHAR T_ID T_BRACK_O T_BRACK_C T_PAR_C T_BRACE_O

So there is a token for the void word, the tokens for the parenthesis ( and ), which are T_PAR_O and T_PAR_O respectively and so on. We no longer see loose characters.

This process is usually combined with the parsing process, which will be discuss in a later post. This process is also known as scanner. So how do we implement a scanner for this project? Using Flex.

Flex is lexical analyser generator, which means that it is a program that generates a specific scanner for our needs. To generate this scanner we have to define a file called scanner.l, for this project the initial version of this file is in this commit at Github. To fully understand the contents of this file you should read the flex’s manual available online in this link. There really isn’t any tutorial that can explain how to make a scanner better then the manual, which is quit good.

To understand the very basics note that one of the first things the file defines is the following:

ID [a-zA-Z_][a-zA-Z0-9_]*

INT_B2_LIT 0b[0-1]+
INT_B8_LIT 0[0-7]+
INT_B10_LIT [0-9]+
INT_B16_LIT 0x[0-9a-fA-F]+
FLOAT_LIT ([0-9]*\.[0-9]+)|([0-9]+\.[0-9]*)

STRING_LIT \"(\\.|[^\\"])*\"
CHAR_LIT \'(\\.|[^\\'])\'

COMMENT "#".*"\n"
IGNORE [ \t\n]

These are regular expressions that define how a few tokens look like. The first one defines how an identifier looks like. So basically any sequence of characters that starts with a letter or underscore and is followed by either a letter, number or underscore is recognized as an identifier. Any sequence that starts with a 0, followed by a b, followed by a few 0s and 1s is the literal binary representation of an integer. Similar for the for the other bases and float numbers. Also similar for the other regular expressions.

The rest of the file has instructions that look like the following:

"int" { return RET_YY_T(T_INT); }
{ID} { return T_ID; }

The first line tell the scanner what to do when it recognizes the word int. What happens is that the scanner will execute the code in front of it. The RET_YY_T is a macro that gets expanded to yylval.token = T_INT. The parser function, which is yet to be defined, will receive the T_INT token from the scanner indicating that the reserved word int was scanned.

The second line is similar, with the difference that it just identifies an identifier token and it uses a regular expression ID defined before instead of a literal value. This rule is actually incomplete, but the missing part will be added later when another part of the parser is done;

Starting the Rift Programming Language

During my first semester of my Masters Degree I had two particular courses that I found quite interesting. One was called Advance Topics of Compilers and the other was Advance Topics of Distributed systems. The objective of the first course was the study of a more complex compiler design than the one given in a previous course of *Compilers*, while the second one was the study of distributed systems, programming for distributed systems, among other things.

After these two courses started I decided to develop a project that would be used for both courses. That project would be a programming language which would have a concurrency paradigm.

Languages

At first I started exploring other languages that had concurrency paradigms. Among others, I explored the Go Programming Language, which ended up being one of the main sources of inspiration for this project.

Go is a language that makes it simple to use threads, known by goroutines. To start a new thread the go command is used. To see it working visit this link.

go say('World')
say('Hello')

What the code above does is to execute the say() function twice, only the first function runs in a different goroutine than the second one. When executing this function you may see Hello followed by World or vice versa.

This is such a simple way to start a new thread. Just calling a function with go before the name of the function. This was something I had to had on my language.

One other thing Go had that was very interesting was Channels. Channels are a very simple way to send some value from a goroutine to another. See this example. So to send and receive, respectively, something from a Channel you do the following:

c <- 10
x := <- x

In the code above the c variable would be a channel and the x variable would be some integer variable. The first line is an example of sending the number 10 to channel c. In another thread you can simple call the second line (assuming c would be the same channel for both threads) to receive something from the channel.

Multithreading

Meanwhile back at the course about distributed systems I was exploring different ways to program algorithms that used threads. Among a few technologies I used the POSIX Threads standard, or pthreads. This standard defined an API that can be used with the C programming language to create multithreading programs.

I made some experiments with pthreads for a while by implementing a few algorithms.

At this moment I started thinking how I would implement the compiler in terms out outputted code. In this project I just couldn’t output any low level code like x86 assembly because I had never worked with x86 before. And to implement a compiler that turns a high level language with multithreading features into x86 in such limited time would have been just impossible. Well, not impossible, just very hard with a high probability of failure involved, all with limited time.

So I decided to just output C code. The outputted code would be C99 using pthreads.

Rift

Having those things I went to develop a concept for the language. That concept would be what the language would look like, what would it allow, what data types did it had, how the multithreading feature would look like, if it had classes, among many other things.

I came up with this end result.

The language also needed a name. Because this would have the notion of parallelism I started thinking about division. So I went to a synonymous dictionary and looked for synonymous of Division. The name Rift showed up. It sounded cool. And like this the development of the Rift Programming Language began!