A basic Rust compiler built with C
Rust is a multi-paradigm system programming language focused on safety, especially safe concurrency. This compiler takes handles the following constructs:
- If-Else constructs
- While constructs
- For constructs
- All arithmetic operators
- Single Line Comments (//)
- Multi Line Comments (/* … */)
Semantically we have checked the following:
- Whether any variable used on the RHS is defined and in the current scope or any Enclosing Scope of the current scope.
- Missing semicolons
- Assignment before Declaration
Symbol table has been created in the Lexical analyser phase. It is
implemented using nested structures which hold the lexeme, datatype, value, scope and line number.
It is assumed that each level of scope has 20 variables, however this can be modified easily. The Node structure is used to handle scope.
For the Abstract Syntax Tree, We have 2 Types of Nodes, Leaf nodes and Internal nodes. The nodes can have a variable number of children (0-3) depending upon the construct it represents.
To display the AST, We take the AST and store it as a matrix of levels. Leaf nodes in the AST representing identifiers, constants, Lists, packages point to a record in the symbol table.
The Intermediate code is generated by recursively stepping through the AST. 3 Address code format is used.
- Dead Code Elimination:
We Eliminate dead code, specifically unused variables in the whole program. For example, if we have the following lines of code:
a = 10
b = 10
c = a + b
And these 3 variables are not used on any other RHS, then these 3 lines of code are Eliminated during optimization. All the dead variables in the code are removed. - Elimination of common subexpressions
- Constant folding
Panic mode recovery method of error handling has been employed. This means that the compiler ignores the input until the next delimiter (semicolon here) is obtained.
The scope checking is handled by recursively stepping through enclosing scopes and finding the most recent definition of the variable. If no definition is found, we print the error.
We are using ARM 7 architecture. In this simple compiler, the variable is read from its slot in the memory to a register, performing operation and immediately writing it back into the memory. This way, we never really run out of registers because at most only 3 registers will be used. Liveness of variables is already taken care of at the code optimization phase.
ARM7 has 31 GPRs and 6 SPRs. The input to the assembly code
generator is the optimized ICG from the previous phase of building the compiler.
Token generation
lex lexical_analysis.l
Symbol table generation
yacc -d lexical_analysis.y
lex lexical_analysis.l
gcc lex.yy.c y.tab.c -ll
./a.out
AST generation
lex ast.l
yacc -d ast.y
lex 2.l
gcc lex.yy.c y.tab.c -ll
./a.out
Intermediate Code Generation
yacc -d icg.y
gcc lex.yy.c y.tab.c -ll
./a.out<test1.rs
Code Optimization
python3 code_optimize.py icg_ip.txt
Assembly code generation (ARM)
python3 acg.py icg_ip.txt
The Compiler can be further enhanced by adding
- Error recovery
- Better memory management
- Semantic analysis for Parameter matching
- More efficient optimization techniques
- Next use the algorithm for register allocation by making it a two pass assembler.
This project is licensed under the MIT License - see the LICENSE file for details.