Create your own Programming Language and its Compiler.

Compiler

A compiler is a software program that transforms source code written in a high-level programming language into an executable machine code (or another language as in our example, Java) that can be run on a computer.

A typical compiler consists of several components or phases that work together to translate the source code of a program into executable code. Here are the main components of a compiler:

  1. Lexical Analyzer: The lexical analyzer, also known as the scanner, reads the source code and breaks it up into a stream of tokens. Tokens are the smallest meaningful units of the programming language, such as keywords, identifiers, operators, and constants.
  2. Syntax Analyzer: The syntax analyzer, also known as the parser, takes the stream of tokens generated by the lexical analyzer and builds a parse tree or an abstract syntax tree (AST). The parse tree represents the structure of the program according to the rules of the programming language’s grammar.
  3. Semantic Analyzer: The semantic analyzer checks the parse tree for semantic errors and inconsistencies. It verifies that the types of expressions and variables are consistent and that all variables are declared before they are used. It also performs type inference, where the compiler infers the type of a variable based on its usage.
  4. Code Generator: Simply takes the AST as a model of the program, anf generates the target code.
  5. Symbol Table: The symbol table is a data structure used by the compiler to store information about variables, functions, and other program elements.
  6. Error Handler: The error handler is responsible for detecting and reporting errors in the source code.

The Example:

We will create a language called L and compile it to a Java target.

The components:

  1. Grammer File

We will use ANTLR here,
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator that can be used to create parsers, lexers, and compilers for programming languages, configuration files, data formats, and other structured text formats. It uses a combination of lexer and parser grammars to parse input text and generate abstract syntax trees (ASTs) that can be used by compilers or interpreters to execute the program.

A detailed ANTLR setup and use guide here: Java with ANTLR | Baeldung

An example code for our language is the following:

def mul[x: Int, y: Int]: Int <- x * y;
var x: Int <- 2;
var y: Int;
y <- x * 2 + x;
mul[x, y+2];

We can define the rules for this language in ANTLR g4 file:

grammar L;

l: statement*
;

statement: function_def
| variable_def
| variable_assign
| function_call
;

function_def : 'def' ID '[' parameters? ']' ':' type assign SC
;

parameters : parameter (',' parameter)*
;

parameter : ID ':' type
;

function_call : ID '[' arguments? ']' SC
;

arguments : argument (',' argument)*
;

argument : expression
;

variable_def : 'var' ID ':' type assign? SC
;

variable_assign : ID assign SC
;

assign: '<-' expression
;

type : 'Int' | 'String' | 'Boolean' | ID
;

expression : value
| ID
| '(' expression ')'
| expression '+' expression
| expression '-' expression
| expression '*' expression
| expression '/' expression
| expression '>' expression
| expression '<' expression
| expression '>=' expression
| expression '<=' expression
| expression '==' expression
| expression '!=' expression
| '!' expression
| expression '&&' expression
| expression '||' expression
;

value : INT
| STRING
| TRUE
| FALSE
;

INT : ('0'..'9')+
;

STRING : '"' (ESC | ~["\\])* '"'
;

fragment ESC : '\\' ["\\/bfnrt]
;

TRUE : 'true'
;

FALSE : 'false'
;

ID : [a-zA-Z][a-zA-Z0-9]*
;

SC : ';'
;

WS : [ \t\r\n]+ -> skip
;
  • l rule: This is the start rule, which matches zero or more statements.
  • statement rule: This matches one of four types of statements: function definitions, variable definitions, variable assignments, or function calls.
  • function_def rule: This matches a function definition, which consists of a name, optional parameters, a return type, and a body. The body is an expression that gets assigned to the function name.
  • parameters rule: This matches a comma-separated list of parameters for a function.
  • parameter rule: This matches a single parameter, which consists of a name, a colon, and a type.
  • function_call rule: This matches a function call, which consists of a function name and zero or more arguments.
  • arguments rule: This matches a comma-separated list of arguments for a function call.
  • argument rule: This matches a single argument, which is just an expression.
  • variable_def rule: This matches a variable definition, which consists of a name, a colon, a type, and an optional assignment.
  • variable_assign rule: This matches a variable assignment, which consists of a variable name and an expression to assign to it.
  • assign rule: This matches an assignment operator and an expression.
  • type rule: This matches a type, which can be Int, String, Boolean, or any other identifier.
  • expression rule: This matches a basic expression, which can be a value, a variable, a parenthesized expression, or a combination of expressions using arithmetic, logic, or comparison operators.
  • value rule: This matches a literal value, which can be an integer, a string, true, or false.

Tokens:

  • INT rule: This matches an integer literal.
  • STRING rule: This matches a string literal, which can include escape sequences.
  • ESC rule: This matches escape sequences inside a string literal.
  • TRUE and FALSE rules: These match the Boolean literals true and false.
  • ID rule: This matches an identifier, which is used for variable names, function names, and type names.
  • SC rule: This matches a semicolon, which is used to terminate statements.
  • WS rule: This matches whitespace and skips it.

2. The AST

Our AST has the interface

public interface Statement {
public String getType();
}

And the following implements it

public class VariableAssign implements Statement {
private Identifier identifier;
private Assign assign;

@Override
public String getType() {
return "variable_assign";
}
}

public class VariableDef implements Statement {
private Identifier identifier;
private Boolean initialized;
private Assign assign;

@Override
public String getType() {
return "variable_def";
}
}

public class FunctionCall implements Statement {
String name;
List<Argument> arguments;

@Override
public String getType() {
return "function_call";
}
}

public class FunctionDef implements Statement {
String name;
List<Parameter> parameters;
Assign assign;

@Override
public String getType() {
return "function_def";
}
}

These are Java classes that implement the Statement interface and represent the different types of statements that can be parsed from the L language.

VariableAssign represents a statement that assigns a value to a variable. It contains an Identifier object for the variable being assigned and an Assign object for the assignment expression.

VariableDef represents a statement that defines a variable. It contains an Identifier object for the variable being defined, a boolean flag for whether the variable is initialized, and an Assign object for the assignment expression (if the variable is initialized).

FunctionCall represents a statement that calls a function. It contains the name of the function and a list of Argument objects for the function arguments.

FunctionDef represents a statement that defines a function. It contains the name of the function, a list of Parameter objects for the function parameters, and an Assign object for the function body.

The full code of the AST: l-language-compiler/src/main/java/samsaydali/l/ast at master · samsaydali7/l-language-compiler · GitHub

3. Lexical and Syntax Analyser:

This code is an implementation of an ANTLR4 visitor for a programming language. The visitor parses the input code and build a list of statements, which can then be used for further analysis or execution.

public class LVisitor extends LBaseVisitor {

public List<Statement> statements = new LinkedList<>();
ScopeService scope = new ScopeService();

@Override
public Object visitStatement(LParser.StatementContext ctx) {
Object result = super.visitStatement(ctx);
statements.add((Statement) result);
return result;
}

@Override
public Statement visitVariable_def(LParser.Variable_defContext ctx) {
VariableDef.VariableDefBuilder builder = VariableDef.builder();

// Add the identifier
Identifier identifier = new Identifier(ctx.ID().getText(), new Type(ctx.type().getText()));
scope.addIdentifier(identifier);
builder.identifier(identifier);

// Add the assign if exists
if (ctx.assign() != null) {
Assign assign = buildAssign(ctx.assign(), identifier.getType());
builder.initialized(true);
builder.assign(assign);
}

return builder.build();
}

@Override
public Statement visitVariable_assign(LParser.Variable_assignContext ctx) {
Identifier identifier = scope.getIdentifier(ctx.ID().getText());
Assign assign = buildAssign(ctx.assign(), identifier.getType());
return new VariableAssign(identifier, assign);
}

@Override
public Statement visitFunction_def(LParser.Function_defContext ctx) {
String name = ctx.ID().getText();
List<Parameter> parameters = new LinkedList<>();
scope.openScope();
for (int i = 0; i < ctx.parameters().parameter().size(); i++) {
LParser.ParameterContext pc = ctx.parameters().parameter().get(i);
Identifier identifier = new Identifier(pc.ID().getText(), new Type(pc.type().getText()));
scope.addIdentifier(identifier);
parameters.add(new Parameter(identifier, i));
}
Assign assign = buildAssign(ctx.assign(), new Type(ctx.type().getText()));
scope.closeScope();
FunctionDef def = new FunctionDef(name, parameters, assign);
scope.addFunction(def);
return def;
}

@Override
public Statement visitFunction_call(LParser.Function_callContext ctx) {
String name = ctx.ID().getText();
List<Argument> arguments = new LinkedList<>();
for (int i = 0; i < ctx.arguments().argument().size(); i++) {
LParser.ArgumentContext arg = ctx.arguments().argument().get(i);
Expression expression = buildExpression(arg.expression());
arguments.add(new Argument(i, expression));
}
// Validate function exists and it's arguments
scope.getFunctionDef(name, arguments);
return new FunctionCall(name, arguments);
}

private Assign buildAssign(LParser.AssignContext ctx, Type type) {
return new Assign(type, buildExpression(ctx.expression()));
}

private Expression buildExpression(LParser.ExpressionContext ctx) {

if (ctx.value() != null) {
return getValueExpression(ctx.value());
} else if (ctx.ID() != null) {
return getIDExpression(ctx.ID().getText());
} else if (ctx.getChildCount() == 3) {
Expression left = buildExpression(ctx.expression(0));
Expression right = buildExpression(ctx.expression(1));
String operator = ctx.getChild(1).getText();
return getBinaryExpression(operator, left, right);
} else if (ctx.getChildCount() == 2) {
String operator = ctx.getChild(0).getText();
Expression right = buildExpression(ctx.expression(0));
return getUnaryExpression(operator, right);
} else {
return null;
}
}

private Expression getUnaryExpression(String op, Expression rhs) {
Expression.ExpressionBuilder builder = Expression.builder();
builder
.expressionType(Expression.ExpressionType.UNARY)
.op(op)
.rhs(rhs);
return builder.build();
}

private Expression getBinaryExpression(String op, Expression lhs, Expression rhs) {
Expression.ExpressionBuilder builder = Expression.builder();
builder
.expressionType(Expression.ExpressionType.BINARY)
.op(op)
.lhs(lhs)
.rhs(rhs);
return builder.build();
}

private Expression getIDExpression(String name) {
Expression.ExpressionBuilder builder = Expression.builder();
builder
.expressionType(Expression.ExpressionType.ID)
.ID(scope.getIdentifier(name));
return builder.build();
}

private Expression getValueExpression(LParser.ValueContext context) {
Expression.ExpressionBuilder builder = Expression.builder();
builder.expressionType(Expression.ExpressionType.VALUE)
.value(new Value(getValueExpressionType(context), context.getText()));

return builder.build();
}

private Type getValueExpressionType(LParser.ValueContext context) {
if (context.STRING() != null) return new Type("String");
else if (context.INT() != null) return new Type("Int");
else if (context.TRUE() != null || context.FALSE() != null) return new Type("Bool");
else return new Type("String");
}
}

The visitor defines methods for each rule in the ANTLR4 grammar, and each method builds a corresponding statement or expression object, which is then added to the list of statements.

The visitor also manages a scope (Symbol Table), which is used to keep track of variables and functions declared in the program. The scope is updated as new variables and functions are declared, and it is used to look up identifiers when they are referenced in the code (Semantic Analyzer and Error Handler).

4. Symbol Table

public class Scope {

List<Identifier> identifiers = new LinkedList<>();
List<FunctionDef> functions = new LinkedList<>();
Scope parent = null;

public void addIdentifier(Identifier identifier) {
Optional<Identifier> check = identifiers.stream().filter(i -> i.getName().equals(identifier.getName())).findFirst();
if (check.isPresent()) {
throw new IdentifierAlreadyInScope(identifier.getName());
}
identifiers.add(identifier);
}

public void addFunction(FunctionDef functionDef) {
Optional<FunctionDef> check = functions.stream().filter(i -> i.getName().equals(functionDef.getName())).findFirst();
if (check.isPresent()) {
throw new IdentifierAlreadyInScope(functionDef.getName());
}
functions.add(functionDef);
}

public Identifier getIdentifier(String id) throws IdentifierNotFoundInScope {
Optional<Identifier> identifier = identifiers.stream().filter(i -> i.getName().equals(id)).findFirst();
if (identifier.isPresent()) {
return identifier.get();
}
if (parent != null) {
return parent.getIdentifier(id);
}
throw new IdentifierNotFoundInScope(id);
}

public FunctionDef getFunction(String id, List<Argument> arguments) throws IdentifierNotFoundInScope {
Optional<FunctionDef> function = functions.stream().filter(i -> i.getName().equals(id)).findFirst();
if (function.isPresent()) {
List<Parameter> parameters = function.get().getParameters();
if (!(arguments.size() == parameters.size())) {
// Validate argument count matching parameter count.
throw new ArgumentsNotCompliantWithParametersList(id);
}
for (int i = 0; i < parameters.size(); i++) {
// Validate argument types matching parameter types.
if (!arguments.get(i).getExpression().resultType().equals(parameters.get(i).getIdentifier().getType())) {
throw new ArgumentsNotCompliantWithParametersList(id);
}
}
return function.get();
}
if (parent != null) {
return parent.getFunction(id, arguments);
}
throw new IdentifierNotFoundInScope(id);
}

public void clear() {
identifiers.clear();
}
}

This is a Java class called Scope that is used to keep track of variables and functions within a specific scope of a program.

The class has three fields:

  • identifiers: A List of Identifier objects representing variables in the current scope.
  • functions: A List of FunctionDef objects representing functions in the current scope.
  • parent: A reference to the parent Scope object. This is used to look up variables and functions in outer scopes.

The class has several methods:

  • addIdentifier(): Adds an Identifier object to the identifiers list, but checks for duplicates first.
  • addFunction(): Adds a FunctionDef object to the functions list, but checks for duplicates first.
  • getIdentifier(): Returns an Identifier object given its name. If the identifier is not found in the current scope, it looks in the parent scope (if it exists). If the identifier is still not found, it throws an IdentifierNotFoundInScope exception.
  • getFunction(): Returns a FunctionDef object given its name and a list of arguments. It checks that the number of arguments matches the number of parameters, and that the types of the arguments match the types of the parameters. If the function is not found in the current scope, it looks in the parent scope (if it exists). If the function is still not found, it throws an IdentifierNotFoundInScope exception.
  • clear(): Clears the identifiers list. This is used to reset the scope between function calls.
public class ScopeService {

Scope currentScope = new Scope();

public void openScope() {
Scope newScope = new Scope();
newScope.setParent(currentScope);
currentScope = newScope;
}

public void closeScope() {
currentScope.clear();
currentScope = currentScope.getParent();
}

public void addIdentifier(Identifier identifier) throws IdentifierAlreadyInScope {
currentScope.addIdentifier(identifier);
}

public Identifier getIdentifier(String id) throws IdentifierNotFoundInScope {
return currentScope.getIdentifier(id);
}

public void addFunction(FunctionDef def) throws IdentifierAlreadyInScope {
currentScope.addFunction(def);
}

public FunctionDef getFunctionDef(String id, List<Argument> arguments) throws IdentifierNotFoundInScope {
return currentScope.getFunction(id, arguments);
}

}

ScopeService that provides methods for managing scopes and identifiers within them. The class has an instance variable called currentScope, which is initially set to a new Scope object.

The openScope method creates a new Scope object, sets its parent to the current scope, and makes the new scope the current scope.

The closeScope method clears the current scope and sets the current scope to its parent.

The addIdentifier method adds an Identifier object to the current scope using the addIdentifier method of the current scope.

The getIdentifier method retrieves an Identifier object by name from the current scope using the getIdentifier method of the current scope.

The addFunction method adds a FunctionDef object to the current scope using the addFunction method of the current scope.

The getFunctionDef method retrieves a FunctionDef object by name and arguments from the current scope using the getFunction method of the current scope.

5. Java Code Template and Code Generation

FreeMarker is a Java-based template engine used for generating dynamic content, such as HTML web pages, emails, and documents. It allows developers to separate the presentation layer from the business logic layer by using templates. FreeMarker templates are essentially text files with placeholders that are replaced with dynamic content when the template is processed.

Details at: FreeMarker Java Template Engine (apache.org)

Java target template file:

package samsaydali.l.codegen;

public class Main {

public static void main(String[] args) {
<#list statements as statement>
<#if statement.getType() == "variable_def">
${statement.identifier.type.type} ${statement.identifier.name}<#if statement.initialized??> = ${statement.assign.expression.getText()}</#if>;
</#if>
<#if statement.getType() == "variable_assign">
${statement.identifier.name} = ${statement.assign.expression.getText()};
</#if>
<#if statement.getType() == "function_call">
${statement.name}(${statement.arguments?map(a -> a.expression.getText())?join(', ')?no_esc});
</#if>
</#list>

}

<#list statements?filter(s -> s.getType() == 'function_def') as function>
private static int ${function.name}(${function.parameters?map(p -> p.identifier.type.type + " " + p.identifier.name)?join(', ')?no_esc}) {
return ${function.assign.expression.getText()};
}
</#list>
}

FreeMarker template used to generate code based on a list of statements. It contains several conditional blocks that check the type of each statement and generate the appropriate code based on that type.

For example, if the statement is a variable definition, the template generates a line of code that declares a variable with the specified type and name, optionally initialized to the value of the expression on the right-hand side of the equals sign. If the statement is a function call, the template generates a line of code that calls the specified function with the given arguments.

Additionally, the template contains a loop that iterates over the list of statements and generates code for each function definition found in the list. Each function definition is turned into a private static method with the same name as the function and the appropriate parameter types and names, and the method body is set to the value of the expression assigned to the function.

It’s worth noting that the <#if> and <#list> statements are part of the FreeMarker syntax, and are used here to conditionally generate code based on the type of each statement and to loop over the list of statements to generate code for each function definition.

The Final App

public class LCompiler {

public LVisitor visit(String lInput) {
LLexer lexer = new LLexer(CharStreams.fromString(lInput));
CommonTokenStream tokens = new CommonTokenStream(lexer);
LParser parser = new LParser(tokens);

LVisitor visitor = new LVisitor();
parser.l().accept(visitor);

return visitor;
}

public List<Statement> getAST(LVisitor visitor) {
return visitor.statements;
}

public JavaGenerator getGenerator() {
return new JavaGenerator();
}

public String generateCode(List<Statement> ast) throws TemplateException, IOException {
return getGenerator().generate(ast);
}

public String compileToJava(String lInput) throws TemplateException, IOException {
LVisitor visitor = visit(lInput);
List<Statement> ast = getAST(visitor);
String javaCode = generateCode(ast);
return javaCode;
}

}

The LCompiler class is the main class responsible for compiling L code to Java code.

The visit method takes the L input, initializes a lexer and a parser using ANTLR, and passes the parsed AST to an instance of LVisitor, which is responsible for visiting the nodes of the AST and building a list of Statement objects.

The getAST method simply returns the list of Statement objects built by the visitor.

The getGenerator method creates and returns a new instance of JavaGenerator, which is a class responsible for generating Java code from a list of Statement objects.

The generateCode method takes a list of Statement objects and generates Java code using the JavaGenerator class.

Finally, the compileToJava method takes the L input, compiles it to a list of Statement objects using visit, generates Java code from the list of Statement objects using generateCode, and returns the resulting Java code.

The Full code at: GitHub — samsaydali7/l-language-compiler

Leave a Comment

Your email address will not be published. Required fields are marked *