Expression Programs and x86 Assembly

2. Expression Programs and x86 Assembly#

2.1. Assignment overview#

This assignment covers the language of expression programs, their semantic analysis, and compilation to a subset of x86. There are 5 tasks and 12 questions in this assignment. Questions marked for glory are optional. Do them only if you feel ambitious and have the time.

2.1.1. What you need to get started#

Everything from Assignment 1, including x86.ml file that is the same as before.
Programming in OCaml. We recommend you familiarize yourself with everything up to and including Section 5, plus Section 7.1, of the OCaml book. This includes the following concepts.
1. Programming with environments (aka associative maps) in OCaml. See the following resources in the OCaml book:
  - Chapter 3.8 has a basic example of creating environments using associative lists.
  - Chapter 5.9 explains how to use the Map functionality from the OCaml standard library.
2. Higher-order programming in OCaml, e.g., using functions such as List.fold_left. See Chapter 4 of the OCaml book.
3. Basics of modular programming in OCaml, in particular the use of functors when working with the standard library. See Chapter 5 of the OCaml book.
4. Programming with references in OCaml. See Chapter 7.1 of the OCaml book.

2.1.2. What you need to hand in#

Please hand in a .zip file containing the following

A brief report documenting your solution. Acceptable report formats are .pdf, .rtf, and .md. For each question and for each task, briefly (1 – 4 sentences) describe your implementation or answer. Write concisely.
All the source files needed to reproduce your solution. Three pieces of code from the assignment description are to be copied into your solution. These are
1. The OCaml code for the eprog type declaration below in Section 2.2.
2. The OCaml code for the semant_result type declaration in Section 2.3.
3. x86.ml OCaml file. Do not modify this file.
Your project must compile.

We recommend that you organize your codebase into several files (remember also that each file is a module in OCaml): one for the definition of eprog and its pretty printing, one for example programs, one for the semantic analysis, one for the evaluation, and one for the translation. Please take a look at the Dune build management for OCaml. A relatively simple dune configuration should be sufficient for this project. Your solution must work with OCaml 5.2.0 on a modern x86-64 Linux system, such as Ubuntu 22.04 LTS.

Important

Make sure to understand all the code you hand in, including what is copied from here.

2.2. Expression programs#

This assignment revolves around a simple programming language that we call the language of expression programs. This language is a direct extension of the language of arithmetic expressions from the first assignment. It has the following AST.

(* -- Use this in your solution without modifications *)
(* Defining the type for binary operations *)
type binop =
  | Add | Sub | Mul | Div

type varname = string                (* variable names are strings *)

(* Defining the type for arithmetic expressions *)
type expr =
  | Int of int                       (* Integer constant *)
  | BinOp of binop * expr * expr     (* Binary operation *)
  | Var of varname                   (* Variable lookup  *)


(* Defining the type of statements *)  
type estmt = 
  | Val of varname * expr            (* Binding variable to a value *)
  | Input of varname                 (* Input statement *)

(* Expression program is a list of statements 
   followed by an expression *) 
type eprog = estmt list * expr

The core change from the arithmetic expressions is the introduction of immutable variables. There are two ways to bind a variable to a value.

The input statement reads an integer value from the console and binds it to the variable.
The val statement evaluates an arithmetic expression and binds the result to the variable.

We refer to the above operations as expression statements, and represent them using the type estmt.

Because variable bindings are immutable, they can only be bound once. To refer to variables, the declaration for expressions now includes a variable lookup constructor, Var of varname. An expression program is a list of statements (that is, either input or val bindings) followed by one return expression. The type eprog captures this in its definition as a tuple consisting of two items

A list of expression statements estmt, and
An expression

For representing variables, we use the built-in OCaml string type.

2.2.1. Concrete syntax#

For concrete syntax, we adopt the following notation

Input statements are written using the syntax input x
Value binding statements are written using the syntax val x = e, where e is the binding expression.
Return expression is written as return e.

For example, the following concrete syntax

input x
val y = x + 1
return x + y

represents the program given by this AST

let eprog_01: eprog = (
      [ Input "x" ; Val ("y", BinOp (Add, Var "x", Int 1)) ],
      BinOp (Add, Var "x", Var "y"))

Task 1: Pretty printer for expression programs

Write a function string_of_eprog that has type eprog -> string for pretty printing expression programs using the above notation. You probably want to reuse and extend the pretty printer for expressions you have written in the first assignment.

2.3. Semantic analysis#

Semantic analysis is a compilation phase that reports type and other semantic errors in the program. In our case, we have two kinds of errors to report

Undeclared variables. An important consideration in language design is what to do with undeclared variables. For example, the undeclared variable x in the program
```
val y = x + 1
return y
```
In this assignment, we want to prevent such programs and therefore report undeclared variables as errors.
Duplicate bindings. Because our language does not have any notion of nested scopes and mutation, there is little value in duplicate bindings, e.g., the following program will be rejected because of the duplicate binding of y
```
input y
val y = x
return y + 2
```

We use the following Ocaml declarations for error reporting

(* -- Use this in your solution without modifications --  *)
type semant_error 
  = Undeclared of varname 
  | Duplicate of varname


type semant_result
  = Ok
  | Error of semant_error list 

Note

If your semantic analysis and expression program declarations live in different modules (as we suggested earlier) remember to open the expression program module.

Task 2: Semantic analysis

Write a function semant that has type eprog -> semant_result that returns whether the program has any type errors, in which case the errors are collected in a list of type semant_error.

Question 1: Does semant need to be recursively inspect its argument? Why or why not? What auxiliary functions do you define? Are any of them recursive or not and why?
Question 2: Describe how you keep track of variable declarations in your implementation. What data structure(s) do you use?
Question 3 (glory *): Implement the warning of unused variables. Extend the definition of semant_result to incorporate a warning possibility and extend the implementation of semant appropriately.

2.4. Evaluator#

Our evaluator takes programs that are accepted by the semantic analysis and runs them.

Task 3: Evaluation of expression programs
Write a function eval that has the type eprog -> int for evaluating expression programs. For evaluating the input statements, you can use something like the following OCaml code

let eprog_input() 
   = Printf.printf "Please enter an integer: " ; read_line () |> int_of_string

Question 4: In the above code, ; is the OCaml sequencing and |> is the OCaml pipeline operator. How would you write the implementation of eprog_input using just the let expressions without these operators?

As you further work on this task, answer the following questions.

Question 5 (glory): The function int_of_string that we use above raises an exception if its argument is not an integer. How would you handle this situation?
Question 6: What data structure(s) do you use in the implementation of the interpreter?
Question 7: What runtime errors are possible during the execution of your interpreter? Are they preventable, and how?

2.5. Compiling to x86 assembly#

Task 4: Compiling expression programl to x86

Write a function eprog_to_x86 that has the type eprog -> X86.prog that compiles expression programs to x86. For reading input from the console, use the following C function

#include <stdio.h>         /* make sure these two includes are    */
#include <inttypes.h>      /* present in the start of your C file */

int64_t read_integer () {
    int64_t value;
    printf("Please enter an integer: ");
    scanf("%" PRId64 "" , &value);
    return value;
}

For reporting the result of the program, use the function print_int, as in Assignment 1.

The following two sections describe one potential approach to this task. Note that this is only one of the ways of implementing this task. Other approaches are possible, and you are welcome to pursue them in your assignment; in that case it is still a good idea to understand the approach outlined below.

2.5.1. Stack organization via spilling#

A simple (admittedly quite inefficient) way of x86 code generation is via spilling, where each local variable has a reserved slot on the stack. Additionally, the results of the intermediate binary operations are likewise stored on the stack in a reserved location. The following image illustrates such a stack organization.

2.5.2. Implementation via spilling#

To implement code generation via spilling, we need a number of building blocks.

Introduce a layout type that is an associative map for mapping variables to their respective slot positions on the stack.
Introduce a counter for keeping track of the temporaries for the intermediate results of the binary operations.
Break down your implementation into auxiliary functions, such as
- cg_stmt of type layout -> estmt -> X86.ins list that takes the layout, a single statement, and returns a list of instructions corresponding to the assembly code for that statement. The layout is needed to store the result of the statement in the right stack slot.
- cg_expr of type layout -> expr -> X86.ins list * X86.operand that takes the layout, a single expression, and returns a pair of the instructions corresponding to the assembly code for that expression and an X86 operand that has the result of that expression. The layout is needed to read variables from the stack.
  
  This programming pattern of returning a list of instructions (or something akin to that) and an operand, indicating where to find the result after those instructions are executed, is commonplace for “flattening” phases, as is the case here because we translate from a tree data structure (expressions) to a list data structure (assembly instructions).
  
  For example, for integer expression (Int n), function cg_expr can return a pair [], (~$ n). For binops, the instructions list will include moving data into the right registers, doing the binop, and storing the result on the stack in a temporary slot. The operand will be the offset from the base pointer to the temporary. For example, if the temporary has index j the operand is the displacement X86.Ind3 (Lit(-8*(N+j+1)), Rbp), where N is the number of the regular locals. Remember to increase the temp counter, when needed.

One more concern is what information we need to generate the correct function prologue and epilogues. For example, the function prologue must include an instruction for extending the stack – such as subtracting from the %rsp register. But how much to subtract? The answer depends on two things.

The number of the variables in the source program. This is easy to compute by, e.g., taking the length of the list of expression statements.
The number of temporaries needed for the code generation of all of the expressions. This number is not as readily available, because it depends on the expressions in the rest of the program. This means that the code generation of the prologue might need to be postponed until the code generation of the rest of the program is complete. This is an interesting, but not unusual in compilation, phenomenon: the prologue – an early section of the x86 program – turns out to be the one generated at the very end of the translation process!

As you further work on this assignment, answer the following questions.

Question 8 (glory *)

The C function scanf that we use here provides no means of error reporting or error detection when the input is malformed. How would you implement a robust input of integers from a console in C?

Question 9

What data structure(s) do you use in the implementation of the compiler?

Question 10

Describe inefficiencies that are present in your solution, if any.

Question 11 (glory ** )

Pick one or several optimizations to implement in your compilation, and generate optimized code. Explain which optimizations you are implementing and how they affect the produced code. What data structures and intermediate representations do you use in the implementation of these optimizations?

Extra-glory: can you benchmark the quantitative improvements from your optimizations? Explain your benchmarking approach.

Task 5: Putting it all together, examples, and cross-testing
In this task, we put together all the ingredients developed in the earlier tasks.

Write a function interpret of the type eprog -> unit that takes an expression program as its input, runs the semantic analysis, and depending on the result of the semantic analysis either prints out the errors or proceeds to evaluate the program using the eval function you have implemented.
Write a function compile of the type eprog -> unit that takes an expression program as its input, runs the semantic analysis, and depending on the result of the semantic analysis either prints out the errors or proceeds to compile the program into assembly.
Write 5 example programs that cover all the features of the language.
Cross-test that both evaluation and compilation agree on the results.

Question 12.: Why are your example programs good programs for testing your implementations? Did you discover any bugs during the cross-testing?