Arithmetic Expressions in OCaml and x86 Assembly

1. Arithmetic Expressions in OCaml and x86 Assembly#

1.1. Assignment overview#

This assignment covers the topics of program representation via Abstract Syntax Trees (ASTs), implementation of an evaluator for arithmetic expressions in OCaml, and pretty printing. There are 3 tasks and 9 questions in this assignment. One of the questions is marked for glory. This means it is optional and you should do it if you feel ambitious.

1.1.1. What you need to get started#

Make sure you have a working installation of OCaml and clang. The course development container should have the necessary pre-requisites already installed.
We assume that by now you are familiar with OCaml basics.
We assume that you are familiar with basic assembly programming.
Download the auxiliary file x86.ml from Brightspace. Do not edit it.

1.1.2. What you need to hand in#

Please hand in a .zip file containing the following

A brief report documenting your solution. Acceptable report formats are .pdf, .rtf, and .md. For each question and for each task, briefly (1 – 4 sentences) describe your implementation or answer. Write concisely.
All the source files needed to reproduce your solution. Two pieces of code from the assignment description are to be copied into your solution. These are
1. The OCaml code for the expr type declaration below.
2. x86.ml OCaml file. Do not modify this file.
We do not prescribe how to organize your solution in terms of modules or folder structure, but note that a relatively simple organization should be sufficient for this assignment. Your solution must work with OCaml 5.2.0 on a modern x86-64 Linux system, such as Ubuntu 22.04 LTS (for example in the course’s Docker container).

Your project must compile and your report should contain concrete brief instructions about how to do it, e.g.,
by running make in the root folder (in that case make sure to include the Makefile), or a call to dune.

Important

Make sure to understand all the code you hand in, including what is copied from here.

1.2. Arithmetic expressions#

We start with the study representing arithmetic expressions. For this, we introduce the following OCaml declarations.

(* -- Use this in your solution without modifications -- *)
(* Defining the type for binary operations *)
type binop = Add | Sub | Mul | Div

(* Defining the type for arithmetic expressions *)
type expr 
  = Int of int                       (* Integer constant *)
  | BinOp of binop * expr * expr     (* Binary operation *)

Here are a few examples of values of type expr.

let expression_01 = Int 5                                         (* 5       *)

let expression_02 = BinOp (Add, Int 1, expression_01)             (* 1+5     *)

let expression_03 = BinOp (Mul, BinOp (Add, Int 2, Int 2), Int 2) (* (2+2)*2 *)

let expression_04 = BinOp (Add, Int 2, BinOp (Mul, Int 2, Int 2)) (* 2+2*2   *)

Given just the above, there is not much we can do yet. Our first task is to write an evaluator.

1.2.1. Evaluator#

Task 1: Evaluator for arithmetic expressions

Write a function eval that has type expr -> int for evaluating arithmetic expressions.

As we can see from the type, the function eval should take one argument of type expr and return an integer. For example, eval expression_03 should return 8. As you work on this task, answer the following questions:

Question 1: Does this function need to recursively explore its argument, and why (or why not)?
Question 2: Why does this function have the return type int? What other return types may be suitable?
Question 3: How does your evaluation handle the case of division by zero? Note that it may be just fine to not special-treat division by zero, but it is important you understand what actually happens at runtime.

1.2.2. Pretty printer#

A pretty printer is a function that takes an internal representation of a data structure, e.g., an AST such as expr and produces its textual representation, e.g., as a string.

Task 2: Pretty printer for arithmetic expressions

Write a function string_of_expr that has type expr -> string for pretty printing arithmetic expressions.

The output of the pretty printer should be in the format that is accepted by OCaml’s REPL, such as utop, using infix notation. For example, the following are all acceptable results for string_of_expr expression_04

2+2*2
2+(2*2)
(2+(2*2))
((2)+((2)*(2)))

As you work on this task, feel free to define additional functions, if needed. You may use the built-in function string_of_int for converting integers to strings, and string concatenation operation ^, e.g., "Hello " ^ "world". As you work on this task, answer the following questions:

Question 4: Are the functions you have defined recursive, and why (or why not)?
Question 5 (glory): Make sure you avoid unnecessary parentheses, assuming standard precedence. Explain how you implement this. If you are not confident about your solution to this question, make sure you submit a proper separate answer to question 4.

1.2.3. Using both the evaluator and the pretty printer#

With both the evaluator and the pretty printer ready, they can be put together, for example using the following expression.

let expressions 
	= [ expression_01 ; expression_02 ; expression_03 ; expression_04] in
let print_expr e = Printf.printf ("%s = %d\n") (string_of_expr e) (eval e) in 
List.map print_expr expressions

The exact output will depend on your implementation of the printer. Here is one example output.

= 5
+ 5 = 6
(2 + 2) * 2 = 8
+ 2 * 2 = 6

1.3. x86 representation#

We now switch our attention to a low-level program representation. Our main tool here is an AST for x86 programs, and its associated pretty printer, provided in the file x86.ml. We use this AST representation to construct assembly programs in OCaml. There are several advantages to using an AST instead of directly representing assembly programs as strings.

Separation of concerns. A dedicated representation lets us focus on the semantics of the programs we produce, while the pretty printer handles the syntactic details.
Well-formedness. A well-designed AST helps in avoiding nonsensical programs, e.g., referring to a non-existing register. Additional well-formedness checks can be programmed as separate passes over the AST.
Opportunities for further analysis. AST data structures are essential for implementing other analyses or optimization, e.g., dead-code elimination.

Checkpoint

Before proceeding further, please download and study the provided x86.ml file.

1.3.1. Using x86 AST#

Next, we see how to use the x86 AST. We proceed with the following steps.

Construct an OCaml value corresponding to an assembly program, and print it.
Compile the resulting assembly program.
Run the compiled binary and output the return value.

1.3.1.1. Producing assembly program programmatically#

The following OCaml declaration constructs an AST for an assembly program that returns value 23.

let asm_example =
  let open Asm in                  (* Open Asm module locally; this brings
                                      the helper functions from that module 
                                      into the local scope *)
   [ gtext "example"               (* Global label *)
      [ (Movq, [~$ 23; ~% Rax])    (* Store 23 in register %rax. 
                                      This is the register we must use 
                                      for returning values from a function
                                      according to System V ABI *)
      ; (Retq, [])                 (* Return instruction *)
   ]]

As you study the above example, answer the following questions:

Question 6: What is the OCaml type of asm_example?
Question 7: How would you rewrite this to use the Asm module without the local module open directive?
Question 8: Could we have used text instead of gtext? Why?

Note

Some platforms, e.g., Intel Macs, require the exported labels to be prefixed with an underscore, e.g., _example. This is known as name mangling, and has been historically used to prevent collision with reserved names [Lev00]. Linux does not require this.

We can pretty print the above program as follows

let _ = Printf.printf "%s\n" (string_of_prog asm_example)

This will produce the output

        .text
        .globl  example
example:
        movq    $23, %rax
        retq

1.3.1.2. Compiling the assembly program#

To run this assembly program, we follow the approach where we link it against a C program that calls the function that we have declared, in this case: example. This step depends on the following.

The text of the assembly program is saved in a .s file so that it can be processed by clang, e.g., example.s.

There is a C wrapper that calls our example function. We can use a simple program like this

/* main.c: C wrapper for our assembly program */
#include <stdio.h>
extern int example (); /*  the name of our function */

int main() {
    int result = example();
    return result;
}

Save this file as main.c

Call clang to compile and link both files
```
clang main.c example.s
```

Note

Arm Mac users (M1/M2), prepend the call to clang with arch -x86_64, e.g., arch -x86_64 clang main.c example.s. You need Rosetta 2 installed for this. See also the note on mangling above.

This will create a binary with the default name a.out. If you want a different name for the output, use the compiler -o flag.

1.3.1.3. Executing the produced binary#

We run the produced binary as follows

./a.out

We can inspect the output by prompting the exit status shell variable $?. For example, if we run

echo $?

immediately after the previous command, we should get the value 23.

1.3.1.4. Handling return values greater than 255#

If you modify the program to return a value greater than 255, e.g., replace the literal 23 with 2023, you will run into an issue: POSIX reserves only one byte for process return values. This restriction is justified in the context of operating systems because process return values are typically used for distinguishing normal vs. erroneous exits.

Question 9: What output do you get from echo $? when returning the value 2023?

To handle results greater than 255, we need a workaround. Instead of returning the value to the OS, we will print it on the console. We extend our C program with a function for printing integers, print_int, that can be called from the assembly. The resulting OCaml and C programs look as follows.

(* Ocaml expression that constructs the assembly program. *)
let dolphin_main =
  let open Asm in  
   [ gtext "example" [
      (Pushq, [~% Rbp])               (* Stack alignment before call *)
   ;  (Movq, [~$ 2023;  ~% Rdi ] )    (* System V ABI requires passing the 
                                         first argument via register %rdi *)
   ;  (Callq, [ ~$$ "print_int"])     (* Calling the C function to print *)
   ;  (Movq, [~$ 0; ~% Rax])          (* Return 0 to indicate normal exit *)
   ;  (Popq, [~% Rbp])                (* Stack re-alignment *)
   ;  (Retq, [])                       
   ]] 

/* main.c: C wrapper for assembly programs */
   
#include <stdio.h>
extern int example ();

int main() {
    int result = example();
    return result;
}

void print_int (int x) {
    printf ("%d\n", x);
}

If you repeat the steps earlier with pretty printing of the assembly text, saving it as example.s, and compiling and running the resulting binary ./a.out, we will get the output

It is no longer necessary to echo $? other than to ensure that the program exits normally with 0.

Finally, now that we have introduced both the high-level and the low-level representations, we can work with both of them.

Task 3: Translate the following 5 OCaml expressions to assembly

Consider the following OCaml declarations

let task3_exp1 = BinOp (Add, Int 20, BinOp (Mul, Int 26, Int 58))
let task3_exp2 = BinOp (Mul, Int 5, BinOp (Div, Int 1, Int 10))
let task3_exp3 = BinOp (Sub, Int 31, Int 870)
let task3_exp4 = BinOp (Mul, BinOp (Add, BinOp (Div, Int 6, BinOp (Add, Int 10, Int 49)), Int 10), 
          BinOp (Add, BinOp (Sub, BinOp (Mul, Int 70, Int 77), BinOp (Div, Int 12, Int 9)), Int 5))
let task3_exp5 = BinOp (Sub, BinOp (Div, Int 34, Int 72), BinOp (Div, Int 17, Int 46))

Create a list of OCaml values, named task3_asm1, task3_asm2, … that are x86 assembly translations of these arithmetic expressions. These translations are to be done manually, without introducing any automation. Your translation should not implement any shortcuts or optimizations, e.g., returning pre-computed constant outputs (we study optimizations later in the course). Neither should it be unnecessarily clunky.

For each expression, cross-reference the output from your evaluator with the output from running the translated assembly. Describe in your report how you implement cross-referencing, and whether you found any inconsistencies during this process.