Chapter 06 "Intermediate Code Generation"

CMPS 450 Lecture Notes - Week 9

resources:
llvm compiler infrastructure
Solaris elfdump

Ch 6.1 "Variants of Syntax Trees"

DAGs (directed acyclic graphs) can represent the order of operations for expressions in a language. For example, the expression
     A + B 
is this DAG:
     A → + ← B  
The edges leaving operand A and B are outgoing. This is always the case since operands must "happen-before" the operator. Operators, on the other hand, can have both incoming and outgoing edges. In this example, the + operator's incoming edges are from its left and right operand. Operators in an expression with more than one operator will have both incoming and outgoing edges; e.g., one expression becomes an operand to another expression. For example, assuming + is left associative, the expression
           A + B + C 
is this DAG, where the leftmost + has both incoming and outgoing edges:
          B 
          ↓ 
      A → + → + ← C 
Since a DAG is acyclic it can be drawn as a tree rooted in the last operation performed, where the arrows are implicitly directed up from the leaves towards the root of the tree. The vertices of the graph are tree nodes and the graph edges are tree branches directed from the leaves up to the root. Since the root of the tree denotes the last operation performed, an LVR (postorder) traversal of the tree is a valid topological sort of the DAG and correctly represents the semantics of the expression in terms of order of operations. The above DAG is the tree shown below. A postorder traversal results in (A + B ) + C:
              +
             / \
            +   C
           / \
          A   B 
Since operands can have outgoing edges only, operands will always be the leaves of the tree and operators the internal nodes. The order of operations corresponds to the order in which synthesized attributes of a syntax tree are decorated in bottom-up LALR (Look-Ahead Left-to-right scan Rightmost derivation) parsing.

The big advantage to using DAGs for expressions is that common subexpressions can be identified and combined. Take, for example, this expression

          a + a * (b - c) + (b - c) * d
To demonstrate the technique of combining, first combine the common sub-expression (b - c) into a single sub-expression (e.g., a sub-DAG). Do the same for common expression 'a'. When finished, there should be 9 vertices: 6 operators + 5 operands - 2 common expressions. The DAG below is the above expression shown in tree form, where the root + is the last operation performed. Note that the two * operators point to the same sub-DAG. The + and * in the left subtree also point to the terminal 'a' since both have 'a' as a left operand.
            
                      +
                    /    \ 
                   +      *
                  / \     /\ 
                  \  *   /  d   
                   \ / \/
                    a  -
                      /  \
                     b    c    
A postorder traversal of the tree will give you the same topological sort of the DAG as you had in the original tree, thus verifying the validify of this approach:
         a + a * (b - c) + (b - c) * d 
The advantage to combining sub-DAGs is that you will generate fewer computer instructions since you can 'reuse' previous computations.

Ch 6.2 "Three-Address Code"

Three-address code is a linear representation of abstract syntax trees and, as such, can be generated while traversing the tree. Expressions are expanded to at most one operator on the right side of an instruction; e.g. x + y * z is
          t_1 = y * z
          t_2 = x + t_1  
The use of names for intermediate values (t_1 and t_2) make three-address code easy to rearrange. The above DAG for this expression
     a + a * (b - c) + (b - c) * d   
results in the 3-address code shown below, generated by a topological sort which is a depth-first left to right traversal of the abstract syntax tree:
                       +
                    /    \ 
                   +      *
                  / \     ) \ 
                 (   *   /   d   
                  \  / \/
                   a   -
                      /  \
                     b    c    
The first sub-expression is b - c:
 SORT                              Three-Address Code
 d c b - * a * a + +                t_1 = b - c
   -----
                        +
                      /   \ 
                    +      *
                  /  \    / \ 
                 /   *   /   d   
                 \  /  \/
                  a    t_1

 d t_1 * a * a + +                  t_2 = t_1 * d 
 -------
                        +
                      /   \ 
                    +     t_2 
                  /  \     
                 /   *      
                 \  / \ 
                  a   t_2 


 t_2 a * a + +                       t_3 = t_2 * a 
 -------
                        +
                      /   \ 
                    +     t_2 
                  /  \     
                 a   t_3      

 t_3 a + +                          t_4 = t_3 + a

                         +
                        / \
                      t_4  t_2
 t_4 +
(right opand for + is t_2)          t_5 = t_4 + t_2

The final result is:
       1. t1 = b - c 
       2. t2 = t1 * d
       3. t3 = t2 * a 
       4. t4 = t3 + a 
       5. t5 = t4 + t2

Addresses and Instructions

A three-address code is comprised of an address and an instruction, where an address is identified by a name, a constant or a compiler-generated temporary. For imperative languages, these codes are about all you need:

  1. x = y op z           // binary ops
  2. x = op y             // unary ops
  3. x = y                // assignments
  4. goto L               // unconditional jumps
  5. if x goto L          // conditional jumps
  6. if x relop y goto L  // conditional jumps by relational ops
  7. param x, call p,n
     y = call p           // procedure calls and returns 

  8. x = y[i]             // support for non-scalars
  9. x = &y, x = *y *x=y  // support for pointers
 
Labels can be assigned symbolically (e.g. goto label) or by a position (e.g. like original Fortran or Basic):
     100: t_1 = i + 1
     104: if t_3 < v goto 100
A 3-address code is stored in a structure (i.e., an object or a record) that is attached to the node in the abstract syntax tree (AST) pertaining to that code. For example, for the assignment statement
           a = b * -c + b * c 
the AST and the 3-address code are shown here:
            
             =
            /  \                 t_1 = minus c
           a    +                t_2 = b * t_1
               /  \              t_3 = b * c
              *    *             t_4 = t_2 + t_3
             / \  / \              a = t_4 
            b  -  b  c 
               |     
               c      
There are three ways to represent the 3-address code:
// Method 1.  Quadruple Method
     op     arg_1  arg_2  result
  0  -      c             t_1     // unary minus
  1  *      b      t_1    t_2 
  2  *      b      c      t_3
  3  +      t_2    t_3    t_4
  4  =      t_4           a
The disadvantage to quadruples is space - every operation requires temporary storage space for a result. This problem is solved by the triple method.
 
// Method 2. Triple Method
     op     arg_1  arg_2 
   0 -      c            
   1 *      b      (0)    
   2 *      b       c     
   3 +     (1)     (2) 
   4 =      a      (3)                
The disadvantage to triples is that an optimizing compiler cannot easily rearrange instructions.

Method 3, the Indirect Method, provides the advantages of method #1 and method #2. In this method you add a list of pointers to triples. An optimizing compiler can move instructions by reordering the instruction list, without affecting the triples.

Ch 6.3 "Types and Declarations"

Type checking is a semantic issue that can be handled either before or during code generation. Assuming the code has passed the semantic contraints enforced by type rules, type (i.e., a variable declaration) controls how much memory will be needed for the variable. This information is required during code generation at which time variables are bound to an address. Assume this BNF for a sequence of declarations:
   decl -> type ID; decl | e
   type -> INT | FLOAT 

Storage Layout for Local Names.
Every static object (e.g., identifiers that are declared at compile time) must be bound to an address by the compiler, which means that the compiler must know the size of the object at compile time. The storage layout for data objects is heavily influenced by the hardware of the target machine. The size of scalars (int, char, float, double, etc.) and how data are aligned is controlled by the instruction set architecture of target machine. But the compiler front-end is designed to be platform independent. So how does this work? The compiler front-end will make decisions based on defaults that are stored in compiler header files. See /usr/include/limits.h" and /usr/include/sys/types.h" for GNU C defaults. Gnu C also supports a number of command line options for changing the size of pre-defined data types, setting alignment, and other machine dependent options (see gcc manpage). The sizeof() subroutine in glib.c uses these defaults for user applications (See gcc/c-common.c").

Ch 6.4 "Translation of Expressions"

Skip references to non-scalar types.

Ch 6.5 "Type Checking"

Type checking is a semantic issue that directly relates to code generation since you must know the size of an object to generate an address. To know the size of an object you must know its type. The type system of a language includes the rules governing mixing types or not in expressions. In a strongly typed language (such as Ada) the compiler is not allowed to silently coerce one type to another. In a moderately typed language (such as Java) the compiler is allowed to perform widening coercions (no data is lost). In a weakly typed language (such as C) the compiler can silently coerce a float to an int (as lose information in the process). The rules governing type in a language is the type system. The 'type' of an expression can be assigned by the compiler by inference or by synthesis.

Type synthesis builds up the type of an expression from the types in subexpressions (an attribute grammer). C uses this. A type system that support widening coercions allows the compiler to implicit cast one type to another as long as information is not lost; e.g.

     int i = 5;
     char c = 'a';
     c = i;  // NOT OK
     i = c;  // OK 
 
Java Design
C coercions

Type inference determines type by context. This is the design of JavaScript for the most part.