cse 5317/4305: design and construction of compilerslambda.uta.edu/cse5317/spring07/notes.pdf ·...

CSE 5317/4305: Design and Construction of Compilers

Leonidas Fegaras

University of Texas at Arlington, CSE

January 14, 2005

Copyright c©1999-2005 by Leonidas Fegarasemail: [email protected]

web: http://lambda.uta.edu/

Contents

1 Introduction 31.1 What is a Compiler? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 What is the Challenge? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Compiler Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Lexical Analysis 62.1 Regular Expressions (REs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Deterministic Finite Automata (DFAs) . . . . . . . . . . . . . . . . . . . . . 82.3 Converting a Regular Expression into a Deterministic Finite Automaton . . 112.4 Case Study: The Calculator Scanner . . . . . . . . . . . . . . . . . . . . . . 13

3 Parsing 143.1 Context-free Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Predictive Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Recursive Descent Parsing . . . . . . . . . . . . . . . . . . . . . . . . 203.2.2 Predictive Parsing Using Tables . . . . . . . . . . . . . . . . . . . . . 21

3.3 Bottom-up Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.1 Bottom-up Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.2 Shift-Reduce Parsing Using the ACTION/GOTO Tables . . . . . . . 263.3.3 Table Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.4 SLR(1), LR(1), and LALR(1) Grammars . . . . . . . . . . . . . . . . 333.3.5 Practical Considerations for LALR(1) Grammars . . . . . . . . . . . 36

3.4 Case Study: The Calculator Parser . . . . . . . . . . . . . . . . . . . . . . . 38

1

4 Abstract Syntax 394.1 Building Abstract Syntax Trees in Java . . . . . . . . . . . . . . . . . . . . . 404.2 Building Abstract Syntax Trees in C . . . . . . . . . . . . . . . . . . . . . . 414.3 Gen: A Java Package for Constructing and Manipulating Abstract Syntax Trees 42

5 Semantic Actions 46

6 Semantic Analysis 496.1 Symbol Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.2 Types and Type Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.3 Case Study: The Calculator Interpreter . . . . . . . . . . . . . . . . . . . . . 55

7 Activation Records 567.1 Run-Time Storage Organization . . . . . . . . . . . . . . . . . . . . . . . . . 567.2 Case Study: Activation Records for the MIPS Architecture . . . . . . . . . . 60

8 Intermediate Code 638.1 Translating Variables, Records, Arrays, and Strings . . . . . . . . . . . . . . 658.2 Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

9 Basic Blocks and Traces 68

10 Instruction Selection 69

11 Liveness Analysis 72

12 Register Allocation 7412.1 Register Allocation Using the Interference Graph . . . . . . . . . . . . . . . 7412.2 Code Generation for Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

13 Garbage Collection 7813.1 Storage Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7813.2 Automatic Garbage Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 82

13.2.1 Mark-and-Sweep Garbage Collection . . . . . . . . . . . . . . . . . . 8313.2.2 Copying Garbage Collection . . . . . . . . . . . . . . . . . . . . . . . 85

13.3 Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

2

1 Introduction

1.1 What is a Compiler?

A compiler is a program that translates a source program written in some high-level pro-gramming language (such as Java) into machine code for some computer architecture (suchas the Intel Pentium architecture). The generated machine code can be later executed manytimes against different data each time.

An interpreter reads an executable source program written in a high-level programminglanguage as well as data for this program, and it runs the program against the data toproduce some results. One example is the Unix shell interpreter, which runs operatingsystem commands interactively.

Note that both interpreters and compilers (like any other program) are written in somehigh-level programming language (which may be different from the language they accept) andthey are translated into machine code. For a example, a Java interpreter can be completelywritten in Pascal, or even Java. The interpreter source program is machine independent sinceit does not generate machine code. (Note the difference between generate and translated into

machine code.) An interpreter is generally slower than a compiler because it processes andinterprets each statement in a program as many times as the number of the evaluationsof this statement. For example, when a for-loop is interpreted, the statements inside thefor-loop body will be analyzed and evaluated on every loop step. Some languages, such asJava and Lisp, come with both an interpreter and a compiler. Java source programs (Javaclasses with .java extension) are translated by the javac compiler into byte-code files (with.class extension). The Java interpreter, java, called the Java Virtual Machine (JVM), mayactually interpret byte codes directly or may internally compile them to macine code andthen execute that code.

Compilers and interpreters are not the only examples of translators. Here are few more:Source Language Translator Target LanguageLaTeX Text Formater PostScriptSQL database query optimizer Query Evaluation PlanJava javac compiler Java byte codeJava cross-compiler C++ codeEnglish text Natural Language Understanding semantics (meaning)Regular Expressions JLex scanner generator a scanner in JavaBNF of a language CUP parser generator a parser in Java

This course deals mainly with compilers for high-level programming languages, but thesame techniques apply to interpreters or to any other compilation scheme.

1.2 What is the Challenge?

Many variations:

• many programming languages (eg, FORTRAN, C++, Java)

• many programming paradigms (eg, object-oriented, functional, logic)

3

• many computer architectures (eg, MIPS, SPARC, Intel, alpha)

• many operating systems (eg, Linux, Solaris, Windows)

Qualities of a compiler (in order of importance):

1. the compiler itself must be bug-free

2. it must generate correct machine code

3. the generated machine code must run fast

4. the compiler itself must run fast (compilation time must be proportional to programsize)

5. the compiler must be portable (ie, modular, supporting separate compilation)

6. it must print good diagnostics and error messages

7. the generated code must work well with existing debuggers

8. must have consistent and predictable optimization.

Building a compiler requires knowledge of

• programming languages (parameter passing, variable scoping, memory allocation, etc)

• theory (automata, context-free languages, etc)

• algorithms and data structures (hash tables, graph algorithms, dynamic programming,etc)

• computer architecture (assembly programming)

• software engineering.

The course project (building a non-trivial compiler for a Pascal-like language) will give youa hands-on experience on system implementation that combines all this knowledge.

1.3 Compiler Architecture

A compiler can be viewed as a program that accepts a source code (such as a Java program)and generates machine code for some computer architecture. Suppose that you want tobuild compilers for n programming languages (eg, FORTRAN, C, C++, Java, BASIC, etc)and you want these compilers to run on m different architectures (eg, MIPS, SPARC, Intel,alpha, etc). If you do that naively, you need to write n∗m compilers, one for each language-architecture combination.

The holly grail of portability in compilers is to do the same thing by writing n + mprograms only. How? You use a universal Intermediate Representation (IR) and you makethe compiler a two-phase compiler. An IR is typically a tree-like data structure that capturesthe basic features of most computer architectures. One example of an IR tree node is a

4

representation of a 3-address instruction, such as d← s1 +s2, that gets two source addresses,s1 and s2, (ie. two IR trees) and produces one destination address, d. The first phase of thiscompilation scheme, called the front-end, maps the source code into IR, and the second phase,called the back-end, maps IR into machine code. That way, for each programming languageyou want to compile, you write one front-end only, and for each computer architecture, youwrite one back-end. So, totally you have n + m components.

But the above ideal separation of compilation into two phases does not work very well forreal programming languages and architectures. Ideally, you must encode all knowledge aboutthe source programming language in the front end, you must handle all machine architecturefeatures in the back end, and you must design your IRs in such a way that all language andmachine features are captured properly.

A typical real-world compiler usually has multiple phases. This increases the compiler’sportability and simplifies retargeting. The front end consists of the following phases:

1. scanning: a scanner groups input characters into tokens;

2. parsing: a parser recognizes sequences of tokens according to some grammar and gen-erates Abstract Syntax Trees (ASTs);

3. Semantic analysis: performs type checking (ie, checking whether the variables, functionsetc in the source program are used consistently with their definitions and with thelanguage semantics) and translates ASTs into IRs;

4. optimization: optimizes IRs.

The back end consists of the following phases:

1. instruction selection: maps IRs into assembly code;

2. code optimization: optimizes the assembly code using control-flow and data-flow anal-yses, register allocation, etc;

3. code emission: generates machine code from assembly code.

The generated machine code is written in an object file. This file is not executable sinceit may refer to external symbols (such as system calls). The operating system provides thefollowing utilities to execute the code:

1. linking: A linker takes several object files and libraries as input and produces oneexecutable object file. It retrieves from the input files (and puts them together inthe executable object file) the code of all the referenced functions/procedures and itresolves all external references to real addresses. The libraries include the operatingsytem libraries, the language-specific libraries, and, maybe, user-created libraries.

2. loading: A loader loads an executable object file into memory, initializes the registers,heap, data, etc and starts the execution of the program.

Relocatable shared libraries allow effective memory use when many different applicationsshare the same code.

5

2 Lexical Analysis

A scanner groups input characters into tokens. For example, if the input is

x = x*(b+1);

then the scanner generates the following sequence of tokens

id(x)

=

id(x)

*

(

id(b)

+

num(1)

)

;

where id(x) indicates the identifier with name x (a program variable in this case) andnum(1) indicates the integer 1. Each time the parser needs a token, it sends a request to thescanner. Then, the scanner reads as many characters from the input stream as it is necessaryto construct a single token. The scanner may report an error during scanning (eg, when itfinds an end-of-file in the middle of a string). Otherwise, when a single token is formed, thescanner is suspended and returns the token to the parser. The parser will repeatedly callthe scanner to read all the tokens from the input stream or until an error is detected (suchas a syntax error).

Tokens are typically represented by numbers. For example, the token * may be assignednumber 35. Some tokens require some extra information. For example, an identifier is atoken (so it is represented by some number) but it is also associated with a string thatholds the identifier name. For example, the token id(x) is associated with the string, "x".Similarly, the token num(1) is associated with the number, 1.

Tokens are specified by patterns, called regular expressions. For example, the regularexpression [a-z][a-zA-Z0-9]* recognizes all identifiers with at least one alphanumeric letterwhose first letter is lower-case alphabetic.

A typical scanner:

1. recognizes the keywords of the language (these are the reserved words that have aspecial meaning in the language, such as the word class in Java);

2. recognizes special characters, such as ( and ), or groups of special characters, such as:= and ==;

3. recognizes identifiers, integers, reals, decimals, strings, etc;

4. ignores whitespaces (tabs and blanks) and comments;

5. recognizes and processes special directives (such as the #include "file" directive inC) and macros.

6

A key issue is speed. One can always write a naive scanner that groups the input char-acters into lexical words (a lexical word can be either a sequence of alphanumeric characterswithout whitespaces or special characters, or just one special character), and then tries toassociate a token (ie. number, keyword, identifier, etc) to this lexical word by performing anumber of string comparisons. This becomes very expensive when there are many keywordsand/or many special lexical patterns in the language. In this section you will learn how tobuild efficient scanners using regular expressions and finite automata. There are automatedtools called scanner generators, such as flex for C and JLex for Java, which construct a fastscanner automatically according to specifications (regular expressions). You will first learnhow to specify a scanner using regular expressions, then the underlying theory that scannergenerators use to compile regular expressions into efficient programs (which are basically fi-nite state machines), and then you will learn how to use a scanner generator for Java, calledJLex.

2.1 Regular Expressions (REs)

Regular expressions are a very convenient form of representing (possibly infinite) sets ofstrings, called regular sets. For example, the RE (a|b) ∗ aa represents the infinite set{“aa”, “aaa”, “baa”, “abaa”, · · ·}, which is the set of all strings with characters a and b thatend in aa. Formally, a RE is one of the following (along with the set of strings it designates):

name RE designationepsilon ε {“”}symbol a {“a”} for some character a

concatenation AB the set { rs| r ∈ A, s ∈ B }, where rs is stringconcatenation, and A and B designate the REs A and B

alternation A|B the set A ∪ B, where A and B designate the REs A and Brepetition A∗ the set ε|A|(AA)|(AAA)| · · · (an infinite set)

Repetition is also called Kleen closure. For example, (a|b)c designates the set { rs| r ∈({“a”} ∪ {“b”}), s ∈ {“c”} }, which is equal to {“ac”, “bc”}.

We will use the following notational conveniences:

P+ = PP ∗

P ? = P | ε[a− z] = a|b|c| · · · |z

We can freely put parentheses around REs to denote the order of evaluation. For example,(a|b)c. To avoid using many parentheses, we use the following rules: concatenation andalternation are associative (ie, ABC means (AB)C and is equivalent to A(BC)), alternationis commutative (ie, A|B = B|A), repetition is idempotent (ie, A∗∗ = A∗), and concatenationdistributes over alternation (eg, (a|b)c = ac|bc).

For convenience, we can give names to REs so we can refer to them by their name. Forexample:

7

for − keyword = forletter = [a− zA− Z]digit = [0− 9]identifier = letter (letter | digit)∗

sign = +| − | εinteger = sign (0 | [1− 9]digit∗)decimal = integer . digit∗

real = (integer | decimal) E sign digit+

There is some ambiguity though: If the input includes the characters for8, then the firstrule (for for-keyword) matches 3 characters (for), the fourth rule (for identifier) can match 1,2, 3, or 4 characters, the longest being for8. To resolve this type of ambiguities, when thereis a choice of rules, scanner generators choose the one that matches the maximum numberof characters. In this case, the chosen rule is the one for identifier that matches 4 characters(for8). This disambiguation rule is called the longest match rule. If there are more thanone rules that match the same maximum number of characters, the rule listed first is chosen.This is the rule priority disambiguation rule. For example, the lexical word for is taken asa for-keyword even though it uses the same number of characters as an identifier.

2.2 Deterministic Finite Automata (DFAs)

A DFA represents a finite state machine that recognizes a RE. For example, the followingDFA:

recognizes (abc+)+. A finite automaton consists of a finite set of states, a set of transitions(moves), one start state, and a set of final states (accepting states). In addition, a DFA hasa unique transition for every state-character combination. For example, the previous figurehas 4 states, state 1 is the start state, and state 4 is the only final state.

A DFA accepts a string if starting from the start state and moving from state to state,each time following the arrow that corresponds the current input character, it reaches a finalstate when the entire input string is consumed. Otherwise, it rejects the string.

The previous figure represents a DFA even though it is not complete (ie, not all state-character transitions have been drawn). The complete DFA is:

8

but it is very common to ignore state 0 (called the error state) since it is implied. (Thearrows with two or more characters indicate transitions in case of any of these characters.)The error state serves as a black hole, which doesn’t let you escape.

A DFA is represented by a transition table T , which gives the next state T [s, c] for a states and a character c. For example, the T for the DFA above is:

a b c0 0 0 01 2 0 02 0 3 03 0 0 44 2 0 4

Suppose that we want to build a scanner for the REs:

for − keyword = foridentifier = [a− z][a − z0− 9]∗

The corresponding DFA has 4 final states: one to accept the for-keyword and 3 to acceptan identifier:

(the error state is omitted again). Notice that for each state and for each character, there isa single transition.

A scanner based on a DFA uses the DFA’s transition table as follows:

state = initial_state;

current_character = get_next_character();

while ( true )

{ next_state = T[state,current_character];

9

if (next_state == ERROR)

break;

state = next_state;


if ( current_character == EOF )

break;

};

if ( is_final_state(state) )

‘we have a valid token’

else ‘report an error’

This program does not explicitly take into account the longest match disambiguation rulesince it ends at EOF. The following program is more general since it does not expect EOFat the end of token but still uses the longest match disambiguation rule.

state = initial_state;

final_state = ERROR;


while ( true )

{ next_state = T[state,current_character];

if (next_state == ERROR)

break;

state = next_state;

if ( is_final_state(state) )

final_state = state;


if (current_character == EOF)

break;

};

if ( final_state == ERROR )

‘report an error’

else if ( state != final_state )

‘we have a valid token but we need to backtrack

(to put characters back into the input stream)’

else ‘we have a valid token’

Is there any better (more efficient) way to build a scanner out of a DFA? Yes! We canhardwire the state transition table into a program (with lots of gotos). You’ve learned inyour programming language course never to use gotos. But here we are talking about aprogram generated automatically, which no one needs to look at. The idea is the following.Suppose that you have a transition from state s1 to s2 when the current character is c. Thenyou generate the program:

s1: current_character = get_next_character();

...

10

if ( current_character == ’c’ )

goto s2;

...

s2: current_character = get_next_character();

...

2.3 Converting a Regular Expression into a Deterministic Finite

Automaton

The task of a scanner generator, such as JLex, is to generate the transition tables or tosynthesize the scanner program given a scanner specification (in the form of a set of REs).So it needs to convert REs into a single DFA. This is accomplished in two steps: first itconverts REs into a non-deterministic finite automaton (NFA) and then it converts the NFAinto a DFA.

An NFA is similar to a DFA but it also permits multiple transitions over the samecharacter and transitions over ε. In the case of multiple transitions from a state over thesame character, when we are at this state and we read this character, we have more than onechoice; the NFA succeeds if at least one of these choices succeeds. The ε transition doesn’tconsume any input characters, so you may jump to another state for free.

Clearly DFAs are a subset of NFAs. But it turns out that DFAs and NFAs have thesame expressive power. The problem is that when converting a NFA to a DFA we may getan exponential blowup in the number of states.

We will first learn how to convert a RE into a NFA. This is the easy part. There are only5 rules, one for each type of RE:

11

As it can been shown inductively, the above rules construct NFAs with only one final state.For example, the third rule indicates that, to construct the NFA for the RE AB, we constructthe NFAs for A and B, which are represented as two boxes with one start state and one finalstate for each box. Then the NFA for AB is constructed by connecting the final state of Ato the start state of B using an empty transition.

For example, the RE (a|b)c is mapped to the following NFA:

The next step is to convert a NFA to a DFA (called subset construction). Suppose thatyou assign a number to each NFA state. The DFA states generated by subset constructionhave sets of numbers, instead of just one number. For example, a DFA state may have beenassigned the set {5, 6, 8}. This indicates that arriving to the state labeled {5, 6, 8} in theDFA is the same as arriving to the state 5, the state 6, or the state 8 in the NFA whenparsing the same input. (Recall that a particular input sequence when parsed by a DFA,leads to a unique state, while when parsed by a NFA it may lead to multiple states.)

First we need to handle transitions that lead to other states for free (without consumingany input). These are the ε transitions. We define the closure of a NFA node as the set ofall the nodes reachable by this node using zero, one, or more ε transitions. For example,The closure of node 1 in the left figure below

12

is the set {1, 2}. The start state of the constructed DFA is labeled by the closure of the NFAstart state. For every DFA state labeled by some set {s1, . . . , sn} and for every character c inthe language alphabet, you find all the states reachable by s1, s2, ..., or sn using c arrows andyou union together the closures of these nodes. If this set is not the label of any other nodein the DFA constructed so far, you create a new DFA node with this label. For example,node {1, 2} in the DFA above has an arrow to a {3, 4, 5} for the character a since the NFAnode 3 can be reached by 1 on a and nodes 4 and 5 can be reached by 2. The b arrow fornode {1, 2} goes to the error node which is associated with an empty set of NFA nodes.

The following NFA recognizes (a|b)∗(abb | a+b), even though it wasn’t constructed withthe above RE-to-NFA rules. It has the following DFA:

2.4 Case Study: The Calculator Scanner

As a running example for this and the subsequent sections, we will use a simple interpreterfor a calculator written in Java. If you don’t have Java on your PC, you need to downloadSun’s J2SE SDK from http://java.sun.com/j2se/downloads.html. You also need to download

1. the jar archive that contains the CUP, JLex, and Gen classes from http://lambda.uta.edu/System.jar

2. the file http://lambda.uta.edu/cse5317/calc.tar.gz

On a Unix system, you do:

1. execute: tar xfz calc.tar.gz to extract the files in the directory calc

2. change the CLASSPATH line in the file calc/Makefile to point to your System.jar

3. execute: cd calc; build

4. run the calculator using run

On Windows, you do:

13

1. uncompress and extract the files from tar xfz calc.tar.gz

2. change the make.bat and run.bat files to point to your jar files

3. on a DOS prompt go to the calc directory and do build to compile the programs

4. do a run to run the calculator

Using the calculator, you can calculate arithmetic expressions, such as 2*(3+8);, you can as-sign values to variables, such as x:=3+4;, you can reference variables by name, such as x+3;,you can define recursive functions interactively, such as define f(n) = if n=0 then 1 else n*f(n-1);,and you can call them using f(5);, etc. You exit using quit;.

The source files of the calculator example are given in http://lambda.uta.edu/cse5317/calc/.Some of these files will be explained in detail later, but now we are ready to look at thescanner only, given in calc.lex. The tokens, such as sym.INT, are imported from the parserand will be explained in the next section. For now, just assume that these are different con-stants (integers). The order of lines is important in some cases: if we put the rule for {ID}before the keywords, then the keywords will never be recognized (this is the consequence ofthe rule priority law). The lexical constructs that need to be skipped, such as comments andwhitespaces, should never return (we must return only when we find a complete token).

3 Parsing

3.1 Context-free Grammars

Consider the following input string:

x+2*y

When scanned by a scanner, it produces the following stream of tokens:

id(x)

+

num(2)

*

id(y)

The goal is to parse this expression and construct a data structure (a parse tree) to representit. One possible syntax for expressions is given by the following grammar G1:

E ::= E + T

| E - T

| T

T ::= T * F

| T / F

| F

F ::= num

| id

14

where E, T and F stand for expression, term, and factor respectively. For example, the rulefor E indicates that an expression E can take one of the following 3 forms: an expressionfollowed by the token + followed by a term, or an expression followed by the token - followedby a term, or simply a term. The first rule for E is actually a shorthand of 3 productions:

E ::= E + T

E ::= E - T

E ::= T

G1 is an example of a context-free grammar (defined below); the symbols E, T and F arenonterminals and should be defined using production rules, while +, -, *, /, num, and id

are terminals (ie, tokens) produced by the scanner. The nonterminal E is the start symbol ofthe grammar.

In general, a context-free grammar (CFG) has a finite set of terminals (tokens), a finiteset of nonterminals from which one is the start symbol, and a finite set of productions of theform:

A ::= X1 X2 ... Xn

where A is a nonterminal and each Xi is either a terminal or nonterminal symbol.Given two sequences of symbols a and b (can be any combination of terminals and

nonterminals) and a production A ::= X1X2...Xn, the form aAb => aX1X2...Xnb is calleda derivation. That is, the nonterminal symbol A is replaced by the rhs (right-hand-side) ofthe production for A. For example,

T / F + 1 - x => T * F / F + 1 - x

is a derivation since we used the production T := T * F.Top-down parsing starts from the start symbol of the grammar S and applies derivations

until the entire input string is derived (ie, a sequence of terminals that matches the inputtokens). For example,

E => E + T

=> E + T * F

=> T + T * F

=> T + F * F

=> T + num * F

=> F + num * F

=> id + num * F

=> id + num * id

which matches the input sequence id(x) + num(2) * id(y). Top down parsing occasion-ally requires backtracking. For example, suppose the we used the derivation E => E - T

instead of the first derivation. Then, later we would have to backtrack because the derivedsymbols will not match the input tokens. This is true for all nonterminals that have morethan one production since it indicates that there is a choice of which production to use. Wewill learn how to construct parsers for many types of CFGs that never backtrack. These

15

parsers are based on a method called predictive parsing. One issue to consider is which non-terminal to replace when there is a choice. For example, in T + F * F we have 3 choices:we can use a derivation for T, for the first F, or for the second F. When we always replacethe leftmost nonterminal, it is called leftmost derivation.

In contrast to top-down parsing, bottom-up parsing starts from the input string anduses derivations in the opposite directions (ie, by replacing the rhs sequence X1X2...Xn ofa production A ::= X1X2...Xn with the nonterminal A. It stops when it derives the startsymbol. For example,

id(x) + num(2) * id(y)

=> id(x) + num(2) * F

=> id(x) + F * F

=> id(x) + T * F

=> id(x) + T

=> F + T

=> T + T

=> E + T

=> E

The parse tree of an input sequence according to a CFG is the tree of derivations. Forexample if we used a production A ::= X1X2...Xn (in either top-down or bottom-up parsing)then we construct a tree with node A and children X1X2...Xn. For example, the parse treeof id(x) + num(2) * id(y) is:

E

/ | \

E + T

| / | \

T T * F

| | |

F F id

| |

id num

So a parse tree has non-terminals for internal nodes and terminals for leaves. As anotherexample, consider the following grammar:

S ::= ( L )

| a

L ::= L , S

| S

Under this grammar, the parse tree of the sentence (a, ((a, a), a)) is:

S

/ | \

( L )

16

/ | \

L , S

| / | \

S ( L )

| / | \

a L , S

| |

S a

/ | \

( L )

/ | \

L , S

| |

S a

|

a

Consider now the following grammar G2:

E ::= T + E

| T - E

| T

T ::= F * T

| F / T

| F

F ::= num

| id

This is similar to our original grammar, but it is right associative when the leftmost derivationrules is used. That is, x-y-z is equivalent to x-(y-z) under G2, as we can see from its parsetree.

Consider now the following grammar G3:

E ::= E + E

| E - E

| E * E

| E / E

| num

| id

Is this grammar equivalent to our original grammar G1? Well, it recognizes the same lan-guage, but it constructs the wrong parse trees. For example, the x+y*z is interpreted as(x+y)*z by this grammar (if we use leftmost derivations) and as x+(y*z) by G1 or G2.That is, both G1 and G2 grammar handle the operator precedence correctly (since ∗ hashigher precedence than +), while the G3 grammar does not.

In general, to write a grammar that handles precedence properly, we can start with agrammar that does not handle precedence at all, such as our last grammar G3, and then we

17

can refine it by creating more nonterminals, one for each group of operators that have thesame precedence. For example, suppose we want to parse an expression E and we have 4groups of operators: {not}, {∗, /}, {+,−}, and {and, or}, in this order of precedence. Thenwe create 4 new nonterminals: N , T , F , and B and we split the derivations for E into 5groups of derivations (the same way we split the rules for E in the last grammar into 3groups in the first grammar).

A grammar is ambiguous if it has more than one parse tree for the same input sequencedepending which derivations are applied each time. For example, the grammar G3 is am-biguous since it has two parse trees for x-y-z (one parses x-y first, while the other parsesy-z first). Of course, the first one is the right interpretation since − is left associative.Fortunately, if we always apply the leftmost derivation rule, we will never derive the secondparse tree. So in this case the leftmost derivation rule removes the ambiguity.

3.2 Predictive Parsing

The goal of predictive parsing is to construct a top-down parser that never backtracks. Todo so, we must transform a grammar in two ways:

1. eliminate left recursion, and

2. perform left factoring.

These rules eliminate most common causes for backtracking although they do not guaranteea completely backtrack-free parsing (called LL(1) as we will see later).

Consider this grammar:

A ::= A a

| b

It recognizes the regular expression ba∗. The problem is that if we use the first productionfor top-down derivation, we will fall into an infinite derivation chain. This is called left

recursion. But how else can you express ba∗? Here is an alternative way:

A ::= b A’

A’ ::= a A’

|

where the third production is an empty production (ie, it is A’ ::= ). That is, A′ parsesthe RE a∗. Even though this CFG is recursive, it is not left recursive. In general, for eachnonterminal X, we partition the productions for X into two groups: one that contains theleft recursive productions, and the other with the rest. Suppose that the first group is:

X ::= X a1

...

X ::= X an

while the second group is:

18

X ::= b1

...

X ::= bm

where a, b are symbol sequences. Then we eliminate the left recursion by rewriting theserules into:

X ::= b1 X’

...

X ::= bm X’

X’ ::= a1 X’

...

X’ ::= an X’

X’ ::=

For example, the CFG G1 is transformed into:

E ::= T E’

E’ ::= + T E’

| - T E’

|

T ::= F T’

T’ ::= * F T’

| / F T’

|

F ::= num

| id

Suppose now that we have a number of productions for X that have a common prefix intheir rhs (but without any left recursion):

X ::= a b1

...

X ::= a bn

We factor out the common prefix as follows:

X ::= a X’

X’ ::= b1

...

X’ ::= bn

This is called left factoring and it helps predict which rule to use without backtracking. Forexample, the rule from our right associative grammar G2:

E ::= T + E

| T - E

| T

19

is translated into:

E ::= T E’

E’ ::= + E

| - E

|

As another example, let L be the language of all regular expressions over the alphabetΣ = {a, b}. That is, L = {“ε”, “a”, “b”, “a ∗ ”, “b ∗ ”, “a|b”, “(a|b)”, ...}. For example, thestring “a(a|b) ∗ |b ∗ ” is a member of L. There is no RE that captures the syntax of all REs.Consider for example the RE ((· · · (a) · · ·)), which is equivalent to the language (na)n for alln. This represents a valid RE but there is no RE that can capture its syntax. A context-freegrammar that recognizes L is:

R ::= R R| R “|” R| R ∗| ( R )| a| b| “ε”

after elimination of left recursion, this grammar becomes:

R ::= ( R ) R′

| a R′

| b R′

| “ε” R′

R′ ::= R R′

| “|” R R′

| ∗R′

|

3.2.1 Recursive Descent Parsing

Consider the transformed CFG G1 again:

E ::= T E’

E’ ::= + T E’

| - T E’

|

T ::= F T’

T’ ::= * F T’

| / F T’

|

F ::= num

| id

Here is a Java program that parses this grammar:

20

static void E () { T(); Eprime(); }

static void Eprime () {

if (current_token == PLUS)

{ read_next_token(); T(); Eprime(); }

else if (current_token == MINUS)

{ read_next_token(); T(); Eprime(); };

}

static void T () { F(); Tprime(); }

static void Tprime() {

if (current_token == TIMES)

{ read_next_token(); F(); Tprime(); }

else if (current_token == DIV)

{ read_next_token(); F(); Tprime(); };

}

static void F () {

if (current_token == NUM || current_token == ID)

read_next_token();

else error();

}

In general, for each nonterminal we write one procedure; For each nonterminal in the rhsof a rule, we call the nonterminal’s procedure; For each terminal, we compare the currenttoken with the expected terminal. If there are multiple productions for a nonterminal, weuse an if-then-else statement to choose which rule to apply. If there was a left recursion ina production, we would have had an infinite recursion.

Please look at the following web page for a demo of the recursive descent algorithm usingJava aplets: http://www.upb.de/cs/ag-kastens/uebi/parsdemo/

3.2.2 Predictive Parsing Using Tables

Predictive parsing can also be accomplished using a predictive parsing table and a stack. Itis sometimes called non-recursive predictive parsing. The idea is that we construct a tableM [X, token] which indicates which production to use if the top of the stack is a nonterminalX and the current token is equal to token; in that case we pop X from the stack and wepush all the rhs symbols of the production M [X, token] in reverse order. We use a specialsymbol $ to denote the end of file. Let S be the start symbol. Here is the parsing algorithm:

push(S);

read_next_token();

repeat

X = pop();

if (X is a terminal or ’$’)

if (X == current_token)

read_next_token();

else error();

else if (M[X,current_token] == "X ::= Y1 Y2 ... Yk")

21

{ push(Yk);

...

push(Y1);

}

else error();

until X == ’$’;

Now the question is how to construct the parsing table. We need first to derive theFIRST [a] for some symbol sequence a and the FOLLOW [X] for some nonterminal X. Infew words, FIRST [a] is the set of terminals t that result after a number of derivations ona (ie, a => ... => tb for some b). For example, FIRST [3 + E] = {3} since 3 is the firstterminal. To find FIRST [E + T ], we need to apply one or more derivations to E untilwe get a terminal at the beginning. If E can be reduced to the empty sequence, then theFIRST [E + T ] must also contain the FIRST [+T ] = {+}. The FOLLOW [X] is the setof all terminals that follow X in any legal derivation. To find the FOLLOW [X], we needfind all productions Z ::= aXb in which X appears at the rhs. Then the FIRST [b] mustbe part of the FOLLOW [X]. If b is empty, then the FOLLOW [Z] must be part of theFOLLOW [X].

Consider our CFG G1:

1) E ::= T E’ $

2) E’ ::= + T E’

3) | - T E’

4) |

5) T ::= F T’

6) T’ ::= * F T’

7) | / F T’

8) |

9) F ::= num

10) | id

FIRST[F] is of course {num,id}. This means that FIRST[T]=FIRST[F]={num,id}. In ad-dition, FIRST[E]=FIRST[T]={num,id}. Similarly, FIRST[T’] is {*,/} and FIRST[E’] is{+,-}.

The FOLLOW of E is {$} since there is no production that has E at the rhs. For E’,rules 1, 2, and 3 have E’ at the rhs. This means that the FOLLOW[E’] must contain both theFOLLOW[E] and the FOLLOW[E’]. The first one is {$} while the latter is ignored since we aretrying to find E’. Similarly, to find FOLLOW[T], we find the rules that have T at the rhs: rules 1,2, and 3. Then FOLLOW[T] must include FIRST[E’] and, since E ′ can be reduced to the emptysequence, it must include FOLLOW[E’] too (ie, {$}). That is, FOLLOW[T]={+,-,$}. Similarly,FOLLOW[T’]=FOLLOW[T]={+,-,$} from rule 5. The FOLLOW[F] is equal to FIRST[T’]={*,/}

plus FOLLOW[T’] and plus FOLLOW[T] from rules 5, 6, and 7, since T’ can be reduced to theempty sequence.

To summarize, we have:

22

FIRST FOLLOWE {num,id} {$}E’ {+,-} {$}T {num,id} {+,-,$}T’ {*,/} {+,-,$}F {num,id} {+,-,*,/,$}

Now, given the above table, we can easily construct the parsing table. For each t ∈FIRST [a], add X ::= a to M [X, t]. If a can be reduced to the empty sequence, then foreach t ∈ FOLLOW [X], add X ::= a to M [X, t].

For example, the parsing table of the grammar G1 is:num id + - * / $

E 1 1E’ 2 3 4T 5 5T’ 8 8 6 7 8F 9 10

where the numbers are production numbers. For example, consider the eighth productionT ′ ::= . Since FOLLOW[T’]={+,-,$}, we add 8 to M[T’,+], M[T’,-], M[T’,$].

A grammar is called LL(1) if each element of the parsing table of the grammar has atmost one production element. (The first L in LL(1) means that we read the input from leftto right, the second L means that it uses left-most derivations only, and the number 1 meansthat we need to look one token only ahead from the input.) Thus G1 is LL(1). If we havemultiple entries in M , the grammar is not LL(1).

We will parse now the string x-2*y$ using the above parse table:

Stack current_token Rule

---------------------------------------------------------------

E x M[E,id] = 1 (using E ::= T E’ $)

$ E’ T x M[T,id] = 5 (using T ::= F T’)

$ E’ T’ F x M[F,id] = 10 (using F ::= id)

$ E’ T’ id x read_next_token

$ E’ T’ - M[T’,-] = 8 (using T’ ::= )

$ E’ - M[E’,-] = 3 (using E’ ::= - T E’)

$ E’ T - - read_next_token

$ E’ T 2 M[T,num] = 5 (using T ::= F T’)

$ E’ T’ F 2 M[F,num] = 9 (using F ::= num)

$ E’ T’ num 2 read_next_token

$ E’ T’ * M[T’,*] = 6 (using T’ ::= * F T’)

$ E’ T’ F * * read_next_token

$ E’ T’ F y M[F,id] = 10 (using F ::= id)

$ E’ T’ id y read_next_token

$ E’ T’ $ M[T’,$] = 8 (using T’ ::= )

$ E’ $ M[E’,$] = 4 (using E’ ::= )

$ $ stop (accept)

As another example, consider the following grammar:

23

G ::= S $

S ::= ( L )

| a

L ::= L , S

| S

After left recursion elimination, it becomes

0) G := S $

1) S ::= ( L )

2) S ::= a

3) L ::= S L’

4) L’ ::= , S L’

5) L’ ::=

The first/follow tables are:FIRST FOLLOW

G ( aS ( a , ) $L ( a )L’ , )

which are used to create the parsing table:( ) a , $

G 0 0S 1 2L 3 3L’ 5 4

3.3 Bottom-up Parsing

The basic idea of a bottom-up parser is that we use grammar productions in the oppositeway (from right to left). Like for predictive parsing with tables, here too we use a stackto push symbols. If the first few symbols at the top of the stack match the rhs of somerule, then we pop out these symbols from the stack and we push the lhs (left-hand-side) ofthe rule. This is called a reduction. For example, if the stack is x * E + E (where x is thebottom of stack) and there is a rule E ::= E + E, then we pop out E + E from the stackand we push E; ie, the stack becomes x * E. The sequence E + E in the stack is called ahandle. But suppose that there is another rule S ::= E, then E is also a handle in the stack.Which one to choose? Also what happens if there is no handle? The latter question is easyto answer: we push one more terminal in the stack from the input stream and check againfor a handle. This is called shifting. So another name for bottom-up parsers is shift-reduceparsers. There two actions only:

1. shift the current input token in the stack and read the next token, and

2. reduce by some production rule.

24

Consequently the problem is to recognize when to shift and when to reduce each time, and,if we reduce, by which rule. Thus we need a recognizer for handles so that by scanningthe stack we can decide the proper action. The recognizer is actually a finite state machineexactly the same we used for REs. But here the language symbols include both terminalsand nonterminal (so state transitions can be for any symbol) and the final states indicateeither reduction by some rule or a final acceptance (success).

A DFA though can only be used if we always have one choice for each symbol. But thisis not the case here, as it was apparent from the previous example: there is an ambiguity inrecognizing handles in the stack. In the previous example, the handle can either be E + E orE. This ambiguity will hopefully be resolved later when we read more tokens. This impliesthat we have multiple choices and each choice denotes a valid potential for reduction. Soinstead of a DFA we must use a NFA, which in turn can be mapped into a DFA as we learnedin Section 2.3. These two steps (extracting the NFA and map it to DFA) are done in onestep using item sets (described below).

3.3.1 Bottom-up Parsing

Recall the original CFG G1:

0) S :: = E $

1) E ::= E + T

2) | E - T

3) | T

4) T ::= T * F

5) | T / F

6) | F

7) F ::= num

8) | id

Here is a trace of the bottom-up parsing of the input x-2*y$:

Stack rest-of-the-input Action

---------------------------------------------------------------

1) id - num * id $ shift

2) id - num * id $ reduce by rule 8

3) F - num * id $ reduce by rule 6

4) T - num * id $ reduce by rule 3

5) E - num * id $ shift

6) E - num * id $ shift

7) E - num * id $ reduce by rule 7

8) E - F * id $ reduce by rule 6

9) E - T * id $ shift

10) E - T * id $ shift

11) E - T * id $ reduce by rule 8

12) E - T * F $ reduce by rule 4

13) E - T $ reduce by rule 2

25

14) E $ shift

15) S accept (reduce by rule 0)

We start with an empty stack and we finish when the stack contains the non-terminal S only.Every time we shift, we push the current token into the stack and read the next token fromthe input sequence. When we reduce by a rule, the top symbols of the stack (the handle)must match the rhs of the rule exactly. Then we replace these top symbols with the lhs of therule (a non-terminal). For example, in the line 12 above, we reduce by rule 4 (T ::= T * F).Thus we pop the symbols F, *, and T (in that order) and we push T. At each instance, thestack (from bottom to top) concatenated with the rest of the input (the unread input) formsa valid bottom-up derivation from the input sequence. That is, the derivation sequence hereis:

1) id - num * id $

3) => F - num * id $

4) => T - num * id $

5) => E - num * id $

8) => E - F * id $

9) => E - T * id $

12) => E - T * F $

13) => E - T $

14) => E $

15) => S

The purpose of the above example is to understand how bottom-up parsing works. It involveslots of guessing. We will see that we don’t actually push symbols in the stack. Rather, wepush states that represent possibly more than one potentials for rule reductions.

As we learned in earlier sections, a DFA can be represented by transition tables. In thiscase we use two tables: ACTION and GOTO. ACTION is used for recognizing which actionto perform and, if it is shifting, which state to go next. GOTO indicates which state togo after a reduction by a rule for a nonterminal symbol. We will first learn how to use thetables for parsing and then how to construct these tables.

3.3.2 Shift-Reduce Parsing Using the ACTION/GOTO Tables

Consider the following grammar:

0) S ::= R $

1) R ::= R b

2) R ::= a

which parses the R.E. ab∗$.The DFA that recognizes the handles for this grammar is: (it will be explained later how

it is constructed)

26

where ’r2’ for example means reduce by rule 2 (ie, by R :: = a) and ’a’ means accept. Thetransition from 0 to 3 is done when the current state is 0 and the current input token is ’a’.If we are in state 0 and have completed a reduction by some rule for R (either rule 1 or 2),then we jump to state 1.

The ACTION and GOTO tables that correspond to this DFA are:state action goto

a b $ S R0 s3 11 s4 s22 a a a3 r2 r2 r24 r3 r3 r3

where for example s3 means shift a token into the stack and go to state 3. That is,transitions over terminals become shifts in the ACTION table while transitions over non-terminals are used in the GOTO table.

Here is the shift-reduce parser:

push(0);

read_next_token();

for(;;)

{ s = top(); /* current state is taken from top of stack */

if (ACTION[s,current_token] == ’si’) /* shift and go to state i */

{ push(i);

read_next_token();

}

else if (ACTION[s,current_token] == ’ri’)

/* reduce by rule i: X ::= A1...An */

{ perform pop() n times;

s = top(); /* restore state before reduction from top of stack */

push(GOTO[s,X]); /* state after reduction */

}

else if (ACTION[s,current_token] == ’a’)

success!!

else error();

}

Note that the stack contains state numbers only, not symbols.For example, for the input abb$, we have:

27

Stack rest-of-input Action

----------------------------------------------------------------------------

0 abb$ s3

0 3 bb$ r2 (pop(), push GOTO[0,R] since R ::= a)

0 1 bb$ s4

0 1 4 b$ r1 (pop(), pop(), push GOTO[0,R] since R ::= R b)

0 1 b$ s4

0 1 4 $ r1 (pop(), pop(), push GOTO[0,R] since R ::= R b)

0 1 $ s2

0 1 2 accept

3.3.3 Table Construction

Now the only thing that remains to do is, given a CFG, to construct the finite automaton(DFA) that recognizes handles. After that, constructing the ACTION and GOTO tableswill be straightforward.

The states of the finite state machine correspond to item sets. An item (or configuration)is a production with a dot (.) at the rhs that indicates how far we have progressed using thisrule to parse the input. For example, the item E ::= E + . E indicates that we are usingthe rule E ::= E + E and that, using this rule, we have parsed E, we have seen a token +,and we are ready to parse another E. Now, why do we need a set (an item set) for each statein the state machine? Because many production rules may be applicable at this point; laterwhen we will scan more input tokens we will be able tell exactly which production to use.This is the time when we are ready to reduce by the chosen production.

For example, say that we are in a state that corresponds to the item set with the followingitems:

S ::= id . := E

S ::= id . : S

This state indicates that we have already parsed an id from the input but we have 2 pos-sibilities: if the next token is := we will use the first rule and if it is : we will use thesecond.

Now we are ready to construct our automaton. Since we do not want to work withNFAs, we build a DFA directly. So it is important to consider closures (like we did when wetransformed NFAs to DFAs). The closure of an item X ::= a . t b (ie, the dot appearsbefore a terminal t) is the singleton set that contains the item X ::= a . t b only. Theclosure of an item X ::= a . Y b (ie, the dot appears before a nonterminal Y) is the setconsisting of the item itself, plus all productions for Y with the dot at the left of the rhs,plus the closures of these items. For example, the closure of the item E ::= E + . T is theset:

E ::= E + . T

T ::= . T * F

T ::= . T / F

T ::= . F

28

F ::= . num

F ::= . id

The initial state of the DFA (state 0) is the closure of the item S ::= . a $, whereS ::= a $ is the first rule. In simple words, if there is an item X ::= a . s b in an item

set, where s is a symbol (terminal or nonterminal), we have a transition labelled by s to anitem set that contains X ::= a s . b. But it’s a little bit more complex than that:

1. If we have more than one item with a dot before the same symbol s, say X ::= a . s b

and Y ::= c . s d, then the new item set contains both X ::= a s . b and Y ::= c s . d.

2. We need to get the closure of the new item set.

3. We have to check if this item set has been appeared before so that we don’t create itagain.

For example, our previous grammar which parses the R.E. ab∗$:

0) S ::= R $

1) R ::= R b

2) R ::= a

has the following item sets:

which forms the following DFA:

We can understand now the R transition: If we read an entire R term (that is, after wereduce by a rule that parses R), and the previous state before the reduction was 0, we jumpto state 1.

As another example, the CFG:

29

1) S ::= E $

2) E ::= E + T

3) | T

4) T ::= id

5) | ( E )

has the following item sets:

I0: S ::= . E $ I4: E ::= E + T .

E ::= . E + T

E ::= . T I5: T ::= id .

T ::= . id

T ::= . ( E ) I6: T ::= ( . E )

E ::= . E + T

I1: S ::= E . $ E ::= . T

E ::= E . + T T ::= . id

T ::= . ( E )

I2: S ::= E $ .

I7: T ::= ( E . )

I3: E ::= E + . T E ::= E . + T

T ::= . id

T ::= . ( E ) I8: T ::= ( E ) .

I9: E ::= T .

that correspond to the DFA:

Explanation: The initial state I0 is the closure of S ::= . E $. The two items S ::= . E $

and E ::= . E + T in I0 must have a transition by E since the dot appears before E. So wehave a new item set I1 and a transition from I0 to I1 by E. The I1 item set must contain theclosure of the items S ::= . E $ and E ::= . E + T with the dot moved one place right.

30

Since the dot in I1 appears before terminals ($ and +), the closure is the items themselves.Similarly, we have a transition from I0 to I9 by T, to I5 by id, to I6 by ’(’, etc, and we dothe same thing for all item sets.

Now if we have an item set with only one item S ::= E ., where S is the start symbol,then this state is the accepting state (state I2 in the example). If we have an item set withonly one item X ::= a . (where the dot appears at the end of the rhs), then this statecorresponds to a reduction by the production X ::= a. If we have a transition by a terminalsymbol (such as from I0 to I5 by id), then this corresponds to a shifting.

The ACTION and GOTO tables that correspond to this DFA are:state action goto

id ( ) + $ S E T0 s5 s6 1 91 s3 s22 a a a a a3 s5 s6 44 r2 r2 r2 r2 r25 r4 r4 r4 r4 r46 s5 s6 7 97 s8 s38 r5 r5 r5 r5 r59 r3 r3 r3 r3 r3

As another example, consider the following augmented grammar:

0) S’ ::= S $

1) S ::= B B

2) B ::= a B

3) B ::= c

The state diagram is:

31

The ACTION and GOTO parsing tables are:state action goto

a c $ S’ S B0 s5 s7 1 31 s22 a a a3 s5 s7 44 r1 r1 r15 s5 s7 66 r2 r2 r27 r3 r3 r3

As yet another example, the grammar:

S ::= E $

E ::= ( L )

E ::= ( )

E ::= id

L ::= L , E

L ::= E

has the following state diagram:

32

If we have an item set with more than one items with a dot at the end of their rhs, thenthis is an error, called a reduce-reduce conflict, since we can not choose which rule to use toreduce. This is a very important error and it should never happen. Another conflict is ashift-reduce conflict where we have one reduction and one or more shifts at the same itemset. For example, when we find b in a+b+c, do we reduce a+b or do we shift the plus tokenand reduce b+c later? This type of ambiguities is usually resolved by assigning the properprecedences to operators (described later). The shift-reduce conflict is not as severe as thereduce-reduce conflict: a parser generator selects reduction against shifting but we may geta wrong behavior. If the grammar has no reduce-reduce and no shift-reduce conflicts, it isLR(0) (ie. we read left-to-right, we use right-most derivations, and we don’t look ahead anytokens).

3.3.4 SLR(1), LR(1), and LALR(1) Grammars

Here is an example of a grammar that is not LR(0):

1) S ::= E $

2) E ::= E + T

3) | T

4) T ::= T * F

5) | F

6) F ::= id

7) | ( E )

Let’s focus on two item sets only:

I0: S ::= . E $ I1: E ::= T .

E ::= . E + T T ::= T . * F

E ::= . T

T ::= . T * F

T ::= . F

F ::= . id

33

F ::= . ( E )

State I1 has a shift-reduce conflict. Another example is:

S ::= X

X ::= Y

| id

Y ::= id

which includes the following two item sets:

I0: S ::= . X $

X ::= . Y I1: X ::= id .

X ::= . id Y ::= id .

Y ::= . id

State I1 has a reduce-reduce conflict.Can we find an easy fix for the reduce-reduce and the shift-reduce conflicts? Consider

the state I1 of the first example above. The FOLLOW of E is {$,+,)}. We can see that * isnot in the FOLLOW[E], which means that if we could see the next token (called the lookahead

token) and this token is *, then we can use the item T ::= T . * F to do a shift; otherwise,if the lookahead is one of {$,+,)}, we reduce by E ::= T. The same trick can be used in thecase of a reduce-reduce conflict. For example, if we have a state with the items A ::= a .

and B ::= b ., then if FOLLOW[A] doesn’t overlap with FOLLOW[B], then we can deducewhich production to reduce by inspecting the lookahead token. This grammar (where wecan look one token ahead) is called SLR(1) and is more powerful than LR(0).

The previous trick is not always possible. Consider the grammar:

S ::= E $

E ::= L = R

| R

L ::= * R

| id

R ::= L

for C-style variable assignments. Consider the two states:

I0: S ::= . E $

E ::= . L = R I1: E ::= L . = R

E ::= . R R ::= L .

L ::= . * R

L ::= . id

R ::= . L

34

we have a shift-reduce conflict in I1. The FOLLOW of R is the union of the FOLLOW of Eand the FOLLOW of L, ie, it is {$,=}. In this case, = is a member of the FOLLOW of R, sowe can’t decide shift or reduce using just one lookahead.

An even more powerful grammar is LR(1), described below. This grammar is not usedin practice because of the large number of states it generates. A simplified version of thisgrammar, called LALR(1), has the same number of states as LR(0) but it is far more powerfulthan LR(0) (but less powerful than LR(1)). It is the most important grammar from allgrammars we learned so far. CUP, Bison, and Yacc recognize LALR(1) grammars.

Both LR(1) and LALR(1) check one lookahead token (they read one token ahead fromthe input stream – in addition to the current token). An item used in LR(1) and LALR(1)is like an LR(0) item but with the addition of a set of expected lookaheads which indicatewhat lookahead tokens would make us perform a reduction when we are ready to reduceusing this production rule. The expected lookahead symbols for a rule X ::= a are always asubset or equal to FOLLOW[X]. The idea is that an item in an itemset represents a potentialfor a reduction using the rule associated with the item. But each itemset (ie. state) can bereached after a number of transitions in the state diagram, which means that each itemsethas an implicit context, which, in turn, implies that there are only few terminals permitted toappear in the input stream after reducing by this rule. In SLR(1), we made the assumptionthat the followup tokens after the reduction by X ::= a are exactly equal to FOLLOW[X].But this is too conservative and may not help us resolve the conflicts. So the idea is torestrict this set of followup tokens by making a more careful analysis by taking into accountthe context in which the item appears. This will reduce the possibility of overlappings insift/reduce or reduce/reduce conflict states. For example,

L ::= * . R, =$

indicates that we have the item L ::= * . R and that we have parsed * and will reduce byL ::= * R only when we parse R and the next lookahead token is either = or $ (end of file).It is very important to note that the FOLLOW of L (equal to {$,=}) is always a supersetor equal to the expected lookaheads. The point of propagating expected lookaheads to theitems is that we want to restrict the FOLLOW symbols so that only the relevant FOLLOWsymbols would affect our decision when to shift or reduce in the case of a shift-reduce orreduce-reduce conflict. For example, if we have a state with two items A ::= a ., s1 andB ::= b ., s2, where s1 and s2 are sets of terminals, then if s1 and s2 are not overlapping,this is not a reduce-reduce error any more, since we can decide by inspecting the lookaheadtoken.

So, when we construct the item sets, we also propagate the expected lookaheads. Whenwe have a transition from A ::= a . s b by a symbol s, we just propagate the expectedlookaheads. The tricky part is when we construct the closures of the items. Recall that whenwe have an item A ::= a . B b, where B is a nonterminal, then we add all rules B ::= . c

in the item set. If A ::= a . B b has an expected lookahead t, then B ::= . c has all theelements of FIRST[bt] as expected lookaheads.

For example, consider the previous grammar:

S ::= E $

E ::= L = R

35

| R

L ::= * R

| id

R ::= L

Two of the LR(1) item sets are:

I0: S ::= . E $ ?

E ::= . L = R $ I1: E ::= L . = R $

E ::= . R $ R ::= L . $

L ::= . * R =$

L ::= . id =$

R ::= . L $

We always start with expected lookahead ? for the rule S ::= E $, which basicallyindicates that we don’t care what is after end-of-file. The closure of L will contain both= and $ for expected lookaheads because in E ::= . L = R the first symbol after L is =,and in R ::= . L (the closure of E ::= . R) we use the $ terminal for expected lookaheadpropagated from E ::= . R since there is no symbol after L. We can see that there is noshift reduce error in I1: if the lookahead token is $ we reduce, otherwise we shift (for =).

In LR(1) parsing, an item A ::= a, s1 is different from A ::= a, s2 if s1 is differ-ent from s2. This results to a large number of states since the combinations of expectedlookahead symbols can be very large. To reduce the number of states, when we have twoitems like those two, instead of creating two states (one for each item), we combine the twostates into one by creating an item A := a, s3, where s3 is the union of s1 and s2. Sincewe make the expected lookahead sets larger, there is a danger that some conflicts will haveworse chances to be resolved. But the number of states we get is the same as that for LR(0).This grammar is called LALR(1).

There is an easier way to construct the LALR(1) item sets. Simply start by constructingthe LR(0) item sets. Then we add the expected lookaheads as follows: whenever we finda new lookahead using the closure law given above we add it in the proper item; when wepropagate lookaheads we propagate all of them. Sometimes when we insert a new lookaheadwe need to do all the lookahead propagations again for the new lookahead even in the casewe have already constructed the new items. So we may have to loop through the items manytimes until no new lookaheads are generated. Sounds hard? Well that’s why there are parsergenerators to do that automatically for you. This is how CUP work.

3.3.5 Practical Considerations for LALR(1) Grammars

Most parsers nowdays are specified using LALR(1) grammars fed to a parser generator, suchas CUP. There are few tricks that we need to know to avoid reduce-reduce and shift-reduceconflicts.

When we learned about top-down predictive parsing, we were told that left recursionis bad. For LALR(1) the opposite is true: left recursion is good, right recursion is bad.Consider this:

36

L ::= id , L

| id

which captures a list of ids separated by commas. If the FOLLOW of L (or to be moreprecise, the expected lookaheads for L ::= id) contains the comma token, we are in bigtrouble. We can use instead:

L ::= L , id

| id

Left recursion uses less stack than right recursion and it also produces left associativetrees (which is what we usually want).

Some shift-reduce conflicts are very difficult to eliminate. Consider the infamous if-then-else conflict:

S ::= if E then S else S

| if E then S

| ...

Read Section 3.3 in the textbook to see how this can be eliminated.Most shift-reduce errors though are easy to remove by assigning precedence and associa-

tivity to operators. Consider the grammar:

S ::= E $

E ::= E + E

| E * E

| ( E )

| id

| num

and four of its LALR(1) states:

I0: S ::= . E $ ?

E ::= . E + E +*$ I1: S ::= E . $ ? I2: E ::= E * . E +*$

E ::= . E * E +*$ E ::= E . + E +*$ E ::= . E + E +*$

E ::= . ( E ) +*$ E ::= E . * E +*$ E ::= . E * E +*$

E ::= . id +*$ E ::= . ( E ) +*$

E ::= . num +*$ I3: E ::= E * E . +*$ E ::= . id +*$

E ::= E . + E +*$ E ::= . num +*$

E ::= E . * E +*$

Here we have a shift-reduce error. Consider the first two items in I3. If we have a*b+c

and we parsed a*b, do we reduce using E ::= E * E or do we shift more symbols? In theformer case we get a parse tree (a*b)+c; in the latter case we get a*(b+c). To resolve thisconflict, we can specify that * has higher precedence than +. The precedence of a grammarproduction is equal to the precedence of the rightmost token at the rhs of the production.

37

For example, the precedence of the production E ::= E * E is equal to the precedence ofthe operator *, the precedence of the production E ::= ( E ) is equal to the precedence ofthe token ), and the precedence of the production E ::= if E then E else E is equal tothe precedence of the token else. The idea is that if the lookahead has higher precedencethan the production currently used, we shift. For example, if we are parsing E + E usingthe production rule E ::= E + E and the lookahead is *, we shift *. If the lookahead hasthe same precedence as that of the current production and is left associative, we reduce,otherwise we shift. The above grammar is valid if we define the precedence and associativityof all the operators. Thus, it is very important when you write a parser using CUP or anyother LALR(1) parser generator to specify associativities and precedences for most tokens(especially for those used as operators). Note: you can explicitly define the precedence of arule in CUP using the %prec directive:

E ::= MINUS E %prec UMINUS

where UMINUS is a pseudo-token that has higher precedence than TIMES, MINUS etc, so that-1*2 is equal to (-1)*2, not to -(1*2).

Another thing we can do when specifying an LALR(1) grammar for a parser generatoris error recovery. All the entries in the ACTION and GOTO tables that have no contentcorrespond to syntax errors. The simplest thing to do in case of error is to report it andstop the parsing. But we would like to continue parsing finding more errors. This is callederror recovery. Consider the grammar:

S ::= L = E ;

| { SL } ;

| error ;

SL ::= S ;

| SL S ;

The special token error indicates to the parser what to do in case of invalid syntax forS (an invalid statement). In this case, it reads all the tokens from the input stream until itfinds the first semicolon. The way the parser handles this is to first push an error state inthe stack. In case of an error, the parser pops out elements from the stack until it finds anerror state where it can proceed. Then it discards tokens from the input until a restart ispossible. Inserting error handling productions in the proper places in a grammar to do gooderror recovery is considered very hard.

3.4 Case Study: The Calculator Parser

The file simple calc.cup contains the CUP grammar for the calculator parser (without se-mantics actions). Notice the section that specifies the precedence and associativity of theterminals symbols:

precedence nonassoc ELSE;

precedence right OR;

precedence right AND;

38

precedence nonassoc NOT;

precedence left EQ, LT, GT, LE, GE, NE;

precedence left PLUS, MINUS;

precedence left TIMES, DIV;

It lists the terminals in order of precedence, from lower to higher. Thus, TIMES and DIV havethe highest precedence. For example, 1+3*4 is equivalent to 1+(3*4) since TIMES has higherprecedence than PLUS. In addition, the keywords left, right, and nonassoc indicate that theoperators have left, right, or no associativity at all, respectively. This means that 1+2+3 isequivalent to (1+2)+3, while (x and y and z) is equivalent to (x and (y and z)). Thislist is very important for helping CUP resolve shift-reduce conflicts. The reason that we setELSE to the lowest precendence is to resolve the infamous if-then-else conflict: It basicallymeans that we always shift in case of nested if-then-elses.

Another thing to notice is that, when there is a choice, left recursion is preferred fromright recursion, such as in:

expl ::= expl COMMA exp

| exp

4 Abstract Syntax

A grammar for a language specifies a recognizer for this language: if the input satisfies thegrammar rules, it is accepted, otherwise it is rejected. But we usually want to performsemantic actions against some pieces of input recognized by the grammar. For example, ifwe are building a compiler, we want to generate machine code. The semantic actions arespecified as pieces of code attached to production rules. For example, the CUP productions:

E ::= E:e1 + E:e2 {: RESULT = e1 + e2; :}

| E:e1 - E:e2 {: RESULT = e1 - e2; :}

| num:n {: RESULT = n; :}

;

contain pieces of Java code (enclosed by {: :}) to be executed at a particular point of time.Here, each expression E is associated with an integer. The goal is to calculate the value of Estored in the variable RESULT. For example, the first rule indicates that we are reading oneexpression E and call its result e1, then we read a +, then we read another E and call itsresult e2, and then we are executing the Java code {: RESULT = e1 + e2; :}. The codecan appear at any point of the rhs of a rule (even at the beginning of the rhs of the rule) andwe may have more than one (or maybe zero) pieces of code in a rule. The code is executedonly when all the symbols at the left of the code in the rhs of the rule have been processed.

It is very uncommon to build a full compiler by adding the appropriate actions to pro-ductions. It is highly preferable to build the compiler in stages. So a parser usually buildsan Abstract Syntax Tree (AST) and then, at a later stage of the compiler, these ASTs arecompiled into the target code. There is a fine distinction between parse trees and ASTs.A parse tree has a leaf for each terminal and an internal node for each nonterminal. For

39

each production rule used, we create a node whose name is the lhs of the rule and whosechildren are the symbols of the rhs of the rule. An AST on the other hand is a compact datastructure to represent the same syntax regardless of the form of the production rules usedin building this AST.

4.1 Building Abstract Syntax Trees in Java

When building ASTs, it’s a good idea to define multiple classes to capture various familiesof constructs. For example, we can have an Exp class to represent expressions, Stmt class torepresent statements, and Type class to represent types. Here is an example of Exp in Java:

abstract class Exp {

}

class IntegerExp extends Exp {

public int value;

public IntegerExp ( int n ) { value=n; }

}

class TrueExp extends Exp {

public TrueExp () {}

}

class FalseExp extends Exp {

public FalseExp () {}

}

class VariableExp extends Exp {

public String value;

public VariableExp ( String n ) { value=n; }

}

class BinaryExp extends Exp {

public String operator;

public Exp left;

public Exp right;

public BinaryExp ( String o, Exp l, Exp r ) { operator=o; left=l; right=r; }

}

class UnaryExp extends Exp {

public String operator;

public Exp operand;

public UnaryExp ( String o, Exp e ) { operator=o; operand=e; }

}

class ExpList {

public Exp head;

public ExpList next;

public ExpList ( Exp h, ExpList n ) { head=h; next=n; }

}

class CallExp extends Exp {

public String name;

40

public ExpList arguments;

public CallExp ( String nm, ExpList s ) { name=nm; arguments=s; }

}

class ProjectionExp extends Exp {

public Exp value;

public String attribute;

public ProjectionExp ( Exp v, String a ) { value=v; attribute=a; }

}

class RecordElements {


public Exp value;

public RecordElements next;

public RecordElements ( String a, Exp v, RecordElements el )

{ attribute=a; value=v; next=el; }

}

class RecordExp extends Exp {

public RecordElements elements;

public RecordExp ( RecordElements el ) { elements=el; }

}

For example,

new BinaryExp("+",new BinaryExp("-",new VariableExp("x"),new IntegerExp(2)),

new IntegerExp(3))

constructs the AST for the input (x-2)+3.

4.2 Building Abstract Syntax Trees in C

The previous example of expression ASTs can be written as follows in C:

typedef struct Exp {

enum { int_exp, true_exp, false_exp, variable_exp,

binary_op, unary_op, function_call,

record_construction, projection } tag;

union { int integer;

string variable;

struct { string oper;

struct Exp* left;

struct Exp* right; } binary;

struct { string oper;

struct Exp* uexp; } unary;

struct { string name;

struct Exp_list* arguments; } call;

struct rec { string attribute;

struct Exp* value;

41

struct rec* next; } record;

struct { struct Exp* value;

string attribute; } project;

} op;

} ast;

where Exp list is a list of ASTs:

typedef struct Exp_list {

ast* elem;

struct Exp_list* next;

} ast_list;

It’s a good idea to define a constructor for every kind of expression to simplify the task ofconstructing ASTs:

ast* make_binary_op ( string oper, ast* left, ast* right ) {

ast* e = (ast*) malloc(sizeof(ast));

e->tag = binary_op;

e->op.binary.oper = make_string(oper);

e->op.binary.left = left;

e->op.binary.right = right;

return e;

};

For example,

make_binary_op("+",make_binary_op("-",make_variable("x"),make_integer(2)),

make_integer(3))

constructs the AST for the input (x-2)+3.Unfortunately, when constructing a compiler, we need to define many tree-like data

structures to capture ASTs for many different constructs, such as expressions, statements,declarations, programs etc, as well as type structures, intermediate representation (IR) trees,etc. This would require hundreds of recursive structs in C or classes in Java. An alternativemethod is to use just one generic tree structure to capture all possible tree structures. Thisis exactly what we did for our calculator example and is described in detail below.

4.3 Gen: A Java Package for Constructing and Manipulating Ab-

stract Syntax Trees

The code for the calculator example is written in Gen. Gen is a Java preprocessor that addssyntactic constructs to the Java language to make the task of handling Abstract SyntaxTrees (ASTs) easier. The class project will be developed using Gen.

Two classes are used by Gen, which are defined in Ast.java: the class Ast that cap-tures an AST (with subclasses Variable, Number, Real, Astring, and Node), and the classArguments that captures a list of ASTs. An Ast is a tree-like data structure, which is usedfor representing various tree-like data structures used in compiler construction, includingASTs and Intermediate Representation (IR) code.

42

abstract class Ast {

}

class Number extends Ast {

public long value;

public Number ( long n ) { value = n; }

}

class Real extends Ast {

public double value;

public Real ( double n ) { value = n; }

}

class Variable extends Ast {


public Variable ( String s ) { value = s; }

}

class Astring extends Ast {


public Astring ( String s ) { value = s; }

}

class Node extends Ast {

public String name;

public Arguments args;

public Node ( String n, Arguments a ) { tag = n; args = a; }

}

where the Arguments class represents a list of ASTs:

class Arguments {

public Ast head;

public Arguments tail;

public Arguments ( Ast h, Arguments t );

public final static Arguments nil;

public Arguments append ( Ast e );

}

See the file Ast.java for a complete definition.For example, the Java expression

new Node("Binop",

Arguments.nil.append(new Variable("Plus"))

.append(new Variable("x"))

.append(new Node("Binop",

Arguments.nil.append(new Variable("Minus"))

.append(new Variable("y"))

.append(new Variable("z")))))

constructs the AST Binop(Plus,x,Binop(Minus,y,z)).

43

The nice thing about this approach is that we do not need to add a new class to defineanother tree-like data structure. The disadvantage is that it’s now easier to make mistakeswhen writing programs to manipulate these tree structures. For example, we may pass astatement tree in a procedure that handles expression trees and this will not be detected bythe Java compiler.

To make the task of writing these tree constructions less tedious, Gen extends Java withthe syntactic form #< >. For example, #<Binop(Plus,x,Binop(Minus,y,z))> is equivalentto the above Java expression. That is, the text within the brackets #< > is used by Gento generate Java code, which creates the tree-like form (an instance of the class Ast) thatrepresents this text. Values of the class Ast can be included into the form generated bythe #< > brackets by “escaping” them with a backquote character (‘). The operand of theescape operator (the backquote operator) is expected to be an expression of type Ast thatprovides the value to “fill in” the hole in the bracketed text at that point (actually, anescaped string/int/double is also lifted to an Ast). For example, in

Ast x = #<join(a,b,p)>;

Ast y = #<select(‘x,q)>;

Ast z = #<project(‘y,A)>;

y is set to #<select(join(a,b,p),q)> and z to #<project(select(join(a,b,p),q),A)>.There is also bracketed syntax, #[ ], for constructing instances of Arguments.

The bracketed syntax has the following BNF:

bracketed ::= "#<" expr ">" an AST construction

| "#[" arg "," ... "," arg "]" an Arguments construction

expr ::= name the representation of a variable name

| integer the repr. of an integer

| real the repr. of a real number

| string the repr. of a string

| "‘" name escaping to the value of \verb@name@

| "‘(" code ")" escaping to the value of \verb@code@

| name "(" arg "," ... "," arg ")" the repr. of an AST node with zero or more children

| "‘" name "(" arg "," ... "," arg ")" the repr. of an AST node with escaped name

| expr opr expr an AST node that represents a binary infix operation

| "‘" name "[" expr "]" variable substitution (explained later)

arg ::= expr the repr. of an expression

| "..." name escaping to a list of ASTs bound to \verb@name@

| "...(" code ")" escaping to a list of ASTs returned by \verb@code@

where code is any Java code. The #< ‘(code) > embeds the value returned by the Javacode code of type Ast to the term representation inside the brackets.

For example, #<‘f(6,...r,g("ab",‘(k(x))),‘y)> is equivalent to the following Javacode:

new Node(f,

Arguments.nil.append(new Number(6))

.append(r)

.append(new Node("g",Arguments.nil.append(new Astring("ab"))

.append(k(x))))

.append(y)

44

If f="h", y=#<m(1,"a")>, and k(x) returns the value #<8>, then the above term is equivalentto #<h(6,g("ab",8),m(1,"a"))>. The three dots (...) construct is used to indicate a listof children in an AST node. Since this list is an instance of the class Arguments, the typeof name in ...name is also Arguments. For example, in

Arguments r = #[join(a,b,p),select(c,q)];

Ast z = #<project(...r)>;

z will be bound to #<project(join(a,b,p),select(c,q))>.Gen provides a case statement syntax with patterns. Patterns match the Ast represen-

tations with similar shape. Escape operators applied to variables inside these patterns rep-resent variable patterns, which “bind” to corresponding subterms upon a successful match.This capability makes it particularly easy to write functions that perform source-to-sourcetransformations. A function that simplifies arithmetic expressions can be expressed easilyas:

Ast simplify ( Ast e ) {

#case e

| plus(‘x,0) => return x;

| times(‘x,1) => return x;

| times(‘x,0) => return #<0>;

| _ => return e;

#end;

}

where the _ pattern matches any value. For example, simplify(#<times(z,1)>) returns#<z>, since times(z,1) matches the second case pattern. The BNF of the case statementis:

case_stmt ::= "#case" code case ... case "#end"

case ::= "|" expr guard "=>" code

guard ::= ":" code an optional condition

|

expr ::= name exact match with a variable name

| integer exact match with an integer

| real exact match with a real number

| string exact match with a string

| "‘" name match with the value of \verb@name@

| "‘(" code ")" match with the value of \verb@code@

| name "(" arg "," ... "," arg ")" match with an AST node with zero or more children

| "‘" name "(" arg "," ... "," arg ")" match with an AST node with escaped name

| expr opr expr an AST node that represents a binary infix operation

| "‘" name "[" expr "]" second-order matching (explained later)

| "_" match any Ast

arg ::= expr match with an Ast

| "..." name match with a list of ASTs bound to \verb@name@

| "...(" code ")" match with a list of ASTs returned by \verb@code@

| "..." match the rest of the arguments

For example, the pattern ‘f(...r) matches any Ast Node: when it is matched with#<join(a,b,c)>, it binds f to the string "join" and r to the Arguments #[a,b,c]. An-other example is the following function that adds the terms #<8> and #<9> as children toany Node e:

45

Ast add_arg ( Ast e ) {

#case e

| ‘f(...r) => return #<‘f(8,9,...r)>;

| ‘x => return x;

#end;

}

As another example of a case statement in Gen, consider the following function, whichswitches the inputs of a binary join found as a parameter to a Node e:

Ast switch_join_args ( Ast e ) {

#case e

| ‘f(...r,join(‘x,‘y),...s) => return #<‘f(...r,join(‘y,‘x),...s)>;

| ‘x => return x;

#end;

}

The most powerful construct in Gen is second-order pattern matching, denoted by the specialsyntax ‘f[expr]. When ‘f[expr] is matched against an Ast, e, it traverses the entire treerepresentation of e (in preorder) until it finds a tree node that matches the pattern expr.It fails when it does not find a match. When it finds a match, it succeeds and binds thevariables in the pattern expr. Furthermore, it binds the variable f to a list of Ast (of classArguments) that represents the path from the root Ast to the Ast node that matched thepattern. This is best used in conjuction with the bracketed expression ‘f[e], which usesthe path bound in f to construct a new Ast with expr replaced with e. For example, theGen program

Ast n = new_name();

#case e

| ‘f[join(‘x,‘y)] => return #<let(‘n,join(‘x,‘y),‘f[‘n])>;

#end;

extracts the first term that matches a join from the term e and it pulls the term out in alet-binding.

Another syntactic construct in Gen is a for-loop that iterates over Arguments:

"#for" name "in" code "do" code "#end"

For example,

#for v in #[a,b,c] do

System.out.println(v);

#end;

5 Semantic Actions

Let’s consider now how actions are evaluated by different parsers. In recursive descentparsers, actions are pieces of code embedded in the recursive procedures. For the followinggrammar:

46

E ::= T E’

E’ ::= + T E’

| - T E’

|

T ::= num

we have the following recursive descent parser:

int E () { return Eprime(T()); };

int Eprime ( int left ) {

if (current_token==’+’) {

read_next_token();

return Eprime(left + T());

} else if (current_token==’-’) {

read_next_token();

return Eprime(left - T());

} else return left;

};

int T () {

if (current_token==’num’) {

read_next_token();

return num_value;

} else error();

};

By passing T() as input to Eprime, we pass the left operand to Eprime.Table-driven predictive parsers use the parse stack to push/pop actions (along with sym-

bols) but they use a separate semantic stack to execute the actions. In that case, the parsingalgorithm becomes:

push(S);

read_next_token();

repeat

X = pop();

if (X is a terminal or ’$’)

if (X == current_token)

read_next_token();

else error();

else if (X is an action)

perform the action;

else if (M[X,current_token] == "X ::= Y1 Y2 ... Yk")

{ push(Yk);

...

push(Y1);

}

else error();

until X == ’$’;

47

For example, suppose that pushV and popV are the functions to manipulate the semanticstack. The following is the grammar of an interpreter that uses the semantic stack to performadditions and subtractions:

E ::= T E’ $ { print(popV()); }

E’ ::= + T { pushV(popV() + popV()); } E’

| - T { pushV(-popV() + popV()); } E’

|

T ::= num { pushV(num); }

For example, for 1+5-2, we have the following sequence of actions:

pushV(1); pushV(5); pushV(popV()+popV()); pushV(3);

pushV(-popV()+popV()); print(popV());

Question: what would happen if we put the action of the second rule at the end of rhs?In contrast to top-down parsers, bottom-up parsers can only perform an action after a

reduction (ie, after the entire rhs of a rule has been processed). Why? because at a particularinstance of time we may have a potential for multiple rules for reduction (this is the ideabehind itemsets), which means that we may be in the middle of many rules at a time, butlater only one rule will actually be used; so, we can’t execute an action in the middle of arule because we may have to undo it later if the rule is not used for reduction. This meansthat we can only have rules of the form

X ::= Y1 ... Yn { action }

where the action is always at the end of the rule. This action is evaluated after the ruleX ::= Y1 ... Yn is reduced. To evaluate actions, in addition to state numbers, the parserpushes values into the parse stack: Both terminals and non-terminals are associated withtyped values, which in the CUP parser generator are instances of the Object class (or ofsome subclass of the Object class). Since the Java Object class is a superclass of all classes,it doesn’t carry any additional information, but the subclasses of Object, such as the classInteger, the class String, and the class Ast for ASTs, do. The value associated with a terminalis in most cases an Object, except for an identifier which is a String, for an integer which isan Integer, etc. The typical values associated with non-terminals in a compiler are ASTs,lists of ASTs, etc. In CUP, you can retrieve the value of a symbol s at the lhs of a rule byusing the notation s:x, where @x@ is a variable name that hasn’t appeared elsewhere in thisrule. The value of the non-terminal defined by a rule is called RESULT and should alwaysbe assigned a value in the action. For example, if the non-terminal E is associated with aninteger value (of type Integer), then the following rule:

E ::= E:n PLUS E:m {: RESULT = n+m; :}

retrieves the value, n, of the left operand from the parse stack, retrieves the value, m, ofthe right operand from the parse stack, and pushes the value of RESULT on the parse stack,which has been set to n+m after the reduction of the rule. That is, the elements of the parserstack in CUP are pairs of a state-number (integer) and an Object. So when the above ruleis reduced, the three top elements of the stack, which form the handle of the reduction, will

48

contain three elements: the state reached when we reduced the rule for E to get the leftoperand, the state for shifting over PLUS, and the state reached when we reduced the rule forE to get the right operand (top of stack). Along with these states, there are three Objects:one bound to n, one ignored (since the terminal PLUS is associated with an empty Object,which is ignored), and one bound to m (top of stack). When we reduce by the above rule, weuse the GOTO table to find which state to push, we pop the handle (three elements), andwe push the pair of this state and the RESULT value on the parse stack.

If we want build an AST in CUP, we need to associate each non-terminal symbol withan AST type. For example, if we use the non-terminals exp and expl for expressions andlist of expressions respectively, we can define their types as follows:

non terminal Ast exp;

non terminal Arguments expl;

Then the production rules should have actions to build ASTs:

exp ::= exp:e1 PLUS exp:e2 {: RESULT = new Node(plus_exp,e1,e2); :}

| exp:e1 MINUS exp:e2 {: RESULT = new Node(minus_exp,e1,e2); :}

| id:nm LP expl:el RP {: RESULT = new Node(call_exp,el.reverse()

.cons(new Variable(nm))); :}

| INT:n {: RESULT = new Number(n.intValue()); :}

;

expl ::= expl:el COMMA exp:e {: RESULT = el.cons(e); :}

| exp:e {: RESULT = nil.cons(e); :}

;

That is, for integer addition, we build an AST node that corresponds to a binary operation(see the AST in Section 4).

What if we want to put an action in the middle of the rhs of a rule in a bottom-up parser?In that case we use a dummy nonterminal, called a marker. For example,

X ::= a { action } b

is equivalent to

X ::= M b

M ::= a { action }

This is done automatically by the CUP parser generator (ie, we can actually put actionsin the middle of a rhs of a rule and CUP will use the above trick to put it at the end of arule). There is a danger though that the resulting rules may introduce new shift/reduce orreduce/reduce conflicts.

6 Semantic Analysis

6.1 Symbol Tables

After ASTs have been constructed, the compiler must check whether the input program istype-correct. This is called type checking and is part of the semantic analysis. During type

49

checking, a compiler checks whether the use of names (such as variables, functions, typenames) is consistent with their definition in the program. For example, if a variable x hasbeen defined to be of type int, then x+1 is correct since it adds two integers while x[1] iswrong. When the type checker detects an inconsistency, it reports an error (such as “Error:an integer was expected”). Another example of an inconsistency is calling a function withfewer or more parameters or passing parameters of the wrong type.

Consequently, it is necessary to remember declarations so that we can detect inconsis-tencies and misuses during type checking. This is the task of a symbol table. Note that asymbol table is a compile-time data structure. It’s not used during run time by staticallytyped languages. Formally, a symbol table maps names into declarations (called attributes),such as mapping the variable name x to its type int. More specifically, a symbol tablestores:

• for each type name, its type definition (eg. for the C type declaration typedef int* mytype,it maps the name mytype to a data structure that represents the type int*).

• for each variable name, its type. If the variable is an array, it also stores dimensioninformation. It may also store storage class, offset in activation record etc.

• for each constant name, its type and value.

• for each function and procedure, its formal parameter list and its output type. Eachformal parameter must have name, type, type of passing (by-reference or by-value),etc.

One convenient data structure for symbol tables is a hash table. One organization of ahash table that resolves conflicts is chaining. The idea is that we have list elements of type:

class Symbol {

public String key;

public Object binding;

public Symbol next;

public Symbol ( String k, Object v, Symbol r ) { key=k; binding=v; next=r; }

}

to link key-binding mappings. Here a binding is any Object, but it can be more specific (eg,a Type class to represent a type or a Declaration class, as we will see below). The hashtable is a vector Symbol[] of size SIZE, where SIZE is a prime number large enough to havegood performance for medium size programs (eg, SIZE=109). The hash function must mapany key (ie. any string) into a bucket (ie. into a number between 0 and SIZE-1). A typicalhash function is, h(key) = num(key) mod SIZE, where num converts a key to an integer andmod is the modulo operator. In the beginning, every bucket is null. When we insert a newmapping (which is a pair of key-binding), we calculate the bucket location by applying thehash function over the key, we insert the key-binding pair into a Symbol object, and we insertthe object at the beginning of the bucket list. When we want to find the binding of a name,we calculate the bucket location using the hash function, and we search the element list ofthe bucket from the beginning to its end until we find the first element with the requestedname.

50

Consider the following Java program:

1) {

2) int a;

3) {

4) int a;

5) a = 1;

6) };

7) a = 2;

8) };

The statement a = 1 refers to the second integer a, while the statement a = 2 refers to thefirst. This is called a nested scope in programming languages. We need to modify the symboltable to handle structures and we need to implement the following operations for a symboltable:

• insert ( String key, Object binding )

• Object lookup ( String key )

• begin_scope ()

• end_scope ()

We have already seen insert and lookup. When we have a new block (ie, when weencounter the token {), we begin a new scope. When we exit a block (ie. when we encounterthe token }) we remove the scope (this is the end_scope). When we remove a scope, weremove all declarations inside this scope. So basically, scopes behave like stacks. One wayto implement these functions is to use a stack of numbers (from 0 to SIZE) that refer tobucket numbers. When we begin a new scope we push a special marker to the stack (eg, -1).When we insert a new declaration in the hash table using insert, we also push the bucketnumber to the stack. When we end a scope, we pop the stack until and including the first-1 marker. For each bucket number we pop out from the stack, we remove the head of thebinding list of the indicated bucket number. For example, for the previous C program, wehave the following sequence of commands for each line in the source program (we assumethat the hash key for a is 12):

1) push(-1)

2) insert the binding from a to int into the beginning of the list table[12]

push(12)

3) push(-1)

4) insert the binding from a to int into the beginning of the list table[12]

push(12)

6) pop()

remove the head of table[12]

pop()

7) pop()

remove the head of table[12]

pop()

51

Recall that when we search for a declaration using lookup, we search the bucket list fromthe beginning to the end, so that if we have multiple declarations with the same name, thedeclaration in the innermost scope overrides the declaration in the outer scope.

The textbook makes an improvement to the above data structure for symbol tables bystoring all keys (strings) into another data structure and by using pointers instead of stringsfor keys in the hash table. This new data structure implements a set of strings (ie. no stringappears more than once). This data structure too can be implemented using a hash table.The symbol table itself uses the physical address of a string in memory as the hash key tolocate the bucket of the binding. Also the key component of element is a pointer to thestring. The idea is that not only we save space, since we store a name once regardless of thenumber of its occurrences in a program, but we can also calculate the bucket location veryfast.

6.2 Types and Type Checking

A typechecker is a function that maps an AST that represents an expression into its type.For example, if variable x is an integer, it will map the AST that represents the expressionx+1 into the data structure that represents the type int. If there is a type error in theexpression, such as in x>"a", then it displays an error message (a type error). So beforewe describe the typechecker, we need to define the data structures for types. Suppose thatwe have five kinds of types in the language: integers, booleans, records, arrays, and namedtypes (ie. type names that have been defined earlier by some typedef). Then one possibledata structure for types is:

abstract class Type {

}

class IntegerType extends Type {

public IntegerType () {}

}

class BooleanType extends Type {

public BooleanType () {}

}

class NamedType extends Type {

public String name;

public NamedType ( String n ) { value=n; }

}

class ArrayType extends Type {

public Type element;

public ArrayType ( Type et ) { element=et; }

}

class RecordComponents {


public Type type;

public RecordComponents next;

public RecordComponents ( String a, Type t, RecordComponents el )

52

{ attribute=a; type=t; next=el; }

}

class RecordType extends Type {

public RecordComponents elements;

public RecordType ( RecordComponents el ) { elements=el; }

}

that is, if the type is an integer or a boolean, there are no extra components. If it is a namedtype, we have the name of the type. If it is an array, we have the type of the array elements(assuming that the size of an array is unbound, otherwise we must include the array bounds).If it is a record, we have a list of attribute/types (the RecordComponents class) to capturethe record components.

The symbol table must contain type declarations (ie. typedefs), variable declarations,constant declarations, and function signatures. That is, it should map strings (names) intoDeclaration objects:

abstract class Declaration {

}

class TypeDeclaration extends Declaration {

public Type declaration;

public TypeDeclaration ( Type t ) { declaration=t; }

}

class VariableDeclaration extends Declaration {


public VariableDeclaration ( Type t ) { declaration=t; }

}

class ConstantDeclaration extends Declaration {


public Exp value;

public ConstantDeclaration ( Type t, Exp v ) { declaration=t; value=v; }

}

class TypeList {

public Type head;

public TypeList next;

public TypeList ( Type h, TypeList n ) { head=h; next=n; }

}

class FunctionDeclaration extends Declaration {

public Type result;

public TypeList parameters;

public FunctionDeclaration ( Type t, TypeList tl ) { result=t; parameters=tl; }

}

If we use the hash table with chaining implementation, the symbol table symbol_table

would look like this:

class Symbol {

53

public String key;

public Declaration binding;

public Symbol next;

public Symbol ( String k, Declaration v, Symbol r )

{ key=k; binding=v; next=r; }

}

Symbol[] symbol_table = new Symbol[SIZE];

Recall that the symbol table should support the following operations:

insert ( String key, Declaration binding )

Declaration lookup ( String key )

begin_scope ()

end_scope ()

The typechecking function may have the following signature:

static Type typecheck ( Exp e );

The function typecheck must be recursive since the AST structure is recursive. In fact, thisfunction is a tree traversals that checks each node of the AST tree recursively. The body ofthe typechecker may look like this:

static Type typecheck ( Exp e ) {

if (e instanceof IntegerExp)

return new IntegerType();

else if (e instanceof TrueExp)

return new BooleanType();

else if (e instanceof FalseExp)

return new BooleanType();

else if (e instanceof VariableExp) {

VariableExp v = (VariableExp) e;

Declaration decl = lookup(v.value);

if (decl == null)

error("undefined variable");

else if (decl instanceof VariableDeclaration)

return ((VariableDeclaration) decl).declaration;

else error("this name is not a variable name");

} else if (e instanceof BinaryExp) {

BinaryExp b = (BinaryExp) e;

Type left = typecheck(b.left);

Type right = typecheck(b.right);

switch ( b.operator ) {

case "+": if (left instanceof IntegerType

&& right instanceof IntegerType)

return new IntegerType();

54

else error("expected integers in addition");

...

}

} else if (e instanceof CallExp) {

CallExp c = (CallExp) e;

Declaration decl = lookup(c.name);

if (decl == null)

error("undefined function");

else if (!(decl instanceof FunctionDeclaration))

error("this name is not a function name");

FunctionDeclaration f = (FunctionDeclaration) decl;

TypeList s = f.parameters;

for (ExpList r=c.arguments; r!=null && s!=null; r=r.next, s=s.next)

if (!equal_types(s.head,typecheck(r.head)))

error("wrong type of the argument in function call")

if (r != null || s != null)

error("wrong number of parameters");

return f.result;

}

else ...

}

where equal_types(x,y) checks the types x and y for equality. We have two types of typeequality: type equality based on type name equivalence, or based on structural equivalence.For example, if we have defined T to be a synonym for the type int and have declared thevariable x to be of type T, then using the first type of equality, x+1 will cause a type error(since T and int are different names), while using the second equality, it will be correct.

Note also that since most realistic languages support many binary and unary operators, itwill be very tedious to hardwire their typechecking into the typechecker using code. Instead,we can use another symbol table to hold all operators (as well as all the system functions)along with their signatures. This also makes the typechecker easy to change.

6.3 Case Study: The Calculator Interpreter

In Section 4.3, we described Gen used in constructing ASTs for the calculator example. Weare now ready to go back to the calculator parser calc.gen to see how ASTs are constructed.Most nonterminals and some terminals have a type:

terminal String ID;

terminal Integer INT;

terminal Float REALN;

terminal String STRINGT;

non terminal Ast exp, string, name;

non terminal Arguments expl, names;

non terminal item, prog;

55

For example, each production for the nonterminal exp constructs a value of type Ast whenis reduced. For example, the following line in the calculator parser:

| exp:e1 PLUS exp:e2 {: RESULT = #<plus_exp(‘e1,‘e2)>; :}

constructs a new Ast for exp when the rule is reduced. This Ast, which is assigned to thevariable RESULT, is a tree node with label plus_exp and two children, e1 and e2, whichcorrespond to the Asts of the plus operands.

Recall that our calculator is an interpreter, rather than a compiler. The symbol table ofan interpreter must bind each program variable to both its type and its value. For example,when the interpret encounters the variable x in a program, it must assert the type of x (eg, aninteger) so it can tell if x is used correctly in the program. It also needs the value of x becauseit needs to evaluate x. A compiler would only need the type of x. Fortunately, our calculatorhas only one type: double floating points. So there is no need for type information sinceeverything is expected to be (or be able to be converted to) a floating point. The calculatorsymbol table is an instance of the class SymbolTable given in SymbolTable.java. It is a hashtable with items of type SymbolCell:

class SymbolCell {

String name;

Ast binding;

SymbolCell next;

SymbolCell ( String n, Ast v, SymbolCell r ) { name=n; binding=v; next=r; }

}

The binding is the actual value of the symbol, not its type. If we had multiple types, weshould have had another attribute type of type Ast in Symbol. The binding plays two roles:if the name is a variable name, then the binding is the actual value (a double floating pointnumber) represented by a Real AST node. If the name is a function definition, then thebinding is the AST of the actual definition. To implement variable scoping, a scope stackis used as it is described in the previous section. The special marker -1 in the scope stackindicates a new scope. All the other entries are locations in the symbol table.

The interpreter code is given in Eval.gen. Function eval evaluates the AST e using thesymbol table st and returns a double floating point number. When a variable is encountered,its value is retrieved from the symbol table. A function call can be either a primitiveoperation, such as an integer addition, or a call to a defined function. In either case, thefunction arguments must be evaluated before the operation/call is performed. The functioncall is performed by creating a new scope, binding the functions’ formal parameters to theevaluated call arguments, and evaluating the function body using the new scope. After thecall, the function scope is popped out.

7 Activation Records

7.1 Run-Time Storage Organization

An executable program generated by a compiler will have the following organization inmemory on a typical architecture (such as on MIPS):

56

This is the layout in memory of an executable program. Note that in a virtual memoryarchitecture (which is the case for any modern operating system), some parts of the memorylayout may in fact be located on disk blocks and they are retrieved in memory by demand(lazily).

The machine code of the program is typically located at the lowest part of the layout.Then, after the code, there is a section to keep all the fixed size static data in the program.The dynamically allocated data (ie. the data created using malloc in C) as well as thestatic data without a fixed size (such as arrays of variable size) are created and kept in theheap. The heap grows from low to high addresses. When you call malloc in C to create adynamically allocated structure, the program tries to find an empty place in the heap withsufficient space to insert the new data; if it can’t do that, it puts the data at the end of theheap and increases the heap size.

The focus of this section is the stack in the memory layout. It is called the run-time

stack. The stack, in contrast to the heap, grows in the opposite direction (upside-down):from high to low addresses, which is a bit counterintuitive. The stack is not only used topush the return address when a function is called, but it is also used for allocating some ofthe local variables of a function during the function call, as well as for some bookkeeping.

Lets consider the lifetime of a function call. When you call a function you not only wantto access its parameters, but you may also want to access the variables local to the function.Worse, in a nested scoped system where nested function definitions are allowed, you maywant to access the local variables of an enclosing function. In addition, when a function callsanother function, we must forget about the variables of the caller function and work withthe variables of the callee function and when we return from the callee, we want to switchback to the caller variables. That is, function calls behave in a stack-like manner. Considerfor example the following program:

procedure P ( c: integer )

x: integer;

57

procedure Q ( a, b: integer )

i, j: integer;

begin

x := x+a+j;

end;

begin

Q(x,c);

end;

At run-time, P will execute the statement x := x+a+j in Q. The variable a in this state-ment comes as a parameter to Q, while j is a local variable in Q and x is a local variable to P.How do we organize the runtime layout so that we will be able to access all these variablesat run-time? The answer is to use the run-time stack.

When we call a function f, we push a new activation record (also called a frame) onthe run-time stack, which is particular to the function f. Each frame can occupy manyconsecutive bytes in the stack and may not be of a fixed size. When the callee functionreturns to the caller, the activation record of the callee is popped out. For example, if themain program calls function P, P calls E, and E calls P, then at the time of the second call toP, there will be 4 frames in the stack in the following order: main, P, E, P

Note that a callee should not make any assumptions about who is the caller becausethere may be many different functions that call the same function. The frame layout in thestack should reflect this. When we execute a function, its frame is located on top of thestack. The function does not have any knowledge about what the previous frames contain.There two things that we need to do though when executing a function: the first is that weshould be able to pop-out the current frame of the callee and restore the caller frame. Thiscan be done with the help of a pointer in the current frame, called the dynamic link, thatpoints to the previous frame (the caller’s frame). Thus all frames are linked together in thestack using dynamic links. This is called the dynamic chain. When we pop the callee fromthe stack, we simply set the stack pointer to be the value of the dynamic link of the currentframe. The second thing that we need to do, is if we allow nested functions, we need to beable to access the variables stored in previous activation records in the stack. This is donewith the help of the static link. Frames linked together with static links form a static chain.The static link of a function f points to the latest frame in the stack of the function thatstatically contains f. If f is not lexically contained in any other function, its static link isnull. For example, in the previous program, if P called Q then the static link of Q will point tothe latest frame of P in the stack (which happened to be the previous frame in our example).Note that we may have multiple frames of P in the stack; Q will point to the latest. Alsonotice that there is no way to call Q if there is no P frame in the stack, since Q is hiddenoutside P in the program.

A typical organization of a frame is the following:

58

Before we create the frame for the callee, we need to allocate space for the callee’sarguments. These arguments belong to the caller’s frame, not to the callee’s frame. Thereis a frame pointer (called FP) that points to the beginning of the frame. The stack pointerpoints to the first available byte in the stack immediately after the end of the current frame(the most recent frame). There are many variations to this theme. Some compilers usedisplays to store the static chain instead of using static links in the stack. A display is aregister block where each pointer points to a consecutive frame in the static chain of thecurrent frame. This allows a very fast access to the variables of deeply nested functions.Another important variation is to pass arguments to functions using registers.

When a function (the caller) calls another function (the callee), it executes the followingcode before (the pre-call) and after (the post-call) the function call:

• pre-call: allocate the callee frame on top of the stack; evaluate and store functionparameters in registers or in the stack; store the return address to the caller in aregister or in the stack.

• post-call: copy the return value; deallocate (pop-out) the callee frame; restore param-eters if they passed by reference.

In addition, each function has the following code in the beginning (prologue) and at the end(epilogue) of its execution code:

• prologue: store frame pointer in the stack or in a display; set the frame pointer tobe the top of the stack; store static link in the stack or in the display; initialize localvariables.

• epilogue: store the return value in the stack; restore frame pointer; return to the caller.

59

We can classify the variables in a program into four categories:

• statically allocated data that reside in the static data part of the program; these arethe global variables.

• dynamically allocated data that reside in the heap; these are the data created by mallocin C.

• register allocated variables that reside in the CPU registers; these can be functionarguments, function return values, or local variables.

• frame-resident variables that reside in the run-time stack; these can be function argu-ments, function return values, or local variables.

Every frame-resident variable (ie. a local variable) can be viewed as a pair of (level,offset).The variable level indicates the lexical level in which this variable is defined. For example,the variables inside the top-level functions (which is the case for all functions in C) havelevel=1. If a function is nested inside a top-level function, then its variables have level=2,etc. The offset indicates the relative distance from the beginning of the frame that we canfind the value of a variable local to this frame. For example, to access the nth argument ofthe frame in the above figure, we retrieve the stack value at the address FP+1, and to accessthe first argument, we retrieve the stack value at the address FP+n (assuming that eachargument occupies one word). When we want to access a variable whose level is differentfrom the level of the current frame (ie. a non-local variable), we subtract the level of thevariable from the current level to find out how many times we need to follow the static link(ie. how deep we need to go in the static chain) to find the frame for this variable. Then,after we locate the frame of this variable, we use its offset to retrieve its value from the stack.For example, to retrieve to value of x in the Q frame, we follow the static link once (sincethe level of x is 1 and the level of Q is 2) and we retrieve x from the frame of P.

Another thing to consider is what exactly do we pass as arguments to the callee. Thereare two common ways of passing arguments: by value and by reference. When we pass anargument by reference, we actually pass the address of this argument. This address may bean address in the stack, if the argument is a frame-resident variable. A third type of passingarguments, which is not used anymore but it was used in Algol, is call by name. In thatcase we create a thunk inside the caller’s code that behaves like a function and contains thecode that evaluates the argument passed by name. Every time the callee wants to access thecall-by-name argument, it calls the thunk to compute the value for the argument.

7.2 Case Study: Activation Records for the MIPS Architecture

The following describes a function call abstraction for the MIPS architecture. This may beslightly different from the one you will use for the project.

MIPS uses the register $sp as the stack pointer and the register $fp as the frame pointer.In the following MIPS code we use both a dynamic link and a static link embedded in theactivation records.

Consider the previous program:

60

procedure P ( c: integer )

x: integer;

procedure Q ( a, b: integer )

i, j: integer;

begin

x := x+a+j;

end;

begin

Q(x,c);

end;

The activation record for P (as P sees it) is shown in the first figure below:

The activation record for Q (as Q sees it) is shown in the second figure above. The thirdfigure shows the structure of the run-time stack at the point where x := x+a+j is executed.This statement uses x, which is defined in P. We can’t assume that Q called P, so we shouldnot use the dynamic link to retrieve x; instead, we need to use the static link, which pointsto the most recent activation record of P. Thus, the value of variable x is computed by:

lw $t0, -8($fp) # follow the static link of Q

lw $t1, -12($t0) # x has offset=-12 inside P

Function/procedure arguments are pushed in the stack before the function call. If this isa function, then an empty placeholder (4 bytes) should be pushed in the stack before thefunction call; this will hold the result of the function.

61

Each procedure/function should begin with the following code (prologue):

sw $fp, ($sp) # push old frame pointer (dynamic link)

move $fp, $sp # frame pointer now points to the top of stack

subu $sp, $sp, 500 # allocate say 500 bytes in the stack

# (for frame size = 500)

sw $ra, -4($fp) # save return address in frame

sw $v0, -8($fp) # save static link in frame

(where $v0 is set by the caller – see below) and should end with the following code (epilogue):

lw $ra, -4($fp) # restore return address

move $sp, $fp # pop frame

lw $fp, ($fp) # restore old frame pointer (follow dynamic link)

jr $ra # return

For each procedure call, you need to push the arguments into the stack and set $v0 to be theright static link (very often it is equal to the static link of the current procedure; otherwise,you need to follow the static link a number of times). For example, the call Q(x,c) in P istranslated into:

lw $t0, -12($fp)

sw $t0, ($sp) # push x

subu $sp, $sp, 4

lw $t0, 4($fp)

sw $t0, ($sp) # push c

subu $sp, $sp, 4

move $v0, $fp # load static link in $v0

jal Q # call procedure Q

addu $sp, $sp, 8 # pop stack

Note that there are two different cases for setting the static link before a procedure call.Lets say that caller level and callee level are the nesting levels of the caller and the calleeprocedures (recall that the nesting level of a top-level procedure is 0, while the nesting levelof a nested procedure embedded inside another procedure with nesting level l, is l+1). Whenthe callee is lexically inside the caller’s body, that is, when callee level=caller level+1, wehave:

move $v0, $fp

The call Q(x,c) in P is such a case because the nesting levels of P and Q are 0 and 1, respec-tively. Otherwise, we follow the static link of the caller d + 1 times, where d=caller level-callee level (the difference between the nesting level of the caller from that of the callee).For d=0, that is, when both caller and callee are at the same level, we have

lw $v0, -8($fp)

For d=2 we have

62

lw $t1, -8($fp)

lw $t1, -8($t1)

lw $v0, -8($t1)

These cases are shown in the following figure:

Note also that, if you have a call to a function, you need to allocate 4 more bytes in thestack to hold the result.

See the factorial example for a concrete example of a function expressed in MIPS code.

8 Intermediate Code

The semantic phase of a compiler first translates parse trees into an intermediate representa-tion (IR), which is independent of the underlying computer architecture, and then generatesmachine code from the IRs. This makes the task of retargeting the compiler to anothercomputer architecture easier to handle.

We will consider Tiger for a case study. For simplicity, all data are one word long (fourbytes) in Tiger. For example, strings, arrays, and records, are stored in the heap and they arerepresented by one pointer (one word) that points to their heap location. We also assumethat there is an infinite number of temporary registers (we will see later how to spill-outsome temporary registers into the stack frames). The IR trees for Tiger are used for buildingexpressions and statements. They are described in detail in Section 7.1 in the textbook.Here is a brief description of the meaning of the expression IRs:

• CONST(i): the integer constant i.

• MEM(e): if e is an expression that calculates a memory address, then this is thecontents of the memory at address e (one word).

• NAME(n): the address that corresponds to the label n, eg. MEM(NAME(x)) returnsthe value stored at the location X.

63

• TEMP(t): if t is a temporary register, return the value of the register, eg.

MEM(BINOP(PLUS,TEMP(fp),CONST(24)))

fetches a word from the stack located 24 bytes above the frame pointer.

• BINOP(op,e1,e2): evaluate e1, then evaluate e2, and finally perform the binary oper-ation op over the results of the evaluations of e1 and e2. op can be PLUS, AND, etc.We abbreviate BINOP(PLUS,e1,e2) as +(e1,e2).

• CALL(f,[e1,e2,...,en]): evaluate the expressions e1, e2, etc (in that order), and at theend call the function f over these n parameters. eg.

CALL(NAME(g),ExpList(MEM(NAME(a)),ExpList(CONST(1),NULL)))

represents the function call g(a,1).

• ESEQ(s,e): execute statement s and then evaluate and return the value of the expres-sion e

And here is the description of the statement IRs:

• MOVE(TEMP(t),e): store the value of the expression e into the register t.

• MOVE(MEM(e1),e2): evaluate e1 to get an address, then evaluate e2, and then storethe value of e2 in the address calculated from e1. eg,

MOVE(MEM(+(NAME(x),CONST(16))),CONST(1))

computes x[4] := 1 (since 4*4bytes = 16 bytes).

• EXP(e): evaluates e and discards the result.

• JUMP(L): Jump to the address L (which must be defined in the program by someLABEL(L)).

• CJUMP(o,e1,e2,t,f): evaluate e1, then e2. The binary relational operator o must beEQ, NE, LT etc. If the values of e1 and e2 are related by o, then jump to the addresscalculated by t, else jump the one for f.

• SEQ(s1,s2,...,sn): perform statement s1, s2, ... sn is sequence.

• LABEL(n): define the name n to be the address of this statement. You can retrievethis address using NAME(n).

In addition, when JUMP and CJUMP IRs are first constructed, the labels are not knownin advance but they will be known later when they are defined. So the JUMP and CJUMPlabels are first set to NULL and then later, when the labels are defined, the NULL valuesare changed to real addresses.

64

8.1 Translating Variables, Records, Arrays, and Strings

Local variables located in the stack are retrieved using an expression represented by theIR: MEM(+(TEMP(fp),CONST(offset))). If a variable is located in an outer static scope klevels lower than the current scope, the we retrieve the variable using the IR:

MEM(+(MEM(+(...MEM(+(MEM(+(TEMP(fp),CONST(static))),CONST(static))),...)),

CONST(offset)))

where static is the offset of the static link. That is, we follow the static chain k times, andthen we retrieve the variable using the offset of the variable.

An l-value is the result of an expression that can occur on the left of an assignmentstatement. eg, x[f(a,6)].y is an l-value. It denotes a location where we can store a value. It isbasically constructed by deriving the IR of the value and then dropping the outermost MEMcall. For example, if the value is MEM(+(TEMP(fp),CONST(offset))), then the l-value is+(TEMP(fp),CONST(offset)).

In Tiger (the language described in the textbook), vectors start from index 0 and eachvector element is 4 bytes long (one word), which may represent an integer or a pointer tosome value. To retrieve the ith element of an array a, we use MEM(+(A,*(I,4))), whereA is the address of a (eg. A is +(TEMP(fp),CONST(34))) and I is the value of i (eg.MEM(+(TEMP(fp),CONST(26)))). But this is not sufficient. The IR should check whethera < size(a): CJUMP(gt,I,CONST(size of A),MEM(+(A,*(I,4))),NAME(error label)), thatis, if i is out of bounds, we should raise an exception.

For records, we need to know the byte offset of each field (record attribute) in the baserecord. Since every value is 4 bytes long, the ith field of a structure a can be retrieved usingMEM(+(A,CONST(i*4))), where A is the address of a. Here i is always a constant since weknow the field name. For example, suppose that i is located in the local frame with offset24 and a is located in the immediate outer scope and has offset 40. Then the statementa[i+1].first := a[i].second+2 is translated into the IR:

MOVE(MEM(+(+(TEMP(fp),CONST(40)),

*(+(MEM(+(TEMP(fp),CONST(24))),

CONST(1)),

CONST(4)))),

+(MEM(+(+(+(TEMP(fp),CONST(40)),

*(MEM(+(TEMP(fp),CONST(24))),

CONST(4))),

CONST(4))),

CONST(2)))

since the offset of first is 0 and the offset of second is 4.In Tiger, strings of size n are allocated in the heap in n + 4 consecutive bytes, where the

first 4 bytes contain the size of the string. The string is simply a pointer to the first byte.String literals are statically allocated. For example, the MIPS code:

ls: .word 14

.ascii "example string"

65

binds the static variable ls into a string with 14 bytes. Other languages, such as C, storea string of size n into a dynamically allocated storage in the heap of size n + 1 bytes. Thelast byte has a null value to indicate the end of string. Then, you can allocate a string withaddress A of size n in the heap by adding n + 1 to the global pointer ($gp in MIPS):

MOVE(A,ESEQ(MOVE(TEMP(gp),+(TEMP(gp),CONST(n+1))),TEMP(gp)))

8.2 Control Statements

Statements, such as for-loop and while-loop, are translated into IR statements that containjumps and conditional jumps.

The while loop while c do body; is evaluated in the following way:

loop: if c goto cont else goto done

cont: body

goto loop

done:

which corresponds to the following IR:

SEQ(LABEL(loop),

CJUMP(EQ,c,1,NAME(done),NAME(cont)),

LABEL(cont),

s,

JUMP(NAME(loop)),

LABEL(done))

The for statement for i:=lo to hi do body is evaluated in the following way:

i := lo

j := hi

if i>j goto done

loop: body

i := i+1

if i<=j goto loop

done:

The break statement is translated into a JUMP. The compiler keeps track which label toJUMP to on a “break” statement by maintaining a stack of labels that holds the “done:”labels of the for- or while-loop. When it compiles a loop, it pushes the label in the stack,and when it exits a loop, it pops the stack. The break statement is thus translated into aJUMP to the label at the top of the stack.

A function call f(a1,...,an) is translated into the IR CALL(NAME(l_f),[sl,e1,...,en]),where l_r is the label of the first statement of the f code, sl is the static link, and ei is theIR for ai. For example, if the difference between the static levels of the caller and callee isone, then sl is MEM(+(TEMP(fp),CONST(static))), where static is the offset of the staticlink in a frame. That is, we follow the static link once. The code generated for the body of

66

the function f is a sequence which contains the prolog, the code of f, and the epilogue, as itwas discussed in Section 7.

For example, suppose that records and vectors are implemented as pointers (i.e. memoryaddresses) to dynamically allocated data in the heap. Consider the following declarations:

struct { X: int, Y: int, Z: int } S; /* a record */

int i;

int V[10][10]; /* a vector of vectors */

where the variables S, i, and V are stored in the current frame with offsets -16, -20, and -24respectively. By using the following abbreviations:

S = MEM(+(TEMP(fp),CONST(-16)))

I = MEM(+(TEMP(fp),CONST(-20)))

V = MEM(+(TEMP(fp),CONST(-24)))

we have the following IR translations:

S.Z+S.X

--> +(MEM(+(S,CONST(8))),MEM(S))

if (i<10) then S.Y := i else i := i-1

--> SEQ(CJUMP(LT,I,CONST(10),trueL,falseL),

LABEL(trueL),

MOVE(MEM(+(S,CONST(4))),I),

JUMP(exit),

LABEL(falseL),

MOVE(I,-(I,CONST(1))),

JUMP(exit),

LABEL(exit))

V[i][i+1] := V[i][i]+1

--> MOVE(MEM(+(MEM(+(V,*(I,CONST(4)))),*(+(I,CONST(1)),CONST(4)))),

MEM(+(MEM(+(V,*(I,CONST(4)))),*(I,CONST(4)))))

for i:=0 to 9 do V[0][i] := i

--> SEQ(MOVE(I,CONST(0)),

MOVE(TEMP(t1),CONST(9)),

CJUMP(GT,I,TEMP(t1),done,loop),

LABEL(loop),

MOVE(MEM(+(MEM(V),*(I,CONST(4)))),I),

MOVE(I,+(I,CONST(1))),

CJUMP(LEQ,I,TEMP(t1),loop,done),

LABEL(done))

67

9 Basic Blocks and Traces

Many computer architectures have instructions that do not exactly match the IR represen-tations given in the previous sections. For example, they do not support two-way branchingas in CJUMP(op,e1,e2,l1,l2). In addition, nested calls, such as CALL(f,[CALL(g,[1])]),will cause interference between register arguments and returned results. The nested SEQs,such as SEQ(SEQ(s1,s2),s3), impose an order of a evaluation, which restricts optimiza-tion (eg, if s1 and s2 do not interfere with each other, we would like to be able to switchSEQ(s1,s2) with SEQ(s2,s1) because it may result to a more efficient program.

The problems mentioned above can be eliminated in two phases:

1. transforming IR trees into a list of canonical trees, and

2. transforming unrestricted CJUMPs into CJUMPs that are followed by their false targetlabel.

An IR is a canonical tree if it does not contain SEQ or ESEQ and the parent node of eachCALL node is either an EXP or a MOVE(TEMP(t),...) node. The idea is that we transforman IR in such a way that all ESEQs are pulled up in the IR and become SEQs at the topof the tree. At the end, we are left with nested SEQs at the top of the tree, which areeliminated to form a list of statements. For example, the IR:

SEQ(MOVE(NAME(x),ESEQ(MOVE(TEMP(t),CONST(1)),TEMP(t))),

JUMP(ESEQ(MOVE(NAME(z),NAME(L)),NAME(z))))

is translated into:

SEQ(SEQ(MOVE(TEMP(t),CONST(1)),

MOVE(NAME(x),TEMP(t)))

SEQ(MOVE(NAME(z),NAME(L)),

JUMP(NAME(z))))

which corresponds to a list of statements (if we remove the top SEQs):

[ MOVE(TEMP(t),CONST(1)), MOVE(NAME(x),TEMP(t)),

MOVE(NAME(z),NAME(L)), JUMP(NAME(z)) ]

Here are some rules to make canonical trees:

• ESEQ(s1,ESEQ(s2,e)) = ESEQ(SEQ(s1,s2),e)

• BINOP(op,ESEQ(s,e1),e2) = ESEQ(s,BINOP(op,e1,e2))

• MEM(ESEQ(s,e)) = ESEQ(s,MEM(e))

• JUMP(ESEQ(s,e)) = SEQ(s,JUMP(e))

• CJUMP(op,ESEQ(s,e1),e2,l1,l2) = SEQ(s,CJUMP(op.e1,e2,l1,l2))

• BINOP(op,e1,ESEQ(s,e2))= ESEQ(MOVE(temp(t),e1),ESEQ(s,BINOP(op,TEMP(t),e2)))

68

• CJUMP(op,e1,ESEQ(s,e2),l1,l2)= SEQ(MOVE(temp(t),e1),SEQ(s,CJUMP(op,TEMP(t),e2,l1,l2)))

• MOVE(ESEQ(s,e1),e2) = SEQ(s,MOVE(e1,e2))

To handle function calls, we store the function results into a new register:CALL(f,a) = ESEQ(MOVE(TEMP(t),CALL(f,a)),TEMP(t))That way expressions, such as +(CALL(f,a),CALL(g,b)), would not rewrite each others

result register.Now the next thing to do is to transform any CJUMP into a CJUMP whose false target

label is the next instruction after CJUMP. This reflects the conditional JUMP found in mostarchitectures (ie. “if condition then JUMP label”). To do so, we first form basic blocks fromthe IR tree. A basic block is a sequence of statements whose first statement is a LABEL,the last statement is a JUMP or CJUMP, and does not contain any other LABELs, JUMPs,or CJUMPs. That is, we can only enter at the beginning of a basic block and exit at theend. To solve the CJUMP problem, we first create the basic blocks for an IR tree and thenwe reorganize the basic blocks in such a way that every CJUMP at the end of a basic blockis followed by a block the contains the CJUMP false target label. A secondary goal is toput the target of a JUMP immediately after the JUMP. That way, we can eliminate theJUMP (and maybe merge the two blocks). The algorithm is based on traces: You start atrace with an unmark block and you consider the target of the JUMP of this block or thefalse target block of its CJUMP. Then, if the new block is unmarked, you append the newblock to the trace, you use it as your new start, and you apply the algorithm recursively;otherwise, you close this trace and you start a new trace by going back to a point wherethere was a CJUMP and you choose the true target this time. You continue until all blocksare marked. This is a greedy algorithm. At the end, there may be still some CJUMPs thathave a false target that does not follow the CJUMP (this is the case where this false targetlabel was the target of another JUMP or CJUMP found earlier in a trace). In that case:

• if we have a CJUMP followed by a true target, we negate the condition and switch thetrue and false targets;

• otherwise, we create a new block LABEL(L) followed by JUMP(F) and we replaceCJUMP(op,a,b,T,F) with CJUMP(op,a,b,T,L).

Also, if there is a JUMP(L) followed by a LABEL(L), we remove the JUMP.

10 Instruction Selection

After IR trees have been put into a canonical form (described in the previous section), theyare used in generating assembly code. The obvious way to do this is to macroexpand each IRtree node. For example, MOVE(MEM(+(TEMP(fp),CONST(10))),CONST(3)) is macroexpandedinto the pseudo-assembly code at the right column:

TEMP(fp) t1 := fp

CONST(10) t2 := 10

69

+(TEMP(fp),CONST(10)) t3 := t1+t2

CONST(3) t4 := 3

MOVE(MEM(...),CONST(3)) M[t3] := t4

where ti stands for a temporary variable. This method generates very poor quality code.For example, the IR above can be done using only one instruction in most architectures:M[fp+10] := 3.

A method, called maximum munch, generates better code, especially for RISC machines.The idea is to use tree pattern matching to map a tree pattern (a fragment of an IR tree) intoa list of assembly instructions. These tree patterns are called tiles. For RISC we always haveone-to-one mapping (one tile to one assembly instruction). Note that for RISC machines thetiles are small (very few number of IR nodes) but for CISC machines the tiles are usuallylarge since the CISC instructions are very complex.

The following is the mapping of some tiles (left) into MIPS code (right):

CONST(c) li ’d0, c

+(e0,e1) add ’d0, ’s0, ’s1

+(e0,CONST(c)) add ’d0, ’s0, c

*(e0,e1) mult ’d0, ’s0, ’s1

*(e0,CONST(2^k)) sll ’d0, ’s0, k

MEM(e0) lw ’d0, (’s0)

MEM(+(e0,CONST(c))) lw ’d0, c(’s0)

MOVE(MEM(e0),e1) sw ’s1, (’s0)

MOVE(MEM(+(e0,CONST(c))),e1) sw ’s1, c(’s0)

JUMP(NAME(X)) b X

JUMP(e0) jr ’s0

LABEL(X) X: nop

Here e0 and e1 are subtrees of these IR tree nodes. Most of the assembly instructions abovehave one destination register d0 and a number of source registers, s0 and s1. So if a treepattern, such as MOVE(MEM(e0),e1), has two input subtrees e0 and e1, then their values aretaken from the source registers s0 and s1 (which are the destinations of e0 and e1). Thequote in the assembly instructions is used to indicate that temporary registers should beselected so that there is this match of source-destination names.

To translate an IR tree into assembly code, we perform a tiling: we cover the IR treewith non-overlapping tiles. We can see that there are many different tilings. Consider forexample the tiger statement a[i] := x, where i is a temporary register, the a address isstored at the offset 20, and x at the offset 10. The IR form is:

MOVE(MEM(+(MEM(+(fp,CONST(20))),*(TEMP(i),CONST(4)))),MEM(+(fp,CONST(10))))

The following are two possible tilings of the IR:

lw r1, 20($fp) add r1, $fp, 20

sll r2, r1, 2 lw r1, (r1)

add r1, r1, r2 sll r2, r1, 2

lw r2, 10($fp) add r1, r1, r2

70

sw r2, (r1) add r2, $fp, x

lw r2, (r2)

sw r2, (r1)

The first tiling is obviously better since it can be executed faster.It’s highly desirable to do optimum tiling, ie, to generate the shortest instruction sequence

(or alternatively the sequence with the fewest machine cycles). But this is not always easy toachieve. So most compilers do an optimal tiling where no two adjacent tiles can be combinedinto a tile of lower cost. Optimum tiling is not always optimal but it’s close for RISCmachines.

There are two main ways of performing optimum tiling: using maximal munch (a greedyalgorithm) or using dynamic programming. In maximal munch you start from the IR rootand from all matching tiles you select the one with the maximum number of IR nodes (largesttile). Then you go to the children (subtrees) of this tile and apply the algorithm recursivelyuntil you reach the tree leaves. The dynamic programming works from the leaves to theroot. It assigns a cost to every tree node by considering every tile that matches the nodeand calculating the minimum value of:

cost of a node = (number of nodes in the tile) + (total costs of all the tile children)

(the costs of the tile children are known since the algorithm is bottom-up).For example, consider the following IR represented as a tree and its tiling using maximal

munch:

The order of tiling is IEHBDGACF (ie, top-down). After we set the tiles, we use adifferent register ri for each connection between the tiles. Then we expand the tiles intoMIPS code by following the tile mapping table. The order of tile expansion is ABCDEFGHI(ie, bottom-up). The MIPS code is:

A lw r1, fp

B lw r2, 8(r1)

C lw r3, i

71

D sll r4, r3, 2

E add r5, r2, r4

F lw r6, fp

G lw r7, 16(r6)

H add r8, r7, 1

I sw r8, (r5)

11 Liveness Analysis

When we generated IR trees, we assumed that we have a very large number of temporaryvariables stored in registers. Of course this is not true for real machines. In particularCISC machines have very few registers (Pentium has 6 general registers only). So it’s highlydesirable to use one machine register for multiple temporary variables (instead of using oneregister for one temporary variable). Consider this program:

1. a := 0

2. L1: b := a+1

3. c := c+b

4. a := b*2

5. if a<10 goto L1

6. return c

Obviously we need a maximum of three registers, one for each variable a, b, and c. But wecan do better if we use two registers: one for c and one for both a and b. This is possiblebecause after a+1 is computed in statement 2 and during statements 3 and 4, the value ofa is not used, and these are the only places where b is used. We say that variables a and b

do not interfere if they are not live during the same periods in the program. In that case,they can occupy the same register. A variable x is live at a particular point (statement)in a program, if it holds a value that may be needed in the future. That is, x is live at aparticular point if there is a path (possibly following gotos) from this point to a statementthat uses x and there is no assignment to x in any statement in the path (because if therewas an assignment to x, the old value of x is discarded before it is used). For example, a islive in 2 (because it’s the place where it is used), not live in 3 and 4 (because it is assigneda value in 4 before it is used in 5), and live again in 5. Variable b is live in 3 and 4 only,and variable c is live in 2, 3, 4, 5, and 6 (it’s live in 2 because there is a path to 3 where itis used). In general, let’s say that you have a use of a variable v in line n:

k. v := ...

...

n. x := ... v ..

(Here v is used in the source of an assignment but it may have also been used in a functioncall, return statement, etc.) We try to find an assignment v := ... such that there is pathfrom this statement to line n and there is no other assignment to v along this path. Inthat case, we say that v is live along this path (immediately after the asignment until andincluding line n). We do that for every use of v and the union of all these regions in which

72

v is live gives us the life of v. In our example, the life of variables are:a b c

1 X2 X X3 X X4 X X5 X X6 X

Let’s formalize the liveness of a variable. The first thing to do is to construct the control

flow graph (CFG) of a program. The CFG nodes are individual statements (or maybe basicblocks) and the graph edges represent potential flow of control. The outgoing edges of anode n are succ[n] and the ingoing edges are pred[n]. For example, succ[5] = [6, 2] andpred[5] = [4]. For each CFG node n (a statement) we define use[n] to be all the variablesused (read) in this statement and def [n] all the variables assigned a value (written) in thisstatement. For example, use[3] = [b, c] and def [3] = [c].

We say that variable v is live at a statement n if there is a path in the CFG from thisstatement to a statement m such that v ∈ use[m] and for each n ≤ k < m : v 6∈ def [k].That is, there is no assignment to v in the path from n to m. For example, c is alive in 4since it is used in 6 and there is no assignment to c in the path from 4 to 6.

The liveness analysis analyzes a CFG to determine which places variables are live ornot. It is a data flow analysis since it flows around the edges of the CFG information aboutvariables (in this case information about the liveness of variables). For each CFG node n wederive two sets: in[n] (live-in) and out[n] (live-out). The set in[n] gives all variables that arelive before the execution of statement n and out[n] gives all variables that are live after theexecution of statement n. So the goal here is to compute these sets from the sets succ, useand def . To do this, we consider the properties of in and out:

1. v ∈ use[n]⇒ v ∈ in[n]. That is, if v is used in n then is live-in in n (regardless whetherit is defined in n).

2. v ∈ (out[n]− def [n]) ⇒ v ∈ in[n]. That is, if v is live after the execution of n and isnot defined in n, then v is live before the execution of n.

3. for each s ∈ succ[n] : v ∈ in[s]⇒ v ∈ out[n]. This reflects the formal definition of theliveness of variable v.

The algorithm to compute in and out is the following:

foreach n : in[n]← ∅; out[n]← ∅repeat

foreach n :in′[n]← in[n]out′[n]← out[n]in[n]← use[n] ∪ (out[n]− def [n])out[n]←

⋃s∈succ[n] in[s]

until in′ = in ∧ out′ = out

73

That is, we store the old values of in and out into in′ and out′ and we repeatedly execute theloop until we can’t add any more elements. The algorithm converges very fast if we considerthe CFG nodes in the reverse order (when is possible), ie, in the order 6, 5, 4, 3, 2, 1. SeeTable 10.6 in the textbook for an example.

The life of a variable can be directly derived from vector in: if v ∈ in[n] then v is live online n. Then, from the variable lifes we can compute the interference graph. The nodes ofthis graph are the program variables and for each node v and w there is an interference edgeif the lives of the variables v and w overlap in at least one program point (statement). Foreach CFG node n that assigns the value to the variable a (ie, a ∈ def [n]) we add the edges(a, b1), (a, b2), . . . , (a, bm), where out[n] = {b1, b2, . . . , bm}. There is a special case when n isa move command: a← c; in that case we do not add the edge (a, bk) if bk = c. For example,the previous program has an interference graph with three nodes: a, b, and c, and two edges:a-c and b-c.

The following is a larger example:x y z w u v

1. v := 12. z := v+1 X3. x := z * v X X4. y := x * 2 X X5. w := x+z * y X X X6. u := z+2 X X X7. v := u+w+y X X X8. return v * u X X

For example, v is used in line 3 and the last assignment to v before this line was in line 1.So v is live on lines 2 and 3. Also v is used in line 8 and the last assignment to v before thisline was in line 7. So v is also live on line 8. The interference graph is the following:

12 Register Allocation

12.1 Register Allocation Using the Interference Graph

The interference graph is used for assigning registers to temporary variables. If two variablesdo not interfere (ie, there is no edge between these two variables in the interference graph)then we can use the same register for both of them, thus reducing the number of registersneeded. On the other hand, if there is a graph edge between two variables, then we shouldnot assign the same register to them since this register needs to hold the values of both

74

variable at one point of time (because the lives of these variables overlap at one point oftime – this is what interference means).

Suppose that we have n available registers: r1, r2, . . . , rn. If we view each register as adifferent color, then the register allocation problem for the interference graph is equivalentto the graph coloring problem where we try to assign one of the n different colors to graphnodes so that no two adjacent nodes have the same color. This algorithm is used in mapdrawing where we have countries or states on the map and we want to color them using asmall fixed number of colors so that states that have a common border are not painted withthe same color. The graph in this case has one node for each state and an edge between twostates if they have common borders.

The register allocation algorithm uses a stack of graph nodes to insert all nodes of theinterference graph one at a time. Each time it selects a node that has fewer than n neighbors(adjacent nodes). The idea is that if we can color the graph after we remove this node, thenof course we can color the original graph (with the node included) because the neighbors ofthis node can have n−1 different colors in the worst case; so we can just assign the availablenth color to the node. This is called simplification of the graph. Each selected node ispushed in the stack. Sometimes though we cannot simplify any further because all nodeshave n or more neighbors. In that case we select one node (ie. variable) to be spilled intomemory instead of assigning a register to it. This is called spilling and the spilled victim canbe selected based on priorities (eg. which variable is used less frequently, is it outside a loop,etc). The spilled node is also pushed on the stack. When the graph is completely reduced(and all nodes have been pushed in the stack), we pop the stack one node at a time, werebuild the interference graph and at the same time we assign a color to the popped-out nodeso that its color is different from the colors of its neighbors. This is called the selection phase.If we can’t assign a color to a node, we spill out the node into memory (a node selected tobe spilled out during the spill phase does not necessarily mean that it will actually spilledinto memory at the end). If there are spilled nodes, we use a memory access for each spilledvariable (eg. we use the frame location $fp-24 to store the spilled temporary variable and wereplace all occurrences of this variable in the program with M[$fp-24]). Then we restart thewhole process (the building of the interference graph, simplification, etc) from the beginning.Figures 11.1-11.3 in Section 11.1 in the textbook give an example of this method.

Consider the following interference graph from the previous section:

The nodes are pushed in the stack in the order of xvuzyw. Then, at the selection phase,xyzwuv variables are assigned the registers R0R2R1R0R1R0.

If there is a move instruction in a program (ie. an assignment of the form a:=b) andthere is no conflict between a and b, then it is a good idea to use the same register for

75

both a and b so that you can remove the move instruction entirely from the program. Infact you can just merge the graph nodes for a and b in the graph into one node. Thatway nodes are now labeled by sets of variables, instead of just one variable. This is calledcoalescing. This is a good thing to do since it reduces the number of registers needed and itremoves the move instructions, but it may be bad since it increases the number of neighborsof the merged nodes, which may lead to an irreducible graph and a potential spilling. Todo this, we add another phase to the register allocation algorithm, called coalescing, thatcoalesces move related nodes. If we derive an irreducible graph at some point of time, thereis another phase, called freezing, that de-coalesces one node (into two nodes) and continuesthe simplification.

A common criterion for coalescing is that we coalesce two nodes if the merged nodehas fewer than n neighbors of degree greater than or equal to n (where n is the numberof available registers). This is called the Briggs criterion. A slight variation of this is theGeorge criterion where we coalesce nodes if all the neighbors of one of the nodes with degreegreater than or equal to n already interfere with the other node.

Coalescing is very useful when handling callee-save registers in a program. Recall thatcallee-save registers need to be saved by the callee procedure to maintain their values uponthe exit of the procedure. Suppose that r3 is a callee-save register. The procedure needs tosave this register into a temporary variable at the beginning of the procedure (eg. a := r3)and to restore it at the end of the procedure (ie. r3 := a). That way, if r3 is not used atall during the procedure body, it will be coalesced with a and the move instructions will beremoved. But coalescing can happen in many other different situations as long as there isno interference, which makes this technique very general. Note that registers in a programare handled as temporary variables with a preassigned color (precolored nodes). This meansthat precolored nodes can only be coalesced with other nodes (they cannot be simplifiedor spilled). See the example in Section 11.3 in the textbook for a program with precolorednodes.

12.2 Code Generation for Trees

Suppose that you want generate assembly code for complex expression trees using the fewestnumber of registers to store intermediate results. Suppose also that we have two-addressinstructions of the form

op Ri, T

where op is an operation (add, sub, mult, etc), Ri is a register (R1, R2, R3, etc), and T isan address mode such as a memory reference, a register, indirect access, indexing etc. Wealso have a move instruction of the form:

load Ri, T

For example, for the expression (A-B)+((C+D)+(E*F)), which corresponds to the AST:

+

/ \

/ \

76

- +

/ \ / \

A B + *

/ \ / \

C D E F

we want to generate the following assembly code:

load R2, C

add R2, D

load R1, E

mult R1, F

add R2, R1

load R1, A

sub R1, B

add R1, R2

That is, we used only two register.There is an algorithm that generates code with the least number of registers. It is called

the Sethi-Ullman algorithm. It consists of two phases: numbering and code generation. Thenumbering phase assigns a number to each tree node that indicates how many registers areneeded to evaluate the subtree of this node. Suppose that for a tree node T , we need lregisters to evaluate its left subtree and r registers to evaluate its right subtree. Then if oneof these numbers is larger, say l > r, then we can evaluate the left subtree first and storeits result into one of the registers, say Rk. Now we can evaluate the right subtree using thesame registers we used for the left subtree, except of course Rk since we want to rememberthe result of the left subtree. But this is always possible since l > r. This means that weneed l registers to evaluate T too. The same happens if r > l but now we need to evaluatethe right subtree first and store the result to a register. If l = r we need an extra registerRl+1 to remember the result of the left subtree. If T is a tree leaf, then the number ofregisters to evaluate T is either 1 or 0 depending whether T is a left or a right subtree (sincean operation such as add R1, A can handle the right component A directly without storingit into a register). So the numbering algorithm starts from the tree leaves and assigns 1 or 0as explained. Then for each node whose children are labeled l and r, if l = r then the nodenumber is l + 1 otherwise it is max(l, r). For example, we have the following numbering forthe previous example:

2

/ \

/ \

1 2

/ \ / \

1 0 1 1

/ \ / \

1 0 1 0

77

The code generation phase is as follows. We assume that for each node T has numberingregs(T ). We use a stack of available registers which initially contains all the registers inorder (with the lower register at the top).

generate(T ) =if T is a leaf write “load top(), T”if T is an internal node with children l and r thenif regs(r) = 0 then { generate(l); write “op top(), r” }if regs(l) >= regs(r)then { generate(l)

R := pop()generate(r)write “op R, top()”push(R) }

if regs(l) < regs(r)then { swap the top two elements of the stack

generate(r)R := pop()generate(l)write “op top(), R”push(R)swap the top two elements of the stack }

This assumes that the number of registers needed is no greater than the number of avail-able registers. Otherwise, we need to spill the intermediate result of a node to memory. Thisalgorithm also does not take advantage of the commutativity and associativity properties ofoperators to rearrange the expression tree.

13 Garbage Collection

13.1 Storage Allocation

Consider a typical storage organization of a program:

78

All dynamically allocated data are stored in the heap. These are the data created bymalloc in C or new in C++. You can imagine the heap as a vector of bytes (characters) andend_of_heap a pointer to the first available byte in the heap:

char heap[heap_size];

int end_of_heap = 0;

A trivial implementation of malloc in C is:

void* malloc ( int size ) {

void* loc = (void*) &heap[end_of_heap];

end_of_heap += size;

return loc;

};

If you always create dynamic structures but never delete them, you will eventually run outof heap space. In that case, malloc will request the operating system for more heap (thiscode is not shown). This is very expensive, because it may require to move the stack datato higher memory locations. Alternatively, you can recycle the dynamically allocated datathat you don’t use. This is done using free in C or delete in C++. To do so, you need tolink all deallocated data in the heap into a list:

typedef struct Header { struct Header *next; int size; } Header;

Header* free_list;

Initially, the free list contains only one element that covers the entire heap (ie, it’s heap_sizebytes long):

void initialize_heap () {

free_list = (Header*) heap;

79

free_list->next = 0;

free_list->size = heap_size;

};

(This is terrible C code, but this can only be done by coercing bytes to pointers.) Note thateach cell in the heap has a Header component (ie, a next pointer and a size) followed bysize-sizeof(Header) bytes (so that is size bytes long). malloc first searches the free listto find a cell large enough to fit the given number of bytes. If it finds one, it gets a chunkout of the cell leaving the rest untouched:


Header* prev = free_list;

for(Header* r=free_list; r!=0; prev=r, r=r->next)

if (r->size > size+sizeof(Header))

{ Header* new_r = (Header*) (((char*) r)+size);

new_r->next = r->next;

new_r->size = r->size;

if (prev==free_list)

free_list = new_r;

else prev->next = new_r;

return (void*) r;

};



return loc;

};

free simply puts the recycled cell at the beginning of the free list:

void free ( void* x ) {

if ("size of *x" <= sizeof(Header))

return;

((Header*) x)->next = free_list;

((Header*) x)->size = "size of *x";

free_list = (Header*) x;

};

We can see that there is a lot of overhead in malloc since the free list may be very long. Themost important problem though is fragmentation of the heap into tiny cells, so that eventhough the total free space in the free list is plenty, it is useless for large object allocation.The reason for fragmentation is that we unrestrictedly choose a cell from the free list to doallocation. It makes more sense to try to find a free cell that provides the requested numberof bytes exactly. This can be done by maintaining a vector of free lists, so that the nthelement of the vector is a free list that links all the free cells of size n.

The above malloc/free way of allocating dynamic cells is called a manual memory al-

location since the programmer is responsible for allocating and deleting objects explicitly.

80

This is to be contrasted to the automatic memory management, in which the system man-ages the dynamic allocation automatically. Manual memory management is the source ofthe worst and hardest to find bugs. It’s also the source of most of the mysterious programcrashes. It takes the fun out of programming. It causes horrible memory problems due to“overflow”, “fence past errors”, “memory corruption”, “step-on-others-toe” (hurting othervariable’s memory locations) or “memory leaks”. The memory problems are extremely hardto debug and are very time consuming to fix and troubleshoot. Memory problems bring downthe productivity of programmers. Memory related bugs are very tough to crack, and evenexperienced programmers take several days or weeks to debug memory related problems.Memory bugs may be hidden inside the code for several months and can cause unexpectedprogram crashes. It is estimated that the memory bugs due to usage of char* and pointersin C/C++ is costing $2 billions every year in time lost due to debugging and downtimeof programs. Now of course all modern languages (Java, SML, Haskell, etc) use automaticgarbage collection. Despite that, C and C++ are still popular and programmers spend mostof their time trying to do manual memory allocation (and then, after they remove mostcore dumps from their program, they port it to a different architecture and their programcrashes!).

Why manual memory management is so hard to do correctly? Basically, you need tohave a global view of how dynamic instances of a type are created and passed around. Thisdestroys the good software engineering principle that programs should be developed in smallindependent components. The problem is to keep track of pointer assignments. If more thanone objects point to a dynamically allocated object, then the latter object should be deletedonly if all objects that point to it do not need it anymore. So basically, you need to keeptrack of how many pointers are pointing to each object. This is called reference counting

and used to be popular for OO languages like C++. We do not call this method automaticgarbage collection because the programmer again is responsible of putting counters to everyobject to count references. This method is easy to implement for languages like C++ whereyou can redefine the assignment operator dest=source, when both dest and source arepointers to data. Instead of using C* for a pointer to an object C, we use Ref<C>, where thetemplate Ref provides reference counting to C:

template< class T >

class Ref {

private:

int count;

T* pointer;

void MayDelete () { if (count==0) delete pointer; };

void Copy ( const Ref &sp ) {

++sp.count;

count = sp.count;

pointer = sp.pointer;

};

public:

Ref ( T* ptr = 0 ) : count(1), pointer(ptr) {};

Ref ( const Ref &sp ) { Copy(sp); };

81

~Ref () { MayDelete(); };

T* operator-> () { return pointer; };

Ref& operator= ( const Ref &sp ) {

if (this != &sp) {

count--;

MayDelete();

Copy(sp);

};

return *this;

};

};

When an instance of Ref<C> is created, its counter is set to 1. When we delete the object,we just decrement the counter because there may be other objects pointing to it. But whenthe counter becomes zero, we delete it. The only way to move pointers around, is through anassignment from a C* object to a C* object. In that case, the object pointed by the sourcepointer will have one pointer more, while the object pointed by the destination pointer willhave one pointer less. Reference counting avoids some misuses of the heap but it comeswith a high cost: every assignment takes many cycles to be completed and some of the workmay be unnecessary since you may pass a pointer around causing many unnecessary counterincrements/decrements. In addition, we cannot get rid of cyclic objects (eg. when A pointsto B and B points to A) using reference counting. All objects that participate in a cyclicchain of references will always have their counters greater than zero, so they will never bedeleted, even if they are garbage.

13.2 Automatic Garbage Collection

Heap-allocated records that are not reachable by any chain of pointers from program variablesare considered garbage. All other data are ’live’. This is a very conservative approach butis always safe since you can never access unreachable data. On the other hand, it maykeep objects that, even though reachable, they will never be accessed by the program in thefuture.

With automatic storage management, user programs do not reclaim memory manually.When the heap is full, the run-time system suspends the program and starts garbage collec-tion. When the garbage collection is done, the user program resumes execution:

char heap[heap_size];

int end_of_heap = 0;


if (size+end_of_heap > heap_size)

GC(); // garbage collection



return loc;

};

82

13.2.1 Mark-and-Sweep Garbage Collection

Here too we use the conservative approximation that if we can reach an object by followingpointers from program variables, then the object is live (not garbage). These programvariables are called roots and can be either frame-allocated or static. Therefore, the garbagecollector needs to check the entire static section as well as all frames in the run-time stackfor pointers to heap. It is common to use the following conservative approach: if a wordhas a value between the minimum and maximum address of the heap, then it is consideredto be a pointer to the heap. An object is live if it is pointed by either a root or by a liveobject (this is a recursive definition). The garbage collector needs to start from each rootand follow pointers recursively to mark all live objects.

A Mark-and-Sweep garbage collector has two phases:

1. Mark: starting from roots, mark all reachable objects by using a depth-first-searchpointer traversal

2. Sweep: scan the heap from the beginning to the end and reclaim the unmarked objects(and unmark the marked objects).

The depth-first-search is done using the function

DFS ( p ) =

{ if *p record is unmarked

then mark *p

for each pointer p->fi of the record *p do DFS(p->fi)

}

which is called from every root:

for each p in roots

DFS(p)

The mark phase is:

p = ’first object in the heap’

while p is in the heap do

{ if *p is marked

then unmark *p

else insert *p into the free list

p = p+(size of record *p)

}

For example, consider the following objects in memory at the time the garbage collectortakes over:

83

1 5

2 4

7

1211

9

108

6

3

d ae

i g

kb

cf

j

h e

roots

The roots can be static or frame-allocated variables. Let’s assume that these objects arerepresented as follows in memory (as consecutive bytes):

1 6 e 32 0 i 43 0 d 54 0 g 05 0 a 06 8 b 77 10 k 08 0 c 09 0 j 10

10 0 f 011 0 h 1212 11 l 0

Here, 0 are null pointers while the other pointers are memory addresses. That is, the freelist is 0 (empty) and the roots are 1, 6, and 9.

After the mark-and-sweep garbage collection, the memory layout becomes:

84

1 6 e 32 03 0 d 54 25 0 a 06 8 b 77 10 k 08 0 c 09 0 j 10

10 0 f 011 412 11

where the free list points to the cell 12 now.

13.2.2 Copying Garbage Collection

A copying garbage collector uses two heaps:

1. from-space: the current working heap

2. to-space: needs to be in memory during garbage collection only

At garbage collection time, the garbage collector creates an empty to-space heap in memory(of the same size as the from-space), copies the live objects from the from-space to the to-space (making sure that pointers are referring to the to-space), disposes the from-space, andfinally uses the to-space as the new from-space.

Converting a pointer p that points to a from-space into a pointer that points to theto-space space, is called forwarding a pointer:

forward (p) {

if p points t from-space

then if p.f1 points to to-space

then return p.f1

else for each field fi of p

next.fi := p.fi

p.f1 := next

next := next + (size of *p)

return p.f1

else return p

Forwarding pointers can be done in breadth-first-search, using Cheney’s Algorithm. Com-pared to depth-first-search, breadth-first-search has better locality of reference:

scan := begin-of-to-space

next := scan

for each root r

r := forward(r)

while scan < next

85

for each field fi of *scan

scan.fi := forward(scan.fi)

scan := scan + (size of *scan)

Cheney’s Algorithm uses a very simple allocation algorithm and has no need for stack(since is not recursive). More importantly, its run time depends on the number of live objects,not on the heap size. Furthermore, it does not suffer from data fragmentation, resulting intoa more compact memory. In addition, it allows incremental (concurrent) garbage collection.On the other hand, it needs double the amount of memory and needs to recognize all pointersto heap.

For example,

roots

169

1 1 6 e 32 2 0 i 43 3 0 d 54 4 0 g 05 5 0 a 06 6 8 b 77 7 10 k 08 8 0 c 09 9 0 j 10

10 10 0 f 011 11 0 h 1212 12 11 l 0

from-space

51 ← scan, next5253545556575859606162

to-spaceAfter we forward the roots from the from-space to the to-space, the to-space will containthe forwarded roots, and the roots and the forward pointers of the root elements in thefrom-space will point to the to-space, as shown below:

roots

515253

1 51 6 e 32 2 0 i 43 3 0 d 54 4 0 g 05 5 0 a 06 52 8 b 77 7 10 k 08 8 0 c 09 53 0 j 10

10 10 0 f 011 11 0 h 1212 12 11 l 0

from-space

51 51 6 e 3 ← scan52 52 8 b 753 53 0 j 1054 ← next5556575859606162

to-spaceThen, we forward the pointers of the first element of the to-space pointed by the scan pointer(element 51). The first pointer 6 has already been forwarded to 52, as we can see from theelement 6 in the from-space. The second pointer 3 is forwarded at the end of the to-space

86

to 54:

roots

515253

1 51 6 e 32 2 0 i 43 54 0 d 54 4 0 g 05 5 0 a 06 52 8 b 77 7 10 k 08 8 0 c 09 53 0 j 10

10 10 0 f 011 11 0 h 1212 12 11 l 0

from-space

51 51 52 e 5452 52 8 b 7 ← scan53 53 0 j 1054 54 0 d 555 ← next56575859606162

to-spaceNow we forward the pointer 8 of element 52:

roots

515253

1 51 6 e 32 2 0 i 43 54 0 d 54 4 0 g 05 5 0 a 06 52 8 b 77 7 10 k 08 55 0 c 09 53 0 j 10

10 10 0 f 011 11 0 h 1212 12 11 l 0

from-space

51 51 52 e 5452 52 55 b 7 ← scan53 53 0 j 1054 54 0 d 555 55 0 c 056 ← next575859606162

to-spaceand the pointer 7 of element 52:

roots

515253

1 51 6 e 32 2 0 i 43 54 0 d 54 4 0 g 05 5 0 a 06 52 8 b 77 56 10 k 08 55 0 c 09 53 0 j 10

10 10 0 f 011 11 0 h 1212 12 11 l 0

from-space

51 51 52 e 5452 52 55 b 5653 53 0 j 10 ← scan54 54 0 d 555 55 0 c 056 56 10 k 057 ← next5859606162

to-space

87

Now we forward the pointer 10 of element 53:

roots

515253

1 51 6 e 32 2 0 i 43 54 0 d 54 4 0 g 05 5 0 a 06 52 8 b 77 56 10 k 08 55 0 c 09 53 0 j 10

10 57 0 f 011 11 0 h 1212 12 11 l 0

from-space

51 51 52 e 5452 52 55 b 5653 53 0 j 5754 54 0 d 5 ← scan55 55 0 c 056 56 10 k 057 57 0 f 058 ← next59606162

to-spaceThen we forward the pointer 5 of element 54:

roots

515253

1 51 6 e 32 2 0 i 43 54 0 d 54 4 0 g 05 58 0 a 06 52 8 b 77 56 10 k 08 55 0 c 09 53 0 j 10

10 57 0 f 011 11 0 h 1212 12 11 l 0

from-space

51 51 52 e 5452 52 55 b 5653 53 0 j 5754 54 0 d 5855 55 0 c 0 ← scan56 56 10 k 057 57 0 f 058 58 0 a 059 ← next606162

to-spaceThe pointer 10 of element 56 has already been forwarded. Therefore, we get:

roots

515253

1 51 6 e 32 2 0 i 43 54 0 d 54 4 0 g 05 58 0 a 06 52 8 b 77 56 10 k 08 55 0 c 09 53 0 j 10

10 57 0 f 011 11 0 h 1212 12 11 l 0

from-space

51 51 52 e 5452 52 55 b 5653 53 0 j 5754 54 0 d 5855 55 0 c 056 56 57 k 057 57 0 f 058 58 0 a 059 ← scan, next606162

to-space

88

Finally, the to-space becomes the new heap and the from-space is deallocated from memory.

13.3 Other Issues

For further reading, look at:

• GC FAQ http://www.iecc.com/gclist/GC-faq.html

• Allocation and GC Myths http://www.hpl.hp.com/personal/Hans Boehm/gc/myths.ps

There is an excellent mark and sweep garbage collector for C and C++ programs. It iscalled gc (http://www.hpl.hp.com/personal/Hans Boehm/gc/).

89

cse 5317/4305: design and construction of compilerslambda.uta.edu/cse5317/spring07/notes.pdf ·...

Documents