intermediate code generation

Post on 03-Jan-2016

34 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Intermediate Code Generation. Chapter 6. Bac k -end and Fron t -end of A Com piler. Bac k -end and Fron t -end of A Com piler. Cl ose t o sou r c e lan g uage (e . g. s y n t a x t r e e ). Cl ose t o machine l an g uage. - PowerPoint PPT Presentation

TRANSCRIPT

1

Intermediate Code Generation

Chapter 6

Back-end and Front-end

of A Compiler

Back-end and Front-endof A Compiler

Close to source language(e.g. syntax tree)

Close to machine language

m x n compilers can be built by writing just m front ends and n back ends

Back-end and Front-endof A Compiler

Includes:• Type checking• Any syntactic checks that remain after parsing(e.g. ensure break statement is enclosed

within while-, for-, or switch statements).

Component-Based Approach to Building Compilers

5

Source program in Language-1

Source program in Language-2

Non-optimized Intermediate Code

Optimized Intermediate Code

Target-1 machine code Target-2 machine code

Target-1 Code Generator Target-2 Code Generator

Intermediate-code Optimizer

Language-1 Front End Language-2 Front End

Intermediate Representation (IR)

6

A kind of abstract machine language that can express the target machine operations without committing to too much machine details.

:MWhy IR ?

Without IR

7

C

Pascal

FORTRAN

C++

SPARC

HP PA

x86

IBM PPC

With IR

8

C

Pascal

FORTRAN

C++

SPARC

HP PA

x86

IBM PPC

IR

With IR

C

Pascal

FORTRAN

C++

IR Common Backend?

9

Advantages of Using an Intermediate Language

10

1. Retargeting - Build a compiler for a new machine by

attaching a new code generator to an existing front-

end.

2. Optimization - reuse intermediate code optimizers

in compilers for different languages and different

machines.

Note: the terms “intermediate code”, “intermediate

language”, and “intermediate representation” are all

used interchangeably.

Issues in Designing an IR

11

• Whether to use an existing IR• if target machine architecture is

similar• if the new language is similar

• Whether the IR is appropriate for the kind of optimizations to be performed• e.g. speculation and predication• some transformations may take

much longer than they would on a different IR

Issues in Designing an IR

12

• Designing a new IR needs to consider

• Level (how machine dependent it is)

• Structure• Expressiveness• Appropriateness for general and

special optimizations• Appropriateness for code

generation• Whether multiple IRs should be

used

Multiple-Level IR

13

Source Program

High-level IR

Low-level IR

Target code

Semantic Check

High-level Optimization

Low-level Optimization

Using Multiple-level IR

14

Translating from one level to another in the compilation process

• Preserving an existing technology investment

• Some representations may be more appropriate for a particular task.

Commonly Used IR

15

:MPossible IR forms• Graphical representations: such as

syntax trees, AST (Abstract Syntax Trees), DAG

• Postfix notation• Three address code• SSA (Static Single Assignment) form

:MIR should have individual components that describe simple things

Syntax Tree

Syntax Tree

+

+

*

a -

-

b c

a*

d

b c

Syntax Tree

+

+

*

a -

-

b c

a*

d

b c

Syntax Tree+

+

*

a -

-

b c

a

b c

*

d

Directed Acyclic Graph (DAG):• More compact representation• Gives clues regarding generation of efficient

code

Node can have more thanone parent

Example

Construct the DAG for:

How to Generate DAG fromSyntax-Directed Definition?

All what is needed is that functions such as Node and Leaf above check whether a node already exists. If such a node exists, a pointer is returned to that node.

How to Generate DAG from

Syntax-Directed Definition?

Data Structure: Array

Data Structure: Array

Left and right children of an intermediate node

Operation code

Leaves

Scanning the array each time a new node is needed, is not an efficient thing to do.

Data Structure: Hash Table

Hash function = h(op, L, R)

Three-Address Code• Another option for intermediate

presentation• Built from two concepts:

– addresses and instructions• At most one operator

AddressCan be one of the following:• A name: source program

name• A constant• Compiler-generated

temporary

Instructions

Procedure call such as p(x1, x2, …, xn) is implemented as:

Example

OR

Choice of Operator Set• Rich enough to implement the

operations of the source language• Close enough to machine

instructions tosimplify code generation

Data StructureHow to present these instructions in a

data structure?– Quadruples– Triples– Indirect triples

Data Structure:Quadruples

• Has four fields: op, arg1, arg2, result

• Exceptions:– Unary operators: no arg2– Operators like param: no arg2, no

result– (Un)conditional jumps: target label is

theresult

tl =

tz = b * tt

t3 = minus

c t4 = b

* t3 t6 = tz + t4

a = t5

minus c 0

12

3

4

5

op arg1 arg2 result

=I ts I

I

minus : c I• tlI

• b tl I

t2

* I I -,I

Tm1nus 1 c I I t3

* I b • t3 I t4I

+ I t2 • t4 'I tsI

I

I

a...

Data Structure:Triples

• Only three fields: no result field• Results referred to by its

position

Data Structure:Indirect Triples

• When instructions are moving around during optimizations: quadruples are better than triples.

• Indirect triples solve this problem

List of pointers to triplesOptimizing complier can reorder instructionlist, instead of affecting the triples themselves

Single-Static-Assignment (SSA)

• Is an intermediate presentation• Facilitates certain code

optimizations• All assignments are to variables

with distinct names

Single-Static-Assignment (SSA)

Example:

If we use different names for X in true part and false part, then which nameshall we use in the assignment of y = x * a ?

The answer is: Ø-function

Returns the value of its argument that corresponds to the control-flow path that was taken to get to the assignment statement containing the Ø-function

Example

Types and Declarations• Type checking: to ensure that types

of operands match the type expected by operator

• Determine the storage needed• Calculate the address of an

array reference• Insert explicit type conversion• Choose the write version of an

operator• …

Storage Layout• From the type, we can determine

amount of storage at run time.• At compile time, we will use this

amountto assign its name a relative address.

• Type and relative address are saved inthe symbol table entry of the name.

• Data with length determined only at run time saves a pointer in the symbol table.

Storage Layout• Multibyte objects are stored in

consecutive bytes and given the address of the first byte

• Storage for aggregates (e.g. arrays and classes) is allocated in one contiguous block of bytes.

B

offset = offset + T.width; }

D €

P { offset = 0; }

D D

T id ;{ top.put( id .lexeme, T.type, offset);

To keep track of the next available relative address

Create a symbol table entry

Translations of Statementsand Expressions

Syntax-Directed Definition (SDD)

Syntax-Directed Translation(SDT)

PRODUCTION SEMANTIC RULES

S id = E ;

E E1 + E2

I - E1

( E1 )

I id

S.code = E.code IIgen(top .g et( id.lexeme) '=' E.addr)

E.addr- = new Temp()E.cod e = E 1.code II E2.code II

gen(E.addr'=' E 1 .addr'+' E2 .addr)

E.addr = new Temp()E.cod e = E 1.code II

gen(E.addr '=''minus' E1.addr)

E.addr· = E1.addr E.cod e = Et.code

E .addr = top.get( id .lexeme)E.cod e ="

Three-address code of E

Build an instruction

Address holding value of E(e.g. tmp variable, name, constant)

Get a temporaryvariable

Current symbol table

a = b + - c

•t 1 = minus ct2 ; b + tla = t2

PRODUCTION SEMANTIC RULES

S id = E ;

E E1 + Ez

I - E1

( E1 )

I id

S.code = E.cod e IIgen( lOJI.g et( id .lexeme) '=' E .addr)

E.addr =new Temp()E.code = E1.code II E2.code II

gen(E.addr '=' E1.addr '+' E2.addr)

E .addr = new Temp()E.code == E1.cod e II

gen(E.addr '=' 'minus' E1.addr)

E.addr = E1.addrE.code = E 1 .cod e

E.addr = top.get( id.lexeme)E .cod e =="

Generating three-address code incrementallyto avoid long strings manipulations

gen() does two things:• generate three address

instruction•append it to the sequence ofinstructions generated so far

Arrays• Elements of the same time• Stored consecutively in memory• In languages like C or Java

elements are: 0, 1, …, n-1• In some other languages:

low, low+1, …, high

ArraysIf elements start with 0, and element width is w, then a[i] address is: base is address of A[0]

Generalizing to two-dimensions a[i1][i2]:w1 is width of a row andw2 the width of an element

Generalizing to k-dimensions:

or

orw is width of an element, n2 is numberof elements

temporary used while computing the offset

pointer to symbol table entry

A is a 2x3 array of integers

54

Type Checking

Type Checking includes several aspects. • The language comes with a type system, i.e., a set of rules

saying what types can appear where. • The compiler assigns a type expression to parts of the

source program. • The compiler checks that the type usage in the program

conforms to the type system for the language.

55

Type Checking

• All type checking could be done at run time: • The compiler generates code to do the checks. • Some languages have very weak typing; for example, variables can change

their type during execution. • Often these languages need run-time checks. • Examples include lisp, snobol, apl.

• A sound type system guarantees that all checks can be performed prior to execution. This does not mean that a given compiler will make all the necessary checks.

• An implementation is strongly typed if compiled programs are guaranteed to run without type errors

56

Rules for Type Checking

• There are two forms of type checking. 1. We will learn type synthesis where the types of parts are used to infer the

type of the whole. For example, integer+real =real. 2. Type inference is very slick. The type of a construct is determined from

usage. This permits languages like ML to check types even though names need not be declared.

• We consider type checking for expressions. • Checking statements is very similar. • View the statement as a function having its components as

arguments and returning void.

57

Type Conversions

• A very strict type system would do no automatic conversion. • Instead it would offer functions for the programmer to explicitly

convert between selected types. Then either the program has compatible types or is in error.

• we will consider a approach in which the language permits certain implicit conversions that the compiler is to supply.

• This is called type coercion. Explicit conversions supplied by the programmer are called casts.

58

Type Conversions

• We will see integer and real types, and postulate a unary function denoted (real) that converts an integer into the real having the same value.

• We consider the more general case where there are multiple types some of which have coercions (often called widening).

• For example in C/Java, int can be widened to long, which in turn can be widened to float.

• Mathematically the hierarchy is a partially order set (poset) in which each pair of elements has a least upper bound (LUB).

• For many binary operators the two operands are converted to the LUB. So adding a short to a char, requires both to be converted to an int. Adding a byte to a float, requires the byte to be converted to a float (the float remains a float and is not converted).

59

Checking and Coercing Types for Basic Arithmetic

• The steps for addition, subtraction, multiplication, and division are all essentially the same:

• Convert each types if necessary to the LUB and then perform the arithmetic on the (converted or original) values. Note that conversion often requires the generation of code.

• Two functions are convenient. 1. LUB(t1,t2) returns the type that is the LUB of the two given types. It signals

an error if there is no LUB, for example if one of the types is an array. 2. widen(a,t,w,newcode,newaddr). Given an address a of type t, and a

(hopefully) wider address w, produce the instructions newcode needed so that the address newaddr is the conversion of address a to type w.

• LUB is simple, just look at the address latice. If one of the type arguments is not in the lattice, signal an error; otherwise find the lowest common ancestor.

60

Checking and Coercing Types for Basic Arithmetic

• widen is more interesting. It involves n2 cases for n types. Many of these are error cases (e.g., if t wider than w). Below is the code for our situation with two possible types integer and real.

• The four cases consist of 2 nops (when t=w), one error (t=real; w=integer) and one conversion (t=integer; w=real).

widen (a:addr, t:type, w:type, newcode:string, newaddr:addr)

if t=w newcode = "" newaddr = a

else if t=integer and w=real newaddr = new Temp() newcode = gen(newaddr = (real) a)

else signal error

61

Checking and Coercing Types for Basic Arithmetic

• With these two functions it is not hard to modify the rules to catch type errors and perform coercions for arithmetic expressions.

• Maintain the type of each operand by defining type attributes for e, t, and f.

• Coerce each operand to the LUB. • This requires that we have type information for the base entities,

identifiers and numbers. The lexer can supply the type of the numbers. We retrieve it via get(NUM.type).

• It is more interesting for the identifiers. We insert that information when we process declarations. So we now have another semantic check: Is the identifier declared before it is used?

62

Checking and Coercing Types for Basic Arithmetic

• let's examine a particularly interesting entry. Consider the assignment statement

A[3/X+4] := X*5+Y;

• Consider the ra node, i.e., the node corresponding to the production. ra → [ e ] := e1 ;

• When the tree traversal gets to this node, its parent has passed in the value of the inherited attribute ra.id=id.entry.

• Thus the ra node has access to the identifier table entry for ID, which in our example is the variable A.

63

Checking and Coercing Types for Basic Arithmetic

• Prior to doing its calculations, the ra node invokes its children and gets back all the synthesized attributes. To summarize, when the ra node performs its calculations, it has available.

• ra.id: the identifier entry for A. • e.addr/e.code: executing e.code results in e.addr containing the value of the

expression 3/X+4. • e1.addr/e1.code: executing e1.code results in e1.addr containing the value of

the expression X*5+Y. • e.type/e1.type: The types of the expressions.

• What must the ra node do? • Ensure execution of e.code and e1.code. • Check that e.type is int (I don't do this). • Multiply e by the base width of the array A. (We need a temporary, ra.t1, to

hold the computed value). • Widen e1 to the base type of A. (We need, and widen generates, a temporary

ra.addr to hold the widened value). • Do the actual assignment of X*5+Y to A[3/X+4].

64

Control Flow

• Control flow includes the study of Boolean expressions, which have two roles.

1. They can be computed and treated similar to integers or real. There are also relational operators that produce Boolean values from arithmetic operands.

2. They are used in certain statements that alter the normal flow of control. In this regard.

• One question that comes up with Boolean expressions is whether both operands need be evaluated. If we are evaluating A or B and find that A is true, must we evaluate B? For example, consider evaluating

A=0 OR 3/A < 1.2 • when A is zero. • Consider A*F(x). If the compiler knows that for this run A is zero, must it

generate code to evaluate F(x)? Don't forget that functions can have side effects,

65

Short-Circuit Code

• This is also called jumping code. Here the Boolean operators AND, OR, and NOT do not appear in the generated instruction stream. Instead we just generate jumps to either the true branch or the false branch.

• This statement cab be translated as follows

66

Flow-of-Control Statements

• We now consider the translation of boolean expressions into three-address code

• in the context of statements such as those generated by the following grammar:

S if(B)S1S if ( B ) S1 else S2S while ( B ) S1

• both B and S have a synthesized attribute code, which gives the translation into three-address instructions.

• For simplicity, we build up the translations B. code and S. code as strings, using syntax-directed definitions.

67

• The translation of if (B) S1 consists of B. code followed by Sl. Code

68

69

Control-Flow Translation of Boolean Expressions

70

Control-Flow Translation of Boolean Expressions

• Consider again the following statement

• Translating into 3 address code will fetch the following results

71

Avoiding Redundant Gotos

• For x > 200 we will have the following code fragment

72

Avoiding Redundant Gotos

73

Avoiding Redundant Gotos

74

Boolean Values and Jumping Code

• A boolean expression may also be evaluated for its value, as in assignment statements such as x = true; or x = a < b;

• A clean way of handling both roles of boolean expressions is to first build a syntax tree for expressions, using either of the following approaches:

1. Use two passes. Construct a complete syntax tree for the input, and then walk the tree in depth-first order, computing the translations specified by the semantic rules.

2. Use one pass for statements, but two passes for expressions. With this approach, we would translate E in while (E) S1 before S1 is examined. The translation of E, however, would be done by building its syntax tree and then walking the tree.

top related