ch2.1 cse244 chapter 2: a simple one pass compiler aggelos kiayias computer science &...

CH2.1

CSE244

Chapter 2: A Simple One Pass Chapter 2: A Simple One Pass CompilerCompiler

Aggelos KiayiasComputer Science & Engineering Department

The University of Connecticut371 Fairfield Road, Box U-1155

Storrs, CT 06269

[email protected]://www.cse.uconn.edu/~akiayias

CH2.2

CSE244

The Entire Compilation Process

Grammars for Syntax DefinitionGrammars for Syntax Definition Syntax-Directed TranslationSyntax-Directed Translation Parsing - Top Down & PredictiveParsing - Top Down & Predictive Pulling Together the PiecesPulling Together the Pieces The Lexical Analysis ProcessThe Lexical Analysis Process Symbol Table ConsiderationsSymbol Table Considerations A Brief Look at Code GenerationA Brief Look at Code Generation Concluding Remarks/Looking AheadConcluding Remarks/Looking Ahead

CH2.3

CSE244

Grammars for Syntax DefinitionGrammars for Syntax Definition

A A Context-free GrammarContext-free Grammar ( (CFGCFG) Is Utilized to ) Is Utilized to Describe the Syntactic Structure of a LanguageDescribe the Syntactic Structure of a Language

A CFG Is Characterized By:A CFG Is Characterized By: 1. A Set of Tokens or Terminal Symbols 2. A Set of Non-terminals 3. A Set of Production Rules

Each Rule Has the Form

NT {T, NT}* 4. A Non-terminal Designated As the Start

Symbol

CH2.4

CSE244

Grammars for Syntax DefinitionGrammars for Syntax DefinitionExample CFGExample CFG

list list + digit

list list - digit

list digit

digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

(the “|” means OR)

(So we could have written

list list + digit | list - digit | digit )

CH2.5

CSE244

Grammars are Used to Derive Strings:

Using the CFG defined on the previous slide, we can derive the string: 9 - 5 + 2 as follows:

list list + digit

list - digit + digit

digit - digit + digit

9 - digit + digit

9 - 5 + digit

9 - 5 + 2

P1 : list list + digit

P2 : list list - digit

P3 : list digit

P4 : digit 9

P4 : digit 5

P4 : digit 2

CH2.6

CSE244

Grammars are Used to Derive Strings:

This derivation could also be represented via a Parse Tree

(parents on left, children on right)

list

digit

digit

list

digit

list

9

5

2-

+

list list + digit

list - digit + digit

digit - digit + digit

9 - digit + digit

9 - 5 + digit

9 - 5 + 2

CH2.7

CSE244

A More Complex Grammar

What is this grammar for ?What does “” represent ?What kind of production rule is this ?

block begin opt_stmts end

opt_stmts stmt_list |

stmt_list stmt_list ; stmt | stmt

CH2.8

CSE244

Defining a Parse Tree

More Formally, a Parse Tree for a CFG Has the More Formally, a Parse Tree for a CFG Has the Following Properties:Following Properties: Root Is Labeled With the Start Symbol Leaf Node Is a Token or Interior Node (Now Leaf) Is a Non-Terminal If A x1x2…xn, Then A Is an Interior;

x1x2…xn Are Children of A and May Be Non-Terminals or Tokens

CH2.9

CSE244

Other Important Concepts Ambiguity

string string

string string

string

+

2-

59

Why is this a Problem ?

Grammar:

string string + string | string – string | 0 | 1 | …| 9

Two derivations (Parse Trees) for the same token string.

stringstring

stringstring

string-

9 +

5 2

CH2.10

CSE244

Other Important Concepts Associativity of Operators

Left vs. Right

right

letter

letter

right

letter

right

c

b

a =

=

right letter = right | letter

letter a | b | c | …| z

list

digit

digit

list

digit

list

9

5

2-

+

list list + digit |

| list - digit | digit

digit 0 | 1 | 2 | …| 9

CH2.11

CSE244

Embedding AssociativityEmbedding Associativity

The language of arithmetic expressions with + -The language of arithmetic expressions with + - (ambiguous) grammar that does not enforce

associativitystring string + string | string – string | 0 | 1 | …| 9

non-ambiguous grammar enforcing left associativity (parse tree will grow to the left)

string string + digit | string - digit | digit

digit 0 | 1 | 2 | …| 9

non-ambiguous grammar enforcing right associativity (parse tree will grow to the right)

string digit + string | digit - string | digit

digit 0 | 1 | 2 | …| 9

CH2.13

CSE244

Syntax-Directed Translation

Associate Attributes With Grammar Rules & Constructs and Translate As Parsing Occurs

The translation will follow the parse tree structure (and as a result the structure and form of the parse tree will affect the translation).

First example: Inductive Translation. Infix to Postfix Notation Translation for Expressions Translation defined inductively As: Postfix(E) where E is

an Expression.

1. If E is a variable or constant then Postfix(E) = E

2. If E is E1 op E2 then Postfix(E)

= Postfix(E1 op E2) = Postfix(E1) Postfix(E2) op

3. If E is (E1) then Postfix(E) = Postfix(E1)

Rules

CH2.14

CSE244

ExamplesExamples

Postfix( ( 9 – 5 ) + 2 ) = Postfix( ( 9 – 5 ) ) Postfix( 2 ) + = Postfix( 9 – 5 ) Postfix( 2 ) + = Postfix( 9 ) Postfix( 5 ) - Postfix( 2 ) + = 9 5 – 2 +

Postfix(9 – ( 5 + 2 ) ) = Postfix( 9 ) Postfix( ( 5 + 2 ) ) - = Postfix( 9 ) Postfix( 5 + 2 ) – = Postfix( 9 ) Postfix( 5 ) Postfix( 2 ) + – = 9 5 2 + –

CH2.15

CSE244

Syntax-Directed Definition

Each Production Has a Set of Semantic Rules

Each Grammar Symbol Has a Set of Attributes

For the Following Example, String Attribute “t” is Associated With Each Grammar Symbol

recall: What is a Derivation for 9 + 5 - 2?

expr expr – term | expr + term | term

term 0 | 1 | 2 | 3 | … | 9

list list - digit list + digit - digit digit + digit - digit 9 + digit - digit 9 + 5 - digit 9 + 5 - 2

CH2.16

CSE244

Syntax-Directed Definition (2))

Each Production Rule of the CFG Has a Semantic Each Production Rule of the CFG Has a Semantic RuleRule

NoteNote: Semantic Rules for : Semantic Rules for exprexpr define define tt as a as a “synthesized attribute” i.e., the various copies of “synthesized attribute” i.e., the various copies of tt obtain their values from “children obtain their values from “children tt’s”’s”

Production Semantic Ruleexpr expr + term expr.t := expr.t || term.t || ‘+’

expr expr – term expr.t := expr.t || term.t || ’-’

expr term expr.t := term.tterm 0 term.t := ‘0’

term 1 term.t := ‘1’…. ….term 9 term.t := ‘9’

CH2.17

CSE244

Semantic Rules are Embedded in Parse Tree

expr.t =95-

expr.t =9

expr.t =95-2+

term.t =5

term.t =2

term.t =9

2+5-9 How Do Semantic Rules Work ? What Type of Tree Traversal is Being

Performed? How Can We More Closely Associate Semantic

Rules With Production Rules ?

CH2.18

CSE244

Translation SchemesEmbed Semantic Actions into the right sides of the productions.

expr expr + term {print(‘+’)}

expr - term {print(‘-’)}

term

term 0 {print(‘0’)}


…


term

term

termexpr

expr

expr

9

5

2-

+

{print(‘-’)}

{print(‘9’)}

{print(‘5’)}

{print(‘2’)}

{print(‘+’)}

CH2.19

CSE244

Parsing – Top-Down & Predictive

Top-Down ParsingTop-Down Parsing Parse tree / derivation of a Parse tree / derivation of a token string occurs in a token string occurs in a top down fashion.top down fashion.

For Example, Consider:For Example, Consider:

type simple

| id

| array [ simple ] of type

simple integer

| char

| num dotdot num

Suppose input is :

array [ num dotdot num ] of integer

Parsing would begin with

type ???

Start symbol

CH2.20

CSE244

Top-Down Parse (type = start symbol)Top-Down Parse (type = start symbol)

type]simple of[array

type


type

numnum dotdot

Input : array [ num dotdot num ] of integer

Lookahead symbol

type

?


Lookahead symbol

CH2.21

CSE244

Top-Down Parse (type = start symbol)Top-Down Parse (type = start symbol)



type

numnum dotdot simple


type

numnum dotdot simple

integer

Lookahead symbol

CH2.22

CSE244

Top-Down Process Recursive Descent or Predictive Parsing Parser Operates by Attempting to Match Tokens in

the Input Stream Utilize both Grammar and Input Below to Motivate

Code for Algorithm

array [ num dotdot num ] of integer

type simple

| id


simple integer

| char

| num dotdot num

procedure match ( t : token ) ;

begin if lookahead = t then lookahead : = nexttoken else errorend ;

CH2.23

CSE244

Top-Down Algorithm (Continued)Top-Down Algorithm (Continued)

procedure type ;begin if lookahead is in { integer, char, num } then simple else if lookahead = ‘’ then begin match (‘’ ) ; match( id ) end else if lookahead = array then begin match( array ); match(‘[‘); simple; match(‘]’); match(of); type end else errorend ;procedure simple ;begin if lookahead = integer then match ( integer ); else if lookahead = char then match ( char ); else if lookahead = num then begin match (num); match (dotdot); match (num) end else errorend ;

CH2.24

CSE244

TracingTracing

Input: array [ num dotdot num ] of integerTo initialize the parser:set global variable : lookahead = arraycall procedure: type

Procedure call to type with lookahead = array results in the actions:match( array ); match(‘[‘); simple; match(‘]’); match(of); type

Procedure call to simple with lookahead = num results in the actions:match (num); match (dotdot); match (num)

Procedure call to type with lookahead = integer results in the actions:simple

Procedure call to simple with lookahead = integer results in the actions:match ( integer )

CH2.25

CSE244

LimitationsLimitations

Can we apply the previous technique to every Can we apply the previous technique to every grammar?grammar?

NO:NO:

type simple


simple integer

| array digit

digit 0|1|2|3|4|5|6|7|8|9

consider the string “consider the string “array 6”

the predictive parser starts with the predictive parser starts with typetype and lookahead= and lookahead= array

apply production apply production type simple OR OR type array digit ????

CH2.26

CSE244

Designing a Predictive ParserDesigning a Predictive Parser

Consider AConsider A FIRST()=set of leftmost tokens that appear in

or in strings generated by . E.g. FIRST(type)={,array,integer,char,num}

Consider productions of the form AConsider productions of the form A, A, A the the sets FIRST(sets FIRST() and FIRST() and FIRST() should be disjoint) should be disjoint

Then we can implement predictive parsing Then we can implement predictive parsing (initially: start NT + lookahead=lefmost)(initially: start NT + lookahead=lefmost) Starting with A? we find into which FIRST()

set the lookahead symbol belongs to and we use this production.

Any non-terminal results in the corresponding procedure call

Terminals are matched.

CH2.27

CSE244

Problems with Top Down ParsingProblems with Top Down Parsing

Left Recursion in CFG May Cause Parser to Loop Forever.Left Recursion in CFG May Cause Parser to Loop Forever. Indeed:Indeed:

In the production AA we write the programprocedure A{

if lookahead belongs to First(A) thencall the procedure A

}

Solution: Remove Left Recursion...Solution: Remove Left Recursion... without changing the Language defined by the

Grammar.

CH2.28

CSE244

Dealing with Left recursionDealing with Left recursion

Solution: Algorithm to Remove Left Recursion:Solution: Algorithm to Remove Left Recursion:

expr expr + term | expr - term | term

term 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

expr term rest

rest + term rest | - term rest |

term 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

BASIC IDEA:AA| becomes

A RR R|

CH2.29

CSE244

What happens to semantic actions?What happens to semantic actions?

expr expr + term {print(‘+’)}

expr - term {print(‘-’)}

term



…


expr term rest

rest + term {print(‘+’)} rest

- term {print(‘-’)} rest



…


CH2.30

CSE244

Comparing GrammarsComparing Grammarswith Left Recursionwith Left Recursion

Notice Location of Semantic Actions in TreeNotice Location of Semantic Actions in Tree

What is Order of Processing?What is Order of Processing?

expr

expr

expr

term

term

term

{print(‘2’)}

{print(‘+’)}

{print(‘5’)}

{print(‘-’)}

{print(‘9’)}

5

+

2-

9

CH2.31

CSE244

Comparing GrammarsComparing Grammarswithout Left Recursionwithout Left Recursion

Now, Notice Location of Semantic Actions in Tree Now, Notice Location of Semantic Actions in Tree for Revised Grammarfor Revised Grammar

What is Order of Processing in this Case?What is Order of Processing in this Case?

{print(‘2’)}

expr

term

term {print(‘-’)}

term {print(‘+’)}{print(‘5’)}

{print(‘9’)} rest

rest

2

5

-9+

rest

CH2.32

CSE244

The Lexical Analysis ProcessA Graphical Depiction

uses getchar ( ) to read character

pushes back c using ungetc (c , stdin)

returns token to caller

tokenval

Sets global variable to attribute value

lexan ( )

lexical analyzer

CH2.33

CSE244

The Lexical Analysis ProcessFunctional Responsibilities

Input Token String Is Broken Down

White Space and Comments Are Filtered Out

Individual Tokens With Associated Values Are Identified

Symbol Table Is Initialized and Entries Are Constructed for Each “Appropriate” Token

Under What Conditions will a Character be Pushed Back?

CH2.34

CSE244

Example of a Lexical AnalyzerExample of a Lexical Analyzer

function lexan: integer ;

var lexbuf : array[ 0 .. 100 ] of char ; c : char ;begin loop begin read a character into c ; if c is a blank or a tab then do nothing else if c is a newline then lineno : = lineno + 1 else if c is a digit then begin set tokenval to the value of this and following digits ; return NUM end

CH2.35

CSE244

Algorithm for Lexical AnalyzerAlgorithm for Lexical Analyzer

else if c is a letter then begin place c and successive letters and digits into lexbuf ; p : = lookup ( lexbuf ) ; if p = 0 then p : = insert ( lexbf, ID) ; tokenval : = p return the token field of table entry p end else set tokenval to NONE ; / * there is no attribute * / return integer encoding of character c endend

Note: Insert / Lookup operations occur against the Symbol Table !

CH2.36

CSE244

Symbol Table ConsiderationsSymbol Table Considerations

ARRAY symtable

lexptr token attributes

div mod id id

0

1

23

4

EOSiEOStnuocEOSdomEOSvid

ARRAY lexemes

OPERATIONS: Insert (string, token_ID) Lookup (string)NOTICE: Reserved words are placed into symbol table for easy lookupAttributes may be associated with each entry, i.e., Semantic Actions Typing Info: id integer etc.

CH2.37

CSE244

A Brief Look at Code Generation

Back-end of Compilation Process - Which Will Not Be Our Emphasis

We’ll Focus on Front-end Important Concepts to Re-emphasize

•• Abstract Stack Machine for Intermediate

Code Generation: (i) basic arithmetic, (ii) stack, (iii), flow control

•• L-value Vs. R-value of an identifier I : = 5 ; L - Location I : = I + 1 ; R - Contents

CH2.38

CSE244

A Brief Look at Code Generation

Employ Statement Templates for Code Generation. Each Template Characterizes the Translation

Different Templates for Each Major Programming Language Construct, if, while, procedure, etc.

IF

code for expr

gofalse out

code for stmt

label out

WHILE

label test

code for expr

gofalse out

code for stmt

goto test

label out

CH2.39

CSE244

Concluding Remarks / Looking Ahead

We’ve Reviewed / Highlighted Entire Compilation We’ve Reviewed / Highlighted Entire Compilation ProcessProcess

Introduced Introduced Context-free GrammarsContext-free Grammars (CFG) and (CFG) and Indicated /Illustrated Relationship to Compiler Indicated /Illustrated Relationship to Compiler TheoryTheory

Reviewed Many Different Versions of Reviewed Many Different Versions of Parse TreesParse Trees That Assist in Both That Assist in Both RecognitionRecognition and and TranslationTranslation

We’ll Return to Beginning - We’ll Return to Beginning - Lexical AnalysisLexical Analysis

We’ll Explore Close Relationship of We’ll Explore Close Relationship of Lexical Lexical AnalysisAnalysis to to Regular ExpressionsRegular Expressions, , GrammarsGrammars, and , and Finite AutomatonsFinite Automatons

ch2.1 cse244 chapter 2: a simple one pass compiler aggelos kiayias computer science &...

Documents

list list digit p2

list list digit p3

list digit p4

list stmt stmt slide

token string

cse244 grammars

symbol slide

previous slide