Download - Transparency No. 1 Lecture 3 Introduction to Parsing and Top-Down Parsing Cheng-Chia Chen

Transparency No. 1

Lecture 3

Introduction to Parsing and

Top-Down Parsing

Cheng-Chia Chen

Transparency No. 2

Outlines

Parsing overviewLimitation of regular expressionsContext-free GrammarDerivation and Parse TreeAmbiguity of CFGsConcrete Parse Tree v.s. Abstract Syntax Tree.Recursive Descent ParsingLL(1) parsing

Transparency No. 3

Parsing Overview

What is syntax ? The way in which words are put together to form phrases,

clauses, or sentences. --- Webster’s Dictionary

The function of a parser : Input: sequence of tokens from lexer

Output: parse tree of the program

Transparency No. 4

Example

Java expr

x == y ? 1 : 2 Parser input

ID == ID ? INT : INT Parser output

ID ID

?:

==

INT

INT

Transparency No. 5

Comparison with Lexical Analysis

Phase Input Output

Lexer Sequence of characters

Sequence of tokens

Parser Sequence of tokens

Parse tree

Transparency No. 6

The Role of the Parser

Not all sequences of tokens are programs . . . . . . Parser must distinguish between valid and invalid

sequences of tokens

We need A language for describing valid sequences of tokens A method for distinguishing valid from invalid sequences

of tokens A way to construct the parse tree from the parsing

process.

Transparency No. 7

Limitation of regular expression

Recall how we define lexical structures using regular expressions.

Ex: ---- (1) digits = [0-9]+ sum = (digits “+”)* digits

match sums of the form: 28 + 301 + 9.Can we use the same way to define the syntax of a

language?

Transparency No. 8

Inadequacy of regular expressions

Consider how to use regular expressions to express sums with parentheses ? (109+235), 61, (1+ (250 + 34))

A solution: --- (2) digits = [0-9]+ sum = expr + expr expr = “(“ sum “)” | digits

Note the difference b/t (1) and (2). in (1) digits and sum can be and actually is treated as

abbreviations of their right hand side(RHS). in (2), the LHS name are used recursively at their RHS, and

cannot be treated as abbreviations.

Transparency No. 9

Inadequacy of regular expressions

To translate (1) digits = [0-9]+ sum = (digits “+”)* digits

into FA, we first translate each definition into normal regular expressions by replacing every reference of definitions by its RHS recursively: digits = [0-9]+ sum = ([0-9]+ “+”)* [0-9]+ and then apply normal procedures to translate regular

expressions into DFAs.But for definitions with recursion like (2), this approach

does not work.

Transparency No. 10

--- (2)1. digits = [0-9]+

2. sum = expr “+” expr

3. expr = “(“ sum “)” | digits Repalce sum in 3. by their RHS at 2:

expr = “(“ expr “+” expr “)” | digits ---- (4) Repace expr at (4) by itself, we get: expr = “(“ (“(“ expr “+” expr “)” | digits) “+” (“(“ expr “+” expr “)” | digits) | digits --- (5)

Conclusion: It is hopeless to eliminate names in RHS by finite

substitutions if there are direct or indirect recursions.

Transparency No. 11

It can be shown that regular expressions with non-recursive abbreviations do

not increase the expressive power of regular expressions, but ,

Regular expressions allowing recursive abbreviations do increase the expressive power of regular expressions

--- This formalism is called context-free grammar, and is just what we need for parsing.

Transparency No. 12

Redundancy induced by recursion

Inner alternation(|) is not needed: expr = ab(c|d) e ==> aux = c | d expr = a b aux e ==> aux=c aux = d expr = a b aux e

Repetition(*) is not needed expr = (a b c ) * ==> expr = (a b c) expr --- right recursion // expr = expr (a b c ) --- left recursion expr = .

Transparency No. 13

Context-Free Grammars

Programming language constructs have recursive structure

An EXPR isif EXPR then EXPR else EXPR fi , or

while ( EXPR ) EXPR , or

…Context-free grammars are a natural notation for this

recursive structure

Transparency No. 14

CFGs (Cont.)

A CFG consists of A set of terminals T A set of non-terminals N A start symbol S (a non-terminal) A set P of productions, where each rule r is of the form:

Assuming X N

X , or

X Y1 Y2 ... Yn where Yi N T

Transparency No. 15

Notational Conventions:

In these lecture notes Non-terminals are written upper-case (S, A,B,C,…) Terminals are written lower-case (a,b,c,…) X,Y,Z, … range over T U N , … ranges over strings of N U T.

The start symbol (S) is the left-hand side of the first production

Each terminal symbol corresponds to a token type from the lexer.

Each non-terminal symbol corresponds to a symbol occurring at the LHS of a production rule.

Transparency No. 16

Examples of CFGs

Expr if Expr then Expr else Expr

| while Expr do Expr

| id

Transparency No. 17

Examples of CFGs

Simple arithmetic expressions:

E E * E

| E + E

| ( E )

| id

Transparency No. 18

The Language of a CFG

Read productions as replacement rules:

X Y1 … Yn

Means X can be replaced by Y1 … Yn

X

Means X can be erased (replaced with empty string)

Transparency No. 19

Key Idea

1. Begin with a string consisting of the start symbol “S”2. Replace any non-terminal X in the string by a right-

hand side of some production

3. Repeat (2) until there are no non-terminals in the string

X Y1 … Yn

Transparency No. 20

The Language of a CFG (Cont.)

Write

X1X2 …Xi … Xn X1 … Xi-1 Y1 … Yn Xi+1…Xn

if there is a production

Xi Y1 … Yn

More formally : X if X ∃ P.∈ or define to be the binary relation { ( X ) | X if x ∃ P. } ∈ on (T U N)*.

Transparency No. 21

The Language of a CFG (Cont.)

Write X1…Xn * Y1 … Ym

if X1…Xn …… Y1 … Ym

in 0 or more stepsOr formally:

Define * to be the reflexive and transitive closure of the relation .

i.e., * iff = or ∃ 1,2,…,n (n ≥ 1) such that 1 …

n = .

Transparency No. 22

The Language of a CFG

Let G be a context-free grammar with start symbol S. Then the language L(G) of G is the set:

{ | S * where is a terminal string. }

Transparency No. 23

Terminals

Terminals are so called because there are no rules for replacing them

Once generated, terminals are permanent

Terminals ought to be token types of the language

Transparency No. 24

Examples

Strings of balanced parentheses :

Two representations of the same grammar G:

( )S S

S

( )

|

S S

( ) | 0i i i

OR

Transparency No. 25

Arithmetic Example

Simple arithmetic expressions:

Some elements of the language:

E E+E | E E | (E) | id

id id + id

(id) id id

(id) id id (id)

Transparency No. 26

Notes

The idea of a CFG is a big step. But:

Membership in a language is just “yes” or “no” We need also the parse tree of the input

Must handle errors gracefully

Need (tools for) an implementation of CFG’s.

Transparency No. 27

Derivations and Parse Trees

A derivation is a sequence :

S0, S1 , S2 , … Sn of strings over terminals and nonterminals such that

1. Si S i+1 for 0 ≤ i < n.

2. S0 = S is the start symbol.

A derivation can be shown as the drawing of a tree Start symbol is the tree’s root For a production X Y1 … Yn

add children Y1 … Yn to node X

Transparency No. 28

Derivation Example

Grammar

String

E E+E | E E | (E) | id

id id + id

Transparency No. 29

Derivation Example (Cont.)

E

E+E

E E+E

id E + E

id id + E

id id + id

E

E

E E

E+

id*

idid

Transparency No. 30

Derivation in Detail (1)

E

E

Transparency No. 31


E

E+E

E

E E+

Transparency No. 32


E E

E

E+E

E +

E

E

E E

E+

*

Transparency No. 33


E

E+E

E E+E

id E + E

E

E

E E

E+

*

id

Transparency No. 34


E

E+E

E E+E

id E +

id id +

E

E

E

E

E E

E+

*

idid

Transparency No. 35


E

E+E

E E+E

id E + E

id id + E

id id + id

E

E

E E

E+

id*

idid

Transparency No. 36

Properties of parse trees.

A parse tree has start symbol at the root terminals or empty nodes at the leaves Non-terminals at the internal nodes if internal node X has children Y1,…,Yn, then

X Y1 Y2 … Yn is a production rule.

An in-order traversal of the leaves is the original input.

The parse tree makes explicit the structure of the input string.

Transparency No. 37

Left-most and Right-most Derivations

The example is a right-most derivation At each step, replace

the right-most non-terminal

There is an equivalent notion of a left-most derivation

E

E+E

E+id

E E + id

E id + id

id id + id

Transparency No. 38

Right-most Derivation in Detail (1)

E

E

Transparency No. 39


E

E+E

E

E E+

Transparency No. 40


id

E

E+E

E+

E

E E+

id

Transparency No. 41


E

E+E

E+id

E E + id

E

E

E E

E+

id*

Transparency No. 42


E

E+E

E+id

E E

E

+ id

id + id

E

E

E E

E+

id*

id

Transparency No. 43


E

E+E

E+id

E E + id

E id + id

id id + id

E

E

E E

E+

id*

idid

Transparency No. 44

Derivations and Parse Trees

Note that right-most and left-most derivations have the same parse tree

The difference is the order in which branches are added

Transparency No. 45

Summary of Derivations

We are not just interested in whether

s (G) We need a parse tree for s

A derivation defines a parse tree But one parse tree may have many derivations

Left-most and right-most derivations are important in parser implementation in that they can each serve as the canonical derivation of a parse tree : Every parse tree has a unique left-most (and a unique right-most) derivation.

Transparency No. 46

Issues

A parser consumes a sequence of tokens s and produces a parse tree

Issues: How to recognize that s L(G) ? How to generate a parse tree of s once s L(G) Ambiguity: Is there more than one parse tree

(interpretation) for some string s ? Error handling: What should we do if no parse

tree exist for an input string.

Transparency No. 47

Ambiguity

Grammar

E E + E | E * E | ( E ) | int

String

int * int + int

Transparency No. 48

Ambiguity (Cont.)

This string has two parse trees

E

E

E E

E*

int +

intint

E

E

E E

E+

int*

intint

Transparency No. 49

Ambiguity (Cont.)

A grammar is ambiguous if it has more than one parse tree for some string Equivalently, there is more than one right-most or left-most

derivation for some stringAmbiguity is bad

Leave meaning of some programs ill-definedAmbiguity is common in programming languages

Arithmetic expressions IF-THEN-ELSE

Transparency No. 50

Dealing with Ambiguity

There are several ways to handle ambiguity

Most direct method is to rewrite the grammar unambiguously

E T + E | T

T int * T | int | ( E )

Enforces precedence of * over +

Transparency No. 51

Ambiguity: The Dangling Else

Consider the grammar E if E then E

| if E then E else E

| OTHER

This grammar is also ambiguous

Transparency No. 52

The Dangling Else: Example

The expression

if E1 then if E2 then E3 else E4

has two parse trees

if

E1 if

E2 E3 E4

if

E1 if

E2 E3

E4

• Typically we want the second form

Transparency No. 53

The Dangling Else: A Fix

else matches the closest unmatched then We can describe this in the grammar

E MIF /* all then are matched */

| UIF /* some then are unmatched */

MIF if E then MIF else MIF

| OTHER

UIF if E then E

| if E then MIF else UIF Describes the same set of strings

Transparency No. 54

The Dangling Else: Example Revisited

The expression if E1 then if E2 then E3 else E4

if(UIF)

E1 if(MIF)

E2 E3 E4

if(MIF)

E1 if(UIF)

E2 E3

E4

• Not valid because the then expression is not a MIF

• A valid parse tree (for a UIF)

Transparency No. 55

Ambiguity

No general techniques for handling ambiguity

Impossible to convert automatically an ambiguous grammar to an unambiguous one

However, if used with care, ambiguity can simplify the grammar Sometimes allows more natural definitions We need disambiguation mechanisms

Transparency No. 56

Precedence and Associativity Declarations

Instead of rewriting the grammar Use the more natural (ambiguous) grammar Along with disambiguating declarations

Most tools allow precedence and associativity declarations to disambiguate grammars

Examples …

Transparency No. 57

Associativity Declarations

Consider the grammar E E - E | int Ambiguous: two parse trees of int - int - int

E

E

E E

E-

int -

intint

E

E

E E

E -

int -

intint

• Left-associativity declaration: %left +, -

Transparency No. 58

Precedence Declarations

Consider the grammar E E + E | E * E | int And the string int + int * int

E

E

E E

E+

int *

intint

E

E

E E

E*

int+

intint• Precedence declarations: %left +, -• %left *, /

Transparency No. 59

Abstract Syntax Trees

So far a parser traces the derivation of a sequence of tokens to generate a parse tree Such a parse tree is called a concrete parse tree

since it reflects the syntax structure of the input.The rest of the compiler needs a structure more

suitable for the representation of the program Abstract syntax trees

Like parse trees but ignore some detailsAbbreviated as AST

Transparency No. 60

Abstract Syntax Tree. (Cont.)

Consider the grammar E int | ( E ) | E + E

And the string 5 + (2 + 3)

After lexical analysis (a list of tokens)

int5 ‘+’ ‘(‘ int2 ‘+’ int3 ‘)’

During parsing we build a parse tree …

Transparency No. 61

Example of Parse Tree

E

E E

( E )

+

E +

int5

int2

E

int3

Characteristic of the parse tree : Traces the operation of the

parser Does capture the nesting

structure But too much info

Parentheses Single-successor nodes

Transparency No. 62

Example of Abstract Syntax Tree

Also captures the nesting structureBut abstracts from the concrete syntax

=> more compact and easier to useAn important data structure in a compiler

PLUS

PLUS

2 5 3

Transparency No. 63

Constructing An AST

We first define the AST class hierarchy ASTNode IntNode , PlusNode

Consider an abstract tree type with two constructors:

new PlusNode(

T1

) =,

T2

PLUS

T1 T2

new IntNode(n) = n

Transparency No. 64

Semantic Actions (Syntax-directed definition )

This is what we’ll use to construct ASTs

Each grammar symbol may have attributes For terminal symbols (lexical tokens) attributes can be

calculated by the lexer

Each production may have an action Written as: X Y1 … Yn { action }

That can refer to or compute symbol attributes

Transparency No. 65

Constructing an AST

We define an attribute ast for non-terminals Values of ast attributes are ASTs We assume that int.lexval is the value of the integer lexeme Computed using semantic actions

E int E.ast = new IntNode(int.lexval)

| E1 + E2 E.ast = new PlusNode

(E1.ast, E2.ast)

| ( E1 ) E.ast = E1.ast

Transparency No. 66

Parse Tree Example

Consider the string int5 ‘+’ ‘(‘ int2 ‘+’ int3 ‘)’A bottom-up evaluation of the ast attribute: E.ast = new PlusNode(new IntNode(5),

new PlusNode(new IntNode(2), new IntNode(3))

PLUS

PLUS

2 5 3

Transparency No. 67

Review

We can specify language syntax using CFGA parser will answer whether s L(G)

and will build a parse tree which we convert to an AST and pass on to the rest of the compiler

Next lectures: How do we answer s L(G) and build a parse tree?

Transparency No. 68

Introduction to Top-Down Parsing

Terminals are seen in order of appearance in the token stream:

t2 t5 t6 t8 t9

The parse tree is constructed From the top From left to right

1

t2 3

4

t5

7

t6

t9

t8

Transparency No. 69

Recursive Descent Parsing

Intuitions: for a set of productions with the same LHS X

X Y1 Y2 … Yn | … Nonterminal X => Procedure definition: X() {…} Nonterminal Y at RHS => procedure call: Y() Terminal b at RHS => match(b) : boolean concatenation X Y => sequence: X() ; Y() choice Y1 Y2 | Z1 Z2 => if ( ?) then {Y1(); Y2() } eise if (?) {Z1() ;Z2() } using one or more looking ahead symbols to help decide branch.

Ex : E T + E | T

E() { if(?) { T() ; m(+); E() } else T() ; }

Transparency No. 70

Recursive Descent Parsing. Example (Cont.)

Consider the grammar E T + E | T T int | int * T | ( E ) Start with top-level non-terminal E Token stream is: int5 * int2

Try the rules for E in orderTry E0 T1 + E2

Then try a rule for T1 ( E3 ) But ( does not match input token int5

Try T1 int . Token matches. But + after T1 does not match input token * Try T1 int * T2

This will match but + after T1 will be unmatched Have exhausted the choices for T1

Backtrack to choice for E0

Transparency No. 71

Recursive Descent Parsing. Example (Cont.)

Try E0 T1

Follow same steps as before for T1

And succeed with T1 int * T2 and T2 int

result in the following parse tree

E0

T1

int5 * T2

int2

Transparency No. 72

Recursive Descent Parsing. Notes.

Easy to implement by hand

But does not always work …

Transparency No. 1

Implementation of a Recursive Descent Parser

Transparency No. 74

A Recursive Descent Parser. Preliminaries

Let Token be the type of all token objects

public class Token {

int type ; // for token type

public final static int INT = 1,

OPENT=2, CLOSE=3, PLUS=4, TIMES=5,

… ; // constants for all token types.

… // other fields omitted

//Let the global tok point to the next token to be matched

public static Token tok;

//next() returns following token of ‘this’ from lexer.

public Token next(){…}

public static Token advance(){ tok = tok.next(); }

public static Token eat(int type){

if( tok.type = type) advance(); else error(); } … }

Transparency No. 75

A Recursive Descent Parser (2)

Define boolean functions that check the input token stream for a match of A given token type (terminal)

bool term(int type) {

if(tok.type == type){ advance(); return true;}

return false; } A given production of S (the nth rule)

bool Sn() { … } // do ‘and’ test inside the body

A NonTerminal S:

bool S() { … } // do ‘or’ test inside the body

These functions eat tokens.

Transparency No. 76


For production E T + E bool E1() { return T() && term(PLUS) && E(); }

For production E T bool E2() { return T(); }

For all productions of E (with backtracking) bool E() { Token save = Token.tok;

if(E1()) return true;

// E1() fails => try next rule from stored save Token; Token.tok = save; if(E2()) return true; …

// En-1() fails => try last rule from stored save Token; Token.tok = save; return En(); }

Transparency No. 77


Functions for non-terminal Tbool T1() { return term(OPEN) && E() && term(CLOSE); }

bool T2() { return term(INT) && term(TIMES) && T(); }

bool T3() { return term(INT); }

bool T() { Token save = Token.tok;

if(T1()) return true;

// E1() fails => try next rule from where save occurs.

Token.tok = save; // backtracking point

if(T2()) return true;

// T2() fails => try last rule from where save appears;

Token.tok = save;

return T3(); }

Transparency No. 78

Recursive Descent Parsing. Notes.

To start the parser 1. invoke Token.init(lexer) : static init(Lexer lexer) {…} // inside Token to set tok to first token; 2. Invoke E()

Notice how this simulates the previous backtracking example.

Easy to implement by handBut does not always work and is not efficient…

Predictive parsing (without backtracking) is better

Transparency No. 79

Recursive descent parser with lookahead

Recursive Descent Parser Unpopular because of inefficient backtracking

in practice:1.Using lookahead symbols to predict which alternative

rule to match and thus avoid some backtracking. though still unable to avoid all backtrackings!!

2.Restrict the grammar to specific form so that backtracking is not needed.

Transparency No. 80

Example

Grammar 3.11 S if E then S else S S begin S L S print E L end L ; S L E num = num

Normal procedure for S() with backtracking:bool S() { Token save = Token.tok;

if(S1()) return true; Token.tok = save;

id(S2()) return true; Token.tok = save;

return S3(); }

Observations: We can avoid unnecessary tries of S1() and S2() if we know

Token.tok is a ‘print’ token. S1() and S2() can also be expanded in-line in S(); a switch(Toekn.tok.type) {…} construct can be used to

determine which Si() is to be tried.

Transparency No. 81

The Improvement (no backtracking)

void S() { switch(Token.tok.type) { case IF : eat(IF); E(); eat(THEN); S(); eat(ELSE); S(); break;} case BEGIN: eat(BEGIN); S(); L(); break;} case PRINT: eat(PRINT); E(); break; default: error(); }}

void L() { switch(Token.tok.type) { case END: eat(END); break;} case SEMI: eat(SEMI); S(); L(); break;} default: error();}}

void E() { eat(NUM); eat(EQ); eat(NUM); }

Transparency No. 82

Try with another grammarGrammar 3.10

S E$ // $ is EOF (end-of-file marker) EE+T EE-T ET TT*F TT/F TF Fid Fnum F(E)

The translated program:

void S() {E(); eat(EOF);}

void E() { switch(Toekn.tok.type) {

case ? : E(); eat(PLUS); T();}

case ? : E(); eat(MINUS); T();}

case ? : T();}}Problem: there is no apparent terminals which E() can

use to decide the clause to proceed. In fact, All clauses are possible. Ex: (2) + 3, (2) – 3, 2 * 3.

Transparency No. 83

void T() { switch(tok.type) {

case ? : T(); eat(TIMES); F();}

case ? : T(); eat(DIV); F();}

case ? : F();}}

void F() { switch(tok.type) {

case ID: eat(ID);}

case NUM: eat(NUM);}

case LPAREN: eat(LPAREN); E(); eat(RPAREN);}}When will a predictive parser not work？

There are multiple production rules for a nonterminal with the same first terminal symbol.sometimes can be resolved by factoring out common left

parts.

Transparency No. 84

When Will Recursive Descent Not Work

Consider a production S S a: In the process of parsing S we try the above rule: boolean S() { return S() && term(a); }

or void S() { S(); eat(a);} What goes wrong?

A left-recursive grammar has a non-terminal S S + S for some

Recursive descent does not work in such cases

Transparency No. 85

Elimination of Left Recursion

Consider the left-recursive grammar S | S

S generates all strings starting with a and followed by a number of i.e., *

Can rewrite using right-recursion S S’

S’ S’ |

Transparency No. 86

More Elimination of Left-Recursion

In general

S S 1 | … | S n | 1 | … | m

All strings derived from S start with one of 1,…,m and continue with several instances of 1,…,n

i.e., (1|…|m ) (1|…|n )*

Rewrite as S 1 S’ | … | m S’

S’ 1 S’ | … | n S’ |

Transparency No. 87

General Left Recursion

The grammar S A | A S

is also left-recursive because

S + S

This left-recursion can also be eliminatedSee [Dragon book, Section 4.3] for general

algorithm.

Transparency No. 88

Summary of Recursive Descent

Simple and general parsing strategy Left-recursion must be eliminated first … but that can be done automatically

Unpopular because of backtracking Thought to be too inefficient

In practice, backtracking is eliminated by restricting the grammar

Transparency No. 89

Predictive Parsers

Table-based : encode grammar rules in a parsing table instead

of program code.Like recursive-descent but parser can “predict”

which production to use By looking at the next few tokens No backtracking

Predictive parsers accept LL(k) grammars 1st L means “left-to-right” scan of input 2nd L means “leftmost derivation” k means “predict based on at most k tokens of

lookahead”In practice, LL(1) is used.

Transparency No. 90

LL(1) Languages

In recursive-descent, for each non-terminal and input token there may be a choice of production

LL(1) means that for each non-terminal and token there is only at most one production

Can be specified as a 2D table One dimension for current non-terminal to expand One dimension for next token A table entry contains zero or one production

Ex: S if E then S else S --- (1) S begin S L --- (2) S print E --(3) Then table[S][if] = (1); table[S][begin] = (2), etc.

Transparency No. 91

Predictive Parsing and Left Factoring

Recall the grammar E T + E | T

T int | int * T | ( E )

Hard to predict because For T two productions start with int For E it is not clear how to predict

A grammar must be left-factored before use for predictive parsing

Transparency No. 93

LL(1) Parsing Table Example

Left-factored grammarE T X X + E | T ( E ) | int Y Y * T |

The LL(1) parsing table:

int * + ( ) $

E T X T X

X + E T int Y ( E )

Y * T

Transparency No. 94

LL(1) Parsing Table Example (Cont.)

Consider the [E, int] entry “When current non-terminal is E and next input is int, use

production E T X This production can generate an int in the first place : T X * int …

Consider the [Y,+] entry “When current non-terminal is Y and current token is +,

get rid of Y” Y can be followed by + only in a derivation in which Y …XY+… * …X+…

Transparency No. 95

LL(1) Parsing Tables. Errors

Blank entries indicate error situations Consider the [E,*] entry “There is no way to derive a string starting with * from

non-terminal E”

Transparency No. 96

Using Parsing Tables

Method similar to recursive descent, except For each non-terminal S We look at the next token a And choose the production shown at [S,a]

We use a stack to keep track of pending non-terminalsWe reject when we encounter an error stateWe accept when the stack is empty

Transparency No. 97

LL(1) Parsing Algorithm

initialize :stack = <S $> and

tok (pointer to next token)repeat case stack of <X, rest> : if(T[X, tok.type] == Y1…Yn) then stack <Y1… Yn rest>; else error (); <t, rest> : if(t == tok.type) then{ stack <rest>;

advance(); } else error ();until stack == < >

Transparency No. 98

LL(1) Parsing Example

Stack Input Action

E $ int * int $ T X

T X $ int * int $ int Y

int Y X $ int * int $ terminal

Y X $ * int $ * T

* T X $ * int $ terminal

T X $ int $ int Y

int Y X $ int $ terminal

Y X $ $ X $ $ $ $ ACCEPT

Transparency No. 99

Constructing Parsing Tables

LL(1) languages are those with a LL(1) parsing table. No table entry can contain more than one

productions

How to generate the parsing table from a CFG ?

Transparency No. 100

Constructing Parsing Tables (Cont.)

If A , where in the line of A do we place ? i.e., table[A][??] =

1. In the column of b, where b can start a string derived from * b We say that b First()

2. In the column of b, if can reduce to and b can follow an A S * A b We say b Follow(A)


Computing Nullable Symbols

Definition:

A nonterminal X is nullable if it can derive the empty string:

X * .

Given a grammar G : the set Nullable(G) of nullable symbols can be defined recursively as follows: 1. Basis: X is nullable if X is a rule of G. 2. Recursion: if Y1,Y2,…Yk (k > 0) are nullable and X Y1 Y2 … Yk is a

rule, then X is nullable.


Example

Let G :

S ACA A aAa | B | C

B bB | b C cC | .

By (1) C is nullable --(3)

By (2)(3) A is nullable --(4)

By (2,3,4) S is nullable --(5).

By (1,2,3,4,5) no futher nullable symbol can be found.

Hence Nullable(G) = {A,C,S}.


For each Nonterminal X define a number order(X) as follows: order(X) = 0 if X is a rule of G. if X Y1,…,Yk is a rule and Max{order(Yi) | I = 1,2,…k} = t , then order(X) = t + 1 if there is no other rule X Z1,…,Zm with Max{order(Zi) | I = 1,2,…m} < t.

Ex: in previous example: order(C) = 1; order(A) = 2; order(S) = 3.

Theorem: 1. X is nullable iff it order(X) is defined.

2. order(X) <= #nonterminals


Algorithm for computing Nullable(G)

input: a CFG G; output NG : the set of all nullable symbols.

NG = {};

1. for all rules r:if r = X , then NG =NG U {X};

2. repeat : NG’ = NG;

for each rule r:

if r = X Y1 Y2 … Yk and {Y1,…Yk} NG’, then

NG = NG U {X}.

until NG = NG’ // no change of NG in the iteration. Correctness:

1. Step 1 . compute all nullables of order 0;

2. kth iteration of step 2 computes nullables of order k.

3. Since all nullables are of finite order, this program must terminate.


Computing First Sets

Definition First(X) = { b | X * b} First(X) can be defined recursively :

1. Basis: First(b) = {b};

2. recursion: if X A1A2…An Y is a rule and

All A1…An are nullable, then First(Y) First(X). can extend First(-) to First() = { b | * b } , where is a sequecne of symbols. Then

First(Y1…Yk) = UY1…Yj-1 are nullable First(Yj) Notes : A,B,C are nonterminals; a,b,c are terminals; X,Y,Z,… are

terminal or nonterminals and are strings of terminals or nonterminals.


First Sets. Example

Recall the grammar E T X X + E | T ( E ) | int Y Y * T |

nullable set: {X,Y}.First sets

First( ( ) = { ( } First( T ) = {int, ( }

First( ) ) = { ) } First( E ) = First(T) = {int, ( }

First( int) = { int } First( X ) = {+ }

First( + ) = { + } First( Y ) = {* }

First( * ) = { * }


Algorithm for computing First(X)

input : a CFG G;output: First(X) for all symbols X.

1. for each terminal b; First(b) = b;

2. repeat:

for each rule X Y1Y2…Yk (k > 0)

2.1 First(X) = First(X) U First(Y1);

2.2 Let t be the index of Yt which is the first non-nullable symbol at the RHD or k+1 if all symbols are nullable.

2.3 for j = 1 to t-1

First(X) = First(X) U First(Yj);

until no First(-) changes in this iteration.


Computing Follow Sets

Definition: Follow(X) = { b | S * X b }Intuition

If X A B then First(B) Follow(A) and Follow(X) Follow(B) Also if B * then Follow(X) Follow(A) If S is the start symbol then $ Follow(S)

Recursive definition: Basis: 1. $ Follow(S) 2. if X … Ythen First() Follow(Y) Recursion: 3. if X … A and is nullable, then Follow(X) Follow(A)

Note: 1,2 are used to compute following siblings while 3 is used to compute following cousins.


Computing Follow Sets (Cont.)

Algorithm:

0. Follow(S) = {$}, Follow(X) = {} for X≠ S;

1. For each rule of the form X Y1Y2…Yk (k > 0)

for i = 1.. k-1

for j = i+1 .. k-1

Follow(Yi) = Follow(Yi) U First(Yj);

if Yj is not nullable then break;

2. Repeat : For each production A X1 X2 … Xn

for k = n ..1

Follow(Xk) = Follow(Xk) U Follow(A);

if (Xk is not nullable) break;

until Follow does not change in this iteration.


Follow Sets. Example

Recall the grammar E T X X + E | T ( E ) | int Y Y * T |

Follow sets

Follow( + ) = { int, ( } Follow( * ) = { int, ( }

Follow( ( ) = { int, ( } Follow( ) ) = {+, ) , $}

Follow( int) = {*, +, ) , $}

Follow( X ) = {$, ) } Follow( T ) = {+, ) , $}

Follow( E ) = {), $} Follow( Y ) = {+, ) , $}


Constructing LL(1) Parsing Tables

Construct a parsing table T for CFG G

For each production A in G do: For each terminal b First() do

T[A, b] = If nullable(), for each b Follow(A) do

T[A, b] =


Notes on LL(1) Parsing Tables

If any entry contains multiple rules then G is not LL(1) If G is ambiguous If G is left recursive If G is not left-factored And in other cases as well

Most programming language grammars are not LL(1)There are tools that build LL(1) tables


Summary

For some grammars there is a simple parsing strategy Predictive parsing

Next time: a more powerful parsing strategy

Download - Transparency No. 1 Lecture 3 Introduction to Parsing and Top-Down Parsing Cheng-Chia Chen

Top Related