syntax analysis (cont.) manas thakur

CS502: Compiler Design

Syntax Analysis (Cont.)

Manas Thakur

Fall 2020

Manas Thakur CS502: Compiler Design 2

What next?

● Bottom-up parsing

● Why?

– BU parsers are more powerful

than TD parsers● Cover more kinds of grammars

(e.g., no need to eliminate left recursion)

– More efficient as well

● Bad news: Slightly more complicated

● Good news: Well known parser generators exist

Seems like winter would never end!

parsing


Bottom-Up Parsing

● Given a string, construct a parse tree by starting at the leaves and walking up to the root.

● The process is called reduction.

– Reduce a string w to the start symbol of the grammar.

– Recall derivation from top-down parsing?


Reduction

● At each reduction step, a specific substring matching the body of a production is replaced by the non-terminal at the head of the production.

● Basically we are constructing a rightmost derivation in reverse!

● How to decide which substring to reduce?

FFid * id * id

id FF

id

TT * id

FF

id

TT * FF

id

FF

id

TT * FF

id

TT

FF

id

TT * FF

id

TT

EEReduction

steps

E → E+T | TT → T*F | FF → id

F→id T→F F→id T→T*F E→T


Handle pruning

● A handle is a substring that matches the body of a production,

and reducing this handle represents one step of reduction.

● Theorem: If G is unambiguous, then every right-sentential form has a unique handle.

● Notice why did we say “a handle” instead of “the handle”?

● BU parsing is essentially the problem of handle pruning.

Right Sentential Form Handle Reducing Production

id1 * id

2id

1F -> id

F * id2

F T -> F

T * id2

id2

F -> id

T * F T * F T -> T * F

T T E -> T


Shift-Reduce Parsing

● Uses a stack to perform bottom-up parsing

● Four actions:

– Shift: shift the next input symbol on top of stack

– Reduce: pop handle off the stack and push the corresponding non-terminal

– Accept: parsing successful

– Error: parsing failed

● The standard scheme used by LR grammars.

Left to right scanning Rightmost derivation


LR Parsing Example

● A table guides the actions, based on the top of the stack and the next input symbol.

Stack Input Action

$ id1 * id

2 $ shift

$ id1

* id2 $ reduce by F -> id

$ F * id2 $ reduce by T -> F

$ T * id2 $ shift

$ T * id2 $ shift

$ T * id2

$ reduce by F -> id

$ T * F $ reduce by T -> T * F

$ T $ reduce by E -> T

$ E $ accept

The job of all LR parsers is toconstruct the “action” table.


LR Parsing Algorithms

● Simple LR or SLR

– Smallest class of grammars

– Smallest tables

– Simple, fast construction

● Canonical LR or CLR

– Largest set of grammars

– Largest tables

– Slow construction

● LookAhead LR or LALR

– Intermediate set of grammars

– Same number of states as CLR

– Faster construction than CLR


LR(k) Items

● An LR(k) item is a pair [α, β], where

– α is a production with a • at some position in the RHS, marking how much of the RHS has been seen

– β is a lookahead string containing k symbols (terminals or $)

● Two cases of interest:

– LR(0) items for SLR table construction

– LR(1) items for CLR and LALR table construction


Example of LR(0) items

● A → XYZ generates four LR(0) items:– [A → •XYZ]

– [A → X•YZ]

– [A → XY•Z]

– [A → XYZ•]

● [A → •XYZ] indicates that the parser is looking for a string that can be derived from XYZ

● [A → XY•Z] indicates that the parser has seen a string derived from XY and is looking for one derivable from Z


CLOSURE

● Given an item [A → α • Bβ ], its closure contains the item and any other items that can generate legal substrings to follow α.

function CLOSURE(I)repeat

if [A → α • Bβ ] I∈add [B → •γ] to I

until no more items can be added to Ireturn I

E’ → EE → E+T | TT → T*F | FF → (E) | id

I = {[E’ → •E]}

I0

E’ → •EE → •E+TE → •TT → •T*FT → •FF → •(E)F → •id

CLOSURE(I)

Grammar G’ with anaugmented production:


GOTO

● Let I be the set of LR(0) items and X be a grammar symbol. Then, GOTO(I, X) is the closure of the set of all items

– [A → αX•β] such that [A → α•Xβ] ∈ I

I0

E’ → •EE → •E+TE → •TT → •T*FT → •FF → •(E)F → •id

EI1

E’ → E•E → E•+T

GOTO(I0, E) = I1

Classwork: Construct GOTO(I1, +).

E’ → EE → E+T | TT → T*F | FF → (E) | id


I0

E' → . EE → . E + TE → . TT → . T * F T → . FF → . (E) F → . id

I0

E' → . EE → . E + TE → . TT → . T * F T → . FF → . (E) F → . id

I1

E' → E .E → E . + T

I1

E' → E .E → E . + T

E

accept

$

I2

E → T .T → T . * F

I2

E → T .T → T . * F

T

I3

T → F .

I5

F → id .

I5

F → id .

I4

F → ( . E )E → . E + TE → . TT → . T * F T → . FF → . (E) F → . id

I4

F → ( . E )E → . E + TE → . TT → . T * F T → . FF → . (E) F → . id

id

F

id

(

F

I6

E → E + . TT → . T * F T → . FF → . (E) F → . id

I6

E → E + . TT → . T * F T → . FF → . (E) F → . id

I7

T → T * . FF → . ( E )F → . id

I7

T → T * . FF → . ( E )F → . id

+

*

I8

E → E . + TF → ( E . )

I8

E → E . + TF → ( E . )

E

I9

E → E + T .T → T . * F

I9

E → E + T .T → T . * F

I10

T → T * F .

I10

T → T * F .

I11

F → ( E ) .

I11

F → ( E ) .

T

F

(

T

T

*

id

id

(

F

F

(

)+

id

(

LR(0) Automaton

E’ → EE → E+T | TT → T*F | FF → (E) | id



Manas Thakur

Fall 2020


Before we reach SLR● We can build a simpler than SLR parser using LR(0) item sets for

the following grammar:E’ → EE → E+T | TT → (E) | id

I0

E' → . EE → . E + TE → . TT → . (E) T → . id

I0

E' → . EE → . E + TE → . TT → . (E) T → . id

I1

E' → E .E → E . + T

I1

E' → E .E → E . + T

E accept$

I2

E → T .

I2

E → T .

T

I3

T → id .

I3

T → id .

I4

T → ( . E )E → . E + TE → . TT → . (E) T → . id

I4

T → ( . E )E → . E + TE → . TT → . (E) T → . id

id

id

(

I5

E → E + . TT → . (E) T → . id

I5

E → E + . TT → . (E) T → . id

+

I6

E → E . + TT → ( E . )

I6

E → E . + TT → ( E . )

E

I7

E → E + T .

I7

E → E + T .

I8

T → ( E ) .

I8

T → ( E ) .

T

(

T

T

id

(

)

+

id

(


Constructing LR(0) parsing table● Construct the LR(0) item sets for G’

– G’ is G with an augmented start production S’ → S

● State i is constructed using set Ii

– [A → α•aβ] I∈ i and GOTO(Ii,a) = Ij

– ⇒ ACTION[i,a] ← “shift j”, a != $∀– [A → α•] I∈ i, A != S’

– ⇒ ACTION[i,a] ← “reduce A → α”, a∀– [S’ → S•] I∈ i ACTION[i, a] ← ⇒ “accept”, a∀

● GOTO(Ii, A) = Ij GOTO[i, A] ← j⇒

● Set undefined entries in ACTION and GOTO to “error”

● Initial state of parser is CLOSURE([S’ → •S])


LR(0) Parsing Table

State id + ( ) $ E T

0 s3 s4 1 2

1 s5 accept

2 r(E→T) r(E→T) r(E→T) r(E→T) r(E→T)

3 r(T→id) r(T→id) r(T→id) r(T→id) r(T→id)

4 s3 s4 6 2

5 s3 s4 7

6 s5 s4 s8 9

7 r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T)

8 r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T)

E'→ EE → E + T | TF → (E) | id


Need for more powerful LR parsers

● LR(0) is too simple to cover many grammars.

● Doesn’t cover even our expression grammar:

● Recall the giant automaton:

– e.g.: s7 or r(E→T) on (I2, *)

– Called a shift-reduce conflict– Similarly we can have reduce-reduce conflicts

● Further reading: Section 4.5.4 (DB)– Multiply defined entries imply the grammar is not LR(0)

● Reason: LR(0) automata do not know on what next symbol to reduce, and end up adding too many reduce actions conservatively.

E’ → EE → E+T | TT → T*F | FF → (E) | id


Constructing SLR parsing table● Construct the LR(0) item sets for G’

– G’ is G with an augmented start production S’ → S

● State i is constructed using set Ii

– [A → α•aβ] I∈ i and GOTO(Ii,a) = Ij

⇒ ACTION[i,a]← “shift j”, a != $∀– [A → α•] I∈ i, A != S’

⇒ ACTION[i,a] ← “reduce A → α”, a FOLLOW(A)∀ ∈

– [S0’ → S$•] I∈ i ACTION[i, a] ← ⇒ “accept”, a∀● GOTO(Ii, A) = Ij GOTO [i, A] ← j⇒

● Set undefined entries in ACTION and GOTO to “error”

● Initial state of parser s0 is CLOSURE([S’ → •S$])

This is the only addition w.r.t. the LR(0) algorithm!


SLR Parsing TableState id + * ( ) $ E T F

0 s5 s4 1 2 3

1 s6 accept

2 r(E→T) s7 r(E→T) r(E→T)

3 r(T→F) r(T→F) r(T→F) r(T→F)

4 s5 s4 8 2 3

5 r(F→id) r(F→id) r(F→id) r(F→id)

6 s5 s4 9 3

7 s5 s4 10

8 s6 s11

9 r(E→E+T) s7 r(E→E+T) r(E→E+T)

10 r(T→T*F) r(T→T*F) r(T→T*F) r(T→T*F)

11 r(F→(E)) r(F→(E)) r(F→(E)) r(F→(E))

FOLLOW(E) = {+,),$}FOLLOW(T) = {+,*,),$}FOLLOW(F) = {+,*,),$}

E' → EE → E + T | TT → T * F | FF → (E) | id


SLR Parsing Example

0 $ id * id $ Shift to 5

0 5 $ id * id $ Reduce by F → id

Stack Symbols Input Action

0 3 $ F * id $ Reduce by T → F

0 2 $ T * id $ Shift to 7

0 2 7 $ T * id $ Shift to 5

0 2 7 5 $ T * id $ Reduce by F → id

0 2 7 10 $ T * F $ Reduce by T → T * F

0 2 $ T $ Reduce by E → T

0 1 $ E $ Accept

E' → EE → E + T | TT → T * F | FF → (E) | id

● Parse for id*id:

Shift si: Push current symbol and state si, move pointer.Reduce A →α: Pop |α| symbols and states. GOTO using the nex symbol


A grammar that is not SLR

S'→ SS → L = R | RL → *R | idR → L

I0

S' → . SS → . L = RS → . RL → . *R L → . idR → . L

I0

S' → . SS → . L = RS → . RL → . *R L → . idR → . L

I1

S' → S .

I1

S' → S .I2

S → L . = RR → L .

I2

S → L . = RR → L .

I3

S' → R .

I3

S' → R .I4

L →id .

I4

L →id .

I5

L → * . R L → . * RR → . LR → . id

I5

L → * . R L → . * RR → . LR → . id

I6

S → L = . RR → . LL → . *R L → . id

I6

S → L = . RR → . LL → . *R L → . id

I7

L → *R .

I7

L → *R .I8

R → L .

I8

R → L .

I9

S → L = R .

I9

S → L = R .

● Consider I2 on ‘=’:

– Shift to I6

– Reduce using R → L (as = is in FOLLOW(R); how?)– Conflict in the parsing table implies the grammar is not SLR(1)



Manas Thakur

Fall 2020


LR(1) Items

● Recall LR(k) items definition?

– An LR(k) item is a pair [α, β], where● α is a production with a • at some position in the RHS, marking how

much of the RHS has been seen● β is a lookahead string containing k symbols (terminals or $)

● LR(1) items look like [A → X • YZ, a]


CLOSURE1 and GOTO1

function CLOSURE1(I)repeat

if [A → α • Bβ, a] I∈add [B → •γ, b] to I, where b FIRST(βa)∈

until no more items can be added to Ireturn I

function GOTO1(I, X)Let J be the set of items [A → αX•β, a]

such that [A → α•Xβ, a] I∈return CLOSURE1(J)


LR(1) AutomatonS'→ SS → C CC → c C| d

I0

S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d

I0

S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d

I2

S → C . C, $ C → . c C, $ C → . d, $

I2

S → C . C, $ C → . c C, $ C → . d, $

I3

C → c . C, c/d C → . c C, c/d C → . d, c/d

I3

C → c . C, c/d C → . c C, c/d C → . d, c/d

I1

S' → S ., $

I1

S' → S ., $

I4

C → d ., c/d

I4

C → d ., c/d

I6

C → c . C, $ C → . c C, $ C → . d, $

I6

C → c . C, $ C → . c C, $ C → . d, $

I5

S → CC ., $

I5

S → CC ., $

I7

C → d ., $

I7

C → d ., $

I8

C → c C ., c/d

I8

C → c C ., c/d

I9

C → c C ., $

I9

C → c C ., $

c

c

S

C

c

dd

d

C

c

C

C

d

Same LR(0) item, but different LR(1) items.

$accept


LR(1) or Canonical LR (CLR) Parsing Table

Homework: Construct the LR(1) parserfor our non-SLR grammar and verify that there is no shift-reduce conflict.

State c d $ S C0 s3 s4 1 2

1 accept

2 s6 s7 5

3 s3 s4 8

4 r3 r3

5 r1

6 s6 s7 9

7 r3

8 r2 r2

9 r2

0: S'→ S1: S → C C2: C → c C3: C → d


LookAhead LR (LALR) Parsing

● LR(1) parsers have too many states compared to SLR parsers.

– For C, SLR would have a few hundred states

– For C, LR(1) would have a few thousand states

● How about merging states with the same LR(0) items (aka core)?

– Result: We get LALR parsers!

● A bit of history:

– Knuth invented LR in 1965, but it was considered impractical due to memory requirements.

– Frank DeRemer invented SLR and LALR in 1969 (LALR as part of his PhD thesis).


LALR(1) Automaton

S'→ SS → C CC → c C | dI

0

S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d

I0

S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d

I2

S → C . C, $ C → . c C, $ C → . d, $

I2

S → C . C, $ C → . c C, $ C → . d, $ I

3

C → c . C, c/d C → . c C, c/d C → . d, c/d

I3

C → c . C, c/d C → . c C, c/d C → . d, c/d

I1

S' → S ., $

I1

S' → S ., $

I4

C → d ., c/d

I4

C → d ., c/d

I6

C → c . C, $ C → . c C, $ L → . d, $

I6

C → c . C, $ C → . c C, $ L → . d, $

I5

S → CC ., $

I5

S → CC ., $

I7

C → d ., $

I7

C → d ., $

I8

C → c C ., c/d

I8

C → c C ., c/d

I9

C → c C ., $

I9

C → c C ., $

Merged states for LALR(1):

Original LR(1) states:

I36

C → c . C, c/d/$ C → . c C, c/d/$ C → . d, c/d/$

I36

C → c . C, c/d/$ C → . c C, c/d/$ C → . d, c/d/$

I47

C → d ., c/d/$

I47

C → d ., c/d/$

I89

C → c C ., c/d/$

I89

C → c C ., c/d/$


LALR(1) Parsing Table

State c d $ S C0 s36 s47 1 2

1 accept

2 s36 s47 5

36 s36 s47 8

47 r3 r3 r3

5 r1

9 r2 r2 r2

0: S'→ S1: S → C C2: C → c C3: C → d


A few notes in passing

● LALR parsers are smaller than corresponding LR(1) parsing tables.

● LALR parsers mimic LR parsers on correct inputs.

● On erroneous inputs, LALR may proceed with reductions while LR might have declared an error.

– However, eventually, LALR is guaranteed to report the error.

● Merging sets for LALR never generates SR conflicts, but can generate RR conflicts.

– Further reading: Section 4.7.4.

● Difference between SLR and LALR?

– Both have same LR(0) item sets!

– Difference lies in the lookahead.● The lookaheads in LALR can be proved to be a subset of the

FOLLOW sets in SLR.


Using ambiguous grammars

● Ambiguous grammars should be used sparingly.

● However, they can sometimes feel more natural to write; e.g.:

● Sometimes easier to resolve a resulting conflict by hard-coding:

– Higher priority to shift or reduce

– Higher priority to a certain reduce

● However, it is an ad-hoc way and is better avoided.

E → E + E | E * E | id versusE → E+T | TT → T*F | FF → id


Error handling in parsers

● Ignore till a synchronizing token (such as } or ;):

– Pop the stack

– Discard input symbols

– Resume parsing

● Attach semantic error actions to grammar rules

– Add tokens based on what is missing (e.g., closing parenthesis)

● Programmer-specified substitutions

– %change directive in some parser specifications

● Global error recovery

– Again more of theoretical interest


The Big Grammatical Picture

Clicked from “Modern Compiler Implementation in Java” by Andrew W. Appel.

syntax analysis (cont.) manas thakur

Documents