parsing - computational geometry lab - indexmorin/teaching/3002/notes/ll1.pdfparser generators •...

Parsing

COMP 3002School of Computer Science

2

The Structure of a Compiler

syntacticanalyzer

codegenerator

programtext

interm.rep.

machinecode

tokenizer parsertokenstream

3

Role of the Parser• Converts a token stream into an

intermediate representation– Captures the meaning (instead of text) of the

program– Usually, intermediate representation is a parse tree

id, "x"

=

id, "x"

+

number, "1"

x = x + 1

4

Kinds of Parsers• Universal

– Can parse any grammar– Cocke-Younger-Kasami and Earley's algorithms– Not efficient enough to be used in compilers

• Top-down– Builds parse trees from the top (root) down

• Bottom-up– Builds parse trees from the bottom (leaves) up

5

Errors in Parsing• Lexical errors

– Misspelled identifiers, keywords, or operators

• Syntactical errors– Misplaced or mismatched parentheses, case

statement outside of any switch statement,...

• Semantic errors– Type mismatches between operators and operands

• Logical errors– Bugs – the programmer said one thing but meant

something else • if (x = y) { ... } • if (x == y) { ... }

6

Error Reporting • A parser should

– report the presence of errors clearly and correctly and

– recover from errors quickly enough to detect further errors

7

Error Recovery Modes• Panic-Mode

– discard input symbols until a "synchronizing token" is found

– Examples (in Java): semicolon, '}'

• Phrase-Level– replace a prefix of the remaining input to correct it– Example: Insert ';' or '{'– must be careful to avoid infinite loops

8

Error Recover Modes (Cont'd)• Error Productions

– Specify common errors as part of the language specification

• Global Correction– Compute the smallest set of changes that will make

the program syntactically correct (impractical and usually not usually useful)

9

Context-Free Grammars• CF grammars are used to define languages• Specified using BNF notation

– A set of non-terminals N– A set of terminals T– A list of rewrite rules (productions)– The LHS of each rule contains one non-terminal

symbol– The RHS of each rule contains a regular expression

over the alphabet N T– A special non-terminal is usually designated as the

start symbol• Usually, start symbols is LHS of the first production

10

Context Free Grammars and Compilers• In a compiler

– N consists of language constructs (function, block, if-statement, expression, ...)

– T consists of tokens

11

Grammar Example• Non-terminals: E, T, F

– E = expression– T = term– F = factor

• Terminals: id, +, *, (, )• Start symbol E• "Mathematical formulae using + and *

operators"

E E + T | TT T * F | FF ( E ) | id

E T E'E' + T E' | T F T'T' * F T' | T T * F | FF ( E ) | id

E E + E | E * E | ( E ) | id

12

Derivations• From a grammar specification, we can

derive any string in the language– Start with the start symbol– While the current string contains some non-terminal

N• expand N using a rewrite rule with N on the LHS

E E + E | E * E | ( E ) | id

EE + EE + idE * E + idE * id + id( E ) * id + id( E + E ) * id + id( id + E ) * id + id( id + id ) * id + id

13

Derivation Example• Derive id + id * id with these grammars:

E E + T | TT T * F | FF ( E ) | id


E E + E | E * E | ( E ) | id

14

Derivation Example• Derive:

– id * id + id– id + id * id


15

Terminology• The strings of terminals that we derive from

the start symbol are called sentences• The strings of terminals and non-terminals

are called sentential forms

16

Leftmost and Rightmost Derivations• A derivation is leftmost if at each stage we

always expand the leftmost non-terminal• A derivation is rightmost if at each stage we

always expand the rightmost non-terminal• Give a leftmost and rightmost derivation of

– id * id + id

E E + E | E * E | ( E ) | id

17

Derivations and Parse Trees• A parse-tree is a graphical representation of

a derivation• Internal nodes are labelled with non-

terminals– Root is the start symbol

• Leaves are labelled with terminals– String is represented by left-to-right traversal of

leaves

• When applying an expansion E ABC...Z– Children of node E become nodes labelled A,B,C,...Z

18

Derivations and Parse Trees - Example

EE + EE + idE * E + idE * id + id( E ) * id + id( E + E ) * id + id( id + E ) * id + id( id + id ) * id + id

E

E E

id

+

E E*

id( )E

id

E E+

id

E E + E | E * E | ( E ) | id

19

Ambiguity• Different parse trees for the same sentence

result in ambiguity

E E + E | E * E | ( E ) | id

EE + EE + E * Ea + b * c

EE * EE + E * Ea + b * c

E

E E

id

*

E E+ c

a b

E

E E

a

+

E E*

b c

20

Ambiguity – Cont'd• Ambiguity is usually bad• The same program means two different

things• We try to write grammars that avoid

ambiguity

21

Context Free Grammars and Regular Expressions• CFGs are more powerful than regular

expressions– Converting a regular expression to a CFG is trivial– The CFG S aSb | generates a language that is not

regular

• But not that powerful– The language { ambncmdn : n,m>0 } can not be

expressed by a CFG

22

Enforcing Order of Operations• We can write a CFG to enforce specific

order of operations– Example: + and *– Exercises: • Add comparison operator with lower level of

precedence than +• Add exponentiation operator with higher level of

precedence than *

E PEPE TE + PE | TETE id * TE | id | ( E )

23

Picking Up - Context free grammars• CFGs can specify programming languages• It's not enough to write a correct CFG

– An ambiguous CFG can give two different parse trees for the same string• Same program has two different meanings!

– Not all CFGs are easy to parse efficiently

• We look at restricted classes of CFGs– Sufficiently restricted grammars can be parsed easily– The parser can be generated automatically

24

Parser Generators• Benefits of parser generators

– No need to write code (just grammar)– Parser always corresponds exactly to the grammar

specification– Can check for errors or ambiguities in grammars– No surprise programs

• Drawbacks– Need to write a restricted class of grammar [LL(1),

LR(1), LR(k),...]– Must be able to understand when and why a

grammar is not LL(1) or LR(1) or LR(k)– Means learning a bit of parsing theory– Means learning how to make your grammar LL(1),

LR(1), or LR(k)

25

Ambiguity• This grammar is ambiguous

– consider the input • a - b - c

• Rewrite this grammar to be unambiguous• Rewrite this grammar so that - becomes left

associative:• a - b - c ~ ((a - b) - c)

E E - E | id

26

Solutions

E id MM - E |

E M | idM E - id

27

A Common Ambiguity – The Dangling Else

• Show that this grammar is ambiguous• Remove the ambiguity

– Implement the “else matches innermost if” rule

stmt if expr then stmt | if expr then stmt else stmt | other

28

Solution

stmt matched_stmt | open_stmtmatched_stmt if expr then matched_stmt else matched_stmt | otheropen_stmt if expr then stmt | if expr then matched_stmt else open_stmt

29

Left Recursion• A top-down parser expands the left-most

non-terminal based on the next token• Left-recursion is difficult for top-down

parsing• Immediate left recursion:

– A A– Rewrite as: AA' and A' A'

• More complicated left recursion occurs when A can derive a string starting with A– A + A

30

Removing Left Recursion• Removing immediate left-recursion is easy• Simple case:

– A A– Rewrite as: AA' and A' A'

• More complicated: – A AAA – Rewrite as: • AA'|A'| | A'

• A' A'A'A'

31

Algorithm for Removing all Left Recursion• Textbook page 213

32

Left Factoring• Left factoring is a technique for making a

grammar suitable for top-down parsing• For each non-terminal A find the longest

prefix common to two or more alternatives– Replace A 1 n with

– A A' and A' 1 n• Repeat until no two alternatives have a common prefix

33

Left Factoring Example• Left factor the following grammars

E PEPE TE + PE | TETE id * TE | id | ( E )

34

Summary of Grammar-Manipulation Tricks• Eliminating ambiguity

– Different parse trees for same program

• Enforcing order of operations– Left-associative– right-associative

• Eliminating left-recursion– Gets rid of potential "infinite recursions"

• Left factoring– Allows choosing between alternative productions

based on current input symbol

35

Exercise

• Remove left recursion• Left-factor

rexpr rexpr + rterm | rtermrterm rterm rfactor | rfactorrfactor rfactor * | rprimaryrprimary a | b

36

Top-Down Parsing• Top-down parsing is the problem of

constructing a pre-order traversal of the parse tree

• This results in a leftmost derivation• The expansion of the leftmost non-terminal

is determined by looking at a prefix of the input

37

LL(1) and LL(k)• If the correct expansion can always be

determined by looking ahead at most k symbols then the grammar is an LL(k) grammar

• LL(1) grammars are most common

38

FIRST()• Let be any string of grammar symbols• FIRST() is the set of terminals that begin

strings that can be derived from a– If can derive then is also in FIRST()

• Why is FIRST useful– Suppose A | and FIRST() and FIRST() are

disjoint– Then, by looking at the next symbol we know which

production to use next

39

Computing FIRST(X)• If X is a terminal then FIRST(X) = {X}• If X is a non-terminal and X Y1 Y2...Yk

– i = 0 ; define FIRST(Y0) = { }

– while is in FIRST(Yi)• Add FIRST(Yi+1) to FIRST(X)• i = i+1

– if (i =k or X )• Add to FIRST(X)

• Repeat above step for all non-terminals until nothing is added to any FIRST set

40

Example• Compute FIRST(E), FIRST(PE), FIRST(TE),

FIRST(TE') E PEPE TE + E | TETE id TE'TE' * E TE' id TE' ( E )

41

Computing FIRST(X1X2...Xk)• Given FIRST(X) for every symbol X we can

compute FIRST(X1X2...Xk) for any string of symbols X1X2...Xk:– i = 0 ; define FIRST(X0) = { }

– while is in FIRST(Xi)• Add FIRST(Xi+1) to FIRST(X1X2...Xk)• i = i+1

– if (i =k)• Add to FIRST(X)

42

FOLLOW(A)• Let A be any non-terminal• FOLLOW(A) is the set of terminals a that

can appear immediately to the right of A in some sentential form– I.e. S * A a for some and and start symbol S– Also, if A can be a rightmost symbol in some

sentential form then $ (end of input marker) is in FOLLOW(A)

43

Computing FOLLOW(A)• Place $ into FOLLOW(S)• Repeat until nothing changes:

– if A B then add FIRST()\{} to FOLLOW(B)– if A B then add FOLLOW(A) to FOLLOW(B)– if A B and is in FIRST() then add FOLLOW(A)

to FOLLOW(B)

44

Example• Compute FOLLOW(E), FOLLOW(PE),

FOLLOW(TE), FOLLOW(TE') E PEPE TE + E | TETE id TE'TE' * E TE' id TE' ( E )

45

FIRST and FOLLOW Example

• FIRST(F) = FIRST(T) = FIRST(E) = {(, id }• FIRST(E') = {+, }• FIRST(T') = {*, }• FOLLOW(E) = FOLLOW(E') = {), $}• FOLLOW(T) = FOLLOW(T') = {+,),$}• FOLLOW(F) = {+, *, ), $}

E T E'E' + T E' | T F T'T' * F T' | F ( E ) | id

46

LL(1) Grammars• Left to right parsers producing a leftmost

derivation looking ahead by at most 1 symbol

• Grammar G is LL(1) iff for every two productions of the form A |– FIRST() and FIRST() are disjoint– If is in FIRST() then FIRST() and FOLLOW(A) are

disjoint (and vice versa)

• Most programming language constructs are LL(1) but careful grammar writing is required

47

LL(1) Predictions Tables• LL(1) languages can be parsed efficiently

through the use of a prediction table– Rows are non-terminals– Columns are input symbols (terminals)– Values are productions

48

Constructin LL(1) Prediction Table• The following algorithm constructs the LL(1)

prediction table• For each production A in the grammar

– For each terminal a in FIRST(), set M[A,a] = A – If is in FIRST() then for each terminal b in

FOLLOW(A), set M[A,b] = A

LL(1) Prediction Table Example

E T E'E' + T E' | T F T'T' * F T' | F ( E ) | id

Id + * ( ) $

E

E' E' e

T

T'

F

E T E' E T E'E' + T E' E'

T F T' T F T'T' T' * F T' T' T'

F id F ( E )

LL(1) Prediction Table Example• Use the table to find the derivation of

– id + id * id + id

Id + * ( ) $

E

E' E' eT

T'

F

E T E' E T E'E' + T E' E'

T F T' T F T'T' T' * F T' T' T'

F id F ( E )

51

LL(1) Parser Generators• Given a grammar G, an LL(1) parser

generator can– Computer FIRST(A) and FOLLOW(A) for every non-

terminal A in G– Determine if G is LL(1)– Construct the prediction table for G– Create code that parses any string in G and produces

the parse tree

• In Assignment 2 we will use such a parser generator (javacc)

52

Summary• Programming languages can be specified

with context-free grammars• Some of these grammars are easy to parse

and generate a unique parse tree for any program

• An LL(1) grammar is one for which a leftmost derivation can be done with only one symbol of lookahead

• LL(1) parser generators exist and can produce efficient parsers given only the grammar

Title of presentationSlide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49Slide 50Slide 51Slide 52

parsing - computational geometry lab - indexmorin/teaching/3002/notes/ll1.pdfparser generators •...

Documents