cs1622 - university of pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · cs 1622...

14
1 CS 1622 Lecture 6 1 CS1622 Lecture 6 Introduction to Parsing CS 1622 Lecture 6 2 2 The Front End Parser Checks the stream of words and their parts of speech (produced by the scanner) for grammatical correctness Determines if the input is syntactically well formed Guides checking at deeper levels than syntax Builds an IR representation of the code Think of this as the mathematics of diagramming sentences Source code Scanner IR Parser Errors tokens CS 1622 Lecture 6 3 The Functionality of the Parser Input: sequence of tokens from lexer Output: parse tree of the program parse tree is generated if the input is a legal program Instead of parse tree, some parsers produce directly: abstract syntax tree (AST) + symbol table, or intermediate code, or object code

Upload: others

Post on 01-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · CS 1622 Lecture 6 3 ... intermediate code, or object code. 2 CS 1622 Lecture 6 4 Comparison

1

CS 1622 Lecture 6 1

CS1622

Lecture 6Introduction to Parsing

CS 1622 Lecture 6 2

2

The Front End

Parser

• Checks the stream of words and their parts of speech (producedby the scanner) for grammatical correctness

• Determines if the input is syntactically well formed

• Guides checking at deeper levels than syntax

• Builds an IR representation of the code

Think of this as the mathematics of diagramming sentences

S o u r c e

c o d eS c a n n e r

I RP a r s e r

E r r o r s

tokens

CS 1622 Lecture 6 3

The Functionality of theParser Input: sequence of tokens from lexer Output: parse tree of the program

parse tree is generated if the input is alegal program

Instead of parse tree, some parsersproduce directly:

abstract syntax tree (AST) + symbol table, or intermediate code, or object code

Page 2: CS1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · CS 1622 Lecture 6 3 ... intermediate code, or object code. 2 CS 1622 Lecture 6 4 Comparison

2

CS 1622 Lecture 6 4

Comparison with LexicalAnalysis

Parse treeString oftokens

Parser

String oftokens

String ofcharacters

Scanner

OutputInputPhase

CS 1622 Lecture 6 5

Example

E

E

E E

E+

id*

idid

• The program:x * y + zInput to parser:

ID TIMES ID PLUS IDwe’ll write tokens as follows:

id * id + id

• Output of parser:the parse tree

CS 1622 Lecture 6 6

What must parser do? Not all strings of tokens are valid programs

parser must distinguish between valid andinvalid strings of tokens

Parser must expose program structure(associativity and precedence)

• parser must return the parse treeWe need:

A language for describing valid strings of tokens A method for distinguishing valid from invalid

strings of tokens (and for building the parse tree)

Page 3: CS1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · CS 1622 Lecture 6 3 ... intermediate code, or object code. 2 CS 1622 Lecture 6 4 Comparison

3

CS 1622 Lecture 6 7

Parser

lexicalanalyzer

symboltable

parsersourcetoken

get nexttoken

parsetree

rest offrontend

IR

Parsing = determining whether a string of tokenscan be generated by a grammar

CS 1622 Lecture 6 8

Grammars Precise, easy-to understand description of

syntax Context-free grammars -> efficient parsers

(automatically!) Help in translation and error detection

Eg. Attribute grammars Easier language evolution

Can add new constructs systematically

CS 1622 Lecture 6 9

Syntax Errors Many errors are syntactic or exposed

by parsing eg. Unbalanced ()

Error handling goals: Report errors quickly & accurately Recover quickly (continue parsing after

error) Little overhead on parse time

Page 4: CS1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · CS 1622 Lecture 6 3 ... intermediate code, or object code. 2 CS 1622 Lecture 6 4 Comparison

4

CS 1622 Lecture 6 10

Error Recovery Panic mode

Discard tokens until synchronization token found (often ‘;’)

Phrase level Local correction: replace a token by another and continue

Error productions Encode commonly expected errors in grammar

Global correction Find closest input string that is in L(G)

Too costly in practice

CS 1622 Lecture 6 11

Context-free Grammars Precise and easy way to specify the

syntactical structure of a programminglanguage

Efficient recognition methods exist Natural specification of many

“recursive” constructs: expr -> expr + expr | term

CS 1622 Lecture 6 12

Context-free GrammarDefinition Terminals T

Symbols which form strings of L(G), G a CFG (= tokens in thescanner), e.g. if, else, id

Nonterminals N Syntactic variables denoting sets of strings of L(G) Impose hierarchical structure (e.g., precedence rules)

Start symbol S (∈ N) Denotes the set of strings of L(G)

Productions P Rules that determine how strings are formed N -> (N|T)*

Page 5: CS1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · CS 1622 Lecture 6 3 ... intermediate code, or object code. 2 CS 1622 Lecture 6 4 Comparison

5

CS 1622 Lecture 6 13

Why are regular expressionsnot enough?

What programs are generated by?digit+ ( ( “+” | “-” | “*” | “/” ) digit+ )*

no structure! Generates a list rather than a tree

What important properties this regularexpression fails to express?

CS 1622 Lecture 6 14

Why are regular expressionsnot enough?

Write an automaton that accepts strings “a”, “(a)”, “((a))”, and “(((a)))”

cannot do - regular expressions cannot count.

“a”, “(a)”, “((a))”, “(((a)))”, … “(ka)k”

CS 1622 Lecture 6 15

Example: ExpressionGrammarexpr -> expr op exprexpr -> (expr)expr -> - exprexpr -> idop -> +op -> -op -> *op -> /op -> ^

Terminals: {id, +, -, *, /, ^}

Nonterminals {expr, op,}

Start symbol Expr

Page 6: CS1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · CS 1622 Lecture 6 3 ... intermediate code, or object code. 2 CS 1622 Lecture 6 4 Comparison

6

CS 1622 Lecture 6 16

Notational Conventions Terminals

a,b,c.. +,-,.. ‘,’.’;’ etc 0..9 expr or <expr>

Nonterminals A, B, C .. S start symbol (if present)

or first nonterminal inproduction list

Terminal strings u,v,..

Grammar symbol strings α,β

Productions A -> α

CS 1622 Lecture 6 17

Shorthands & DerivationsE -> E + E | E * E |

(E) | - E | <id> E => - E “E derives -

E” => derives in 1 step =>* derive in n (0..)

steps

CS 1622 Lecture 6 18

More Definitions L(G) language generated by G = set of strings

derived from S S =>+ w : w sentence of G (w string of terminals) S =>+ α : α sentential form of G (string can contain nonterminals) G and G’ are equivalent :⇔ L(G) = L(G’) A language generated by a grammar (of the form

shown) is called a context-free language

Page 7: CS1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · CS 1622 Lecture 6 3 ... intermediate code, or object code. 2 CS 1622 Lecture 6 4 Comparison

7

CS 1622 Lecture 6 19

ExampleG = ({-,*,(,),<id>}, {E},

E, {E -> E + E, E->E * E , E -> (E) , E-> - E, E -> <id>})

Sentence: -(<id> + <id>)Derivation:E => -E => -(E) =>

-(E+E)=>-(<id>+E) => -(<id> + <id>)• Leftmost derivation i.e.

always replace leftmostnonterminal

• Rightmost derivationanalogously

• Left /right sentential form

CS 1622 Lecture 6 20

Parse Trees

E

- E

( E )

E + E

<id> <id>

E => -E =>-(E) =>-(E+E)=>-(<id>+E) => -(<id> +<id>)

Parse tree =graphical representation

of a derivationignoring replacement

order

CS 1622 Lecture 6 21

Expressive Power CFGs are more powerful than REs

Can express matching () with CFGs Can express most properties desired for programming

languages CFGs cannot express:

Identifiers declared before used L = {wcw|w is in (a|b)*} Parameter checking (#formals = #actuals)

L ={anbmcndm|n ≥ 1, m ≥ 1}

Page 8: CS1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · CS 1622 Lecture 6 3 ... intermediate code, or object code. 2 CS 1622 Lecture 6 4 Comparison

8

CS 1622 Lecture 6 22

Parsing = determining whether a string of

tokens can be generated by a grammar Two classes based on order in which

parse tree is constructed: Top-down parsing

Start construction at root of parse tree

Bottom-up parsing Start at leaves and proceed to root

CS 1622 Lecture 6 23

Derivations and Parse TreesA derivation is a sequence of productionsA derivation can be drawn as a tree

Start symbol is the tree’s rootFor a production N -> X1…Xn add

X1 … Xn as children to node N

S!!!LLL

1 nXYY!L

X

1 nYYL

CS 1622 Lecture 6 24

Derivation Example S -> E + E | E * E E -> id | (E) Derivation for: id * id + id Derivation for: id + id + id

See board

Page 9: CS1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · CS 1622 Lecture 6 3 ... intermediate code, or object code. 2 CS 1622 Lecture 6 4 Comparison

9

CS 1622 Lecture 6 25

Notes on Derivations A parse tree has

Terminals at the leaves Non-terminals at the interior nodes

An in-order traversal of the leaves is theoriginal input

The parse tree shows the association ofoperations, the input string does not

CS 1622 Lecture 6 26

Terminals Terminals are called because there are no

rules for replacing them

Once generated, terminals are permanent

Terminals are the tokens of the languagerepresented by the grammar

CS 1622 Lecture 6 27

Left-most and Right-mostDerivations The example is a left-

most derivation At each step, replace the

left-most non-terminal

There is an equivalentnotion of a right-mostderivation

EE*Eid *Eid *E+E id *id+Eid*id+ id

®

®

®

®

®

Page 10: CS1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · CS 1622 Lecture 6 3 ... intermediate code, or object code. 2 CS 1622 Lecture 6 4 Comparison

10

CS 1622 Lecture 6 28

Right-most Derivation inDetail (1)

EEE+EE+ idE * E + idE * id + idid * id + id

®

®*®*®*®*

CS 1622 Lecture 6 29

Derivations and Parse Trees Note that right-most and left-most

derivations have the same parse tree

The difference is the order in whichbranches are added

CS 1622 Lecture 6 30

Summary of Derivations We are not just interested in whether

s e L(G) We need a parse tree for s

A derivation defines a parse tree But one parse tree may have many derivations

Left-most and right-most derivations areimportant in parser implementation

Page 11: CS1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · CS 1622 Lecture 6 3 ... intermediate code, or object code. 2 CS 1622 Lecture 6 4 Comparison

11

CS 1622 Lecture 6 31

Ambiguity (Cont.)This string has two parse trees

E

E

E E

E*

id +

idid

E

E

E E

E+

id*

idid

CS 1622 Lecture 6 32

Parse treesQuestion 1:

for each of the two parse trees, find thecorresponding left-most derivation

Question 2: for each of the two parse trees, find the

corresponding right-most derivation

CS 1622 Lecture 6 33

Ambiguity (Cont.) A grammar is ambiguous if for some string

(the following three conditions are equivalent)

it has more than one parse tree if there is more than one right-most derivation if there is more than one left-most derivation

Ambiguity is BAD Leaves meaning of some programs ill-defined

Page 12: CS1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · CS 1622 Lecture 6 3 ... intermediate code, or object code. 2 CS 1622 Lecture 6 4 Comparison

12

CS 1622 Lecture 6 34

Dealing with Ambiguity There are several ways to handle ambiguity

Most direct method is to rewrite grammarunambiguousl

Enforces precedence of * over + Separate (external of grammar) conflict-resolution

rules E.g., precedence or associativity rules (e.g. via parser

tool declarations) Same idea we saw with scanning: instead of

complicated RE’s use symbol table to recognizekeywords

'''EEE | EEid E' | id | (E)!+!"

CS 1622 Lecture 6 35

Expression Grammars(precedence) Rewrite the grammar

use a different nonterminal for each precedence level start with the lowest precedence (MINUS)

E E - E | E / E | ( E ) | id

rewrite to

E E - E | TT T / T | FF id | ( E )

CS 1622 Lecture 6 36

Exampleparse tree for id – id / id

E E - E | TT T / T | FF id | ( E )

E

E

F F

T

-

id

/

idid

T

FT T

E

Page 13: CS1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · CS 1622 Lecture 6 3 ... intermediate code, or object code. 2 CS 1622 Lecture 6 4 Comparison

13

CS 1622 Lecture 6 37

More than one parse tree? Attempt to construct a parse tree for id-

id/id that shows the wrong precedence.

Question: Why do you fail to construct this parse

tree?

CS 1622 Lecture 6 38

Associativity The grammar captures operator precedence,

but it is still ambiguous! fails to express that both subtraction and division

are left associative;

e.g., 5-3-2 is equivalent to: ((5-3)-2) and not to:(5-(3-2)).

CS 1622 Lecture 6 39

Recursion A grammar is recursive in nonterminal X if:

X + … X … recall that + means “in one or more steps, X derives a

sequence of symbols that includes an X”

A grammar is left recursive in X if: X + X …

in one or more steps, X derives a sequence of symbolsthat starts with an X

A grammar is right recursive in X if: X + … X

in one or more steps, X derives a sequence of symbolsthat ends with an X

Page 14: CS1622 - University of Pittsburghpeople.cs.pitt.edu/~mock/cs1622/lectures/lecture06.pdf · CS 1622 Lecture 6 3 ... intermediate code, or object code. 2 CS 1622 Lecture 6 4 Comparison

14

CS 1622 Lecture 6 40

How to fix associativity The grammar given above is both left and

right recursive in nonterminals exp and term try at home: write the derivation steps that show

this. To correctly expresses operator associativity:

For left associativity, use left recursion. For right associativity, use right recursion.

Here's the correct grammar:E E – T | TT T / F | FF id | ( E )