1 introduction to parsing. 2 outline l regular languages revisited l parser overview context-free...

75
Introduction to Parsing

Upload: spencer-potter

Post on 18-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

3 Languages and Automata l Formal languages are very important in CS »Especially in programming languages l Regular languages »The weakest formal languages widely used »Many applications l We will also study context-free languages

TRANSCRIPT

Page 1: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

1

Introduction to Parsing

Page 2: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

2

Outline Regular languages revisited

Parser overview

Context-free grammars (CFG’s)

Derivations

Page 3: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

3

Languages and Automata Formal languages are very important in CS

» Especially in programming languages

Regular languages» The weakest formal languages widely used» Many applications

We will also study context-free languages

Page 4: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

4

Limitations of Regular Languages

Intuition: A finite automaton that runs long enough must repeat states

Finite automaton can’t remember # of times it has visited a particular state

Finite automaton has finite memory» Only enough to store in which state it is » Cannot count, except up to a finite limit

E.g., language of balanced parentheses is not regular: { (i )i | i ¸ 0}

Page 5: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

5

The Functionality of the Parser

Input: sequence of tokens from lexer

Output: parse tree of the program

Page 6: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

6

Example Java expr

x == y ? 1 : 2 Parser input

ID == ID ? INT : INT Parser output

ID ID

?:

== INT

INT

Page 7: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

7

Comparison with Lexical Analysis

Phase Input Output

Lexer Sequence of characters

Sequence of tokens

Parser Sequence of tokens

Parse tree

Page 8: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

8

The Role of the Parser Not all sequences of tokens are programs . . . . . . Parser must distinguish between valid and

invalid sequences of tokens

We need» A language for describing valid sequences of

tokens» A method for distinguishing valid from invalid

sequences of tokens

Page 9: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

9

Context-Free Grammars Programming language constructs have

recursive structure

An EXPR isif EXPR then EXPR else EXPR fi , orwhile EXPR loop EXPR pool , or…

Context-free grammars are a natural notation for this recursive structure

Page 10: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

10

CFGs (Cont.) A CFG consists of

» A set of terminals T» A set of non-terminals N» A start symbol S (a non-terminal)» A set of productions

Assuming X 2 N X ! , or X ! Y1 Y2 ... Yn where Yi µ N [ T

Page 11: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

11

Notational Conventions In these lecture notes

» Non-terminals are written upper-case» Terminals are written lower-case» The start symbol is the left-hand side of the

first production

Page 12: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

12

Examples of CFGsExpr if Expr then Expr else Expr | while Expr do Expr | id

Page 13: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

13

Examples of CFGs Simple arithmetic expressions:

E E E| E + E| E| id

Page 14: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

14

The Language of a CFGRead productions as replacement rules: X ! Y1 ... Yn

Means X can be replaced by Y1 ... Yn

X ! Means X can be erased (replaced with empty

string)

Page 15: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

15

Key Idea1. Begin with a string consisting of the start symbol

“S”2. Replace any non-terminal X in the string by a

right-hand side of some production

3. Repeat (2) until there are no non-terminals in the string

1 nX Y Y

Page 16: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

16

The Language of a CFG (Cont.)

More formally, write

if there is a production

1 1 1 1 1i n i m i nX X X X X Y Y X X

1 i mX Y Y

Page 17: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

17

The Language of a CFG (Cont.)

Write

if

in 0 or more steps

1 1n mX X Y Y

1 1n mX X Y Y

Page 18: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

18

The Language of a CFGLet G be a context-free grammar with start symbol

S. Then the language of G is:

1 1| and every is a terminaln n ia a S a a a

Page 19: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

19

Terminals Terminals are called because there are no rules

for replacing them

Once generated, terminals are permanent

Terminals ought to be tokens of the language

Page 20: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

20

ExamplesL(G) is the language of CFG G

Strings of balanced parentheses

Two grammars:

( )S SS

( )|

S S

( ) | 0i i i

OR

Page 21: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

21

Arithmetic ExampleSimple arithmetic expressions:

Some elements of the language:

E E+E | E E | (E) | id

id id + id(id) id id(id) id id (id)

Page 22: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

22

NotesThe idea of a CFG is a big step. But:

Membership in a language is “yes” or “no”; also need parse tree of the input

Must handle errors gracefully

Need an implementation of CFG’s (e.g., Cup)

Page 23: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

23

More Notes Form of the grammar is important

» Many grammars generate the same language» Tools are sensitive to the grammar

» Note: Tools for regular languages (e.g., flex) are also sensitive to the form of the regular expression, but this is rarely a problem in practice

Page 24: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

24

Derivations and Parse TreesA derivation is a sequence of productions

A derivation can be drawn as a tree» Start symbol is the tree’s root» For a production add children» to node

S

1 nX Y Y X1 nY Y

Page 25: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

25

Derivation Example Grammar

String

E E+E | E E | (E) | id

id id + id

Page 26: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

26

Derivation Example (Cont.)

EE+EE E+Eid E + Eid id + Eid id + id

E

E

E E

E+id*

idid

Page 27: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

27

Derivation in Detail (1)

E

E

Page 28: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

28

Derivation in Detail (2)

EE+E

E

E E+

Page 29: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

29

Derivation in Detail (3)

E E

EE+EE +

E

E

E E

E+*

Page 30: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

30

Derivation in Detail (4)

EE+EE E+Eid E + E

E

E

E E

E+*

id

Page 31: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

31

Derivation in Detail (5)

EE+EE E+Eid E + id id +

EE

E

E

E E

E+*

idid

Page 32: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

32

Derivation in Detail (6)

EE+EE E+Eid E + Eid id + Eid id + id

E

E

E E

E+id*

idid

Page 33: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

33

Notes on Derivations A parse tree has

» Terminals at the leaves» Non-terminals at the interior nodes

An in-order traversal of the leaves is the original input

The parse tree shows the association of operations, the input string does not

Page 34: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

34

Left-most and Right-most Derivations

The example is a left-most derivation» At each step, replace

the left-most non-terminal

There is an equivalent notion of a right-most derivation

EE+EE+idE E + idE id + idid id + id

Page 35: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

35

Right-most Derivation in Detail (1)

E

E

Page 36: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

36

Right-most Derivation in Detail (2)

EE+E

E

E E+

Page 37: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

37

Right-most Derivation in Detail (3)

id

EE+EE+

E

E E+id

Page 38: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

38

Right-most Derivation in Detail (4)

EE+EE+idE E + id

E

E

E E

E+id*

Page 39: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

39

Right-most Derivation in Detail (5)

EE+EE+idE E E

+ idid + id

E

E

E E

E+id*

id

Page 40: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

40

Right-most Derivation in Detail (6)

EE+EE+idE E + idE id + idid id + id

E

E

E E

E+id*

idid

Page 41: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

41

Derivations and Parse Trees Note that right-most and left-most derivations

have the same parse tree

The difference is the order in which branches are added

Page 42: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

42

Summary of Derivations We are not just interested in whether s 2L(G)

» We need a parse tree for s

A derivation defines a parse tree» But one parse tree may have many derivations

Left-most and right-most derivations are important in parser implementation

Page 43: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

43

Issues A parser consumes a sequence of tokens s and

produces a parse tree Issues:

» How do we recognize that s 2 L(G) ?» A parse tree of s describes how s L(G) » Ambiguity: more than one parse tree

(interpretation) for some string s » Error: no parse tree for some string s» How do we construct the parse tree?

Page 44: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

44

Ambiguity Grammar

E ! E + E | E * E | ( E ) | int

String

int * int + int

Page 45: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

45

Ambiguity (Cont.)This string has two parse trees

E

E

E E

E*int

+intint

E

E

E E

E+int

*intint

Page 46: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

46

Ambiguity (Cont.) A grammar is ambiguous if it has more than one

parse tree for some string» Equivalently, there is more than one right-most

or left-most derivation for some string Ambiguity is bad

» Leaves meaning of some programs ill-defined Ambiguity is common in programming languages

» Arithmetic expressions» IF-THEN-ELSE

Page 47: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

47

Dealing with Ambiguity There are several ways to handle ambiguity

Most direct method is to rewrite the grammar unambiguously

E ! T + E | T T ! int * T | int | ( E )

Enforces precedence of * over +

Page 48: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

48

Ambiguity: The Dangling Else

Consider the grammar E if E then E | if E then E else E | OTHER

This grammar is also ambiguous

Page 49: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

49

The Dangling Else: Example The expression

if E1 then if E2 then E3 else E4

has two parse treesif

E1 if

E2 E3 E4

if

E1 if

E2 E3

E4

• Typically we want the second form

Page 50: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

50

The Dangling Else: A Fix else matches the closest unmatched then We can describe this in the grammar

E MIF /* all then are matched */

| UIF /* some then are unmatched */

MIF if E then MIF else MIF

| OTHERUIF if E then E | if E then MIF else UIF

Describes the same set of strings

Page 51: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

51

The Dangling Else: Example Revisited

The expression if E1 then if E2 then E3 else E4

if(UIF)

E1 if(MIF)

E2 E3 E4

if(MIF)

E1 if(UIF)

E2 E3

E4

• Not valid because the then expression is not a MIF

• A valid parse tree (for a UIF)

Page 52: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

52

Ambiguity No general techniques for handling ambiguity

Impossible to convert automatically an ambiguous grammar to an unambiguous one

Used with care, ambiguity can simplify the grammar» Sometimes allows more natural definitions» We need disambiguation mechanisms

Page 53: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

53

Precedence and Associativity Declarations

Instead of rewriting the grammar» Use the more natural (ambiguous) grammar» Along with disambiguating declarations

Most tools allow precedence and associativity declarations to disambiguate grammars

Examples …

Page 54: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

54

Associativity Declarations Consider the grammar E E + E | int Ambiguous: two parse trees of int + int + int

E

E

E E

E+int +

intint

E

E

E E

E+int+

intint• Left-associativity declaration: %left +

Page 55: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

55

Precedence Declarations Consider the grammar E E + E | E * E | int

» And the string int + int * intE

E

E E

E+int *

intint

E

E

E E

E*int+

intint• Precedence declarations: %left + %left *

Page 56: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

56

Abstract Syntax Trees So far a parser traces the derivation of a

sequence of tokens The rest of the compiler needs a structural

representation of the program Abstract syntax trees

» Like parse trees but ignore some details» Abbreviated as AST

Page 57: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

57

Abstract Syntax Tree. (Cont.) Consider the grammar

E int | ( E ) | E + E And the string

5 + (2 + 3) After lexical analysis (a list of tokens) int5 ‘+’ ‘(‘ int2 ‘+’ int3 ‘)’ During parsing we build a parse tree …

Page 58: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

58

Example of Parse TreeE

E E

( E )+

E +int5

int2

E

int3

Traces the operation of the parser

Does capture the nesting structure

But too much info» Parentheses» Single-successor

nodes

Page 59: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

59

Example of Abstract Syntax Tree

Also captures the nesting structure But abstracts from the concrete syntax

=> more compact and easier to use An important data structure in a compiler

PLUS

PLUS

2 5 3

Page 60: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

60

Constructing An AST We first define the AST class hierarchy

» ASTNode IntNode , PlusNode Consider an abstract tree type with two constructors:

new IntNode(n)

new PlusNode(

T1

) =,

T2

=

PLUS

T1 T2

n

Page 61: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

61

Semantic Actions This is what we’ll use to construct ASTs

Each grammar symbol may have attributes» For terminal symbols (lexical tokens) attributes

can be calculated by the lexer

Each production may have an action» Written as: X Y1 … Yn { action }» That can refer to or compute symbol attributes

Page 62: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

62

Constructing an AST We define an attribute ast for non-terminals

» Values of ast attributes are ASTs» We assume that int.lexval is the value of the

integer lexeme» Computed using semantic actions

E int E.ast = new intNode(int.lexval) | E1 + E2 E.ast = new PlusNode

(E1.ast, E2.ast)

| ( E1 ) E.ast = E1.ast

Page 63: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

63

Parse Tree Example Consider the string int5 ‘+’ ‘(‘ int2 ‘+’ int3 ‘)’ A bottom-up evaluation of the ast attribute: E.ast = new PlusNode(new IntNode(5), new PlusNode(new IntNode(2), new IntNode(3))

PLUS

PLUS

2 5 3

Page 64: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

64

Review We can specify language syntax using CFG A parser will answer whether s L(G)

» and will build a parse tree» which we convert to an AST» and pass on to the rest of the compiler

Next lectures:» How do we answer s L(G) and build a parse

tree?

Page 65: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

65

Introduction to Top-Down Parsing

Terminals are seen in order of appearance in the token stream:

t2 t5 t6 t8 t9

The parse tree is constructed» From the top» From left to right

1

t2 3

4

t5

7

t6

t9

t8

Page 66: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

66

Recursive Descent Parsing Consider the grammar

E T + E | T T int | int * T | ( E )

Token stream is: int5 * int2

Start with top-level non-terminal E

Try the rules for E in order

Page 67: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

67

Recursive Descent Parsing. Example (Cont.)

Try E0 T1 + E2 Then try a rule for T1 ( E3 )

» But ( does not match input token int5

Try T1 int . Token matches.

» But + after T1 does not match input token * Try T1 int * T2

» This will match but + after T1 will be unmatched Have exhausted the choices for T1

» Backtrack to choice for E0

Page 68: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

68

Recursive Descent Parsing. Example (Cont.)

Try E0 T1

Follow same steps as before for T1

» And succeed with T1 int * T2 and T2 int

» With the following parse tree

E0

T1

int5 * T2

int2

Page 69: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

69

Recursive Descent Parsing. Notes.

Easy to implement by hand

But does not always work …

Page 70: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

70

Implementation of a Recursive Descent Parser

Page 71: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

71

A Recursive Descent Parser. Preliminaries

Let TOKEN be the type of tokens» Special tokens INT, OPEN, CLOSE, PLUS,

TIMES

Let the global next point to the next token

Page 72: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

72

A Recursive Descent Parser (2)

Define boolean functions that check the token string for a match of» A given token terminal bool term(TOKEN tok) { return *next++ ==

tok; }» A given production of S (the nth) bool Sn() { … } // and test» Any production of S: bool S() { … } // or test

These functions advance next

Page 73: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

73

A Recursive Descent Parser (3)

For production E T + E bool E1() { return T() && term(PLUS) && E(); }

For production E T bool E2() { return T(); }

For all productions of E (with backtracking) bool E() { TOKEN *save = next; return (next = save, E1()) || (next = save, E2()); }

Page 74: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

74

A Recursive Descent Parser (4)

Functions for non-terminal Tbool T1() { return term(OPEN) && E() && term(CLOSE); }

bool T2() { return term(INT) && term(TIMES) && T(); }

bool T3() { return term(INT); }

bool T() { TOKEN *save = next; return (next = save, T1())

|| (next = save, T2()) || (next = save, T3()); }

Page 75: 1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

75

Recursive Descent Parsing. Notes.

To start the parser » Initialize next to point to first token» Invoke E()

Notice how this simulates our backtracking example from lecture

Easy to implement by hand But does not always work …

» Predictive parsing is better