w03 01-front-end-overview 18 · 3.1 introduction §frontend responsible to turn input program into...

72
Compiler Design Spring 2018 3.0 Frontend 1 Thomas R. Gross Computer Science Department ETH Zurich, Switzerland

Upload: others

Post on 22-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Compiler DesignSpring 2018

3.0 Frontend

1

Thomas R. Gross

Computer Science DepartmentETH Zurich, Switzerland

Page 2: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Admin issues

§ Recitation sessions take place only when announced§ In the lecture / on course website / on the mailing list

§ No recitation session this week§ Next recitation session

§ March 15, 2018 @ 15:00§ ETF E1 (tentative)

2

Page 3: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Compiler model

3

Source program

ASM file

“Front-end”

IR

“Back-end”

OptimizerQuestion: How to build IR (tree)?

Page 4: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Overview

§ 3.1 Introduction§ 3.2 Lexical analysis§ 3.3 “Top down” parsing§ 3.4 “Bottom up” parsing

4

Page 5: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

3.1 Introduction

§ Frontend responsible to turn input program into IR§ Input: Usually a string of ASCII or Unicode characters§ IR: As required by later stages of the compiler

§ Frontend divided into§ Lexical analysis – deals with reading the input program

§ Also known as scanning§ Scanner, Lexer

§ Syntactic analysis – understand structure of the input program§ Also known as parsing§ Parser

5

Page 6: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

3.1 Introduction (cont’d)§ Good news: Syntactic and lexical analysis well understood

§ Good theory and books, e.g., Aho et al., Chapters 2 (in part), 3, and 4§ Good tools

§ Bad news: Even good tools may be painful to use§ Good == powerful§ Many options§ Still can’t handle all possible languages§ May give cryptic error messages

6

Page 7: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

3.1 Introduction (cont’d)§ Need to understand theory to use tool

§ Same theory that allows building tool§ Tools made hand-crafted frontends obsolete§ Frontend tools used for other domains

7

Page 8: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Languages

§ Frontend processes input program§ Need a way to describe what input is allowed

§ Formal languages§ Well-researched area§ First part of compilers supported by tools

§ In this lecture: brief review§ Aho et al. covers topic in more depth§ Focus on essentials

§ (Speed an issue in real life)§ Theory behind tools

8

Page 9: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Languages: Grammar§ Grammars provide a set of rules to generate “strings”§ A grammar consists of

§ Terminals: a, b, c, …§ Non-terminals: X, Y, Z, …§ Set of productions§ Start symbol: S

§ Some terminology§ Terminal symbols: Sometimes called characters or tokens§ Non-terminal symbols: Also called syntactic variables§ String: Sequence of symbols from some alphabet

§ Other terms: Word, sentence 9

Page 10: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Productions§ General form

§ Left-hand side à Right-hand side§ LHS à RHS (for short)§ LHS, RHS: Strings over alphabets of terminal and non-terminal symbols

§ Example: Grammar G1S à aBaS à aXaXb à Xbc | cBa à aBa | b

§ How does a grammar generate a language (known as L(G))?§ Using the grammar G1 as an example 10

Page 11: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

L(G)§ From production to derivation

Given § w -- a word over (T ∪ NT),§ a, b, g words over (T ∪ NT)

§ (a, b, g may be empty) s.t. w = a b g and P a production bà d

We say that w’ = a d g is derived from w, i.e., w ⇒ w’.

§ Example derivation (with G1)§ S ⇒ aBa ⇒ aaBa ⇒ aab

§ L(G1) = anb, n ≥ 1 12

S à aBaS à aXaXb à Xbc | cBa à aBa | b

Page 12: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

L(G)

§ L(G) = set of strings w such that§ w consists only of symbols from the set of terminals§ There exists a sequence of productions P1, P2, …Pn such that S ⇒ RHS1 by

P1, … (by Pi), …. ⇒ w (by Pn)§ In other words: there exists a derivation S ⇒ P1 … … ⇒ Pn w (or S ⇒* w)

14

Page 13: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Productions, 2nd look

§ No constraints on LHS, RHS§ Some RHS could be dead-end street

S à aXaXb à …

§ Remove dead-end streets§ Updated grammar G1’

S à aBaBa à aBa | b

16

Page 14: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Productions, 3rd look

§ We care about L(G) – prune productions that do not contribute

§ Restrictions on LHS§ Only a single non-terminal is allowed on the left hand side§ For example: A à a§ “Context free” grammar or Type-2 grammar

§ Context-free grammars important § Efficient analysis techniques known§ From now on only context-free grammars unless noted 17

Page 15: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Regular and linear grammars

§ Linear grammar: Context-free, at most 1 NT in the RHS§ Left-linear grammar: Linear, NT appears at left end of RHS§ Right-linear grammar: Linear, NT appears at the right end of RHS§ Regular grammar: Either right-linear or left-linear§ Regular grammars generate regular languages

§ Could also be described by regular expression§ Can be recognized by Finite Deterministic Automaton§ Type-3 grammar

18

Page 16: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Special cases

§ ∅ – a language (but not an interesting one)§ e – the empty string

§ Must use a symbol so that we can see it

§ Can be the RHS§ A à e

22

Page 17: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

3.1 Introduction

§ So far: Brief summary of grammars§ Using multiple grammars to save work § Properties of derivations§ Parse trees§ Properties of grammars

§ Detect ambiguity§ Avoid ambiguity

23

Page 18: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

3.1.1 Example grammar G2

§ Start symbol: S§ Terminals: { a, b, …, z, +, -, *, /, ( , ) }§ Non-Terminals: { S, E, Op, Id, L, M, N }§ Productions

S à EE à E Op E | - E | ( E ) | IdOp à + | - | * | / Id à L ML à a | b | ... | zM à L M | N M | eN à 0 | 1 | ... | 9 24

Note: ℇ-production allows us to make M “disappear”

S ⇒ E ⇒ Id ⇒ L M ⇒ L L M ⇒ a L M ⇒ ap M ⇒ ap

Page 19: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Parsing

§ Given G and a word w ∈T*: we want to know if “w ∈ L(G)?”

§ Analysis problem§ Answer is either YES or NO§ ap ∈L(G2)§ ap + bp ∈L(G2)§ ap++ ∉ L(G2)

§ For YES we need to find a sequence of productions so that S ⇒ … … ⇒w§ (or S ⇒* w for short) 26

Page 20: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

w = a3 + b§ Derivation

S ⇒ E ⇒ E Op E ⇒ E Op Id ⇒ E + Id ⇒ Id + Id ⇒Id + LM ⇒ Id + L ⇒ Id + b ⇒ LM + b ⇒ a M + b ⇒a N M + b ⇒ a3 M + b ⇒ a3 + b

29

Page 21: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Comments§ If a string w contains multiple non-terminals we have a choice

when expanding w ⇒w’§ Grammars that are context-free and without useless non-terminals: must

have a production for each non-terminal in w§ Assume A, B ∈ NT, A à a , B à b are productions P1, P2

§ w = d A t B g§ Choice #1: w1 = d a t B g§ Choice #2: w2 = d A t b g§ (Both w ⇒ w1 or w ⇒ w2 possible)

30

Page 22: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

More comments

§ Question: Does the choice influence L(G)?§ Or, is (w1 ⇒ * x ∈ L(G)) ⇔ (w2 ⇒ * x ∈ L(G))

§ Answer: choice does not matter for context-free grammars

§ How to decide which production to pick?§ Everything worked out in the example

§ We’ve always picked the right production

§ Found w = a3 + b

§ Later more…

31

Page 23: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

More comments

§ Part of the derivation is pretty boring§ Do we care about exact steps to generate identifier “a3”?§ Details (not always) important

32

Page 24: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

3.1.1 Example grammar G2

§ Start symbol: S§ Terminals: { a, b, …, z, +, -, *, /, ( , ) }§ Non-Terminals: { S, E, Op, Id, L, M, N }§ Productions

S à EE à E Op E | - E | ( E ) | IdOp à + | - | * | / Id à L ML à a | b | ... | zM à L M | N M | eN à 0 | 1 | ... | 9 33

Page 25: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

More comments

§ Part of the derivation is pretty boring§ Do we care about exact steps to generate identifier “a3”?§ Details (not always) important

§ Can we find a better way to deal with this aspect?§ Better: Simpler§ Better: Maybe also more efficient

34

Page 26: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

36

Page 27: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Regular expressions§ Idea: Use regular expression to capture “uninteresting” part

of a grammar§ Here: Exact rules for identifier names§ Replace part of grammar G2

…Id à L ML à a | b | ... | zM à L M | N M | eN à 0 | 1 | ... | 9

§ Regular expressions recognized by Finite State Machines§ Either a Deterministic Finite Automaton (DFA)§ Or a Nondeterministic Finite Automaton (NFA) 37

Page 28: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Token§ Idea: Introduce grammar symbol that represents

string described by regular expression§ Terminal for the grammar§ Rules/production to generate regular expression string

§ When looking for a derivation identify strings that can be described by regular expression§ “Token”§ Example: a3 + b Tokens: Id (“a3”) + Id (“b”)

§ Chunks of the input stream§ More in 3.2 Lexical analysis 38

regexp regexp

Page 29: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Examples

§ a3 + b … really … Id(“a3”) + Id(“b”)§ z * u + x … really … Id(“z”) * Id(“u”) + Id(“x”)

§ Id * Id + Id ∈ L(G2)

§ Treat terminals the same way§ Id(“z”) Term(“*”) Id(“u”) Term(“+”) Id(“x”)

40

Page 30: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

3.1.2 Simplified grammar G3

§ Start symbol: S

§ Terminals: { a, b, …, z, +, -, *, /, ( , ), Id }

§ Non-Terminals: { S, E, Op, Id, L, M, N }

§ Productions and regular definitions

S à EE à E Op E | - E | ( E ) | IdOp à + | - | * | / Id: L { L | N } *

41regexp

L = { a | b | c | … | z }N = { 0 | 1 | 2 | … | 9 }

Page 31: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

More simplifications?§ Can grammar G3 simplified even further?§ Are there other productions we can replace with a regular

expression?§ Productions

S à EE à E Op E | - E | ( E ) | IdId à L { L | N } * L = { a | b | c | … | z }

N = { 0 | 1 | 2 | … | 9 }Op à + | - | * | /

§ Could treat Op the same wayOp: { + | - | * | / } 43

Page 32: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Simplified grammar G4

§ Start symbol: S

§ Terminals: { a, b, …, z, +, -, *, /, ( , ), Id }

§ Non-Terminals: { S, E, Op}

§ Productions and regular definitionsS à E (1) E à E Op E (2)

| - E (3)| ( E ) (4)| Id (5)

Op à + | - | * | / (6)Id: L { L | N } * L = { a | b | c | … | z }, N = { 0 | 1 | 2 | … | 9 } 44

Page 33: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

w = a 3 + b

Please take a piece of paper and find a derivation for “a 3 + b”

(Raise your hand when you are done.)

Compare your solution with your neighbor’s solution.§ Do you start with the same production?§ Do you use the same production in the 2nd step?

45

Page 34: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

w = a3 + b

§ Some example derivations§ Derivation #1

S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

§ Derivation #2S ⇒ E ⇒E Op E ⇒ E Op Id ⇒ E + Id ⇒ Id + Id

§ More?

48

Page 35: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Looking at derivations

§ a3 + b ∈L(G4), i.e., a3 + b is a legal program§ At least according to grammar G4

§ Are we done?

§ More analysis is needed§ Looking at derivation helps start analysis§ Derivations may provide information on structure

50

Page 36: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Questions for derivations

§ Does the order of applying productions matter?§ Are derivations unique?

§ How do we compare derivations?

51

Page 37: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Choice of non-terminal in derivation step

§ Given w = d A t B g (with A, B ∈ NT, A à a , B à b productions)

§ Two choices§ w ⇒ d a t B g§ w ⇒ d A t b g

52

Page 38: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Many options

§ In Derivation #1: Always the left-most non-terminal is picked for replacement§ “Left-most” derivation

§ In Derivation #2: Always the right-most non-terminal is picked for replacement§ “Right-most” derivation

§ No influence on L(G)§ But useful to distinguish “different” derivations§ Intuitively: Different derivations might convey different “meaning” 53

Page 39: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Derivations

§ Given a grammar G with productions Pi.§ Consider two derivations D1 = Pa Pb Pc … Pn and

D2 = P’a P’b P’c … P’n§ Pj, P’k productions, applied as intended

§ Are D1 and D2 the same?§ Again (intuitively): Do they ”mean” the same?

54

Page 40: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Derivations

§ Question: Are D1 and D2 both right-most derivations(or both left-most derivations)?§ YES: if Pm = P’m for all 1 ≤ m ≤ n § NO: We can’t easily compare

§ Later more (parse trees)

§ Looking at right-most (or left-most) derivations allows us to compare derivations§ Different derivations don’t matter always§ … but sometimes they do (more later)

55

Page 41: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Parse tree

§ Want to identify structure expressed by derivation§ Compare two derivations that are not both right-most (or both left-

most) derivations

§ Summary of derivation§ Ignore the order of applying productions§ Leaves: Terminals§ Interior nodes: Application of a production

56

Page 42: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Parse tree construction§ How to construct parse tree?§ Induction

§ Given derivation a1 ⇒ a2 ⇒… ai ⇒ ai+1 ⇒ ... ⇒ an

§ Step 1: Construct tree for a1

§ Really tree for A = a1

§ Single node labeled A§ Step i: Assume tree for a1 ⇒ a2 ⇒…⇒ ai already constructed

§ ai = X1 X2 … Xj … Xk§ Assume Xjà b = Y1 … Ym leads to ai ⇒ ai+1§ Take tree built for a1 ⇒ a2 ⇒…⇒ ai§ Find j-th leaf in this tree – this is labeled Xj.§ Add m new children (all leaves), labeled Y1 … Ym§ Special case: m = 0, i.e. b = e§ Add one child with label e

58

Page 43: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

59

S

Page 44: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

60

S

E

S à E

Page 45: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

61

S

E

Op EE

S à EE àE Op E

Page 46: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

62

S

E

Op EE

Id

S à EE àE Op E

E à Id

Page 47: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

63

S

E

Op EE

Id +

S à EE àE Op E

E à IdOp à +

Page 48: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

64

S

E

Op EE

Id + Id

S à EE àE Op E

E à IdOp à +E à Id

Page 49: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Example: Constructing a parse tree§ Derivation #2 (right-most derivation)

S ⇒ E ⇒E Op E ⇒ E Op Id ⇒ E + Id ⇒ Id + Id

§ Same tree!§ Parse tree summarizes derivation (you can find production used)§ No statement regarding the right-most or left-most derivation 67

S

E

Op EE

Id + Id

Page 50: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

a + b * c

Talk to your neighbor and find a derivation for “a + b * c”(Hint: right-most or left-most)

Construct the parse tree for your derivation

Compare your tree with the result obtained by your neighbor team

68

Page 51: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Derivations for a + b * c§ Tree #1 § Tree #2

70

S

E

Op EE

Id + Op EE

Id * Id

S

E

OpE E

Id*Op EE

Id + Id

Note: Each tree can be obtained using both a left-most and a right-most derivation.

Page 52: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Derivations and parse trees

§ Derivations with different parse trees§ For the same string w

§ What was intended by the programmer?§ Tree #1 means: a + (b * c)§ Tree #2 means: (a + b) * c

71

Page 53: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Derivations and parse trees

§ Derivations with different parse trees§ For the same string w

§ What was intended by the programmer?§ Tree #1 means: a + (b * c)§ Tree #2 means: (a + b) * c

§ Should we allow grammars with different parse trees for w?§ Probably not for programming languages (if derivations capture

structure) 72

Page 54: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Different parse trees

§ There are grammars that allow more than one right-most derivation for w ∈ L(G)§ (Or more than one left-most derivation)

§ Different right-most (left-most) derivations result in different parse trees§ Capture different structure

74

Page 55: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Different parse trees

§ There are grammars that allow more than one right-most derivation for w ∈ L(G)§ (Or more than one left-most derivation)

§ Example (right-most)§ Derivation #1: S ⇒ E ⇒ E Op E ⇒ E Op E Op E ⇒ E Op E Op Id ⇒ E Op E

* Id ⇒ E Op Id * Id ⇒ E Op Id * Id ⇒ E + Id * Id ⇒ Id + Id * Id

§ Derivation #2: S ⇒ E ⇒ E Op E ⇒ E Op Id ⇒ E * Id ⇒ E Op E * Id ⇒ E Op Id * Id ⇒ E + Id * Id ⇒ Id + Id * Id

76

Page 56: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Derivations and parse treesTree #1 Tree #2

77

S

E

OpE E

Id*Op EE

Id + Id

S

E

Op EE

Id + Op EE

Id * Id

Page 57: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

3.1.3 Ambiguity

§ A grammar that allows more than on parse tree for at least one w ∈ L(G) is called ambiguous

§ Note: Ambiguity is property of the grammar§ We give later a non-ambiguous grammar for expressions

§ We need to compare parse trees (and derivations)§ Comparing derivations easy if only left-most (right-most) used

§ Alternative definition: A grammar that allows more than one (right | left)-most derivation for at least one w ∈ L(G) is called ambiguous 78

Page 58: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Problems w/ ambiguity

§ Compiler does not know how to interpret “a + b * c”§ Is it Tree #1? I.e., (a + b) * c§ Or is it Tree #2? I.e., a + (b * c)

§ What can we do?

79

Page 59: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Addressing ambiguity

§ Change the grammar§ See later for better grammar§ May not always be possible

§ Change language§ Add rules that “*” binds more strongly than “+”

§ Precedence§ Resolves conflicts

§ Bad idea: Let the compiler (writer) decide§ Or let the user worry 80

Page 60: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Another example

§ “If” statement§ Two forms

§ if (Condition) then (Body)§ if (Condition) then (Body) else (Body)

81

Page 61: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Another example – G5

§ Start symbol: S§ Productions

S à stmt-list S | stmt-liststmt-list à …. | if-stmtif-stmt à if cond-expr then S |

if cond-expr then S else Scond-expr à …

§ Other statements (assign, function call, …) and expression details omitted (abbreviate Body, Cond)

82

Page 62: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Please construct with your neighbor a program fragment that shows that this grammar is ambiguous.

(Find an example with two parse trees)

83

Page 63: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Another example

if (Cond) then if (Cond) then (Body) else (Body)Body: some other stmtsCond: some condition expression

What did the programmer mean?

84

if (Cond) then (Body)if (Cond) then (Body) else (Body)

if (Cond) then (Body)if (Cond) then (Body)

else (Body)

Page 64: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

85

Page 65: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Grammar G6

§ S à SS | (S) | () | ε

§ Is G6 ambiguous?

§ What is L(G6)? Find two right-most (left-most) derivations for some w.

§ Find a grammar G6’ such that L(G6) == L(G6’) and G6’ is not ambiguous. 88

Page 66: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

89

Page 67: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Ambiguous languages

§ Ambiguity is a property of the grammar§ One word is enough to show ambiguity§ How do you show that a grammar is not ambiguous?

§ Proof (for one grammar)§ Some kinds of grammars are certified unambiguous

§ We will look at those in compiler design

§ Unfortunately there are languages that are inherently ambiguous§ All grammars that generate such a language are ambiguous§ Even for Type-2 (context free) grammars

91

Page 68: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

92

Page 69: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Transition from parse tree to IR

§ Parse tree§ Sometimes called concrete syntax tree§ Interior nodes represent non-terminals

§ Our tree-based IR: Abstract-syntax tree§ Interior nodes represent programming constructs§ Non-terminals not (directly) preserved§ Structure close to that of the parse tree

§ Building IR: Via derivations or separate transformation step

94

Page 70: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Parse tree vs IRConcrete syntax tree Abstract syntax tree (IR)

95

S

E

Op EE

Ida7 + Id

b

+

VARb

VARa7

Page 71: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Parsing

§ Given G and a word w ∈T*: we want to know if “w ∈ L(G)?”§ Analysis problem

§ Answer is either YES or NO

§ If (and only if) we find a sequence of productions so that S ⇒* w then w ∈ L(G)

96

Page 72: w03 01-front-end-overview 18 · 3.1 Introduction §Frontend responsible to turn input program into IR §Input: Usually a string of ASCII or Unicode characters §IR: As required by

Summary

§ Frontend performs two tasks§ Break input into tokens§ Analyze that sequence of tokens is legal input

§ Find derivation S ⇒* w

§ Goal: produce IR§ Parse trees capture derivations

§ Information about structure – needed for IR

§ Our IR is tree-based, so step from parse tree to IR tree not that large 98