w03 01-front-end-overview 18 · 3.1 introduction §frontend responsible to turn input program into...

Post on 22-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Compiler DesignSpring 2018

3.0 Frontend

1

Thomas R. Gross

Computer Science DepartmentETH Zurich, Switzerland

Admin issues

§ Recitation sessions take place only when announced§ In the lecture / on course website / on the mailing list

§ No recitation session this week§ Next recitation session

§ March 15, 2018 @ 15:00§ ETF E1 (tentative)

2

Compiler model

3

Source program

ASM file

“Front-end”

IR

“Back-end”

OptimizerQuestion: How to build IR (tree)?

Overview

§ 3.1 Introduction§ 3.2 Lexical analysis§ 3.3 “Top down” parsing§ 3.4 “Bottom up” parsing

4

3.1 Introduction

§ Frontend responsible to turn input program into IR§ Input: Usually a string of ASCII or Unicode characters§ IR: As required by later stages of the compiler

§ Frontend divided into§ Lexical analysis – deals with reading the input program

§ Also known as scanning§ Scanner, Lexer

§ Syntactic analysis – understand structure of the input program§ Also known as parsing§ Parser

5

3.1 Introduction (cont’d)§ Good news: Syntactic and lexical analysis well understood

§ Good theory and books, e.g., Aho et al., Chapters 2 (in part), 3, and 4§ Good tools

§ Bad news: Even good tools may be painful to use§ Good == powerful§ Many options§ Still can’t handle all possible languages§ May give cryptic error messages

6

3.1 Introduction (cont’d)§ Need to understand theory to use tool

§ Same theory that allows building tool§ Tools made hand-crafted frontends obsolete§ Frontend tools used for other domains

7

Languages

§ Frontend processes input program§ Need a way to describe what input is allowed

§ Formal languages§ Well-researched area§ First part of compilers supported by tools

§ In this lecture: brief review§ Aho et al. covers topic in more depth§ Focus on essentials

§ (Speed an issue in real life)§ Theory behind tools

8

Languages: Grammar§ Grammars provide a set of rules to generate “strings”§ A grammar consists of

§ Terminals: a, b, c, …§ Non-terminals: X, Y, Z, …§ Set of productions§ Start symbol: S

§ Some terminology§ Terminal symbols: Sometimes called characters or tokens§ Non-terminal symbols: Also called syntactic variables§ String: Sequence of symbols from some alphabet

§ Other terms: Word, sentence 9

Productions§ General form

§ Left-hand side à Right-hand side§ LHS à RHS (for short)§ LHS, RHS: Strings over alphabets of terminal and non-terminal symbols

§ Example: Grammar G1S à aBaS à aXaXb à Xbc | cBa à aBa | b

§ How does a grammar generate a language (known as L(G))?§ Using the grammar G1 as an example 10

L(G)§ From production to derivation

Given § w -- a word over (T ∪ NT),§ a, b, g words over (T ∪ NT)

§ (a, b, g may be empty) s.t. w = a b g and P a production bà d

We say that w’ = a d g is derived from w, i.e., w ⇒ w’.

§ Example derivation (with G1)§ S ⇒ aBa ⇒ aaBa ⇒ aab

§ L(G1) = anb, n ≥ 1 12

S à aBaS à aXaXb à Xbc | cBa à aBa | b

L(G)

§ L(G) = set of strings w such that§ w consists only of symbols from the set of terminals§ There exists a sequence of productions P1, P2, …Pn such that S ⇒ RHS1 by

P1, … (by Pi), …. ⇒ w (by Pn)§ In other words: there exists a derivation S ⇒ P1 … … ⇒ Pn w (or S ⇒* w)

14

Productions, 2nd look

§ No constraints on LHS, RHS§ Some RHS could be dead-end street

S à aXaXb à …

§ Remove dead-end streets§ Updated grammar G1’

S à aBaBa à aBa | b

16

Productions, 3rd look

§ We care about L(G) – prune productions that do not contribute

§ Restrictions on LHS§ Only a single non-terminal is allowed on the left hand side§ For example: A à a§ “Context free” grammar or Type-2 grammar

§ Context-free grammars important § Efficient analysis techniques known§ From now on only context-free grammars unless noted 17

Regular and linear grammars

§ Linear grammar: Context-free, at most 1 NT in the RHS§ Left-linear grammar: Linear, NT appears at left end of RHS§ Right-linear grammar: Linear, NT appears at the right end of RHS§ Regular grammar: Either right-linear or left-linear§ Regular grammars generate regular languages

§ Could also be described by regular expression§ Can be recognized by Finite Deterministic Automaton§ Type-3 grammar

18

Special cases

§ ∅ – a language (but not an interesting one)§ e – the empty string

§ Must use a symbol so that we can see it

§ Can be the RHS§ A à e

22

3.1 Introduction

§ So far: Brief summary of grammars§ Using multiple grammars to save work § Properties of derivations§ Parse trees§ Properties of grammars

§ Detect ambiguity§ Avoid ambiguity

23

3.1.1 Example grammar G2

§ Start symbol: S§ Terminals: { a, b, …, z, +, -, *, /, ( , ) }§ Non-Terminals: { S, E, Op, Id, L, M, N }§ Productions

S à EE à E Op E | - E | ( E ) | IdOp à + | - | * | / Id à L ML à a | b | ... | zM à L M | N M | eN à 0 | 1 | ... | 9 24

Note: ℇ-production allows us to make M “disappear”

S ⇒ E ⇒ Id ⇒ L M ⇒ L L M ⇒ a L M ⇒ ap M ⇒ ap

Parsing

§ Given G and a word w ∈T*: we want to know if “w ∈ L(G)?”

§ Analysis problem§ Answer is either YES or NO§ ap ∈L(G2)§ ap + bp ∈L(G2)§ ap++ ∉ L(G2)

§ For YES we need to find a sequence of productions so that S ⇒ … … ⇒w§ (or S ⇒* w for short) 26

w = a3 + b§ Derivation

S ⇒ E ⇒ E Op E ⇒ E Op Id ⇒ E + Id ⇒ Id + Id ⇒Id + LM ⇒ Id + L ⇒ Id + b ⇒ LM + b ⇒ a M + b ⇒a N M + b ⇒ a3 M + b ⇒ a3 + b

29

Comments§ If a string w contains multiple non-terminals we have a choice

when expanding w ⇒w’§ Grammars that are context-free and without useless non-terminals: must

have a production for each non-terminal in w§ Assume A, B ∈ NT, A à a , B à b are productions P1, P2

§ w = d A t B g§ Choice #1: w1 = d a t B g§ Choice #2: w2 = d A t b g§ (Both w ⇒ w1 or w ⇒ w2 possible)

30

More comments

§ Question: Does the choice influence L(G)?§ Or, is (w1 ⇒ * x ∈ L(G)) ⇔ (w2 ⇒ * x ∈ L(G))

§ Answer: choice does not matter for context-free grammars

§ How to decide which production to pick?§ Everything worked out in the example

§ We’ve always picked the right production

§ Found w = a3 + b

§ Later more…

31

More comments

§ Part of the derivation is pretty boring§ Do we care about exact steps to generate identifier “a3”?§ Details (not always) important

32

3.1.1 Example grammar G2

§ Start symbol: S§ Terminals: { a, b, …, z, +, -, *, /, ( , ) }§ Non-Terminals: { S, E, Op, Id, L, M, N }§ Productions

S à EE à E Op E | - E | ( E ) | IdOp à + | - | * | / Id à L ML à a | b | ... | zM à L M | N M | eN à 0 | 1 | ... | 9 33

More comments

§ Part of the derivation is pretty boring§ Do we care about exact steps to generate identifier “a3”?§ Details (not always) important

§ Can we find a better way to deal with this aspect?§ Better: Simpler§ Better: Maybe also more efficient

34

36

Regular expressions§ Idea: Use regular expression to capture “uninteresting” part

of a grammar§ Here: Exact rules for identifier names§ Replace part of grammar G2

…Id à L ML à a | b | ... | zM à L M | N M | eN à 0 | 1 | ... | 9

§ Regular expressions recognized by Finite State Machines§ Either a Deterministic Finite Automaton (DFA)§ Or a Nondeterministic Finite Automaton (NFA) 37

Token§ Idea: Introduce grammar symbol that represents

string described by regular expression§ Terminal for the grammar§ Rules/production to generate regular expression string

§ When looking for a derivation identify strings that can be described by regular expression§ “Token”§ Example: a3 + b Tokens: Id (“a3”) + Id (“b”)

§ Chunks of the input stream§ More in 3.2 Lexical analysis 38

regexp regexp

Examples

§ a3 + b … really … Id(“a3”) + Id(“b”)§ z * u + x … really … Id(“z”) * Id(“u”) + Id(“x”)

§ Id * Id + Id ∈ L(G2)

§ Treat terminals the same way§ Id(“z”) Term(“*”) Id(“u”) Term(“+”) Id(“x”)

40

3.1.2 Simplified grammar G3

§ Start symbol: S

§ Terminals: { a, b, …, z, +, -, *, /, ( , ), Id }

§ Non-Terminals: { S, E, Op, Id, L, M, N }

§ Productions and regular definitions

S à EE à E Op E | - E | ( E ) | IdOp à + | - | * | / Id: L { L | N } *

41regexp

L = { a | b | c | … | z }N = { 0 | 1 | 2 | … | 9 }

More simplifications?§ Can grammar G3 simplified even further?§ Are there other productions we can replace with a regular

expression?§ Productions

S à EE à E Op E | - E | ( E ) | IdId à L { L | N } * L = { a | b | c | … | z }

N = { 0 | 1 | 2 | … | 9 }Op à + | - | * | /

§ Could treat Op the same wayOp: { + | - | * | / } 43

Simplified grammar G4

§ Start symbol: S

§ Terminals: { a, b, …, z, +, -, *, /, ( , ), Id }

§ Non-Terminals: { S, E, Op}

§ Productions and regular definitionsS à E (1) E à E Op E (2)

| - E (3)| ( E ) (4)| Id (5)

Op à + | - | * | / (6)Id: L { L | N } * L = { a | b | c | … | z }, N = { 0 | 1 | 2 | … | 9 } 44

w = a 3 + b

Please take a piece of paper and find a derivation for “a 3 + b”

(Raise your hand when you are done.)

Compare your solution with your neighbor’s solution.§ Do you start with the same production?§ Do you use the same production in the 2nd step?

45

w = a3 + b

§ Some example derivations§ Derivation #1

S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

§ Derivation #2S ⇒ E ⇒E Op E ⇒ E Op Id ⇒ E + Id ⇒ Id + Id

§ More?

48

Looking at derivations

§ a3 + b ∈L(G4), i.e., a3 + b is a legal program§ At least according to grammar G4

§ Are we done?

§ More analysis is needed§ Looking at derivation helps start analysis§ Derivations may provide information on structure

50

Questions for derivations

§ Does the order of applying productions matter?§ Are derivations unique?

§ How do we compare derivations?

51

Choice of non-terminal in derivation step

§ Given w = d A t B g (with A, B ∈ NT, A à a , B à b productions)

§ Two choices§ w ⇒ d a t B g§ w ⇒ d A t b g

52

Many options

§ In Derivation #1: Always the left-most non-terminal is picked for replacement§ “Left-most” derivation

§ In Derivation #2: Always the right-most non-terminal is picked for replacement§ “Right-most” derivation

§ No influence on L(G)§ But useful to distinguish “different” derivations§ Intuitively: Different derivations might convey different “meaning” 53

Derivations

§ Given a grammar G with productions Pi.§ Consider two derivations D1 = Pa Pb Pc … Pn and

D2 = P’a P’b P’c … P’n§ Pj, P’k productions, applied as intended

§ Are D1 and D2 the same?§ Again (intuitively): Do they ”mean” the same?

54

Derivations

§ Question: Are D1 and D2 both right-most derivations(or both left-most derivations)?§ YES: if Pm = P’m for all 1 ≤ m ≤ n § NO: We can’t easily compare

§ Later more (parse trees)

§ Looking at right-most (or left-most) derivations allows us to compare derivations§ Different derivations don’t matter always§ … but sometimes they do (more later)

55

Parse tree

§ Want to identify structure expressed by derivation§ Compare two derivations that are not both right-most (or both left-

most) derivations

§ Summary of derivation§ Ignore the order of applying productions§ Leaves: Terminals§ Interior nodes: Application of a production

56

Parse tree construction§ How to construct parse tree?§ Induction

§ Given derivation a1 ⇒ a2 ⇒… ai ⇒ ai+1 ⇒ ... ⇒ an

§ Step 1: Construct tree for a1

§ Really tree for A = a1

§ Single node labeled A§ Step i: Assume tree for a1 ⇒ a2 ⇒…⇒ ai already constructed

§ ai = X1 X2 … Xj … Xk§ Assume Xjà b = Y1 … Ym leads to ai ⇒ ai+1§ Take tree built for a1 ⇒ a2 ⇒…⇒ ai§ Find j-th leaf in this tree – this is labeled Xj.§ Add m new children (all leaves), labeled Y1 … Ym§ Special case: m = 0, i.e. b = e§ Add one child with label e

58

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

59

S

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

60

S

E

S à E

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

61

S

E

Op EE

S à EE àE Op E

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

62

S

E

Op EE

Id

S à EE àE Op E

E à Id

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

63

S

E

Op EE

Id +

S à EE àE Op E

E à IdOp à +

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

64

S

E

Op EE

Id + Id

S à EE àE Op E

E à IdOp à +E à Id

Example: Constructing a parse tree§ Derivation #2 (right-most derivation)

S ⇒ E ⇒E Op E ⇒ E Op Id ⇒ E + Id ⇒ Id + Id

§ Same tree!§ Parse tree summarizes derivation (you can find production used)§ No statement regarding the right-most or left-most derivation 67

S

E

Op EE

Id + Id

a + b * c

Talk to your neighbor and find a derivation for “a + b * c”(Hint: right-most or left-most)

Construct the parse tree for your derivation

Compare your tree with the result obtained by your neighbor team

68

Derivations for a + b * c§ Tree #1 § Tree #2

70

S

E

Op EE

Id + Op EE

Id * Id

S

E

OpE E

Id*Op EE

Id + Id

Note: Each tree can be obtained using both a left-most and a right-most derivation.

Derivations and parse trees

§ Derivations with different parse trees§ For the same string w

§ What was intended by the programmer?§ Tree #1 means: a + (b * c)§ Tree #2 means: (a + b) * c

71

Derivations and parse trees

§ Derivations with different parse trees§ For the same string w

§ What was intended by the programmer?§ Tree #1 means: a + (b * c)§ Tree #2 means: (a + b) * c

§ Should we allow grammars with different parse trees for w?§ Probably not for programming languages (if derivations capture

structure) 72

Different parse trees

§ There are grammars that allow more than one right-most derivation for w ∈ L(G)§ (Or more than one left-most derivation)

§ Different right-most (left-most) derivations result in different parse trees§ Capture different structure

74

Different parse trees

§ There are grammars that allow more than one right-most derivation for w ∈ L(G)§ (Or more than one left-most derivation)

§ Example (right-most)§ Derivation #1: S ⇒ E ⇒ E Op E ⇒ E Op E Op E ⇒ E Op E Op Id ⇒ E Op E

* Id ⇒ E Op Id * Id ⇒ E Op Id * Id ⇒ E + Id * Id ⇒ Id + Id * Id

§ Derivation #2: S ⇒ E ⇒ E Op E ⇒ E Op Id ⇒ E * Id ⇒ E Op E * Id ⇒ E Op Id * Id ⇒ E + Id * Id ⇒ Id + Id * Id

76

Derivations and parse treesTree #1 Tree #2

77

S

E

OpE E

Id*Op EE

Id + Id

S

E

Op EE

Id + Op EE

Id * Id

3.1.3 Ambiguity

§ A grammar that allows more than on parse tree for at least one w ∈ L(G) is called ambiguous

§ Note: Ambiguity is property of the grammar§ We give later a non-ambiguous grammar for expressions

§ We need to compare parse trees (and derivations)§ Comparing derivations easy if only left-most (right-most) used

§ Alternative definition: A grammar that allows more than one (right | left)-most derivation for at least one w ∈ L(G) is called ambiguous 78

Problems w/ ambiguity

§ Compiler does not know how to interpret “a + b * c”§ Is it Tree #1? I.e., (a + b) * c§ Or is it Tree #2? I.e., a + (b * c)

§ What can we do?

79

Addressing ambiguity

§ Change the grammar§ See later for better grammar§ May not always be possible

§ Change language§ Add rules that “*” binds more strongly than “+”

§ Precedence§ Resolves conflicts

§ Bad idea: Let the compiler (writer) decide§ Or let the user worry 80

Another example

§ “If” statement§ Two forms

§ if (Condition) then (Body)§ if (Condition) then (Body) else (Body)

81

Another example – G5

§ Start symbol: S§ Productions

S à stmt-list S | stmt-liststmt-list à …. | if-stmtif-stmt à if cond-expr then S |

if cond-expr then S else Scond-expr à …

§ Other statements (assign, function call, …) and expression details omitted (abbreviate Body, Cond)

82

Please construct with your neighbor a program fragment that shows that this grammar is ambiguous.

(Find an example with two parse trees)

83

Another example

if (Cond) then if (Cond) then (Body) else (Body)Body: some other stmtsCond: some condition expression

What did the programmer mean?

84

if (Cond) then (Body)if (Cond) then (Body) else (Body)

if (Cond) then (Body)if (Cond) then (Body)

else (Body)

85

Grammar G6

§ S à SS | (S) | () | ε

§ Is G6 ambiguous?

§ What is L(G6)? Find two right-most (left-most) derivations for some w.

§ Find a grammar G6’ such that L(G6) == L(G6’) and G6’ is not ambiguous. 88

89

Ambiguous languages

§ Ambiguity is a property of the grammar§ One word is enough to show ambiguity§ How do you show that a grammar is not ambiguous?

§ Proof (for one grammar)§ Some kinds of grammars are certified unambiguous

§ We will look at those in compiler design

§ Unfortunately there are languages that are inherently ambiguous§ All grammars that generate such a language are ambiguous§ Even for Type-2 (context free) grammars

91

92

Transition from parse tree to IR

§ Parse tree§ Sometimes called concrete syntax tree§ Interior nodes represent non-terminals

§ Our tree-based IR: Abstract-syntax tree§ Interior nodes represent programming constructs§ Non-terminals not (directly) preserved§ Structure close to that of the parse tree

§ Building IR: Via derivations or separate transformation step

94

Parse tree vs IRConcrete syntax tree Abstract syntax tree (IR)

95

S

E

Op EE

Ida7 + Id

b

+

VARb

VARa7

Parsing

§ Given G and a word w ∈T*: we want to know if “w ∈ L(G)?”§ Analysis problem

§ Answer is either YES or NO

§ If (and only if) we find a sequence of productions so that S ⇒* w then w ∈ L(G)

96

Summary

§ Frontend performs two tasks§ Break input into tokens§ Analyze that sequence of tokens is legal input

§ Find derivation S ⇒* w

§ Goal: produce IR§ Parse trees capture derivations

§ Information about structure – needed for IR

§ Our IR is tree-based, so step from parse tree to IR tree not that large 98

top related