winter 2012-2013 compiler principles lexical analysis (scanning)
DESCRIPTION
Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning). Mayer Goldberg and Roman Manevich Ben-Gurion University. General stuff. Topics taught by me Lexical analysis (scanning) Syntax analysis (parsing) … Dataflow analysis Register allocation - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/1.jpg)
Winter 2012-2013Compiler Principles
Lexical Analysis (Scanning)
Mayer Goldberg and Roman ManevichBen-Gurion University
![Page 2: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/2.jpg)
2
General stuff Topics taught by me
Lexical analysis (scanning) Syntax analysis (parsing) … Dataflow analysis Register allocation
Slides will be available from web-site after lecture
Request: please mute mobiles, tablets, super-cool squeaking devices
![Page 3: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/3.jpg)
3
Today Understand role of lexical
analysis
Lexical analysis theory
Implementing modern scanner
![Page 4: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/4.jpg)
4
Role of lexical analysis First part of compiler front-end
Convert stream of characters into stream of tokens Split text into most basic meaningful
strings Simplify input for syntax analysis
High-level
Language
(scheme)
Executable Code
LexicalAnalysi
s
Syntax Analysi
sParsing
AST Symbol
Tableetc.
Inter.Rep.(IR)
CodeGeneration
![Page 5: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/5.jpg)
5
From scanning to parsing5 + (7 * x)
) id * num ( + num
Lexical Analyzer
program text
token stream
Parser
Grammar:E id E numE E + EE E * EE ( E ) +
num
num x
*
Abstract Syntax Tree
validsyntaxerror
![Page 6: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/6.jpg)
6
Javascript example
var currOption = 0;// Choose content to display in lower pane.function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"];
for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } }}
Identify basic units in this code
![Page 7: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/7.jpg)
7
Javascript example
var currOption = 0;// Choose content to display in lower pane.function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"];
for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } }}
Identify basic units in this code
![Page 8: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/8.jpg)
8
Javascript example Identify basic units in this code
var currOption = 0;// Choose content to display in lower pane.function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"];
for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } }}
keyword numeric literaloperator
string literal
punctuation
identifierwhitespace
![Page 9: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/9.jpg)
9
Scanner output
var currOption = 0;// Choose content to display in lower pane.function choose ( id ) { var menu = ["about-me", "publications“, "teaching", "software", "activities"];
for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } }}
1: VAR1: ID(currOption)1: EQ1: INT_LITERAL(0)1: SEMI3: FUNCTION3: ID(choose)3: LP3: ID(id)3: EP3: LCB...
Stream of TokensLINE: ID(value)
![Page 10: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/10.jpg)
10
What is a token? Lexeme – substring of original text
constituting an identifiable unit Identifiers, Values, reserved words, …
Record type storing: Kind Value (when applicable) Start-position/end-position Any information that is useful for the
parser Different for different languages
![Page 11: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/11.jpg)
11
C++ example 1 Splitting text into tokens can be tricky How should the code below be split?
vector<vector<int>> myVector
>>operator
>, >two tokensor ?
![Page 12: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/12.jpg)
12
C++ example 2 Splitting text into tokens can be tricky How should the code below be split?
vector<vector<int> > myVector
>, >two tokens
![Page 13: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/13.jpg)
Example tokensType ExamplesIdentifier x, y, z, foo, barNUM 42FLOATNUM -3.141592654STRING “so long, and thanks for all the fish”LPAREN (RPAREN )IF if…
13
![Page 14: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/14.jpg)
14
Separating tokens
Type ExamplesComments /* ignore code */
// ignore until end of lineWhite spaces \t \n
Lexemes are recognized but get consumed rather than transmitted to parser if
i fi/*comment*/f
![Page 15: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/15.jpg)
15
Preprocessor directives in C
Type ExamplesInlude directives #include<foo.h>Macros #define THE_ANSWER 42
![Page 16: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/16.jpg)
16
Designing a scanner Define each type of lexeme
Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” Annotations: @SuppressWarnings
But how do we define lexemes of unbounded length?
![Page 17: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/17.jpg)
17
Designing a scanner Define each type of lexeme
Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” Annotations: @SuppressWarnings
But how do we define lexemes of unbounded length? Regular expressions
![Page 18: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/18.jpg)
18
Regular languages refresher Formal languages
Alphabet = finite set of letters Word = sequence of letter Language = set of words
Regular languages defined equivalently by Regular expressions Finite-state automata
![Page 19: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/19.jpg)
19
Regular expressions Empty string: Є Letter: a Concatenation: R1 R2 Union: R1 | R2 Kleene-star: R*
Shorthand: R+ stands for R R* scope: (R) Example: (0* 1*) | (1* 0*)
What is this language?
![Page 20: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/20.jpg)
20
Exercise 1 - Question Language of Java identifiers
Identifiers start with either an underscore ‘_’or a letter
Continue with either underscore, letter, or digit
![Page 21: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/21.jpg)
21
Exercise 1 - Answer Language of Java identifiers
Identifiers start with either an underscore ‘_’or a letter
Continue with either underscore, letter, or digit
(_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* Using shorthand macros
First = _|a|b|…|z|A|…|ZNext = First|0|…|9R = First Next*
![Page 22: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/22.jpg)
22
Exercise 2 - Question Language of rational numbers in
decimal representation (no leading, ending zeros) 0 123.757 .933333 Not 007 Not 0.30
![Page 23: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/23.jpg)
23
Exercise 2 - Answer Language of rational numbers in
decimal representation (no leading, ending zeros)
Digit = 1|2|…|9Digit0 = 0|DigitNum = Digit Digit0*Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.FracPosOrNeg = (Є|-)PosR = 0 | PosOrNeg
![Page 24: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/24.jpg)
24
Exercise 3 - Question Equal number of opening and closing
parenthesis: [n]n = [], [[]], [[[]]], …
![Page 25: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/25.jpg)
25
Exercise 3 - Answer Equal number of opening and closing
parenthesis: [n]n = [], [[]], [[[]]], … Not regular Context-free Grammar:
S ::= [] | [S]
![Page 26: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/26.jpg)
26
Finite automata
start
a
b
b
c
acceptingstate
startstate
transition
An automaton is defined by states and transitions
![Page 27: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/27.jpg)
27
Automaton running example
start
a
b
b
c
Words are read left-to-rightc b a
![Page 28: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/28.jpg)
28
Automaton running example
start
a
b
b
c
Words are read left-to-rightc b a
![Page 29: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/29.jpg)
29
Automaton running example
start
a
b
b
c
Words are read left-to-rightc b a
![Page 30: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/30.jpg)
30
Automaton running example
start
a
b
b
c
Words are read left-to-rightword
acceptedc b a
![Page 31: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/31.jpg)
31
Word outside of language
start
a
b
b
c
c b b
![Page 32: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/32.jpg)
32
Word outside of language Missing transition means non-
acceptance
start
a
b
b
c
c b b
![Page 33: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/33.jpg)
33
Exercise - Question What is the language defined by the
automaton below?
start
a
b
b
c
![Page 34: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/34.jpg)
34
Exercise - Answer What is the language defined by the
automaton below? a b* c Generally: all paths leading to accepting
states
start
a
b
b
c
![Page 35: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/35.jpg)
35
Non-deterministic automata Allow multiple transitions from given
state labeled by same letter
start
a
a
b
c
b
c
![Page 36: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/36.jpg)
36
NFA run example
c b a
start
a
a
b
c
b
c
![Page 37: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/37.jpg)
37
NFA run example Maintain set of states
c b a
start
a
a
b
c
b
c
![Page 38: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/38.jpg)
38
NFA run example
c b a
start
a
a
b
c
b
c
![Page 39: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/39.jpg)
39
NFA run example Accept word if any of the states in the
set is acceptingc b a
start
a
a
b
c
b
c
![Page 40: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/40.jpg)
40
NFA+Є automata Є transitions can “fire” without
reading the input
start a
b
c
Є
![Page 41: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/41.jpg)
41
NFA+Є run example
start a
b
c
c b a
Є
![Page 42: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/42.jpg)
42
NFA+Є run example Now Є transition can non-
deterministically take place
start a
b
c
c b a
Є
![Page 43: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/43.jpg)
43
NFA+Є run example
start a
b
c
c b a
Є
![Page 44: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/44.jpg)
44
NFA+Є run example
start a
b
c
c b a
Є
![Page 45: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/45.jpg)
45
NFA+Є run example
start a
b
c
c b a
Є
![Page 46: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/46.jpg)
46
NFA+Є run example
start a
b
c
c b a
Є
Word accepted
![Page 47: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/47.jpg)
47
Reg-exp vs. automata Regular expressions are declarative
Offer compact way to define a regular language by humans
Don’t offer direct way to check whether a given word is in the language
Automata are operative Define an algorithm for deciding whether
a given word is in a regular language Not a natural notation for humans
![Page 48: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/48.jpg)
48
From reg. exp. to automata Theorem: there is an algorithm to
build an NFA+Є automaton for any regular expression
Proof: by induction on the structure of the regular expression For each sub-expression R we build an
automaton with exactly one start state and one accepting state
Start state has no incoming transitions Accepting state has no outgoing
transitions
![Page 49: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/49.jpg)
49
From reg. exp. to automata Theorem: there is an algorithm to
build an NFA+Є automaton for any regular expression
Proof: by induction on the structure of the regular expression
start
![Page 50: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/50.jpg)
50
Base cases
R =
R = a
start
start a
![Page 51: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/51.jpg)
51
Construction for R1 | R2
start
R1
R2
![Page 52: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/52.jpg)
52
Construction for R1 R2
start
R1 R2
![Page 53: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/53.jpg)
53
Construction for R*
start
R
![Page 54: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/54.jpg)
54
From NFA+Є to DFA Construction requires O(n) states for
a reg-exp of length n Running an NFA+Є with n states on
string of length m takes O(m·n2) time Solution: determinization via subset
construction Number of states worst-case exponential
in n Running time O(m)
![Page 55: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/55.jpg)
55
Subset construction For an NFA+Є with states M={s1,
…,sk} Construct a DFA with one state per
set of states of the corresponding NFA M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …}
Simulate transitions between individual states for every letter as1 s2 a[s1,s4] [s2,s7]
NFA+Є DFA
as4 s7
![Page 56: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/56.jpg)
56
Subset construction For an NFA+Є with states M={s1,
…,sk} Construct a DFA with one state per
set of states of the corresponding NFA M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …}
Extend macro states by states reachable via Є transitions
Єs1 s4 [s1,s2] [s1,s2,s4]NFA+Є DFA
![Page 57: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/57.jpg)
57
Scanning challenges Regular expressions allow us to define
the language of all sequences of tokens
Automata theory provides an algorithm for checking membership of words But we are interested in splitting the text
not just deciding on membership How do we determine lexemes? How do we handle ambiguities –
lexemes matching more than one token?
![Page 58: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/58.jpg)
58
Separating lexemes ID = (a+b+…+z) (a+b+…+z)*
ONE= 1 Input: abb1 How do we identify ID(abb), ONE?
![Page 59: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/59.jpg)
59
Separating lexemes ID = (a+b+…+z) (a+b+…+z)*
ONE= 1 Input: abb1 How do we identify ID(abb), ONE?
start
a-z
1
a-zID
ONE
![Page 60: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/60.jpg)
60
Maximal munch ID = (a+b+…+z) (a+b+…+z)*
ONE= 1 Input: abb1 How do we identify ID(abb), ONE? Solution: find longest matching
lexeme Keep reading text until automaton leaves
accepting state Return token corresponding to accepting
state Reset – go back to start state and
continue reading input from there
![Page 61: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/61.jpg)
61
Handling ambiguities ID = (a+b+…+z) (a+b+…+z)*
IF = if Input: if Matches both tokens What should the scanner output?
start
a-z
i
a-zID
IFfNFA
![Page 62: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/62.jpg)
62
Handling ambiguities ID = (a+b+…+z) (a+b+…+z)*
IF = if Input: if Matches both tokens What should the scanner output?
start
a-z\i
i
a-zID
IF IDfID
a-z\f DFAa-z
![Page 63: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/63.jpg)
63
Handling ambiguities ID = (a+b+…+z) (a+b+…+z)*
IF = if Input: if Matches both tokens What should the scanner output? Solution: break tie using order of
definitions Output: ID(if)
start
a-z\i
i
a-zID
IF IDfID
a-z\fa-z
![Page 64: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/64.jpg)
64
Handling ambiguities IF = if
ID = (a+b+…+z) (a+b+…+z)* Input: if Matches both tokens What should the scanner output? Solution: break tie using order of
definitions Output: IF
Conclusion: list keywordtoken definitions
before identifier definition
start
a-z\i
i
a-zID
IF IDfID
a-z\fa-z
![Page 65: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/65.jpg)
65
Implementing scanners in practice
![Page 66: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/66.jpg)
66
Implementing scanners Manual construction of automata +
determinization is Very tedious Error-prone Non-incremental
Fortunately there are tools that automatically generate code from a specification for most languages C: Lex, Flex
Java: JLex, JFlex
![Page 67: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/67.jpg)
67
Using JFlex Define tokens (and states) Run Jflex to generate Java
implementation Usually MyScanner.nextToken() will
be called in a loop by parser
RegularExpressions JFlex MyScanner.java
Stream of characters
Tokens
MyScanner.lex
![Page 68: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/68.jpg)
68
Common format for reg-exps
Basic Patterns Matchingx The character x. Any character, usually except a new line[xyz] Any of the characters x,y,zRepetition OperatorsR? An R or nothing (=optionally an R)R* Zero or more occurrences of RR+ One or more occurrences of RComposition OperatorsR1R2 An R1 followed by R2R1|R2 Either an R1 or R2Grouping(R) R itself
![Page 69: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/69.jpg)
69
Escape characters What is the expression for one or
more + symbols? (+)+ won’t work (\+)+ will
backslash \ before an operator turns it to standard character
\*, \?, \+, … Newline: \n or \r\n depending on OS Tab: \t
![Page 70: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/70.jpg)
70
Shorthands Use names for expressions
letter = a | b | … | z | A | B | … | Z letter_ = letter | _ digit = 0 | 1 | 2 | … | 9 id = letter_ (letter_ | digit)*
Use hyphen to denote a range letter = a-z | A-Z digit = 0-9
![Page 71: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/71.jpg)
71
Catching errors What if input doesn’t match any
token definition? Trick: Add a “catch-all” rule that
matchesany character and reports an error Add after all other rules
![Page 72: Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning)](https://reader034.vdocuments.us/reader034/viewer/2022042703/5681692e550346895de0750b/html5/thumbnails/72.jpg)
72
Next lecture: parsing