02. chapter 3 - lexical analysis
TRANSCRIPT
![Page 1: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/1.jpg)
Chapter 3Lexical Analysis
![Page 2: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/2.jpg)
OutlineRole of lexical analyzerSpecification of tokensRecognition of tokensLexical analyzer generatorFinite automataDesign of lexical analyzer generator
![Page 3: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/3.jpg)
The role of lexical analyzer
Lexical Analyzer ParserSource
program
token
getNextToken
Symboltable
To semanticanalysis
![Page 4: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/4.jpg)
Why to separate Lexical analysis and parsing1. Simplicity of design 2. Improving compiler efficiency3. Enhancing compiler portability
![Page 5: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/5.jpg)
Tokens, Patterns and LexemesA token is a pair a token name and an
optional token valueA pattern is a description of the form that the
lexemes of a token may takeA lexeme is a sequence of characters in the
source program that matches the pattern for a token
![Page 6: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/6.jpg)
ExampleToken Informal description Sample lexemes
ifelse
comparison
idnumberliteral
Characters i, fCharacters e, l, s, e
< or > or <= or >= or == or !=
Letter followed by letter and digits
Any numeric constant
Anything but “ sorrounded by “
if
else<=, !=
pi, score, D2
3.14159, 0, 6.02e23
“core dumped”
printf(“total = %d\n”, score);
![Page 7: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/7.jpg)
Attributes for tokensE = M * C ** 2
<id, pointer to symbol table entry for E><assign-op><id, pointer to symbol table entry for M><mult-op><id, pointer to symbol table entry for C><exp-op><number, integer value 2>
![Page 8: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/8.jpg)
Lexical errorsSome errors are out of power of lexical
analyzer to recognize:fi (a == f(x)) …
However it may be able to recognize errors like:d = 2r
Such errors are recognized when no pattern for tokens matches a character sequence
![Page 9: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/9.jpg)
Error recoveryPanic mode: successive characters are
ignored until we reach to a well formed tokenDelete one character from the remaining
inputInsert a missing character into the remaining
inputReplace a character by another characterTranspose two adjacent characters
![Page 10: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/10.jpg)
Input bufferingSometimes lexical analyzer needs to look
ahead some symbols to decide about the token to returnIn C language: we need to look after -, = or <
to decide what token to returnIn Fortran: DO 5 I = 1.25
We need to introduce a two buffer scheme to handle large look-aheads safely
E = M * C * * 2 eof
![Page 11: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/11.jpg)
SentinelsSwitch (*forward++) {
case eof:if (forward is at end of first buffer) {reload second buffer;forward = beginning of second buffer;}else if {forward is at end of second buffer) {reload first buffer;\forward = beginning of first buffer;}else /* eof within a buffer marks the end of input */terminate lexical analysis;break;cases for the other characters;
}
E = M eof * C * * 2 eof eof
![Page 12: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/12.jpg)
Specification of tokensIn theory of compilation regular expressions
are used to formalize the specification of tokens
Regular expressions are means for specifying regular languages
Example: Letter_(letter_ | digit)*
Each regular expression is a pattern specifying the form of strings
![Page 13: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/13.jpg)
Regular expressionsƐ is a regular expression, L(Ɛ) = {Ɛ}If a is a symbol in ∑then a is a regular
expression, L(a) = {a}(r) | (s) is a regular expression denoting the
language L(r) ∪ L(s) (r)(s) is a regular expression denoting the
language L(r)L(s)(r)* is a regular expression denoting (L9r))*(r) is a regular expression denting L(r)
![Page 14: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/14.jpg)
Regular definitionsd1 -> r1d2 -> r2…dn -> rn
Example:letter_ -> A | B | … | Z | a | b | … | Z | _digit -> 0 | 1 | … | 9id -> letter_ (letter_ | digit)*
![Page 15: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/15.jpg)
ExtensionsOne or more instances: (r)+Zero of one instances: r?Character classes: [abc]
Example:letter_ -> [A-Za-z_]digit -> [0-9]id -> letter_(letter|digit)*
![Page 16: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/16.jpg)
Recognition of tokensStarting point is the language grammar to
understand the tokens:stmt -> if expr then stmt | if expr then stmt else stmt | Ɛexpr -> term relop term | termterm -> id | number
![Page 17: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/17.jpg)
Recognition of tokens (cont.)The next step is to formalize the patterns:
digit -> [0-9]Digits -> digit+number -> digit(.digits)? (E[+-]? Digit)?letter -> [A-Za-z_]id -> letter (letter|digit)*If -> ifThen -> thenElse -> elseRelop -> < | > | <= | >= | = | <>
We also need to handle whitespaces:ws -> (blank | tab | newline)+
![Page 18: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/18.jpg)
Transition diagramsTransition diagram for relop
![Page 19: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/19.jpg)
Transition diagrams (cont.)Transition diagram for reserved words and
identifiers
![Page 20: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/20.jpg)
Transition diagrams (cont.)Transition diagram for unsigned numbers
![Page 21: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/21.jpg)
Transition diagrams (cont.)Transition diagram for whitespace
![Page 22: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/22.jpg)
Architecture of a transition-diagram-based lexical analyzerTOKEN getRelop(){
TOKEN retToken = new (RELOP)while (1) { /* repeat character processing until areturn or failure occurs */switch(state) {case 0: c= nextchar(); if (c == ‘<‘) state = 1; else if (c == ‘=‘) state = 5; else if (c == ‘>’) state = 6; else fail(); /* lexeme is not a relop */ break;case 1: ……case 8: retract(); retToken.attribute = GT; return(retToken);}
![Page 23: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/23.jpg)
Lexical Analyzer Generator - Lex
Lexical Compiler
Lex Source programlex.l
lex.yy.c
Ccompiler
lex.yy.c a.out
a.outInput stream Sequence of tokens
![Page 24: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/24.jpg)
Structure of Lex programs
declarations%%translation rules%%auxiliary functions
Pattern {Action}
![Page 25: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/25.jpg)
Example%{
/* definitions of manifest constantsLT, LE, EQ, NE, GT, GE,IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/* regular definitionsdelim [ \t\n]ws {delim}+letter [A-Za-z]digit [0-9]id {letter}({letter}|{digit})*number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%{ws} {/* no action and no return */}if {return(IF);}then {return(THEN);}else {return(ELSE);}{id} {yylval = (int) installID(); return(ID); }{number} {yylval = (int) installNum();
return(NUMBER);}…
Int installID() {/* funtion to install the lexeme, whose first character is pointed to by yytext, and whose length is yyleng, into the symbol table and return a pointer thereto */
}
Int installNum() { /* similar to installID, but puts numerical constants into a separate table */
}
![Page 26: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/26.jpg)
26
Finite AutomataRegular expressions = specificationFinite automata = implementation
A finite automaton consists ofAn input alphabet A set of states SA start state nA set of accepting states F SA set of transitions state input state
![Page 27: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/27.jpg)
27
Finite AutomataTransition
s1 a s2
Is readIn state s1 on input “a” go to state s2
If end of inputIf in accepting state => accept, othewise =>
rejectIf no transition possible => reject
![Page 28: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/28.jpg)
28
Finite Automata State GraphsA state
• The start state
• An accepting state
• A transitiona
![Page 29: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/29.jpg)
29
A Simple ExampleA finite automaton that accepts only “1”
A finite automaton accepts a string if we can follow transitions labeled with the characters in the string from the start to some accepting state
1
![Page 30: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/30.jpg)
30
Another Simple ExampleA finite automaton accepting any number of 1’s followed by a single 0Alphabet: {0,1}
Check that “1110” is accepted but “110…” is not
0
1
![Page 31: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/31.jpg)
31
And Another ExampleAlphabet {0,1}What language does this recognize?
0
1
0
1
0
1
![Page 32: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/32.jpg)
32
And Another ExampleAlphabet still { 0, 1 }
The operation of the automaton is not completely defined by the inputOn input “11” the automaton could be in either
state
1
1
![Page 33: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/33.jpg)
33
Epsilon MovesAnother kind of transition: -moves
• Machine can move from state A to state B without reading input
A B
![Page 34: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/34.jpg)
34
Deterministic and Nondeterministic AutomataDeterministic Finite Automata (DFA)
One transition per input per state No -moves
Nondeterministic Finite Automata (NFA)Can have multiple transitions for one input in a
given stateCan have -moves
Finite automata have finite memoryNeed only to encode the current state
![Page 35: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/35.jpg)
35
Execution of Finite AutomataA DFA can take only one path through the
state graphCompletely determined by input
NFAs can chooseWhether to make -movesWhich of multiple transitions for a single input
to take
![Page 36: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/36.jpg)
36
Acceptance of NFAsAn NFA can get into multiple states
• Input:
0
1
1
0
1 0 1
• Rule: NFA accepts if it can get in a final state
![Page 37: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/37.jpg)
37
NFA vs. DFA (1)NFAs and DFAs recognize the same set of
languages (regular languages)
DFAs are easier to implementThere are no choices to consider
![Page 38: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/38.jpg)
38
NFA vs. DFA (2)For a given language the NFA can be simpler
than the DFA0
10
0
01
0
1
0
1
NFA
DFA
• DFA can be exponentially larger than NFA
![Page 39: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/39.jpg)
39
Regular Expressions to Finite AutomataHigh-level sketch
Regularexpressions
NFA
DFA
LexicalSpecification
Table-driven Implementation of DFA
![Page 40: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/40.jpg)
40
Regular Expressions to NFA (1)For each kind of rexp, define an NFA
Notation: NFA for rexp A
A
• For
• For input aa
![Page 41: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/41.jpg)
41
Regular Expressions to NFA (2)For AB
A B
• For A | B
A
B
![Page 42: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/42.jpg)
42
Regular Expressions to NFA (3)For A*
A
![Page 43: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/43.jpg)
43
Example of RegExp -> NFA conversionConsider the regular expression
(1 | 0)*1The NFA is
1C E0D F
B
G
A H 1I J
![Page 44: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/44.jpg)
44
Next
Regularexpressions
NFA
DFA
LexicalSpecification
Table-driven Implementation of DFA
![Page 45: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/45.jpg)
45
NFA to DFA. The TrickSimulate the NFAEach state of resulting DFA
= a non-empty subset of states of the NFAStart state
= the set of NFA states reachable through -moves from NFA start state
Add a transition S a S’ to DFA iffS’ is the set of NFA states reachable from the
states in S after seeing the input a considering -moves as well
![Page 46: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/46.jpg)
46
NFA -> DFA Example10 1
A BC
D
E
FG H I J
ABCDHI
FGABCDHI
EJGABCDHI
0
1
0
10 1
![Page 47: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/47.jpg)
47
NFA to DFA. RemarkAn NFA may be in many states at any time
How many different states ?
If there are N states, the NFA must be in some subset of those N states
How many non-empty subsets are there?2N - 1 = finitely many, but exponentially many
![Page 48: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/48.jpg)
48
ImplementationA DFA can be implemented by a 2D table T
One dimension is “states”Other dimension is “input symbols”For every transition Si a Sk define T[i,a] = k
DFA “execution”If in state Si and input a, read T[i,a] = k and
skip to state Sk
Very efficient
![Page 49: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/49.jpg)
49
Table Implementation of a DFA
S
T
U
0
1
0
10 1
0 1S T UT T UU T U
![Page 50: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/50.jpg)
50
Implementation (Cont.)NFA -> DFA conversion is at the heart of
tools such as flex or jflex
But, DFAs can be huge
In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations
![Page 51: 02. Chapter 3 - Lexical Analysis](https://reader033.vdocuments.us/reader033/viewer/2022061121/546da55cb4af9fe51b8b4f26/html5/thumbnails/51.jpg)
ReadingsChapter 3 of the book