compiler-lexical analysis
DESCRIPTION
lexical analysisiTRANSCRIPT
![Page 1: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/1.jpg)
Lexical Analysis
![Page 2: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/2.jpg)
OutlineRole of lexical analyzerSpecification of tokensRecognition of tokensLexical analyzer generatorFinite automataDesign of lexical analyzer generator
![Page 3: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/3.jpg)
Compiler Construction
Lexicalanalyzer
ParserSource program
read char
put back char
pass tokenand attribute value
get next
Symbol TableRead entireprogram into
memory
id
![Page 4: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/4.jpg)
The role of lexical analyzer
Lexical Analyzer ParserSource
program
token
getNextToken
Symboltable
To semanticanalysis
![Page 5: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/5.jpg)
Other tasks…Stripping out from the source program
comment and white space in the form of blank, tab and newline characters .
Correlate error messages with source program (e.g., line number of error).
![Page 6: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/6.jpg)
Lexical analyzer scanning (simple
operations)
lexical analysis(complex)
![Page 7: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/7.jpg)
Why to separate Lexical analysis and parsing1. Simplicity of design
1. Improving compiler efficiency
1. Enhancing compiler portability
Eliminate the white space and comment lines before parsing
A large amount of time is being spent on reading the source program and generating the tokens
The representation of special or non standard symbols can be isolated in the lexical analyzer
![Page 8: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/8.jpg)
Tokens, Patterns and LexemesPattern: A rule that describes a set of stringsToken: A set of strings in the same patternLexeme: The sequence of characters of a token
![Page 9: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/9.jpg)
ExampleToken Informal description Sample lexemes
ifelse
comparison
idnumberliteral
Characters i, fCharacters e, l, s, e
< or > or <= or >= or == or !=
Letter followed by letter and digits
Any numeric constant
Anything but “ sorrounded by “
if
else<=, !=
pi, score, D2
3.14159, 0, 6.02e23
“core dumped”
![Page 10: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/10.jpg)
Compiler Construction
E = C1 * 10
Token Attribute
ID Index to symbol table entry E
=
ID Index to symbol table entry C1
*
NUM 10
![Page 11: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/11.jpg)
Lexical Error and RecoveryA character sequence that can’t be scanned into any valid
token is a lexical error.Lexical errors are uncommon, but they still must be
handled by a scanner. We won’t stop compilation because of minor error.
fi(a==f(x))... //mistyped if lex will give valid token Delete one character from the remaining input. Insert a missing character into the remaining input. Replace a character by another character. Transpose two adjacent characters from the
remaining input
![Page 12: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/12.jpg)
Input bufferingSometimes lexical analyzer needs to look
ahead some symbols to decide about the token to returnIn C language: we need to look after -, = or <
to decide what token to returnWe need to introduce a two buffer scheme to
handle large look-aheads safely
E = M * C * * 2 eof
![Page 13: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/13.jpg)
Compiler Construction
Specification of TokensRegular expressions are an important notation for
specifying lexeme patterns. While they cannot express all possible patterns, they are very effective in specifying
those types of patterns that we actually need for tokens.
![Page 14: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/14.jpg)
Specification of tokensIn theory of compilation regular expressions
are used to formalize the specification of tokens
Regular expressions are means for specifying regular languages
Example: Letter(letter| digit)*
Each regular expression is a pattern specifying the form of strings
![Page 15: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/15.jpg)
Regular expressionsƐ is a regular expression, L(Ɛ) = {Ɛ}If a is a symbol in ∑then a is a regular
expression, L(a) = {a}(r) | (s) is a regular expression denoting the
language L(r) ∪ L(s) (r)(s) is a regular expression denoting the
language L(r)L(s)(r)* is a regular expression denoting (L(r))*
R* = R concatenated with itself 0 or more times
= {ε} ∪ R ∪ RR ∪ RRR ∪(r) is a regular expression denoting L(r)
![Page 16: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/16.jpg)
ExtensionsOne or more instances: (r)+Zero of one instances: r?Character classes: [ abc ]
Example:letter_ -> [A-Z , a-z_]digit -> [0-9]id -> letter(letter | digit)*
![Page 17: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/17.jpg)
Regular definitionsd1 -> r1
d2 -> r2
…d n -> r n
Example:letter -> A | B | … | Z | a | b | … | z | _digit -> 0 | 1 | … | 9id -> letter(letter| digit)*
![Page 18: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/18.jpg)
Operations on Languages
Example: Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and let D
be the set of digits {0,1,.. .9). L and D are, respectively, the alphabets of uppercase and lowercase letters and of digits. other languages can be constructed from L and D, using the operators illustrated above
![Page 19: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/19.jpg)
Compiler Construction
Terms for Parts of Strings
![Page 20: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/20.jpg)
Specification of TokensAlgebraic laws of regular expressions 1) |= |
2) |(|)=(|)| () =( ) 3) (| )= | (|)= | 4) = = 5)(*)*=*
6) *= + | + = *7) (|)*= (* | *)*= (* *)*
![Page 21: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/21.jpg)
Recognition of TokensTask of recognition of token in a lexical
analyzer Isolate the lexeme for the next token in the
input buffer Produce as output a pair consisting of the
appropriate token and attribute-value, such as <id,pointer to table entry> , using the translation table given in the Fig in next page
![Page 22: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/22.jpg)
Recognition of TokensTask of recognition of token in a lexical
analyzer
Regular expression
Token Attribute-value
if if -id id Pointer to
table entry< relop LT
![Page 23: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/23.jpg)
Recognition of TokensMethods to recognition of token
Use Transition Diagram
![Page 24: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/24.jpg)
Recognition of tokensStarting point is the language grammar to
understand the tokens:Grammar for branching statement
stmt -> if expr then stmt | if expr then stmt else stmt | Ɛexpr -> term relop term | termterm -> id | number
![Page 25: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/25.jpg)
Recognition of tokens (cont.)The next step is to formalize the patterns:
digit -> [0-9]digits -> digit+number -> digits(.digits)? (E[+|-]? digits)?letter -> [A-Z a-z_]id -> letter (letter | digit)*If -> ifThen -> thenElse -> elseRelop -> < | > | <= | >= | = | <>
We also need to handle whitespaces:ws -> (blank | tab | newline)+
![Page 26: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/26.jpg)
Compiler Construction
Ex :RELOP = < | <= | = | <> | > | >=
0
1
5
6
2
3
4
7
8
start<
=
=
=
>
>
other
other
return(relop,LE)
return(relop,NE)
return(relop,LT)
return(relop,GE)
return(relop,GT)
return(relop,EQ)
#
# # indicates input retraction
![Page 27: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/27.jpg)
Compiler Construction
Ex2: ID = letter(letter | digit) *
9 10 11start letter
return(id)
# indicates input retraction
other #
letter or digitTransition Diagram:
![Page 28: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/28.jpg)
Transition diagrams (cont.)Transition diagram for whitespace
![Page 29: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/29.jpg)
9 10 11start letter
return(id)other
letter or digit
switch (state) {case 9:
if (isletter( c) ) state = 10; else state = failure();break;
case 10: c = nextchar(); if (isletter( c) || isdigit( c) ) state = 10; else state 11case 11: retract(1); insert(id); return;
![Page 30: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/30.jpg)
Architecture of a transition-diagram-based lexical analyzerTOKEN getRelop(){
TOKEN retToken = new (RELOP)while (1) { /* repeat character processing until a
return or failure occurs */switch(state) {
case 0: c= nextchar(); if (c == ‘<‘) state = 1; else if (c == ‘=‘) state = 5; else if (c == ‘>’) state = 6; else fail(); /* lexeme is not a relop */ break;
case 1: ……case 8: retract();
retToken.attribute = GT; return(retToken);
}
![Page 31: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/31.jpg)
Lexical Analyzer Generator - Lex
Lexical Compiler
Lex Source programlex.l
lex.yy.c
Ccompiler
lex.yy.c a.out
a.outInput stream Sequence of tokens
![Page 32: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/32.jpg)
Structure of Lex programs
declarations%%translation rules%%auxiliary functions
Pattern {Action}
![Page 33: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/33.jpg)
Example%{
/* definitions of manifest constantsLT, LE, EQ, NE, GT, GE,IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/* regular definitionsdelim [ \t\n]ws {delim}+letter [A-Za-z]digit [0-9]id {letter}({letter}|{digit})*number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%{ws} {/* no action and no return */}if {return(IF);}then {return(THEN);}else {return(ELSE);}{id} {yylval = (int) installID(); return(ID); }{number} {yylval = (int) installNum();
return(NUMBER);}…
Int installID() {/* funtion to install the lexeme, whose first character is pointed to by yytext, and whose length is yyleng, into the symbol table and return a pointer thereto */
}
Int installNum() { /* similar to installID, but puts numerical constants into a separate table */
}
![Page 34: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/34.jpg)
35
Finite AutomataRegular expressions = specificationFinite automata = implementation
A finite automaton consists ofAn input alphabet A set of states SA start state nA set of accepting states F SA set of transitions state input state
![Page 35: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/35.jpg)
36
Finite AutomataTransition
s1 a s2
Is readIn state s1 on input “a” go to state s2
If end of inputIf in accepting state => accept, othewise =>
rejectIf no transition possible => reject
![Page 36: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/36.jpg)
37
Finite Automata State GraphsA state
• The start state
• An accepting state
• A transitiona
![Page 37: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/37.jpg)
38
A Simple ExampleA finite automaton that accepts only “1”
A finite automaton accepts a string if we can follow transitions labeled with the characters in the string from the start to some accepting state
1
![Page 38: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/38.jpg)
39
Another Simple ExampleA finite automaton accepting any number of
1’s followed by a single 0Alphabet: {0,1}
Check that “1110” is accepted but “110…” is not
0
1
![Page 39: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/39.jpg)
40
And Another ExampleAlphabet {0,1}What language does this recognize?
0
1
0
1
0
1
![Page 40: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/40.jpg)
41
And Another ExampleAlphabet still { 0, 1 }
The operation of the automaton is not completely defined by the inputOn input “11” the automaton could be in either
state
1
1
![Page 41: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/41.jpg)
42
Epsilon MovesAnother kind of transition: -moves
• Machine can move from state A to state B without reading input
A B
![Page 42: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/42.jpg)
43
Deterministic and Nondeterministic AutomataDeterministic Finite Automata (DFA)
One transition per input per state No -moves
Nondeterministic Finite Automata (NFA)Can have multiple transitions for one input in a
given stateCan have -moves
Finite automata have finite memoryNeed only to encode the current state
![Page 43: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/43.jpg)
44
Execution of Finite AutomataA DFA can take only one path through the
state graphCompletely determined by input
NFAs can chooseWhether to make -movesWhich of multiple transitions for a single input
to take
![Page 44: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/44.jpg)
45
Acceptance of NFAsAn NFA can get into multiple states
• Input:
0
1
1
0
1 0 1
• Rule: NFA accepts if it can get in a final state
![Page 45: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/45.jpg)
46
NFA vs. DFA (1)NFAs and DFAs recognize the same set of
languages (regular languages)
DFAs are easier to implementThere are no choices to consider
![Page 46: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/46.jpg)
47
NFA vs. DFA (2)For a given language the NFA can be simpler
than the DFA0
10
0
01
0
1
0
1
NFA
DFA
• DFA can be exponentially larger than NFA
![Page 47: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/47.jpg)
48
Regular Expressions to Finite AutomataHigh-level sketch
Regularexpressions
NFA
DFA
LexicalSpecification
Table-driven Implementation of DFA
![Page 48: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/48.jpg)
49
Regular Expressions to NFA (1)For each kind of rexp, define an NFA
Notation: NFA for rexp A
A
• For
• For input aa
![Page 49: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/49.jpg)
50
Regular Expressions to NFA (2)For AB
A B
• For A | B
A
B
![Page 50: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/50.jpg)
51
Regular Expressions to NFA (3)For A*
A
![Page 51: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/51.jpg)
52
Example of RegExp -> NFA conversionConsider the regular expression
(1 | 0)*1The NFA is
1C E0D F
B
G
A H 1I J
![Page 52: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/52.jpg)
53
Next
Regularexpressions
NFA
DFA
LexicalSpecification
Table-driven Implementation of DFA
![Page 53: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/53.jpg)
54
NFA to DFA. The TrickSimulate the NFAEach state of resulting DFA
= a non-empty subset of states of the NFAStart state
= the set of NFA states reachable through -moves from NFA start state
Add a transition S a S’ to DFA iffS’ is the set of NFA states reachable from the
states in S after seeing the input a considering -moves as well
![Page 54: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/54.jpg)
55
NFA -> DFA Example10 1
A BC
D
E
FG H I J
ABCDHI
FGABCDHI
EJGABCDHI
0
1
0
10 1
![Page 55: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/55.jpg)
56
NFA to DFA. RemarkAn NFA may be in many states at any time
How many different states ?
If there are N states, the NFA must be in some subset of those N states
How many non-empty subsets are there?2N - 1 = finitely many, but exponentially many
![Page 56: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/56.jpg)
57
ImplementationA DFA can be implemented by a 2D table T
One dimension is “states”Other dimension is “input symbols”For every transition Si a Sk define T[i,a] = k
DFA “execution”If in state Si and input a, read T[i,a] = k and
skip to state Sk
Very efficient
![Page 57: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/57.jpg)
58
Table Implementation of a DFA
S
T
U
0
1
0
10 1
0 1S T UT T UU T U
![Page 58: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/58.jpg)
59
Implementation (Cont.)NFA -> DFA conversion is at the heart of
tools such as flex or jflex
But, DFAs can be huge
In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations
![Page 59: compiler-lexical analysis](https://reader036.vdocuments.us/reader036/viewer/2022062223/577cc6fa1a28aba7119fb56b/html5/thumbnails/59.jpg)
ReadingsChapter 3 of the book