lexical analysis natawut nupairoj, ph.d. department of computer engineering chulalongkorn university
TRANSCRIPT
![Page 1: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/1.jpg)
Lexical Analysis
Natawut Nupairoj, Ph.D.
Department of Computer EngineeringChulalongkorn University
![Page 2: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/2.jpg)
Outline
Overview. Token, Lexeme, and Pattern. Lexical Analysis Specification. Lexical Analysis Engine.
![Page 3: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/3.jpg)
Front-End Components
ScannerSource program(text stream)
Parser
IntermediateRepresentation(file or in memory)
SemanticAnalyzer
Front-End
Construct parse tree.
Group token.
next-token
token
SymbolTable
m a i n ( ) {
Check semantic/contextual.
identifiermain
symbol(
parse-tree
![Page 4: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/4.jpg)
Tasks for Scanner
Read input and group tokens for Parser. Strip comments and white spaces. Count line numbers. Create an entry in the symbol table. Preprocessing functions
![Page 5: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/5.jpg)
Benefits
Simpler design parser doesn’t worry about comments and white spac
es.
More efficient scanner optimize the scanning process only. use specialize buffering techniques.
Portability handle standard symbols on different platforms.
![Page 6: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/6.jpg)
Basic Terminology
Tokena set of stringsEx: token = identifier
Lexemea sequence of characters in the source progra
m matched by the pattern for a token.Ex: lexeme = counter
![Page 7: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/7.jpg)
Basic Terminology
Pattern a description of strings that can belong to a particular
token set. Ex: pattern = letter followed by letters or digit
{A,…,Z,a,…,z}{A,…,Z,a,…,z,0,…,9}*
![Page 8: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/8.jpg)
Token
const
if
relation
id
num
literal
Lexeme
const
if
<, <=, …, >=
counter, x, y
12.53, 1.42E-10
“Hello World”
Pattern
const
if
comparison symbols
letter (letter | digit)*
any numeric constant
characters between “
![Page 9: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/9.jpg)
Language and Lexical Analysis
Fixed-format input i.e. FORTRANmust consider the alignment of a lexeme.difficult to scan.
No reserved words i.e. PL/Ikeywords vs. id ? -- complex rules.
if if = then then then := else; else else := then;
![Page 10: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/10.jpg)
Regular Expression Revisited
is a regular expression that denotes {}. If a is an alphabet, a is a regular expressio
n that denotes {a}. Suppose r and s are regular expressions:
(r)|(s) denoting L(r) U L(s).(r)(s) denoting L(r)L(s).(r)* denoting (L(r))*
![Page 11: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/11.jpg)
Precedence of Operator
Level of precedenceKleene clusure (*)concatenationunion (|)
All operators are left associative. Ex: a*b | cd* = ((a*)b) | (c(d*))
![Page 12: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/12.jpg)
Regular Definition
A sequence of definitions:d1ฎr1
d2ฎr2
...
dnฎrn
di is a distinct nameri is a regular expression over:
ฎ U {d1, …, di-1}
![Page 13: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/13.jpg)
Examples
letter ฎ A | B | … | Z | a | b | … | z
digit ฎ 0 | 1 | … | 9
id ฎ letter ( letter | digit )*
digits ฎ digit digit*
opt_fraction ฎ . digits | opt_exponent ฎ ( E ( + | - | ) digits ) | num ฎ digits opt_fraction opt_exponent
![Page 14: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/14.jpg)
Notational Shorthands
One or more instancesr+ = rr*
Zero or one instancer? = r | (rs)? = rs |
Character Class [A-Za-z] = A | B | … | Z | a | b | … | z
![Page 15: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/15.jpg)
Examples
digit ฎ [0-9]
digits ฎ digit+
opt_fraction ฎ . digits )?
opt_exponent ฎ ( E ( + | - )? digits )?
num ฎ digits opt_fraction opt_exponent
id ฎ [A-Za-z][A-Za-z0-9]*
![Page 16: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/16.jpg)
Recognition of Tokens
Consider tokens from the grammar. tokenpatternattribute
Draw NFAs with retracting options.
![Page 17: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/17.jpg)
Example : Grammar
stmt ::= if expr then stmt
| if expr then stmt else stmt
| expr
expr ::= term relop term
| term
term ::= id | num
![Page 18: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/18.jpg)
Example : Regular Definition
if ฎ if
then ฎ then
else ฎ else
relop ฎ < | <= | = | <> | > | >=
id ฎ letter (letter | digit)*
num ฎ digit+ ( . digit+ )? ( E (+ | -)? digit+ ) ?
delimฎ blank | tab | newline
ws ฎ delim+
![Page 19: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/19.jpg)
Example: Pattern-Token-Attribute
Attribute-Value
-
-
-
-
Index in table
Index in table
LT
LE
EQ
NE
..
Regular
Expression
ws
if
then
else
id
num
<
<=
=
<>
...
Token
-
if
then
else
id
num
relop
relop
relop
relop
...
![Page 20: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/20.jpg)
Attributes for Tokens
if count >= 0 then ...
<if, >
<id, index for count in symbol table>
<relop, GE>
<num, integer value 0>
<then, >
![Page 21: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/21.jpg)
NFA – Lexical Analysis Engine
0 1
6
2
3
4
5
8
7
return(relop, LE)
return(relop, EQ)
return(relop, NE)
return(relop, LT)
return(relop, GE)
return(relop, GT)
< =
>
other
=
>
=
other
*
*
![Page 22: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/22.jpg)
Handle Numbers
Pattern for number contains options.num ฎ digit+ ( . digit+ )? ( E (+ | -)? digit+ ) ?
31, 31.02, 31.02E-15
Always get the longest possible match.match the longest first if not match, try the next possible pattern.
![Page 23: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/23.jpg)
Handle Numbers
12
19
13
return(num, getnum())
*
other
digit
14 15 16 17 18digit
digit
digitdigit
digit digit. E
E
+ or -
20 21 22 23
digitdigit
digit digit.
25 26
digit
digit
24
27
other
other
*
*
![Page 24: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/24.jpg)
Handle Keywords
Two approaches:encode keywords into an NFA (if, then, etc.)
complex NFA (too many states).
use symbol table simple. require some tricks.
9 1110 return(gettoken(),
install_id())
*otherletter
letter or digit
![Page 25: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/25.jpg)
Handle Keywords
Symbol table contains both lexeme and token type.
Initialize symbol table with all keywords and corresponding token types.
lexeme: if token type: if
lexeme: then token type: then
lexeme: else token type: else
![Page 26: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/26.jpg)
Handle Keywords
Scanner
Parser
SymbolTable
Lexeme Token Type …
if if …
then then …
else else …
initial
1
2
3
4
5
![Page 27: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/27.jpg)
Handle Keywordsgettoken():
If id is not found in the table, return token type ID. Otherwise, return token type from the table.
![Page 28: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/28.jpg)
Handle Keywords
Scanner
Parser
SymbolTable
Lexeme Token Type …
if if …
then then …
else else …
gettoken
Source program(text stream)
i f c o u n t < =i f
next-token
i f
if
1
2
3
4
5
![Page 29: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/29.jpg)
Handle Keywords install_id():
If id is not found in the table, it’s a new id. INSERT NEW ID INTO TABLE and return pointer to the new entry.
If id is found and its type is ID, return pointer to that entry.
Otherwise, it’s a keyword. Return 0.
![Page 30: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/30.jpg)
1
Handle Keywords
Scanner
Parser
SymbolTable
Lexeme Token Type …
if if …
then then …
else else …
install_idSource program(text stream)
i f c o u n t < =i f
next-token
token if0i f
0
2
3
4
5
![Page 31: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/31.jpg)
Handle Keywords
Scanner
Parser
SymbolTable
Lexeme Token Type …
if if …
then then …
else else …
gettoken
Source program(text stream)
i f c o u n t < =i f
next-token
id
1
2
3
4
5
c o u n t
c o u n tc o u n t
Not found!
![Page 32: Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University](https://reader030.vdocuments.us/reader030/viewer/2022032708/56649e585503460f94b5236f/html5/thumbnails/32.jpg)
1
Handle Keywords
Scanner
Parser
SymbolTable
Lexeme Token Type …
if if …
then then …
else else …
install_id
Source program(text stream)
i f c o u n t < =
next-token
token id4
4
2
3
4
5
c o u n tc o u n t
count id …