lexical analysis (tokenizing)morin/teaching/3002/notes/lex.pdf · lexical analysis (tokenizing)...
TRANSCRIPT
![Page 1: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/1.jpg)
Lexical Analysis (Tokenizing)
COMP 3002
School of Computer Science
![Page 2: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/2.jpg)
2
List of Acronyms• RE - regular expression
• FSM - finite state machine
• NFA - non-deterministic finite automata
• DFA - deterministic finite automata
![Page 3: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/3.jpg)
3
The Structure of a Compiler
syntacticanalyzer
codegenerator
programtext
interm.rep.
machinecode
tokenizer parsertokenstream
![Page 4: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/4.jpg)
4
Purpose of Lexical Analysis• Converts a character stream into a token
stream
tokenizerint main(void) { for (int i = 0; i < 10; i++) { ...
![Page 5: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/5.jpg)
5
How the Tokenizer is Used• Usually the tokenizer is used by the parser,
which calls the getNextToken() function when it wants another token
• Often the tokenizer also includes a pushBack() function for putting the token back (so it can be read again)
tokenizer parsertoken
getNextToken()program
text
![Page 6: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/6.jpg)
6
Other Tokenizing Jobs• Input reading and buffering
• Macro expansion (C's #define)
• File inclusion (C's #include)
• Stripping out comments
![Page 7: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/7.jpg)
7
Tokens, Patterns, and Lexemes• A token is a pair
– token name (e.g., VARIABLE)– token value (e.g., "myCounter")
• A lexeme is a sequence of program characters that form a token – (e.g., "myCounter")
• A pattern is a description of the form that the lexemes of a token may take– e.g., character strings including A-Z, a-z, 0-9, and _
![Page 8: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/8.jpg)
8
A History Lesson• Usually tokens are easy to recognize even
without any context, but not always
• A tough example from Fortran 90:
DO 5 I = 1.25<variable, "DO5I"> <assign> <number,"1.25">
DO 5 I = 1,25<do> <number, "5"> <variable, "I"> <assign> <number, "1"> <comma> <number, "25">
![Page 9: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/9.jpg)
9
Lexical Errors• Sometimes the current prefix of the input
stream does not match any pattern– This is an error and should be logged
• The lexical analyzer may try to continue by– deleting characters until the input matches a pattern– deleting the first input character– adding an input character – replacing the first input character– transposing the first two input characters
![Page 10: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/10.jpg)
10
Exercise• Circle the lexemes in the following
programs
public static void main(String args[]) { System.println("Hello World!");}
float max(float a, float b) { return a > b ? a : b;}
![Page 11: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/11.jpg)
11
Input Buffering• Lexemes can be long and the pushBack
function requires a mechanism for pushing them back
• One possible mechanism (suggested in the textbook) is a double buffer
• When we run off the end of one buffer we load the next buffer
return (23); \n }\n public static void
star
t
curr
ent
![Page 12: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/12.jpg)
12
Tokenizing (so far)• What a tokenizer does
– reads character input and turns it into tokens
• What a token is– a token name and a value (usually the lexeme)
• How to read input– use a double buffer if some lookahead is necessary
• How does the tokenizer recognize tokens?
• How do we specify patterns?
![Page 13: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/13.jpg)
13
Where to Next?• We need a formal mechanism for defining
the patterns that define tokens
• This mechanism is formal language theory
• Using formal language theory we can make tokenizers without writing any actual code
![Page 14: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/14.jpg)
14
Strings and Languages
• An alphabet is a set of symbols
• A string S over an alphabet is a finite
sequence of symbols in
• The empty string, denoted , is a string of
length 0
• A language L over is a countable set of
strings over
![Page 15: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/15.jpg)
15
Examples of Languages
• The empty language L =
• The language L = containing only the
empty string
• The set L of all syntactically correct C programs
• The set L of all valid variable names in Java
• The set L of all grammatically correct english sentences
![Page 16: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/16.jpg)
16
String Concatenation• If x and y are strings then the
concatenation of x and y, denoted xy, is the string formed by appending y to x
• Example– x = "dog"– y = "house"– xy = "doghouse"
• If we treat concatenation as a "product" then we get exponentiation:– x2 = "dogdog"– x3 = "dogdogdog"
![Page 17: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/17.jpg)
17
Operations on Languages• We can form complex languages from
simple ones using various operations
• Union: L M (also denoted L | M)– L M = { s : s L or s M }
• Concatenation– LM = { st : s L and t M }
• Kleene Closure L* – L* = { Li : i = 0, 1, 2, ... }
• Positive Closure L+ – L* = { Li : i = 1, 2, 3, ... }
![Page 18: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/18.jpg)
18
Some Example• L = { A,B,C,...Z,a,b,c,...z }
• D = { 0,1,2,3,4,5,6,7,8,9 }
• L D
• LD
• L4
• L*
• L(L D)*
• D+
![Page 19: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/19.jpg)
19
Regular Expressions• Regular expressions provide a notation for
defining languages
• A regular expression r denotes a language
L(r) over a finite alphabet
• Basics:– is a RE and L– For each symbol a in a is a RE and L(a) = { a }
![Page 20: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/20.jpg)
20
Regular Expression Operators• Suppose r and s are regular expressions
• Union (choice)– (r)|(s) denotes L(r) L(s)
• Concatenation– (r)(s) denotes L(r) L(s)
• Kleene Closure– r* denotes (L(r))*
• Parenthesization– (r) denote L(r)– Used to enforce specific order of operations
![Page 21: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/21.jpg)
21
Order of Operations in REs• To avoid too many parentheses, we adopt
the following conventions– The * operator has the highest level of precedence
and is left associative– Concatenation has second highest precedence and is
left associative– The | operator has lowest precedence and is left
associative
![Page 22: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/22.jpg)
22
Binary Examples
• For the alphabet = { a,b }– a|b denotes the language { a, b }– (a|b)(a|b) denotes the langage { aa, ab, ba, bb }
– a* denotes { , a, aa, aaa, aaaa, .... }
– (a|b)* denotes all possible strings over – a|a*b denotes the language { a, b, ab, aab,
aaab, ... }
![Page 23: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/23.jpg)
23
Regular Definitions• REs can quickly become complicated
• Regular definitions are multiline regular expressions
• Each line can refer to any of the preceding lines but not to itself or to subsequent lines
letter_ = A|B|...|Z|a|b|...|z|_digit = 0|1|2|3|4|5|6|7|8|9id = letter_(letter_|digit)*
![Page 24: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/24.jpg)
24
Regular Definition Example• Floating point number example
– Accepts 42, 42.314159, 42.314159E+23, 42E+23, 42E23, ...
digit = 0|1|2|3|4|5|6|7|8|9digits = digit digit*optionalFraction = . digits | optionalExponent = (E (+|-|) digits) | number = digits optionalFraction optionalExponent
![Page 25: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/25.jpg)
25
Exercises• Write regular definitions for
– All strings of lowercase letters that contain the five vowels in order
– All strings of lowercase letters in which the letters are in ascending lexicographic order
– Comments, consisting of a string surrounded by /* and */ without any intervening */
![Page 26: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/26.jpg)
26
Extension of Regular Expressions• There are also several time-saving
extensions of REs
• One or more instances– r+ = rr*
• Zero or one instance– r? = r|
• Character classes– [abcdef] = (a|b|c|d|e|f)– [A-Za-z] = (A|B|C|...|Y|Z|a|b|c|...|y|z)
• Others– See page 127 of the text for more common RE
shorthands
![Page 27: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/27.jpg)
27
Some Examples
digit = [0-9]digits = digit+number = digits (. digits)? (E[+-]? digits)?
letter_ = [A-Za-z_]digit = [0-9]variable= letter_ (letter|digit)*
![Page 28: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/28.jpg)
28
Recognizing Tokens• We now have a notation for patterns that
define tokens
• We want to make these into a tokenizer
• For this, we use the formalism of finite state machines
![Page 29: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/29.jpg)
29
An FSM for Relational Operators• relational operators <, >, <=, >=, ==, <>
0 1 2
3
6
4 5
start <
>
=
>
=
7
LE
NE
GE
EQ= =
LT
GT
![Page 30: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/30.jpg)
30
FSM for Variable Names
letter_ = [A-Za-z_]digit = [0-9]variable= letter_ (letter|digit)*
0 1letter
letter or digit
![Page 31: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/31.jpg)
31
FSM for Numbers• Build the FSM for the following:
digit = [0-9]digits = digit+number = digits (. digits)? ((E|e) digits)?
![Page 32: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/32.jpg)
32
NumReader.java• Look at NumReader.java example
– Implements a token recognizer using a switch statement
![Page 33: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/33.jpg)
33
The Story So Far• We can write tokens types as regular
expressions
• We want to convert these REs into (deterministic) finite automata (DFAs)
• From the DFA we can generate code– A single while loop containing a large switch
statement• Each state in S becomes a case
– A table mapping S×→S • (current state,next symbol)→(new state)
– A hash table mapping S×→S
– Elements of may be grouped into character classes
![Page 34: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/34.jpg)
34
NumReader2.java• Look at NumReader2.java example
– Implements a tokenizer using a hashtable
![Page 35: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/35.jpg)
35
Automatic Tokenizer Generators• Generating FSMs by hand from regular
expressions is tedious and error-prone
• Ditto for generating code from FSMs
• Luckily, it can be done automatically
lex
Regularexpressions lex NFA NFA2DFA tokenizer
![Page 36: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/36.jpg)
36
Non-Deterministic Finite Automata• An NFA is a finite state machine whose
edges are labelled with subsets of
• Some edges may be labelled with
• The same labels may appear on two or more outgoing edges at a vertex
• An NFA accepts a string s if s defines any path to any of its accepting states
![Page 37: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/37.jpg)
37
NFA Example• NFA that accepts apple or ape
1
p
p
p
e
l e
a
a
start
![Page 38: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/38.jpg)
38
NFA Example• NFA that accepts any binary string whose 4
last value is 1
1 01 01 01start01
![Page 39: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/39.jpg)
39
From Regular Expression to NFA• Going from a RE to a NFA with one
accepting state is easy
•
• a
start
astart
![Page 40: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/40.jpg)
40
FSM for s
Union• r|s
FSM for r
start
![Page 41: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/41.jpg)
41
Concatenation• rs
FSM for s
FSM for rstart
![Page 42: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/42.jpg)
42
Kleene Closure• r*
FSM for rstart
ɛ
ɛ
![Page 43: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/43.jpg)
43
NFA to DFA• So far
– We can express token patterns as RE– We can convert REs to NFA
• NFAs are hard to use– Given an NFA F and a string s, it is difficult to test if F
accepts s
• Instead, we first convert the NFA into a deterministic finite automaton– No transitions– No repeated labels on outgoing edges
![Page 44: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/44.jpg)
44
Converting an NFA into a DFA• Converting an NFA into a DFA is easy but
sometimes expensive
• Suppose the NFA has n states 1,...,n
• Each state of the DFA is labelled with one of the 2n subsets of {1,...,n}
• The DFA will be in a state whose label contains i if the NFA could be in state i
• Any DFA state that contains an accepting state of the NFA is also an accepting state
![Page 45: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/45.jpg)
45
NFA 2 DFA – Sketch of Algorithm• Step 1 - Remove duplicate edge labels by
using transitions
a
a a
a
![Page 46: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/46.jpg)
46
NFA 2 DFA• Step 2: Starting at state 0, start expanding
states– State i expands into every state reachable from i
using only -transitions– Create new states, as necessary for the neighbours
of already-expanded states– Use a lookup table to make sure that each possible
state (subset of {1,...,n}) is created only once
![Page 47: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/47.jpg)
47
Example
11
2
8
3 4 5 6
9 10
p
p
p
e
l e
0start
• Convert this NFA into a DFA
aa
![Page 48: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/48.jpg)
48
Example• Convert this NFA into a DFA
1a|b
3b
32 4a|b a|b
![Page 49: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/49.jpg)
49
From REs to a Tokenizer• We can convert from RE to NFA to DFA
• DFAs are easy to implement– Using a switch statement or a (hash)table
• For each token type we write a RE
• The lexical analysis generator then creates a NFA (or DFA) for each token type and combines them into one big NFA
![Page 50: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/50.jpg)
50
From REs to a Tokenizer• One giant NFA captures all token types
• Convert this to a DFA– If any state of the DFA contains an accepting state
for more than 1 token then something is wrong with the language specification
NFA for token A
NFA for token B
NFA for token C
![Page 51: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/51.jpg)
51
Summary• The Tokenizer converts the input character
stream into a token stream
• Tokens can be specified using REs
• A software tool can be used to convert the list of REs into a tokenizer– Convert each RE to an NFA– Combine all NFAs into one big NFA– Convert this NFA into a DFA and the code that
implements this DFA
![Page 52: Lexical Analysis (Tokenizing)morin/teaching/3002/notes/lex.pdf · Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science. 2 List of Acronyms ... • DFA - deterministic](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fbe5647bfd1035189784c06/html5/thumbnails/52.jpg)
52
Other Notes• REs, NFAs, and DFAs are equivalent in
terms of the languages they can define
• Converting from NFA to DFA can be expensive– An n-state NFA can result in a 2n state DFA
• None of these are powerful enough to parse programming languages but are usually good enough for tokens– Example: the language { anbn : n = 1,2,3,...} is not
recognizable by a DFA (why?)