lexical analysis and lexical analyzer generatorsengelen/courses/cop562105/ch3.pdf · 2005. 2....

26
1 1 Lexical Analysis and Lexical Analyzer Generators Chapter 3 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005 2 The Reason Why Lexical Analysis is a Separate Phase Simplifies the design of the compiler LL(1) or LR(1) with 1 lookahead would not be possible Provides efficient implementation Systematic techniques to implement lexical analyzers by hand or automatically Stream buffering methods to scan input Improves portability Non-standard symbols and alternate character encodings can be more easily translated

Upload: others

Post on 15-Dec-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

1

1

Lexical Analysis andLexical Analyzer Generators

Chapter 3

COP5621 Compiler ConstructionCopyright Robert van Engelen, Florida State University, 2005

2

The Reason Why LexicalAnalysis is a Separate Phase

• Simplifies the design of the compiler– LL(1) or LR(1) with 1 lookahead would not be possible

• Provides efficient implementation– Systematic techniques to implement lexical analyzers

by hand or automatically– Stream buffering methods to scan input

• Improves portability– Non-standard symbols and alternate character

encodings can be more easily translated

Page 2: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

2

3

Interaction of the LexicalAnalyzer with the Parser

LexicalAnalyzer Parser

SourceProgram

Token,tokenval

Symbol Table

Get nexttoken

error error

4

Attributes of Tokens

Lexical analyzer

<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>

y := 31 + 28*x

Parser

token

tokenval(token attribute)

Page 3: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

3

5

Tokens, Patterns, and Lexemes

• A token is a classification of lexical units– For example: id and num

• Lexemes are the specific character strings thatmake up a token– For example: abc and 123

• Patterns are rules describing the set of lexemesbelonging to a token– For example: “letter followed by letters and digits” and

“non-empty sequence of digits”

6

Specification of Patterns forTokens: Terminology

• An alphabet Σ is a finite set of symbols(characters)

• A string s is a finite sequence of symbolsfrom Σ– |s| denotes the length of string s– ε denotes the empty string, thus |ε| = 0

• A language is a specific set of strings oversome fixed alphabet Σ

Page 4: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

4

7

Specification of Patterns forTokens: String Operations

• The concatenation of two strings x and y isdenoted by xy

• The exponentation of a string s is definedby

s0 = εsi = si-1s for i > 0

(note that sε = εs = s)

8

Specification of Patterns forTokens: Language Operations

• UnionL ∪ M = {s | s ∈ L or s ∈ M}

• ConcatenationLM = {xy | x ∈ L and y ∈ M}

• ExponentiationL0 = {ε}; Li = Li-1L

• Kleene closureL* = ∪i=0,…,∞ Li

• Positive closureL+ = ∪i=1,…,∞ Li

Page 5: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

5

9

Specification of Patterns forTokens: Regular Expressions

• Basis symbols:– ε is a regular expression denoting language {ε}– a ∈ Σ is a regular expression denoting {a}

• If r and s are regular expressions denotinglanguages L(r) and M(s) respectively, then– r | s is a regular expression denoting L(r) ∪ M(s)– rs is a regular expression denoting L(r)M(s)– r* is a regular expression denoting L(r)*

– (r) is a regular expression denoting L(r)• A language defined by a regular expression is

called a regular set

10

Specification of Patterns forTokens: Regular Definitions

• Naming convention for regular expressions:d1 → r1d2 → r2…dn → rn

where ri is a regular expression overΣ ∪ {d1, d2, …, di-1 }

• Each dj in ri is textually substituted in ri

Page 6: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

6

11

Specification of Patterns forTokens: Regular Definitions

• Example:

letter → A | B | … | Z | a | b | … | z digit → 0 | 1 | … | 9 id → letter ( letter | digit )*

• Cannot use recursion, this is illegal:

digits → digit digits | digit

12

Specification of Patterns forTokens: Notational Shorthands

• We frequently use the following shorthands:r+ = rr*

r? = r | ε[a-z] = a | b | c | … | z

• For example:digit → [0-9]num → digit+ (. digit+)? ( E (+|-)? digit+ )?

Page 7: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

7

13

Regular Definitions andGrammars

stmt → if expr then stmt | if expr then stmt else stmt | ε expr → term relop term | termterm → id | num if → if

then → then else → elserelop → < | <= | <> | > | >= | = id → letter ( letter | digit )*

num → digit+ (. digit+)? ( E (+|-)? digit+ )?

Grammar

Regular definitions

14

Implementing a Scanner UsingTransition Diagrams

0 21

6

3

4

5

7

8

return(relop, LE)

return(relop, NE)

return(relop, LT)

return(relop, EQ)

return(relop, GE)

return(relop, GT)

start <

=

>

=

>

=

other

other

*

*

9start letter 10 11*other

letter or digit

return(gettoken(), install_id())

relop → < | <= | <> | > | >= | =

id → letter ( letter | digit )*

Page 8: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

8

15Implementing a Scanner UsingTransition Diagrams (Code)

token nexttoken(){ while (1) { switch (state) { case 0: c = nextchar(); if (c==blank || c==tab || c==newline) { state = 0; lexeme_beginning++; } else if (c==‘<‘) state = 1; else if (c==‘=‘) state = 5; else if (c==‘>’) state = 6; else state = fail(); break; case 1: … case 9: c = nextchar(); if (isletter(c)) state = 10; else state = fail(); break; case 10: c = nextchar(); if (isletter(c)) state = 10; else if (isdigit(c)) state = 10; else state = 11; break; …

int fail(){ forward = token_beginning; swith (start) { case 0: start = 9; break; case 9: start = 12; break; case 12: start = 20; break; case 20: start = 25; break; case 25: recover(); break; default: /* error */ } return start;}

Decides whatother start state

is applicable

16

The Lex and Flex ScannerGenerators

• Lex and its newer cousin flex are scannergenerators

• Systematically translate regular definitionsinto C source code for efficient scanning

• Generated code is easy to integrate in Capplications

Page 9: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

9

17

Creating a Lexical Analyzer withLex and Flex

lex or flexcompiler

lexsource

programlex.l

lex.yy.c

inputstream

Ccompiler

a.outsequenceof tokens

lex.yy.c

a.out

18

Lex Specification• A lex specification consists of three parts:

regular definitions, C declarations in %{ %}%%translation rules%%user-defined auxiliary procedures

• The translation rules are of the form:p1 { action1 }p2 { action2 }…pn { actionn }

Page 10: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

10

19

Regular Expressions in Lexx match the character x \. match the character . “string”match contents of string of characters . match any character except newline^ match beginning of a line$ match the end of a line[xyz] match one character x, y, or z (use \ to escape -) [^xyz]match any character except x, y, and z [a-z] match one of a to zr* closure (match zero or more occurrences)r+ positive closure (match one or more occurrences)r? optional (match zero or one occurrence)r1r2 match r1 then r2 (concatenation)r1|r2 match r1 or r2 (union)( r ) groupingr1\r2 match r1 when followed by r2{d} match the regular expression defined by d

20

Example Lex Specification 1

%{#include <stdio.h>%}%%[0-9]+ { printf(“%s\n”, yytext); }.|\n { }%%main(){ yylex();}

Containsthe matching

lexeme

Invokesthe lexicalanalyzer

lex spec.lgcc lex.yy.c -ll./a.out < spec.l

Translationrules

Page 11: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

11

21

Example Lex Specification 2%{#include <stdio.h>int ch = 0, wd = 0, nl = 0;%}delim [ \t]+%%\n { ch++; wd++; nl++; }^{delim} { ch+=yyleng; }{delim} { ch+=yyleng; wd++; }. { ch++; }%%main(){ yylex(); printf("%8d%8d%8d\n", nl, wd, ch);}

RegulardefinitionTranslation

rules

22

Example Lex Specification 3%{#include <stdio.h>%}digit [0-9]letter [A-Za-z]id {letter}({letter}|{digit})*%%{digit}+ { printf(“number: %s\n”, yytext); }{id} { printf(“ident: %s\n”, yytext); }. { printf(“other: %s\n”, yytext); }%%main(){ yylex(); }

RegulardefinitionsTranslation

rules

Page 12: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

12

23

Example Lex Specification 4%{ /* definitions of manifest constants */#define LT (256)…%}delim [ \t\n]ws {delim}+letter [A-Za-z]digit [0-9]id {letter}({letter}|{digit})*number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?%%{ws} { }if {return IF;}then {return THEN;}else {return ELSE;}{id} {yylval = install_id(); return ID;}{number} {yylval = install_num(); return NUMBER;}“<“ {yylval = LT; return RELOP;}“<=“ {yylval = LE; return RELOP;}“=“ {yylval = EQ; return RELOP;}“<>“ {yylval = NE; return RELOP;}“>“ {yylval = GT; return RELOP;}“>=“ {yylval = GE; return RELOP;}%%int install_id()…

Returntoken toparser

Tokenattribute

Install yytext asidentifier in symbol table

24

Design of a Lexical AnalyzerGenerator

• Translate regular expressions to NFA• Translate NFA to an efficient DFA

regularexpressions NFA DFA

Simulate NFAto recognize

tokens

Simulate DFAto recognize

tokens

Optional

Page 13: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

13

25

Nondeterministic FiniteAutomata

• Definition: an NFA is a 5-tuple (S,Σ,δ,s0,F)where

S is a finite set of statesΣ is a finite set of input symbol alphabetδ is a mapping from S×Σ to a set of statess0 ∈ S is the start stateF ⊆ S is the set of accepting (or final) states

26

Transition Graph

• An NFA can be diagrammaticallyrepresented by a labeled directed graphcalled a transition graph

0start a 1 32b b

a

b

S = {0,1,2,3}Σ = {a,b}s0 = 0F = {3}

Page 14: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

14

27

Transition Table

• The mapping δ of an NFA can berepresented in a transition table

{3}2{2}1{0}{0, 1}0

Inputb

InputaStateδ(0,a) = {0,1}

δ(0,b) = {0}δ(1,b) = {2}δ(2,b) = {3}

28

The Language Defined by anNFA

• An NFA accepts an input string x iff there is somepath with edges labeled with symbols from x insequence from the start state to some acceptingstate in the transition graph

• A state transition from one state to another on thepath is called a move

• The language defined by an NFA is the set ofinput strings it accepts, such as (a|b)*abb for theexample NFA

Page 15: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

15

29

Design of a Lexical AnalyzerGenerator: RE to NFA to DFA

s0

N(p1)

N(p2)startε

εN(pn)

ε…

p1 { action1 }p2 { action2 }…pn { actionn }

action1

action2

actionn

Lex specification withregular expressions

NFA

DFA

Subset construction(optional)

30

N(r2)N(r1)

From Regular Expression to NFA(Thompson’s Construction)

fi ε

fai

fiN(r1)

N(r2)

start

start

start ε

ε ε

ε

fistart

N(r) fistart

ε

ε

ε

a

r1 | r2

r1r2

r* ε ε

Page 16: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

16

31

Combining the NFAs of a Set ofRegular Expressions

2a1start

6a3start

4 5b b

8b7start

a b

a { action1 }abb { action2 } a*b+ { action3 }

2a1

6a3 4 5b b

8b7

a b0

start

ε

εε

32

Simulating the Combined NFAExample 1

2a1

6a3 4 5b b

8b7

a b0

start

ε

εε

7310

742 7 8

Must find the longest match:Continue until no further moves are possibleWhen last state is accepting: execute action

action1

action2

action3

a ba a noneaction3

Page 17: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

17

33

Simulating the Combined NFAExample 2

2a1

6a3 4 5b b

8b7

a b0

start

ε

εε

7310

742

85

86

When two or more accepting states are reached, thefirst action given in the Lex specification is executed

action1

action2

action3

a bb a noneaction2action3

34

Deterministic Finite Automata

• A deterministic finite automaton is a special caseof an NFA– No state has an ε-transition– For each state s and input symbol a there is at most one

edge labeled a leaving s• Each entry in the transition table is a single state

– At most one path exists to accept a string– Simulation algorithm is simple

Page 18: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

18

35

Example DFA

0start a 1 32b b

bb

a

a

a

A DFA that accepts (a|b)*abb

36

Conversion of an NFA into aDFA

• The subset construction algorithm converts anNFA into a DFA using:

ε-closure(s) = {s} ∪ {t | s →ε … →ε t}ε-closure(T) = ∪s∈T ε-closure(s)move(T,a) = {t | s →a t and s ∈ T}

• The algorithm produces:Dstates is the set of states of the new DFAconsisting of sets of states of the NFADtran is the transition table of the new DFA

Page 19: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

19

37

ε-closure and move Examples2a1

6a3 4 5b b

8b7

a b0

start

ε

εε

ε-closure({0}) = {0,1,3,7}move({0,1,3,7},a) = {2,4,7}ε-closure({2,4,7}) = {2,4,7}move({2,4,7},a) = {7}ε-closure({7}) = {7}move({7},b) = {8}ε-closure({8}) = {8}move({8},a) = ∅

7310

742 7 8

a ba a none

Also used to simulate NFAs

38

Simulating an NFA usingε-closure and move

S := ε-closure({s0})Sprev := ∅ a := nextchar()while S ≠ ∅ do

Sprev := SS := ε-closure(move(S,a))a := nextchar()

end doif Sprev ∩ F ≠ ∅ then

execute action in Sprevreturn “yes”

else return “no”

Page 20: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

20

39

The Subset ConstructionAlgorithm

Initially, ε-closure(s0) is the only state in Dstates and it is unmarkedwhile there is an unmarked state T in Dstates do

mark Tfor each input symbol a ∈ Σ do

U := ε-closure(move(T,a))if U is not in Dstates then

add U as an unmarked state to Dstatesend ifDtran[T,a] := U

end doend do

40Subset Construction Example 1

0start a1 10

2

b

b

a

b

3

4 5

6 7 8 9ε ε

εε

εε

ε

ε

Astart

B

C

D E

b

b

b

b

b

aa

a

a

DstatesA = {0,1,2,4,7}B = {1,2,3,4,6,7,8}C = {1,2,4,5,6,7}D = {1,2,4,5,6,7,9}E = {1,2,4,5,6,7,10}

a

Page 21: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

21

41Subset Construction Example 2

2a1

6a3 4 5b b

8b7

a b0

start

ε

εε

a1

a2

a3

DstatesA = {0,1,3,7}B = {2,4,7}C = {8}D = {7}E = {5,8}F = {6,8}

Astart

a

D

b

b

b

ab

bB

C

E Fa

b

a1

a3

a3 a2 a3

42

Minimizing the Number of Statesof a DFA

Astart

B

C

D E

b

b

b

b

b

aa

a

a

a

Astart

B D Eb b

a

ab

a

a

Page 22: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

22

43

From Regular Expression to DFADirectly

• The important states of an NFA are thosewithout an ε-transition, that is ifmove({s},a) ≠ ∅ for some a then s is animportant state

• The subset construction algorithm uses onlythe important states when it determinesε-closure(move(T,a))

44

From Regular Expression to DFADirectly (Algorithm)

• Augment the regular expression r with aspecial end symbol # to make acceptingstates important: the new expression is r#

• Construct a syntax tree for r#• Traverse the tree to construct functions

nullable, firstpos, lastpos, and followpos

Page 23: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

23

45

From Regular Expression to DFADirectly: Syntax Tree of (a|b)*abb#

*

|

1a

2b

3a

4b

5b

#6

concatenation

closure

alternationpositionnumber

(for leafs ≠ε)

46

From Regular Expression to DFADirectly: Annotating the Tree

• nullable(n): the subtree at node n generateslanguages including the empty string

• firstpos(n): set of positions that can match the firstsymbol of a string generated by the subtree atnode n

• lastpos(n): the set of positions that can match thelast symbol of a string generated be the subtree atnode n

• followpos(i): the set of positions that can followposition i in the tree

Page 24: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

24

47From Regular Expression to DFADirectly: Annotating the Tree

lastpos(c1)firstpos(c1)true*|c1

if nullable(c2) thenlastpos(c1) ∪lastpos(c2)

else lastpos(c2)

if nullable(c1) thenfirstpos(c1) ∪firstpos(c2)

else firstpos(c1)

nullable(c1)and

nullable(c2)

•/ \

c1 c2

lastpos(c1)∪

lastpos(c2)

firstpos(c1)∪

firstpos(c2)

nullable(c1)or

nullable(c2)

|/ \

c1 c2

{i}{i}falseLeaf i

∅∅trueLeaf ε

lastpos(n)firstpos(n)nullable(n)Node n

48

From Regular Expression to DFADirectly: Syntax Tree of (a|b)*abb#

{6}{1, 2, 3}

{5}{1, 2, 3}

{4}{1, 2, 3}

{3}{1, 2, 3}

{1, 2}{1, 2} *

{1, 2}{1, 2} |

{1}{1} a {2}{2} b

{3}{3} a

{4}{4} b

{5}{5} b

{6}{6} #

nullable

firstpos lastpos

1 2

3

4

5

6

Page 25: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

25

49

From Regular Expression to DFADirectly: followpos

for each node n in the tree doif n is a cat-node with left child c1 and right child c2 then

for each i in lastpos(c1) dofollowpos(i) := followpos(i) ∪ firstpos(c2)

end doelse if n is a star-node

for each i in lastpos(n) dofollowpos(i) := followpos(i) ∪ firstpos(n)

end doend if

end do

50From Regular Expression to DFADirectly: Algorithm

s0 := firstpos(root) where root is the root of the syntax treeDstates := {s0} and is unmarkedwhile there is an unmarked state T in Dstates do

mark Tfor each input symbol a ∈ Σ do

let U be the set of positions that are in followpos(p)for some position p in T,such that the symbol at position p is a

if U is not empty and not in Dstates thenadd U as an unmarked state to Dstates

end ifDtran[T,a] := U

end doend do

Page 26: Lexical Analysis and Lexical Analyzer Generatorsengelen/courses/COP562105/Ch3.pdf · 2005. 2. 7. · 3 5 Tokens, Patterns, and Lexemes •A token is a classification of lexical units

26

51

From Regular Expression to DFADirectly: Example

1,2,3start a 1,2,3,4

1,2,3,6

1,2,3,5

b b

b b

a

a

a

-6{6}5{5}4{4}3

{1, 2, 3}2{1, 2, 3}1followposNode

1

2

3 4 5 6

52

Time-Space Tradeoffs

O(|x|)O(2|r|)DFA

O(|r|×|x|)O(|r|)NFA

Time(worst case)

Space(worst case)Automaton