lexical analysis - stanford university we are lexical analysis syntax analysis semantic analysis ir...

301
Lexical Analysis

Upload: vuxuyen

Post on 11-May-2018

239 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Lexical Analysis

Page 2: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Announcements

● Programming Assignment 1 Out● Due Monday, July 9 at 11:59 PM.

● Four handouts (all available online):● Decaf Specification● Lexical Analysis● Intro to flex● Programming Assignment 1

Page 3: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Where We Are

Lexical Analysis

Syntax Analysis

Semantic Analysis

IR Generation

IR Optimization

Code Generation

Optimization

SourceCode

MachineCode

Page 4: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

while (ip < z) ++ip;

Page 5: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

Page 6: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

T_While ( T_Ident < T_Ident ) ++ T_Ident

ip z ip

Page 7: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

T_While ( T_Ident < T_Ident ) ++ T_Ident

ip z ip

While

++

Ident

<

Ident Ident

ip z ip

Page 8: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

do[for] = new 0;

Page 9: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

do[for] = new 0;

d o [ f ] n e w 0 ;=o r

Page 10: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

do[for] = new 0;

T_Do [ T_For T_New T_IntConst

0

d o [ f ] n e w 0 ;=o r

] =

Page 11: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

do[for] = new 0;

T_Do [ T_For T_New T_IntConst

0

d o [ f ] n e w 0 ;=o r

] =

Page 12: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Page 13: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Page 14: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Page 15: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Page 16: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Page 17: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Page 18: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( i < z ) \n \t + i p ;p + +( 1 < i ) \n \t + i ;3 + +7

Page 19: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

Page 20: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

This is called a token. You can think of it as an enumerated type representing what logical entity we read out of the source code.

This is called a token. You can think of it as an enumerated type representing what logical entity we read out of the source code.

The piece of the original program from which we made the token is

called a lexeme.

The piece of the original program from which we made the token is

called a lexeme.

Page 21: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

Page 22: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

Page 23: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

Page 24: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

Sometimes we will discard a lexeme rather than storing it for later use. Here, we ignore whitespace, since it has no bearing on the meaning of

the program.

Sometimes we will discard a lexeme rather than storing it for later use. Here, we ignore whitespace, since it has no bearing on the meaning of

the program.

Page 25: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

Page 26: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

Page 27: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

Page 28: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

(

Page 29: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

(

Page 30: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

(

Page 31: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

(

Page 32: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

(

Page 33: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

(

Page 34: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

( T_IntConst

137

Page 35: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning a Source Filew h i l e ( 1 < i ) \n \t + i ;3 + +

T_While

7

( T_IntConst

137

Some tokens can have attributes that store extra information about the token. Here we store which integer is

represented.

Some tokens can have attributes that store extra information about the token. Here we store which integer is

represented.

Page 36: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Goals of Lexical Analysis

● Convert from physical description of a program into sequence of of tokens.● Each token represents one logical piece of the source

file – a keyword, the name of a variable, etc.

● Each token is associated with a lexeme.● The actual text of the token: “137,” “int,” etc.

● Each token may have optional attributes.● Extra information derived from the text – perhaps a

numeric value.

● The token sequence will be used in the parser to recover the program structure.

Page 37: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Choosing Tokens

Page 38: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

What Tokens are Useful Here?

for (int k = 0; k < myArray[5]; ++k) { cout << k << endl;}

Page 39: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

What Tokens are Useful Here?

for (int k = 0; k < myArray[5]; ++k) { cout << k << endl;}

for {int }<< ;= <( [) ]++

Page 40: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

What Tokens are Useful Here?

for (int k = 0; k < myArray[5]; ++k) { cout << k << endl;}

for {int }<< ;= <( [) ]++

IdentifierIntegerConstant

Page 41: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Choosing Good Tokens

● Very much dependent on the language.● Typically:

● Give keywords their own tokens.● Give different punctuation symbols their own

tokens.● Group lexemes representing identifiers,

numeric constants, strings, etc. into their own groups.

● Discard irrelevant information (whitespace, comments)

Page 42: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning is Hard

● FORTRAN: Whitespace is irrelevant

DO 5 I = 1,25

DO 5 I = 1.25

Thanks to Prof. Alex Aiken

Page 43: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning is Hard

● FORTRAN: Whitespace is irrelevant

DO 5 I = 1,25

DO5I = 1.25

Thanks to Prof. Alex Aiken

Page 44: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning is Hard

● FORTRAN: Whitespace is irrelevant

DO 5 I = 1,25

DO5I = 1.25

● Can be difficult to tell when to partition

input.

Thanks to Prof. Alex Aiken

Page 45: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning is Hard

● C++: Nested template declarations

vector<vector<int>> myVector

Thanks to Prof. Alex Aiken

Page 46: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning is Hard

● C++: Nested template declarations

vector < vector < int >> myVector

Thanks to Prof. Alex Aiken

Page 47: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning is Hard

● C++: Nested template declarations

(vector < (vector < (int >> myVector)))

Thanks to Prof. Alex Aiken

Page 48: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning is Hard

● C++: Nested template declarations

(vector < (vector < (int >> myVector)))

● Again, can be difficult to determine where to split.

Thanks to Prof. Alex Aiken

Page 49: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning is Hard

● PL/1: Keywords can be used as identifiers.

Thanks to Prof. Alex Aiken

Page 50: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning is Hard

● PL/1: Keywords can be used as identifiers.

IF THEN THEN THEN = ELSE; ELSE ELSE = IF

Thanks to Prof. Alex Aiken

Page 51: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning is Hard

● PL/1: Keywords can be used as identifiers.

IF THEN THEN THEN = ELSE; ELSE ELSE = IF

Thanks to Prof. Alex Aiken

Page 52: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Scanning is Hard

● PL/1: Keywords can be used as identifiers.

IF THEN THEN THEN = ELSE; ELSE ELSE = IF

● Can be difficult to determine how to label lexemes.

Thanks to Prof. Alex Aiken

Page 53: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Challenges in Scanning

● How do we determine which lexemes are associated with each token?

● When there are multiple ways we could scan the input, how do we know which one to pick?

● How do we address these concerns efficiently?

Page 54: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Associating Lexemes with Tokens

Page 55: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Lexemes and Tokens

● Tokens give a way to categorize lexemes by what information they provide.

● Some tokens might be associated with only a single lexeme:● Tokens for keywords like if and while probably

only match those lexemes exactly.● Some tokens might be associated with lots of

different lexemes:● All variable names, all possible numbers, all

possible strings, etc.

Page 56: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Sets of Lexemes

● Idea: Associate a set of lexemes with each token.

● We might associate the “number” token with the set { 0, 1, 2, …, 10, 11, 12, … }

● We might associate the “string” token with the set { "", "a", "b", "c", … }

● We might associate the token for the keyword while with the set { while }.

Page 57: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

How do we describe which (potentially infinite) set of lexemes is associated with

each token type?

Page 58: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Formal Languages

● A formal language is a set of strings.● Many infinite languages have finite descriptions:

● Define the language using an automaton.● Define the language using a grammar.● Define the language using a regular expression.

● We can use these compact descriptions of the language to define sets of strings.

● Over the course of this class, we will use all of these approaches.

Page 59: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Regular Expressions

● Regular expressions are a family of descriptions that can be used to capture certain languages (the regular languages).

● Often provide a compact and human-readable description of the language.

● Used as the basis for numerous software systems, including the flex tool we will use in this course.

Page 60: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Atomic Regular Expressions

● The regular expressions we will use in this course begin with two simple building blocks.

● The symbol ε is a regular expression matches the empty string.

● For any symbol a, the symbol a is a regular expression that just matches a.

Page 61: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Compound Regular Expressions

● If R1 and R2 are regular expressions, R1R2 is a regular expression represents the concatenation of the languages of R1 and R2.

● If R1 and R2 are regular expressions, R1 | R2 is a regular expression representing the union of R1 and R2.

● If R is a regular expression, R* is a regular expression for the Kleene closure of R.

● If R is a regular expression, (R) is a regular expression with the same meaning as R.

Page 62: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Operator Precedence

● Regular expression operator precedence is

(R)

R*

R1R2

R1 | R2

● So ab*c|d is parsed as ((a(b*))c)|d

Page 63: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings containing 00 as a substring:

(0 | 1)*00(0 | 1)*

Page 64: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings containing 00 as a substring:

(0 | 1)*00(0 | 1)*

Page 65: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings containing 00 as a substring:

(0 | 1)*00(0 | 1)*

110111001010000

11111011110011111

Page 66: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings containing 00 as a substring:

(0 | 1)*00(0 | 1)*

110111001010000

11111011110011111

Page 67: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings of length exactly four:

Page 68: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings of length exactly four:

(0|1)(0|1)(0|1)(0|1)

Page 69: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings of length exactly four:

(0|1)(0|1)(0|1)(0|1)

Page 70: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings of length exactly four:

(0|1)(0|1)(0|1)(0|1)

0000101011111000

Page 71: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings of length exactly four:

(0|1)(0|1)(0|1)(0|1)

0000101011111000

Page 72: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings of length exactly four:

(0|1){4}

0000101011111000

Page 73: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings of length exactly four:

(0|1){4}

0000101011111000

Page 74: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings that contain at most one zero:

Page 75: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings that contain at most one zero:

1*(0 | ε)1*

Page 76: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings that contain at most one zero:

1*(0 | ε)1*

Page 77: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings that contain at most one zero:

1*(0 | ε)1*

111101111111110111

0

Page 78: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings that contain at most one zero:

1*(0 | ε)1*

111101111111110111

0

Page 79: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simple Regular Expressions

● Suppose the only characters are 0 and 1.

● Here is a regular expression for strings that contain at most one zero:

1*0?1*

111101111111110111

0

Page 80: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose our alphabet is a, @, and ., where a represents “some letter.”

● A regular expression for email addresses is

aa* (.aa*)* aa*.aa*@ (.aa*)*

Page 81: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose our alphabet is a, @, and ., where a represents “some letter.”

● A regular expression for email addresses is

aa* (.aa*)* aa*.aa*@ (.aa*)*

[email protected]@mail.site.org

[email protected]

Page 82: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose our alphabet is a, @, and ., where a represents “some letter.”

● A regular expression for email addresses is

aa* (.aa*)* aa*.aa*@ (.aa*)*

[email protected]@mail.site.org

[email protected]

Page 83: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose our alphabet is a, @, and ., where a represents “some letter.”

● A regular expression for email addresses is

aa* (.aa*)* aa*.aa*@ (.aa*)*

[email protected]@mail.site.org

[email protected]

Page 84: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose our alphabet is a, @, and ., where a represents “some letter.”

● A regular expression for email addresses is

aa* (.aa*)* aa*.aa*@ (.aa*)*

[email protected]@mail.site.org

[email protected]

Page 85: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose our alphabet is a, @, and ., where a represents “some letter.”

● A regular expression for email addresses is

a+ (.aa*)* aa*.aa*@ (.aa*)*

[email protected]@mail.site.org

[email protected]

Page 86: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose our alphabet is a, @, and ., where a represents “some letter.”

● A regular expression for email addresses is

a+ (.a+)* a+.a+@ (.a+)*

[email protected]@mail.site.org

[email protected]

Page 87: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose our alphabet is a, @, and ., where a represents “some letter.”

● A regular expression for email addresses is

a+ (.a+)* a+.a+@ (.a+)*

[email protected]@mail.site.org

[email protected]

Page 88: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose our alphabet is a, @, and ., where a represents “some letter.”

● A regular expression for email addresses is

a+ (.a+)* a+@ (.a+)+

[email protected]@mail.site.org

[email protected]

Page 89: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose our alphabet is a, @, and ., where a represents “some letter.”

● A regular expression for email addresses is

[email protected]@mail.site.org

[email protected]

a+(.a+)*@a+(.a+)+

Page 90: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose that our alphabet is all ASCII characters.

● A regular expression for even numbers is

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)

Page 91: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose that our alphabet is all ASCII characters.

● A regular expression for even numbers is

42+1370-3248

-9999912

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)

Page 92: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose that our alphabet is all ASCII characters.

● A regular expression for even numbers is

42+1370-3248

-9999912

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)

Page 93: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose that our alphabet is all ASCII characters.

● A regular expression for even numbers is

42+1370-3248

-9999912

(+|-)?[0123456789]*[02468]

Page 94: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Applied Regular Expressions

● Suppose that our alphabet is all ASCII characters.

● A regular expression for even numbers is

42+1370-3248

-9999912

(+|-)?[0-9]*[02468]

Page 95: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Matching Regular Expressions

Page 96: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Implementing Regular Expressions

● Regular expressions can be implemented using finite automata.

● There are two main kinds of finite automata:● NFAs (nondeterministic finite automata),

which we'll see in a second, and● DFAs (deterministic finite automata), which

we'll see later.

● Automata are best explained by example...

Page 97: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

Page 98: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

Each circle is a state of the automaton. The automaton's configuration is determined by what state(s) it is in.

Each circle is a state of the automaton. The automaton's configuration is determined by what state(s) it is in.

A Simple Automaton

Page 99: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

These arrows are called transitions. The automaton changes which state(s) it is in

by following transitions.

These arrows are called transitions. The automaton changes which state(s) it is in

by following transitions.

A Simple Automaton

Page 100: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 101: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

The automaton takes a string as input and decides whether

to accept or reject the string.

The automaton takes a string as input and decides whether

to accept or reject the string.

Page 102: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 103: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 104: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 105: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 106: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 107: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 108: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 109: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 110: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 111: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "

Page 112: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" H E Y A "The double circle indicates that this state is an accepting state. The automaton accepts the string if it

ends in an accepting state.

The double circle indicates that this state is an accepting state. The automaton accepts the string if it

ends in an accepting state.

Page 113: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

Page 114: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

Page 115: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

Page 116: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

Page 117: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

Page 118: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

Page 119: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

There is no transition on " here, so the automaton

dies and rejects.

There is no transition on " here, so the automaton

dies and rejects.

Page 120: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" " " " " "

There is no transition on " here, so the automaton

dies and rejects.

There is no transition on " here, so the automaton

dies and rejects.

Page 121: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

Page 122: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

Page 123: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

Page 124: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

Page 125: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

Page 126: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

Page 127: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

Page 128: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

" "start

A,B,C,...,Z

A Simple Automaton

" A B C

This is not an accepting state, so the automaton

rejects.

This is not an accepting state, so the automaton

rejects.

Page 129: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

Page 130: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

Notice that there are multiple transitions defined here on 0 and 1. If we read a 0 or 1 here, we follow both transitions

and enter multiple states.

Notice that there are multiple transitions defined here on 0 and 1. If we read a 0 or 1 here, we follow both transitions

and enter multiple states.

Page 131: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

Page 132: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1

Page 133: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1

Page 134: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1

Page 135: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1

Page 136: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1

Page 137: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1

Page 138: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1

Page 139: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1

Page 140: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1

Page 141: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1

Page 142: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1

Page 143: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1

Page 144: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1

Page 145: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1

Page 146: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A More Complex Automaton

0 1 0

1 0 1

start 0, 1

0 1 1 1 0 1Since we are in at least one accepting state, the

automaton accepts.

Since we are in at least one accepting state, the

automaton accepts.

Page 147: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

Page 148: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

These are called -transitionsε . These transitions are followed automatically and

without consuming any input.

These are called -transitionsε . These transitions are followed automatically and

without consuming any input.

Page 149: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 150: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 151: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 152: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 153: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 154: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 155: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 156: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 157: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 158: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 159: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

An Even More Complex Automatona, b

a, c

b, c

start

ε

ε

ε

c

b

a

b c b a

Page 160: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Simulating an NFA

● Keep track of a set of states, initially the start state and everything reachable by ε-moves.

● For each character in the input:● Maintain a set of next states, initially empty.● For each current state:

– Follow all transitions labeled with the current letter.– Add these states to the set of new states.

● Add every state reachable by an ε-move to the set of next states.

● Complexity: O(mn2) for strings of length m and automata with n states.

Page 161: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From Regular Expressions to NFAs

● There is a (beautiful!) procedure from converting a regular expression to an NFA.

● Associate each regular expression with an NFA with the following properties:

● There is exactly one accepting state.

● There are no transitions out of the accepting state.

● There are no transitions into the starting state.

● These restrictions are stronger than necessary, but make the construction easier.

start

Page 162: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Base Cases

εstart

Automaton for ε

astart

Automaton for single character a

Page 163: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R1R2

Page 164: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

R1

Construction for R1R2

R2

start start

Page 165: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

R1

Construction for R1R2

R2

start

Page 166: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

R1

Construction for R1R2

R2

start ε

Page 167: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

R1

Construction for R1R2

R2

start ε

Page 168: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R1 | R2

Page 169: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R1 | R2

R2

start

R1

start

Page 170: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R1 | R2

R2

start

R1

start

start

Page 171: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R1 | R2

R2

R1start

ε

ε

Page 172: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R1 | R2

R2

R1start

ε

ε

Page 173: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R1 | R2

R2

R1start

ε

ε

ε

ε

Page 174: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R1 | R2

R2

R1start

ε

ε

ε

ε

Page 175: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R*

Page 176: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R*

R

start

Page 177: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R*

R

start start

Page 178: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R*

R

start start

ε

Page 179: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R*

R

start ε ε

ε

Page 180: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R*

R

start ε ε

ε

ε

Page 181: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Construction for R*

R

start

ε

ε ε

ε

Page 182: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Overall Result

● Any regular expression of length n can be converted into an NFA with O(n) states.

● Can determine whether a string of length m matches a regular expression of length n in time O(mn2).

● We'll see how to make this O(m) later (this is independent of the complexity of the regular expression!)

Page 183: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A Quick Diversion...

Page 184: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

I am having some difficulty compiling a C++ program that I've written.

This program is very simple and, to the best of my knowledge, conforms to all the rules set forth in the C++ Standard. [...]

The program is as follows:

Source: http://stackoverflow.com/questions/5508110/why-is-this-program-erroneously-rejected-by-three-c-compilers

Page 185: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

I am having some difficulty compiling a C++ program that I've written.

This program is very simple and, to the best of my knowledge, conforms to all the rules set forth in the C++ Standard. [...]

The program is as follows:

Source: http://stackoverflow.com/questions/5508110/why-is-this-program-erroneously-rejected-by-three-c-compilers

Page 186: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

I am having some difficulty compiling a C++ program that I've written.

This program is very simple and, to the best of my knowledge, conforms to all the rules set forth in the C++ Standard. [...]

The program is as follows:

Source: http://stackoverflow.com/questions/5508110/why-is-this-program-erroneously-rejected-by-three-c-compilers

> g++ helloworld.pnghelloworld.png: file not recognized: File format not recognizedcollect2: ld returned 1 exit status

Page 187: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Challenges in Scanning

● How do we determine which lexemes are associated with each token?

● When there are multiple ways we could scan the input, how do we know which one to pick?

● How do we address these concerns efficiently?

Page 188: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Challenges in Scanning

● How do we determine which lexemes are associated with each token?

● When there are multiple ways we could scan the input, how do we know which one to pick?

● How do we address these concerns efficiently?

Page 189: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

Page 190: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

f o tr

Page 191: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

f o tr

f o trf o trf o trf o trf o tr

f o trf o trf o trf o tr

Page 192: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Conflict Resolution

● Assume all tokens are specified as regular expressions.

● Algorithm: Left-to-right scan.● Tiebreaking rule one: Maximal munch.

● Always match the longest possible prefix of the remaining text.

Page 193: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

f o tr

f o trf o trf o trf o trf o tr

f o trf o trf o trf o tr

Page 194: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Lexing Ambiguities

T_For forT_Identifier [A-Za-z_][A-Za-z0-9_]*

f o tr

f o tr

Page 195: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Implementing Maximal Munch

● Given a set of regular expressions, how can we use them to implement maximum munch?

● Idea:● Convert expressions to NFAs.● Run all NFAs in parallel, keeping track of the

last match.● When all automata get stuck, report the last

match and restart the search at that point.

Page 196: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

Page 197: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

Page 198: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 199: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 200: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 201: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 202: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 203: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 204: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 205: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 206: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 207: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 208: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 209: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 210: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 211: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 212: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 213: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 214: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 215: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 216: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 217: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 218: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 219: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 220: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 221: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 222: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 223: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 224: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 225: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 226: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 227: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 228: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 229: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 230: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 231: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 232: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 233: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 234: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 235: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 236: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 237: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 238: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 239: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 240: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 241: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 242: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

T_Do doT_Double doubleT_Mystery [A-Za-z]

Implementing Maximal Munch

start d o

start d o u b l e

start Σ

D O U B L ED O U B

Page 243: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A Minor Simplification

Page 244: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A Minor Simplification

start d o

start d o u b l e

start Σ

Page 245: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A Minor Simplification

start d o

start d o u b l e

start Σ

Page 246: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A Minor Simplification

d o

d o u b l e

Σ

ε

ε

εstart

Page 247: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A Minor Simplification

d o

d o u b l e

Σ

ε

ε

εstart

Build a single automaton that runs all the matching

automata in parallel.

Build a single automaton that runs all the matching

automata in parallel.

Page 248: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A Minor Simplification

d o

d o u b l e

Σ

ε

ε

εstart

Page 249: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A Minor Simplification

d o

d o u b l e

Σ

ε

ε

εstart

Annotate each accepting state with which automaton

it came from.

Annotate each accepting state with which automaton

it came from.

Page 250: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

Page 251: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

Page 252: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

d o bu eld o bu el

Page 253: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

More Tiebreaking

● When two regular expressions apply, choose the one with the greater “priority.”

● Simple priority system: pick the rule that was defined first.

Page 254: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

d o bu eld o bu el

Page 255: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

d o bu el

Page 256: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Other Conflicts

T_Do doT_Double doubleT_Identifier [A-Za-z_][A-Za-z0-9_]*

d o bu el

d o bu elWhy isn't this a

problem?

Page 257: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

One Last Detail...

● We know what to do if multiple rules match.

● What if nothing matches?● Trick: Add a “catch-all” rule that matches

any character and reports an error.

Page 258: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Summary of Conflict Resolution

● Construct an automaton for each regular expression.

● Merge them into one automaton by adding a new start state.

● Scan the input, keeping track of the last known match.

● Break ties by choosing higher-precedence matches.

● Have a catch-all rule to handle errors.

Page 259: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Challenges in Scanning

● How do we determine which lexemes are associated with each token?

● When there are multiple ways we could scan the input, how do we know which one to pick?

● How do we address these concerns efficiently?

Page 260: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Challenges in Scanning

● How do we determine which lexemes are associated with each token?

● When there are multiple ways we could scan the input, how do we know which one to pick?

● How do we address these concerns efficiently?

Page 261: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

DFAs

● The automata we've seen so far have all been NFAs.

● A DFA is like an NFA, but with tighter restrictions:● Every state must have exactly one

transition defined for every letter.● ε-moves are not allowed.

Page 262: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A Sample DFA

Page 263: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A Sample DFA

start

0 0

1

0

1

0

1

1

Page 264: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A Sample DFA

D

start

0 0

1

0

1

0

1

A

C

B1

Page 265: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

A

A

B

B

C

C

D

D

A Sample DFA

D

start

0 0

1

0

1

0

1

A

C

B1

A

B

C

D

0 1

Page 266: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Code for DFAsint kTransitionTable[kNumStates][kNumSymbols] = { {0, 0, 1, 3, 7, 1, …}, …};bool kAcceptTable[kNumStates] = { false, true, true, …};bool simulateDFA(string input) { int state = 0; for (char ch: input) state = kTransitionTable[state][ch]; return kAcceptTable[state];}

Page 267: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Code for DFAsint kTransitionTable[kNumStates][kNumSymbols] = { {0, 0, 1, 3, 7, 1, …}, …};bool kAcceptTable[kNumStates] = { false, true, true, …};bool simulateDFA(string input) { int state = 0; for (char ch: input) state = kTransitionTable[state][ch]; return kAcceptTable[state];}

Runs in time O(m) on a string of

length m.

Runs in time O(m) on a string of

length m.

Page 268: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Speeding up Matching

● In the worst-case, an NFA with n states takes time O(mn2) to match a string of length m.

● DFAs, on the other hand, take only O(m).● There is another (beautiful!) algorithm to

convert NFAs to DFAs.

Lexical Specification

RegularExpressions NFA DFA

Table-DrivenDFA

Page 269: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Subset Construction

● NFAs can be in many states at once, while DFAs can only be in a single state at a time.

● Key idea: Make the DFA simulate the NFA.

● Have the states of the DFA correspond to the sets of states of the NFA.

● Transitions between states of DFA correspond to transitions between sets of states in the NFA.

Page 270: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

Page 271: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

Page 272: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

Page 273: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

Page 274: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

Page 275: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

Page 276: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

Page 277: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

Page 278: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

Page 279: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

12

Page 280: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

12

Page 281: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

12

Page 282: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

12

Page 283: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

3, 6o

3, 6

12

Page 284: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

3, 6o

3, 6

12

Page 285: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

3, 6o

3, 6 7 8 9 1010u b l e

12

Page 286: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

3, 6o

3, 6 7 8 9 1010u b l e

Σ

Σ – o Σ – u Σ – b Σ – l Σ – e

Σ

Σ

12

Page 287: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

From NFA to DFA

1 2 3d o

4 5 6d o

7u

8b

9l

10e

11 12Σ

0

ε

ε

εstart

0, 1, 4, 11start

2, 5, 122, 5, 12d

12

Σ – d

3, 6o

3, 6 7 8 9 1010u b l e

Σ

Σ – o Σ – u Σ – b Σ – l Σ – e

Σ

Σ

12

Page 288: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Modified Subset Construction

● Instead of marking whether a state is accepting, remember which token type it matches.

● Break ties with priorities.● When using DFA as a scanner, consider

the DFA “stuck” if it enters the state corresponding to the empty set.

Page 289: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Performance Concerns

● The NFA-to-DFA construction can introduce exponentially many states.

● Time/memory tradeoff:● Low-memory NFA has higher scan time.● High-memory DFA has lower scan time.

● Could use a hybrid approach by simplifying NFA before generating code.

Page 290: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Real-World Scanning: Python

Page 291: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

T_While ( T_Ident < T_Ident ) ++ T_Ident

ip z ip

While

++

Ident

<

Ident Ident

ip z ip

Page 292: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Python Blocks

● Scoping handled by whitespace:if w == z:

a = b

c = d

else:

e = f

g = h

● What does that mean for the scanner?

Page 293: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Whitespace Tokens

● Special tokens inserted to indicate changes in levels of indentation.

● NEWLINE marks the end of a line.● INDENT indicates an increase in

indentation.● DEDENT indicates a decrease in indentation.● Note that INDENT and DEDENT encode

change in indentation, not the total amount of indentation.

Page 294: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

if w == z: a = b c = delse: e = fg = h

Scanning Python

Page 295: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

if w == z: a = b c = delse: e = fg = h

if ident

w

== ident

z

: NEWLINE

INDENT ident

a

= ident

b

NEWLINE

ident

c

= ident

d

NEWLINE

DEDENT else :

INDENT ident

e

= ident

f

NEWLINE

DEDENT ident

g

= ident

h

NEWLINE

Scanning Python

NEWLINE

Page 296: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

if w == z: { a = b; c = d;} else { e = f;}g = h;

if ident

w

== ident

z

: NEWLINE

INDENT ident

a

= ident

b

NEWLINE

ident

c

= ident

d

NEWLINE

DEDENT else :

INDENT ident

e

= ident

f

NEWLINE

DEDENT ident

g

= ident

h

NEWLINE

Scanning Python

NEWLINE

Page 297: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

if w == z: { a = b; c = d;} else { e = f;}g = h;

if ident

w

== ident

z

:

{ ident

a

= ident

b

;

ident

c

= ident

d

;

} else :

ident

e

= ident

f

;

} ident

g

= ident

h

;

Scanning Python

{

Page 298: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Where to INDENT/DEDENT?

● Scanner maintains a stack of line indentations keeping track of all indented contexts so far.

● Initially, this stack contains 0, since initially the contents of the file aren't indented.

● On a newline:● See how much whitespace is at the start of the line.● If this value exceeds the top of the stack:

– Push the value onto the stack.– Emit an INDENT token.

● Otherwise, while the value is less than the top of the stack:– Pop the stack.– Emit a DEDENT token.

Source: http://docs.python.org/reference/lexical_analysis.html

Page 299: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Interesting Observation

● Normally, more text on a line translates into more tokens.

● With DEDENT, less text on a line often means more tokens:

if cond1: if cond2: if cond3: if cond4: if cond5: statement1statement2

Page 300: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Summary

● Lexical analysis splits input text into tokens holding a lexeme and an attribute.

● Lexemes are sets of strings often defined with regular expressions.

● Regular expressions can be converted to NFAs and from there to DFAs.

● Maximal-munch using an automaton allows for fast scanning.

● Not all tokens come directly from the source code.

Page 301: Lexical Analysis - Stanford University We Are Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization Source Code Machine Code

Next Time

Lexical Analysis

Syntax Analysis

Semantic Analysis

IR Generation

IR Optimization

Code Generation

Optimization

SourceCode

MachineCode

(Plus a little bit here)