1 lexical analysis cheng-chia chen. 2 outline 1. the goal and niche of lexical analysis in a...

Lexical Analysis

Cheng-Chia Chen

Outline

1. The goal and niche of lexical analysis in a compiler 2. Lexical tokens3. Regular expressions (RE)4. Use regular expressions in lexical specification5. Finite automata (FA)

» DFA and NFA» from RE to NFA» from NFA to DFA» from DFA to optimized DFA

6. Lexical-analyzer generators

1. The goal and niche of lexical analysis

Source Tokens

Interm.Language

Lexicalanalysis

Parsing

CodeGen.

MachineCode

Optimization

(token stream)

(char stream)

Goal of lexical analysis: breaking the input into individual words or “tokens”

Lexical Analysis What do we want to do? Example:

if (i == j)Z = 0;

elseZ = 1;

The input is just a sequence of characters:

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

Goal: Partition input string into substrings» And determine the categories (token types) to which

the substrings belong

2. Lexical Tokens

What’s a token ? Token attributes Normal token and special tokens Example of tokens and special tokens.

What’s a token? a sequence of characters that can be treated as a unit in

the grammar of a PL. Output of lexical analysis is a stream of tokens

Tokens are partitioned into categories called token types. ex:» In English:

– book, students, like, help, strong,… : token- noun, verb, adjective, … : token type

» In a programming language:– student, var34, 345, if, class, “abc” … : token– ID, Integer, IF, WHILE, Whitespace, … : token type

Parser relies on the token type instead of token distinctions to analyze:» var32 and var1 are treated the same,» var32(ID), 32(Integer) and if(IF) are treated differently.

Token attributes token type :

» category of the token; used by syntax analysis.» ex: identifier, integer, string, if, plus, …

token value : » semantic value used in semantic analysis.» ex: [integer, 26], [string, “26”]

token lexeme (member, text): » textual content of a token» [while, “while”], [identifier, “var23”], [plus, “+”],…

positional information: » start/end line/position of the textual content in the source

program.

Notes on Token attributes

Token types affect syntax analysis Token values affect semantic analysis lexeme and positional information affect error

handling Only token type information must be supplied by the

lexical analyzer. Any program performing lexical analysis is called a

scanner (lexer, lexical analyzer).

Aspects of Token types Language view: A token type is the set of all

lexemes of all its token instances. » ID = {a, ab, … } – {if, do,…}.» Integer = { 123, 456, …}» IF = {if}, WHILE={while}; » STRING={“abc”, “if”, “WHILE”,…}

Pattern (regular expression): a rule defining the language of all instances of a token type.» WHILE: w h i l e» ID: letter (letters | digits )*» ArithOp: + | - | * | /

Lexical Analyzer: Implementation

An implementation must do two things:

1. Recognize substrings corresponding to lexemes of tokens

2. Determine token attributes1. type is necessary

2. value depends on the type/application,

3. lexeme/positional information depends on applications (eg: debug or not).

Example input lines:

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

Token-lexeme pairs returned by the lexer:» [Whitespace, “\t”]» [if, - ]» [OpenPar, “(“] » [Identifier, “i”]» [Relation, “==“]» [Identifier, “j”]» …

Normal Tokens and special Tokens

Kinds of tokens» normal tokens: needed for later syntax

analysis and must be passed to parser. » special tokens

– skipped tokens (or nontoken): – do not contribute to parsing,– discarded by the scanner.

Examples: Whitespace, Comments» why need them ?

Question: What happens if we remove all whitespace and all comments prior to scanning?

Lexical Analysis in FORTRAN

FORTRAN rule: Whitespace is insignificant

E.g., VAR1 is the same as VA R1

Footnote: FORTRAN whitespace rule motivated by inaccuracy of punch card operators

A terrible design! Example

Consider» DO 5 I = 1,25» DO 5 I = 1.25

The first is DO 5 I = 1 , 25 The second is DO5I = 1.25

Reading left-to-right, cannot tell if DO5I is a variable or DO stmt. until after “,” is reached

Lexical Analysis in FORTRAN. Lookahead.

Two important points:

1. The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time

2. “Lookahead” may be required to decide where one token ends and the next token begins

» Even our simple example has lookahead issues

i vs. if = vs. ==

Some token types of a typical PL

Type Examples

ID foo n14 last

NUM 73 0 00 515 082

REAL 66.1 .5 10. 1e67 1.5e-10

COMMA ,

NOTEQ !=

LPAREN (

RPAREN )

Some Special Tokens

1,5 are skipped. 2,3 need preprocess, 4 need to be expanded.

1. comment /* … */

// …

2. preprocessor directive

#include <stdio.h>

3. #define NUMS 5,6

4. macro NUMS

5.blank,tabs,newlines \t \n

3. Regular expressions and Regular Languages

The geography of lexical tokens

ID: var1, last5,…

REAL12.35

2.4 e –10…

NUM23 56 0 000

IF:ifLPAREN

RPAREN)

special tokens : \t \n /* … */

the set of all strings

Issues

Definition problem:» how to define (formally specify) the set of

strings(tokens) belonging to a token type ?» => regular expressions

(Recognition problem) » How to determine which set (token type) a

input string belongs to?» => DFA!

Languages

Def. Let be a set of symbols (or characters). A language over is a set of strings of characters

drawn from ( is called the alphabet )

Examples of Languages Alphabet = English

characters Language = English

Not every string on English characters is an English word» likes, school,…» beee,yykk,…

Alphabet = ASCII Language = C programs

Note: ASCII character set is different from English character set

Regular Expressions A language (metaLanguage) for representing (or defining)

languages(sets of words) Definition: If is an alphabet. The set of regular

expression(RegExpr) over is defined recursively as follows:» (Atomic RegExpr) : 1. any symbol c is a RegExpr.» 2. (empty string) is a RegExpr.» (Compound RegExpr): if A and B are RegExpr, then so

3. (A | B) (alternation)

4. (A B) (concatenation)

5. A* (repetition)

Semantics (Meaning) of regular expressions

For each regular expression A, we use L(A) to express the language defined by A.

I.e. L is the function:

L: RegExpr() the set of Languages over

L(A) = the language denoted by RegExpr A

The meaning of RegExpr can be made clear by explicitly defining L.

Atomic Regular Expressions

1. Single symbol: c

L(c) = { c } (for any c ) 2. Epsilon (empty string):

L() = {}

Compound Regular Expressions

3. alternation ( or union or choice)

L( (A | B) ) = { s | s L(A) or s L(B) } 4. Concatenation: AB (where A and B are reg. exp.)

L((A B)) =L(A) L(B)

=def { | L(A) and L(B) } Note:

» Parentheses enclosing (A|B) and (AB) can be omitted if there is no worries of confusion.

» MN (set concatenation) and (string concatenation) will be abbreviated to AB and , respectively.

» AA and L(A) L(A) are abbreviated as A2 and L(A)2, respectively.

Examples

if | then | else { if, then, else} 0 | 1 | … | 9 { 0, 1, …, 9 } (0 | 1) (0 | 1) { 00, 01, 10, 11 }

More Compound Regular Expressions

5. repetition ( or Iteration): A*

L(A*) = { } L(A) L(A)L(A) L(A)3 …

Examples:» 0* : {, 0, 00, 000, …}» 10* : strings starting with 1 and followed by 0’s.» (0|1)* 0 : Binary even numbers.» (a|b)*aa(a|b)*: strings of a’s and b’s containing

consecutive a’s.» b*(abb*)*(a|) : strings of a’s and b’s with no

consecutive a’s.

Example: Keyword

» Keyword: else or if or begin …

else | if | begin | …

Example: Integers

Integer: a non-empty string of digits

( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )*

problem: reuse complicated expression improvement: define intermediate reg. expr.

digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

number = digit digit*

Abbreviation: A+ = A A*

Regular Definitions

Names for regular expressions

» d1 =r1

» d2 =r2

» ...

» dn =rn

where ri over alphabet {d1, d2, ..., d i-1}

note: Recursion is not allowed.

Example

» Identifier: strings of letters or digits, starting with a letter

digit = 0 | 1 | ... | 9

letter = A | … | Z | a | … | z

identifier = letter (letter | digit) *

» Is (letter* | digit*) the same ?

Example: Whitespace

Whitespace: a non-empty sequence of blanks, newlines, CRNL and tabs

WS = (\ | \t | \n | \r\n )+

Example: Email Addresses

Consider chencc@cs.nccu.edu.tw

= letters [ { ., @ }

name = letter+

address = name ‘@’ name (‘.’ name)*

Notational Shorthands

One or more instances» r+ = r r*» r* = (r+ |

Zero or one instance» r? = (r |

Character classes» [abc] = a | b | c» [a-z] = a | b | ... | z» [ac-f] = a | c | d | e | f» [^ac-f] = – [ac-f]

Summary

Regular expressions describe many useful languages

Regular languages are a language specification» We still need an implementation

problem: Given a string s and a rexp R, is

( )?s L R

4. Use Regular expressions in lexical

specification

Specifying lexical structure using regular expressions

Regular Expressions in Lexical Specification

Last lecture: the specification of all lexemes in a token type using regular expression.

But we want a specification of all lexemes of all token types in a programming language.» Which may enable us to partition the input into

lexemes

We will adapt regular expressions to this goal

Regular Expressions => Lexical Spec. (1)

1. Select a set of token types• Number, Keyword, Identifier, ...

2. Write a rexp for the lexemes of each token type• Number = digit+

• Keyword = if | else | …• Identifier = letter (letter | digit)*• LParen = ‘(‘• …

3. Construct R, matching all lexemes for all tokens

R = Keyword | Identifier | Number | …

= R1 | R2 | R3 + …

Facts: If s L(R) then s is a lexeme

» Furthermore s L(Ri) for some “i”

» This “i” determines the token type that is reported

4. Let the input be x1…xn

(x1 ... xn are symbols in the language alphabet)• For 1 i n check

x1…xi L(R) ?

5. It must be that

x1…xi L(Rj) for some j

6. Remove t = x1…xi from input if t is normal token, then pass it to the parser // else it is whitespace or comments, just skip it!7.go to (4)

Ambiguities (1)

There are ambiguities in the algorithm

How much input is used? What if

– x1…xi L(R) and also

– x1…xK L(R) for some i != k.

Rule: Pick the longest possible substring

» The longest match principle !!

Ambiguities (2) Which token is used? What if

– x1…xi L(Rj) and also– x1…xi L(Rk)

Rule: use rule listed first (j iff j < k)» Earlier rule first!

Example: » R1 = Keyword and R2 = Identifier» “if” matches both. » Treats “if” as a keyword not an identifier

Error Handling

What if

No rule matches a prefix of input ? Problem: Can’t just get stuck … Solution:

» Write a rule matching all “bad” strings» Put it last

Lexer tools allow the writing of:

R = R1 | ... | Rn | Error

» Token Error matches if nothing else matches

Summary Regular expressions provide a concise notation for

string patterns Use in lexical analysis requires small extensions

» To resolve ambiguities» To handle errors

Efficient algorithms exist (next)» Require only single pass over the input» Few operations per character (table lookup)

5. Finite Automata Regular expressions = specification Finite automata = implementation A finite automaton consists of

» An input alphabet » A finite set of states S» A start state n» A set of accepting states F S» A set of transitions state input state» If the automata is for recognizing a token type ,

then this type should be associated with the machine.

Finite Automata

Transition

s1 a s2

Is read

In state s1 on input “a” go to state s2

If end of input (or no transition possible)» If in accepting state => accept» Otherwise => reject

Finite Automata State Transition Graphs

A state

• The start state

• An accepting state

• A transitiona

[ T is the tokenType ]T

A Simple Example

A finite automaton that accepts only “1”

Another Simple Example

A finite automaton accepting any number of 1’s followed by a single 0

Alphabet: {0,1}

accepted input: 1*0

And Another Example

Alphabet {0,1} What language does this recognize?

accepted inputs: to be answered later!

And Another Example

Alphabet still { 0, 1 }

The operation of the automaton is not completely defined by the input» On input “11” the automaton could be in either

Epsilon Moves

Another kind of transition: -moves

• Machine can move from state A to state B without reading input

Deterministic and Nondeterministic Automata

Deterministic Finite Automata (DFA)» One transition per input per state » No -moves

Nondeterministic Finite Automata (NFA)» Can have multiple transitions for one input in a

given state» Can have -moves

Finite automata can have only a finite number of states.

Execution of Finite Automata

A DFA can take only one path through the state graph» Completely determined by input

NFAs can choose» Whether to make -moves» Which of multiple transitions for a single input

to take

Acceptance of NFAs

An NFA can get into multiple states

• Input:

• Rule: NFA accepts if it can get in a final state

Acceptance of a Finite Automata

A FA (DFA or NFA) accepts an input string s iff there is some path in the transition diagram from the start state to some final state such that the edge labels along this path spell out s

NFA vs. DFA (1)

NFAs and DFAs recognize the same set of languages (regular languages)

DFAs are easier to implement» There are no choices to consider

NFA vs. DFA (2)

For a given language the NFA can be simpler than the DFA

• DFA can be exponentially larger than NFA

Operations on NFA states

-closure(s): set of NFA states reachable from NFA state s on -transitions alone

-closure(S): set of NFA states reachable from some NFA state s in S on -transitions alone

move(S, c): set of NFA states to which there is a transition on input symbol c from some NFA state s in S

notes: » -closure(S) = Us S∈ -closure(s);» -closure(s) = -closure({s});» -closure(S) = ?

Computing -closure Input. An NFA and a set of NFA states S. Output. E = -closure(S).begin

push all states in S onto stack; T := S;while stack is not empty do begin

pop t, the top element, off of stack;for each state u with an edge from t to u labeled do

if u is not in T do begin add u to T; push u onto stackend

end;return T

Simulating an NFA (for recognizing a token)

Input. An input string ended with eof and an NFA with start state s0 and final states F.

Output. The answer “yes” if accepts, “no” otherwise.begin

S := -closure({s0});c := next_symbol();while c != eof do beginS := -closure(move(S, c));c := next_symbol();end;if S F != then return “yes”else return “no”

Simulating an NFA (for recognizing a sequence of tokens)

Input. An input string ended with eof and an NFA with start state s0 and a set F of final states, each marked by a token type.

Output: a token sequence (possibly ended with error ).

L0: length = 0 ; buf = new ArrayList(); type = -1; S := -closure({s0}); c := next_symbol();

while( c != eof ) { S := -closure(move(S, c)); F1 = S F ; if (F1 != ) { // c goes to final states! buf.add(c ); length= buf.size(); typr = minTypeOf(F1); } if ( S == ) { // cannot make c-transition!! if (length == 0 ) { output( error-token) ; exit() } else { output token(type, buf[0:length-1] ); puch back buf[length:-] and c into the input; goto L0; } else { pos++; buf.add(c); c := next_symbol(); } } if(length = buf.length() ) { if (length > 0) output token(type, buf ) ; } else { output token(type, buf(0, length-1); output error-token ; }.

Regular Expressions to Finite Automata

High-level sketch

Regularexpressions

NFA DFA

LexicalSpecification

Table-driven Implementation of DFA

Optimized DFA

Regular Expressions to NFA (1)

For each kind of rexp, define an NFA» Notation: NFA for rexp A

• For

• For input aa

For AB

• For A | B

ymin(x,y)

For A*

Example of RegExp -> NFA conversion

Consider the regular expression

(1|0)*1 The NFA is

A H 1I J

NFA to DFA

Regular Expressions to Finite Automata

High-level sketch

Regularexpressions

NFA DFA

LexicalSpecification

Table-driven Implementation of DFA

Optimized DFA

RegExp -> NFA :an Examlpe

Consider the regular expression

(1+0)*1 The NFA is

A H 1I J

NFA to DFA. The Trick

Simulate the NFA Each state of DFA

= a non-empty subset of states of the NFA Start state

= the set of NFA states reachable through -moves from NFA start states

Add a transition S a S’ to DFA iff» S’ is the set of NFA states reachable from any

state in S after seeing the input a– considering -moves as well

NFA -> DFA Example

FG H I J

ABCDHI

FGABCDHI

EJGABCDHI

NFA to DFA. Remark

An NFA may be in many states at any time

How many different states ?

If there are N states, the NFA must be in some subset of those N states

How many non-empty subsets are there?» 2N - 1 = finitely many

From an NFA to a DFA Subset construction Algorithm. Input. An NFA N. Output. A DFA D with states S and transition table mv.begin

add -closure(s0) as an unmarked state to S;while there is an unmarked state T in S do begin

mark T; let TokenType(T) = min{ type( s) | s T ∩ F }.∈for each input symbol a do begin

U := -closure(move(T, a));if U is not in S then

add U as an unmarked state to S;mv[T, a] := U

end end end.

Implementation

A DFA can be implemented by a 2D table T» One dimension is “states”» Other dimension is “input symbols”

» For every transition Si a Sk define mv[i,a] = k DFA “execution”

» If in state Si and input a, read mv[i,a] (= k) and move to state Sk

» Very efficient

Table Implementation of a DFA

Simulation of a DFA Input. An input string ended with eof and a DFA with start state

s0 and final states F.Output. The answer “yes” if accepts, “no” otherwise.begin

s := s0;c := next_symbol();while c <> eof do begin

s := mv(s, c); c := next_symbol() end; if s is in F then return “yes” else return “no”end.

Simulation of a DFA(for recognizing token sequence )

Input. An input string ended with eof and a DFA with start state s0 and a set F of final states, each with a type.

Output. a sequence of tokens possibly ended with error token.begin length = 0; buf = new ArrayList(); type = -1; s := s0; c := next_symbol(); while( c <> eof ) {

s := mv(s, c); if( s is not error state ) { if(s is final) { length= buf.size() + 1; type = type(s); } buf.add(c) ; }

if( s is error-state ) { if(length == 0 ) { output error-token ; exit() } else { output token(type, buf[0:length-1]) ; push back buf[length:-] and c into input ; length = 0; type = -1; } } c := next_symbol() } if(s is not final ) { output error-token ;}end.

Implementation (Cont.) NFA -> DFA conversion is at the heart of tools

such as flex

But, DFAs can be huge» DFA => optimized DFA : try to decrease the

number of states. » not always helpful!

In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations

Time-Space Tradeoffs

RE to NFA, simulate NFA» time: O(|r| |x|) , space: O(|r|)

RE to NFA, NFA to DFA, simulate DFA» time: O(|x|), space: O(2|r| )

Lazy transition evaluation» transitions are computed as needed at run

time;» computed transitions are stored in cache for

later use

DFA to optimized DFA

Motivations

Problems:1. Given a DFA M with k states, is it possible to find an

equivalent DFA M’ (I.e., L(M) = L(M’)) with state number fewer than k ?

2. Given a regular language A, how to find a machine with minimum number of states ?

Ex: A = L((a+b)*aba(a+b)*) can be accepted by the following NFA:

By applying the subset construction, we can constructa DFA M2 with 24=16 states, of which only 6 are accessible from the initial state {s}.

s t u v

a,b a,b

Inaccessible states

A state p Q is said to be inaccessible (or unreachable) [from the initial state] if there exists no path from from the initial state to it. If a state is not inaccessible, it is accessible.

Inaccessible states can be removed from the DFA without affecting the behavior of the machine.

Problem: Given a DFA (or NFA), how to find all inaccessible states ?

Finding all accessible states:

(like e-closure) Input. An FA (DFA or NFA) Output. the set of all accessible statesbegin

push all start states onto stack; Add all start states into A;

while stack is not empty do beginpop t, the top element, off of stack;for each state u with an edge from t to udo

if u is not in A do begin add u to A; push u onto stackend

end;return A end.

Minimization process Minimization process for a DFA:

» 1. Remove all inaccessible states» 2. Collapse all equivalent states

What does it mean that two states are equivalent?» both states have the same observable behaviors.i.e.,» there is no way to distinguish their difference, or» more formally, we say p and q are not equivalent(or

distinguishable) iff there is a string x * s.t. exactly one of (p,x) and (q,x) is a final state,

» where (p,x) is the ending state of the path from p with x as the input.

Equivalents sates can be merged to form a simpler machine.

1,2 3,4a,b a,b a,b

Example:

Quotient Construction M=(Q,, ,s,F): a DFA. : a relation on Q defined by:

p q <=>for all x * (p,x) F iff (q,x) FProperty: is an equivalence relation. Hence it partitions Q into equivalence classes [p] = {q Q | p q} for p Q. and the quotient set

Q/ = {[p] | p Q}.Every p Q belongs to exactly one class [p] and p q iff [p]=[q].

Define the quotient machine M/ = <Q’,, ’,s’,F’> where» Q’=Q/ ; s’=[s]; F’={[p] | p F}; and’([p],a)=[(p,a)] for all p Q and a .

Minimization algorithm input: a DFA output: a optimized DFA1. Write down a table of all pairs {p,q}, initially unmarked.2. mark {p,q} if p F and q ∈ F or vice versa.3. Repeat until no more change: 3.1 if unmarked pair {p,q} s.t. {move(p,a), move(q,a)} is ∃

marked for some a S, then mark {p,q}.∈4. When done, p q iff {p,q} is not marked.5. merge all equivalent states into one class and return the

resulting machine Note:For recognizing multiple token types, 2 need change to2’ mark {p,q} if type(p) ≠ type(q) [assume non final state has

the same type ]

An Example:

The DFA:

>0 1 2

1F 3 4

2F 4 3

5F 5 5

Initial Table

3 - - -

4 - - - -

5 - - - - -

0 1 2 3 4

After step 2

3 - M M

4 - M M -

5 M - - M M

0 1 2 3 4

After first pass of step 3

3 - M M

4 - M M -

5 M M M M M

0 1 2 3 4

2nd pass of step 3.

The result : 1 2 and 3 4.1 M

3 M2 M M

4 M2 M M -

5 M M1 M1 M M

0 1 2 3 4

1 lexical analysis cheng-chia chen. 2 outline 1. the goal and niche of lexical analysis in a...

Documents

lesson 2 lexical analysis · 2003-04-08 · lexical...

lexical analysis - part 3 - iit...

lexical analyzer (checker). 2 lexical analyzer lexical...

cs1352 principles of compiler design question and answers...

brainfck lexical analysis - github pages€¦ · brainfck...

chapter 3 lexical analysis. outline role of lexical analyzer...

compiler design -...

chapter 2 lexical analysis dr. frank lee. 2.1 file...

lexical and syntactic analysis - cse at...

fall 2003cs416 compiler design1 lexical analyzer lexical...

1 outline informal sketch of lexical analysis –identifies...

lexical analysis lexical analysis is the first phase of...

cs153: compilers lecture 3: lexical analysis...•lexical...

formal languages and compilers lecture vi: lexical...

aavhdvchdvchdvhdh diploma in java · installing java. java...

topic: lexical analysis€¦ · the lexical analysis...

1 structure of compilers lexical analyzer (scanner) modified...

lexical analysis iii recognizing tokens lecture 4 cs...

course revision.. contents lexical analysis our compiler...

chapter 2 chang chi-chung 2007.3.15. lexical analyzer the...