cse 5317/4305 l2: lexical analysis1 lexical analysis leonidas fegaras

28
CSE 5317/4305 L2: Lexical Analysis 1 Lexical Analysis Leonidas Fegaras

Upload: eustace-gilmore

Post on 13-Jan-2016

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 1

Lexical Analysis

Leonidas Fegaras

Page 2: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 2

Lexical Analysis

• A scanner groups input characters into tokens

input token value

identifier x

equal =

identifier x

star *

x = x * (acc+123) left-paren (

identifier acc

plus +

integer 123

right-paren )• Tokens are typically represented by numbers

Page 3: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 3

Communication with the Parser

• Each time the parser needs a token, it sends a request to the scanner

• the scanner reads as many characters from the input stream as necessary to construct a single token

• when a single token is formed, the scanner is suspended and returns the token to the parser

• the parser will repeatedly call the scanner to read all the tokens from the input stream

scanner parser

get token

token

source file

get next character AST

Page 4: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 4

Tasks of a Scanner

• A typical scanner:– recognizes the keywords of the language

• these are the reserved words that have a special meaning in the language, such as the word class in Java

– recognizes special characters, such as ( and ), or groups of special characters, such as := and ==

– recognizes identifiers, integers, reals, decimals, strings, etc– ignores whitespaces (tabs, blanks, etc) and comments– recognizes and processes special directives (such as the #include "file" directive in C) and macros

Page 5: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 5

Scanner Generators

• Input: a scanner specification– describes every token using Regular Expressions (REs)

eg, the RE [a-z][a-zA-Z0-9]*

recognizes all identifiers with at least one alphanumeric letter whose first letter is lower-case alphabetic– handles whitespaces and resolve ambiguities

• Output: the actual scanner• Scanner generators compile regular expressions into efficient

programs (finite state machines)• You will use a scanner generator for Java, called JLex, for the

project

Page 6: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 6

Regular Expressions

• are a very convenient form of representing (possibly infinite) sets of strings, called regular sets– eg, the RE (a | b)*aa represents the infinite set

{“aa”,“aaa”,“baa”,“abaa”, ... }– a RE is one of the following:

name RE designation

epsilon {“”}

symbol a {“a”} for some character a

concatenation AB the set { rs | rA, sB }, where rs is string concatenation,

and A and B designate the REs for A and B

alternation A | B the set A B, where A and B designate the REs for A and B

repetition A* the set | A | (AA) | (AAA) | ... (an infinite set)

– eg, the RE (a | b)c designates { rs | r{“a”}{“b”}, s{“c”} }, which is equal to {“ac”,“bc”}

– Shortcuts: P+ = PP*, P? = P | , [a-z] = (“a”|“b”|...|“z”)

Page 7: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 7

Properties

• concatenation and alternation are associative– eg, ABC means (AB)C and is equivalent to A(BC)

• alternation is commutative– eg, A | B = B | A

• repetition is idempotent– eg, A** = A*

• concatenation distributes over alternation– eg, (a | b)c = ac | bc

Page 8: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 8

Examples

for-keyword = for

letter = [a-zA-Z]

digit = [0-9]

identifier = letter (letter | digit)*

sign = + | - | integer = sign (0 | [1-9]digit*)

decimal = integer . digit*

real = (integer | decimal) E sign digit+

Page 9: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 9

Disambiguation Rules

1) longest match rule: from all tokens that match the input prefix, choose the one that matches the most characters

2) rule priority: if more than one token has the longest match, choose the one listed first

Examples:• for8 is it the for-keyword, the identifier “f”, the identifier

“fo”, the identifier “for”, or the identifier “for8”?

Use rule 1: “for8” matches the most characters.• for is it the for-keyword, the identifier “f”, the identifier

“fo”, or the identifier “for”?

Use rule 1 & 2: the for-keyword and the “for”

identifier have the longest match but the

for-keyword is listed first.

Page 10: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 10

How Scanner Generators Work

• Translate REs into a finite state machine• Done in three steps:

1) translate REs into a no-deterministic finite automaton (NFA)

2) translate the NFA into a deterministic finite automaton (DFA)

3) optimize the DFA (optional)

Page 11: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 11

Deterministic Finite Automata

• A DFA represents a finite state machine that recognizes a RE– eg, the RE (abc+)+ is represented by the DFA:

• A finite automaton consists of– a finite set of states

– a set of transitions (moves)

– one start state

– a set of final states (accepting states)

• a DFA has a unique transition for every state-character combination• A DFA accepts a string if starting from the start state and moving

from state to state, each time following the arrow that corresponds the current input character, it reaches a final state when the entire input string is consumed

Page 12: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 12

DFA (cont.)

• The error state 0 is implied:

• The transition table T gives the next state T[s,c] for a state s and a character c

a b c0 0 0 01 2 0 02 0 3 03 0 0 44 2 0 4

Page 13: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 13

The DFA of a Scanner

for-keyword = for

identifier = [a-z][a-z0-9]*

Page 14: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 14

Scanner Code

• The scanner code that uses the transition table T:state = initial_state;

current_character = get_next_character();

while ( true )

{ next_state = T[state,current_character];

if (next_state == ERROR)

break;

state = next_state;

current_character = get_next_character();

if ( current_character == EOF )

break;

};

if ( is_final_state(state) )

`we have a valid token'

else `report an error'

Page 15: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 15

With Longest Match

state = initial_state;

final_state = ERROR;

current_character = get_next_character();

while ( true )

{ next_state = T[state,current_character];

if (next_state == ERROR)

break;

state = next_state;

if ( is_final_state(state) )

final_state = state;

current_character = get_next_character();

if (current_character == EOF)

break;

};

if ( final_state == ERROR )

`report an error'

else if ( state != final_state )

`we have a valid token but need to backtrack (to put characters back into the input stream)'

else `we have a valid token'

Page 16: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 16

Alternative Scanner Code

• For each transition in a DFA

s1

• generate code:

s1: current_character = get_next_character();

...

if ( current_character == 'c' )

goto s2;

...

s2: current_character = get_next_character();

...

s2c

Page 17: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 17

Mapping a RE into an NFA

• An NFA is similar to a DFA but it also permits multiple transitions over the same character and transitions over

• The following rules construct NFAs with only one final state:

Page 18: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 18

Example

• The RE (a | b)c is mapped into the NFA:

Page 19: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 19

Converting an NFA to a DFA

• Subset construction:– assign a number to each NFA state– each DFA state will be assigned a set of numbers

– the closure of a DFA state {n1,...,n

k} is the DFA state that contains all the

NFA states that can be reached by zero or more empty transitions (ie, transitions) from the NFA states n

1, ..., or n

k

• so the closure of {n1,...,n

k} is a superset of or equal to {n

1,...,n

k}

– the initial DFA state is the closure of the initial NFA state

– for every DFA state labelled by some set {n1,...,n

k} and for every character

c in the language alphabet, you find all the states reachable by n1, n

2, or n

k

using c arrows and you union together the closures of these nodes. If this set is not the label of any other node in the DFA constructed so far, you create a new DFA node with this label

Page 20: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 20

Example

Page 21: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 21

Example

(a | b)*(abb | a+b)

Page 22: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 22

JLex

• Regular expressions (where e and f are regular expressions):– c any character c other than: ? * + | ( ) ^ $ . [ ] { } " \

– \c any character c, but \n is newline, \^c is control-c, etc

– . any character except \n

– “...” the concatenation of all the characters in the string

– ef concatenation

– e | f alternation

– e* Kleene closure

– e+ ee*

– e? optional e

– {name} macro expansion

– [...]any character enclosed in [ ] (but only one character), from:• c a character c (or use \c)

• ef any character from e or from f

• a-b any character from a to b

• “...”any character in the string

– [^...]any character except those enclosed by [ ]

Page 23: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 23

JLex Rules

• A JLex rule:RE { action }• where action is Java code

– typically, the action returns a token– but you want to skip whitespaces and comments– yytext() returns the part of the input that matches the RE

• JLex uses longest match and rule priority• States and state transitions can be used for better control

– the initial (default) state is YYINITIAL– any other state should be declared using the %state directive– now a rule can take the form:

<s> RE { action }

which can match if we are in state s only

– you jump to a state s using yybegin(s)

Page 24: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 24

Case Study: The Calculator Scanner

• The calculator example is available at:

http://lambda.uta.edu/cse5317/calc.tar.gz• After you download it on gamma, do:tar xfz calc.tar.gzcd calcbuildrun

• then try it with some input; eg,2*(3+8);x:=3+4;x+3;define f(n) = if n=0 then 1 else n*f(n-1);f(5);quit;

Page 25: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 25

Tokens are Defined in calc.cup

terminal LP, RP, COMMA, SEMI, ASSIGN, IF, THEN, ELSE, AND, OR, NOT, QUIT, PLUS, TIMES, MINUS, DIV, EQ, LT, GT, LE, NE, GE, FALSE, TRUE, DEFINE;

terminal String ID;

terminal Integer INT;

terminal Float REALN;

terminal String STRINGT;

• The class constructor Symbol pairs together a terminal token with an optional value (a Java Object)– if a terminal is specified with a class (a subtype of Object) then an object

of this class should be provided along with the token– eg, Symbol(sym.ID,“x”)– eg, Symbol(sym.INT,10)

Page 26: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 26

The Calculator Scanner

import java_cup.runtime.Symbol;

%%%class CalcLex%public%line%char%cup

DIGIT=[0-9]ID=[a-zA-Z][a-zA-Z0-9_]*

%%

Page 27: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 27

The Calculator Scanner (cont.)

{DIGIT}+ { return new Symbol(sym.INT,new Integer(yytext())); }

{DIGIT}+"."{DIGIT}+ { return new Symbol(sym.REALN,new Float(yytext())); }

"(" { return new Symbol(sym.LP); }

")" { return new Symbol(sym.RP); }

"," { return new Symbol(sym.COMMA); }

";" { return new Symbol(sym.SEMI); }

":=" { return new Symbol(sym.ASSIGN); }

"define" { return new Symbol(sym.DEFINE); }

"quit" { return new Symbol(sym.QUIT); }

"if" { return new Symbol(sym.IF); }

"then" { return new Symbol(sym.THEN); }

"else" { return new Symbol(sym.ELSE); }

"and" { return new Symbol(sym.AND); }

"or" { return new Symbol(sym.OR); }

"not" { return new Symbol(sym.NOT); }

"false" { return new Symbol(sym.FALSE); }

"true" { return new Symbol(sym.TRUE); }

Page 28: CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis 28

The Calculator Scanner (cont.)

"+" { return new Symbol(sym.PLUS); }

"*" { return new Symbol(sym.TIMES); }

"-" { return new Symbol(sym.MINUS); }

"/" { return new Symbol(sym.DIV); }

"=" { return new Symbol(sym.EQ); }

"<" { return new Symbol(sym.LT); }

">" { return new Symbol(sym.GT); }

"<=" { return new Symbol(sym.LE); }

"!=" { return new Symbol(sym.NE); }

">=" { return new Symbol(sym.GE); }

{ID} { return new Symbol(sym.ID,yytext()); }

\"[^\"]*\" { return new Symbol(sym.STRINGT, yytext().substring(1,yytext().length()-1)); }

[ \t\r\n\f] { /* ignore white spaces. */ }

. { System.err.println("Illegal character: "+yytext()); }