compilersweb-ext.u-aizu.ac.jp/~hamada/lp/l03-lp.pdf · 2018. 10. 9. · lex in addition to...

34
Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan Compilers

Upload: others

Post on 15-Dec-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Prof. Mohamed Hamada

Software Engineering Lab. The University of Aizu

Japan

Compilers

Page 2: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Lexical Analyzer (Scanner)

Page 3: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

lexical analyzer

Syntax analyzer

symbol table

get next token

token: smallest meaningful sequence of characters of interest in source program

Source Program

get next char

next char next token

(Contains a record for each identifier)

1.  Uses Regular Expressions to define tokens

2.  Uses Finite Automata to recognize tokens

Page 4: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

❚  Source program text Tokens

What is tokens?

•  Examples Tokens –  Operators = + - > ( { := == <> –  Keywords if while for int double –  Identifiers such as pi in program fragment const pi=3.14; –  Numeric literals 43 6.035 -3.6e10 0x13F3A –  Character literals ‘a’ ‘~’ ‘\’’ –  String literals “6.891” “Fall 98” “\”\” = empty” –  Punctuation symbols such as comma and semicolon etc.

•  Example of non-tokens •  White space space(‘ ‘) tab(‘\t’) end-of-line(‘\n’) •  Comments /*this is not a token*/

Page 5: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

How the scanner recognizes a token?

General approach: 1. Build a deterministic finite automaton (DFA) from

regular expression E 2. Execute the DFA to determine whether an input

string belongs to L(E) Note: The DFA construction is done automatically by

a tool such as lex.

Page 6: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Regular Definitions

Regular definitions are regular expressions associated with suitable names.

For Example the set of identifiers in Java can be expressed by the Following regular definition:

letter à A | B | … | Z | a | b | … | z digit à 0 | 1 | 2 | … | 9

id à letter (letter | digit)*

Page 7: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Regular Definitions Notations

1. The `+` symbol denotes one or more instance

digits à digit+ digit à [0 – 9]

2. The `?` symbol denotes zero or one instance

3. The `[ ]` symbol denotes character classes

Example: the following regular definitions represents unsigned numbers in C

fraction à (.digits)?

exponent à (E(+|-)? digits)?

number à digits fraction exponent

Page 8: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

How to Parse a Regular Expression?

Given a DFA, we can generate an automaton that recognizes the longest substring of an input that is a valid token.

Using the three simple rules presented, it is easy to generate an NFA to recognize a regular expression.

Given a regular expression, how do we generate an automaton to recognize tokens?

Create an NFA and convert it to a DFA.

Page 9: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

if {return IF;}

[a - z] [a - z0 - 9 ] * {return ID;}

[0 - 9] + {return NUM;}

([0 - 9] + “.” [0 - 9] *) | ([0 - 9] * “.” [0 - 9] +) {return REAL;}

(“--” [a - z]* “ n”) | (“ ” | “ n ” | “ t ”) + {/* do nothing*/}

. {error ();}

Regular expressions for some tokens

Page 10: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Building Finite Automata for Lexical Tokens

The NFA for a symbol i is: i 1 2 start

The NFA for the regular expression if is:

f 3 1 start 2 i

The NFA for a symbol f is: f 2 start 1

IF

if {return IF;}

Page 11: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Building Finite Automata for Lexical Tokens

a-z 2 1 start

ID

[a-z] [a-z0-9 ] * {return ID;}

0-9

a-z

Page 12: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Building Finite Automata for Lexical Tokens

0-9 2 1 start

NUM

[0 - 9] + {return NUM;}

0-9

Page 13: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Building Finite Automata for Lexical Tokens

1 start

REAL

([0 - 9] + “.” [0 - 9] *) | ([0 - 9] * “.” [0 - 9] +) {return REAL;}

0-9

0-9

2 3 .

0-9

0-9 5 0-9 4

.

Page 14: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Building Finite Automata for Lexical Tokens

1 start

/* do nothing */

(“--” [a - z]* “ n”) | (“ ” | “ n ” | “ t ”) + {/* do nothing*/}

- 2

a-z

- 3 4 n

n t

5 blank n

t blank

Page 15: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

ID

1 2 0 - 9 0 - 9

NUM

0 - 9

1 2 3

4 5

0 - 9

0 - 9 0 - 9

0 - 9

REAL

1 2 4 3

5

a-z n - -

blank, etc. blank, etc.

White space

2 1 any but n

error

IF

1 2 a-z a-z

0-9

Building Finite Automata for Lexical Tokens

1 2 i f

3

.

.

Page 16: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Conversion of NFA into DFA

2 3 8 4 5 6 7

13 9 10 11 12 14 15

1

a-z

0-9 0-9

a-z

0-9 i

f λ

λ

λ

λ

λ

λ

λ

λ

λ

IF

error

NUM

ID

any character

Page 17: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Conversion of NFA into DFA

1-4-9-14

2 3 8 4 5 6 7

13 9 10 11 12 14 15

1

a-z

0-9 0-9

a-z

0-9 i

f λ

λ

λ

λ

λ

λ

λ

λ

λ

IF

error

NUM

ID

any character

Page 18: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Conversion of NFA into DFA

1-4-9-14

a-h 5-6-8-15

2 3 8 4 5 6 7

13 9 10 11 12 14 15

1

a-z

0-9 0-9

a-z

0-9 i

f λ

λ

λ

λ

λ

λ

λ

λ

λ

IF

error

NUM

ID

any character

Page 19: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Conversion of NFA into DFA

2 3 8 4 5 6 7

13 9 10 11 12 14 15

1

a-z

0-9 0-9

a-z

0-9 i

f λ

λ

λ

λ

λ

λ

λ

λ

λ

IF

error

NUM

ID

any character

1-4-9-14

a-h 5-6-8-15

2-5-6-8-15 i

Page 20: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Conversion of NFA into DFA

2 3 8 4 5 6 7

13 9 10 11 12 14 15

1

a-z

0-9 0-9

a-z

0-9 i

f λ

λ

λ

λ

λ

λ

λ

λ

λ

IF

error

NUM

ID

any character

1-4-9-14

a-h 5-6-8-15

2-5-6-8-15 i

j-z

Page 21: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Conversion of NFA into DFA

2 3 8 4 5 6 7

13 9 10 11 12 14 15

1

a-z

0-9 0-9

a-z

0-9 i

f λ

λ

λ

λ

λ

λ

λ

λ

λ

IF

error

NUM

ID

any character

1-4-9-14

a-h 5-6-8-15

2-5-6-8-15 i

j-z 10-11-13-15

0-9

Page 22: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Conversion of NFA into DFA

2 3 8 4 5 6 7

13 9 10 11 12 14 15

1

a-z

0-9 0-9

a-z

0-9 i

f λ

λ

λ

λ

λ

λ

λ

λ

λ

IF

error

NUM

ID

any character

1-4-9-14

a-h 5-6-8-15

2-5-6-8-15 i

j-z 10-11-13-15

0-9

15 other

Page 23: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Conversion of NFA into DFA

The analysis for 1-4-9-14 is complete. We mark it and pick another state in the DFA to analyze.

2 3 8 4 5 6 7

13 9 10 11 12 14 15

1

a-z

0-9 0-9

a-z

0-9 i

f λ

λ

λ

λ

λ

λ

λ

λ

λ

IF

error

NUM

ID

any character

1-4-9-14

a-h 5-6-8-15

2-5-6-8-15 i

j-z 10-11-13-15

0-9

15 other

Page 24: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

The corresponding DFA

5-6-8-15

2-5-6-8-15

10-11-13-15

3-6-7-8

11-12-13

6-7-8

15

1-4-9-14

a-e, g-z, 0-9

a-z,0-9

a-z,0-9

0-9

0-9

f

i

a-h j-z

0-9

other

ID

ID

NUM NUM

IF

error

ID

a-z,0-9

Page 25: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

lexical analyzer

Syntax analyzer

symbol table

get next token

token: smallest meaningful sequence of characters of interest in source program

Source Program

get next char

next char next token

(Contains a record for each identifier)

1.  Uses Regular Expressions to define tokens

2.  Uses Finite Automata to recognize tokens

Page 26: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

How to write a scanner?

General approach: The construction is done automatically by a tool such as the Unix program lex.

Using the source program language grammar to write a simple lex program and save it in a file named lex.l

Using the unix program lex to compile to compile lex.l resulting in a C (scanner) program named lex.yy.c

Compiling and linking the C program lex.yy.c in a normal way resulting the required scanner.

Page 27: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Using Lex

Lex compiler

Lex source

program mylex.l

lex.yy.c

C compiler lex.yy.c a.out

Scanner (a.out) Source program

sequence of

tokens

Page 28: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Lex

In addition to compilers and interpreters, Lex can be used in many other software applications:

1. The desktop calculator bc

2. The tools eqn and pic (used for mathematical equations and complex pictures)

3. PCC (Portable C Compiler) used ith many UNIX systems 4. GCC (GNU C Compiler) used ith many UNIX systems

5. A menu compiler

6. A SQL data base language syntax checker

7. The Lex program itself

And many more

Page 29: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Lex program specification

A Lex program consists of the following three parts:

declarations %% translation rules %% user subroutines (auxiliary procedures)

The first %% is required to mark the beginning of the translation rules and the second %% is required only if user subroutines follow.

Page 30: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Lex program specification

Declarations: include variables, constants and statements.

Translation rules: are statements of the form:

p1 {action1} p2 {action2} pn {actionn}

where each pi is a regular expression and each actioni is a program fragment describing what action the lexical analyzer should take when pattern pi matches a lexeme. For example pi may be an if statement and the corresponding actioni is {return(IF)}.

Page 31: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Lex program specification

How to compile and run the Lex program specification

% lex mylexprogram.l

% cc lex.yy.c –o first -ll

First use a word processor (for example mule) and create your Lex specification program and then save it under any name but it must have the extension .l (for example mylexprogram.l)

Next compile the program using the UNIX Lex command which will automatically generate the Lex C program under the name lex.yy.cc

Finally use the UNIX C compiler cc to compile the C program lex.yy.cc

Page 32: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Lex program specification

Example 1 Simple verb recognizer

verb à is | am | was | do | does | has | have The following is a lex program for the tokens of the grammar

%{ /* This is a very simple Lex program for a few verbs recognition */ %}

%% [ \t]+ /* ignore whites pace */ is | am | was | do | does | has | have {printf(“%s: is a verb\n”, yytext); } [a-zA-Z]+ {printf(“%s: is other\n”, yytext); }

%%

main() { yylex() ; }

Page 33: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Lex program specification

Example 2 consider the following grammar

statement à if expression then statement | if expression then statement else statement expression à term relop term | term term à id | number

letter à [A-Za-z] digit à [0-9] if à if then à then else à else relop à < | <= | = | <> | > | >= id à letter (letter | digit)*

number à digit+ (. digit+)? (E(+|-)? digit+)?

With the following regular definitions

Page 34: Compilersweb-ext.u-aizu.ac.jp/~hamada/LP/L03-LP.pdf · 2018. 10. 9. · Lex In addition to compilers and interpreters, Lex can be used in many other software applications: 1. The

Lex program specification

The following is a lex program for the tokens of the grammar Example 2

%{ /* Here are the definitions of the constants LT, LE, EQ, NE, GT, IF, THEN, ELSE, ID, NUMBER, RELOP */ %}

delim [ \t\n] ws {delim}+ letter [A-Za-z] digit [0-9] id {letter}({letter}|{digit})* number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?

{whitespace} if {return(IF);} then {return(THEN);} else {return(ELSE);} {id} {yylval = install_id(); return(ID);} {number} {yylval = install_num(); return(NUMBER);} "<" {yylval = LT; return(RELOP);} "<=" {yylval = LE; return(RELOP);} "=" {yylval = EQ; return(RELOP);} "<>" {yylval = NQ; return(RELOP);} ">" {yylval = GT; return(RELOP);} ">=" {yylval = GE; return(RELOP);}

install_id() {} install_num() {}

%%

%%