chapter 3. lexical analysis (1)

Chapter 3.

Lexical Analysis (1)

2

Interaction of lexical analyzer with parser.

sourceprogram

lexicalanalyzer

parser

symboltable

token

get nexttoken

3

Lexical Analysis

Issues – Simpler design is preferred – Compiler efficiency is improved– Compiler portability is improved

Terms– Tokens terminal symbols in a grammar– Patterns rules to describing strings of a token

– Lexemes a set of strings matched by the pattern

4

TOKEN SAMPLE LEXEMESINFORMAL DESCRIPTION OF

PATTERN

const

if

relation

id

num

literal

const

if

<, <=, =, <>, >, >=

pi, count, D2

3.1416, 0, 6.02E23

"core dumped"

const

if

< or <= or = or < > or >= or >

letter followed by letters and digits

any numeric constant

any characters between " and " except "

Examples of tokens.

5

Difficulties in implementing lexical

analyzers FORTRAN

– No delimiter is used– DO 5 I=1.25 DO 5 I=1,25 DO 5 I= 1 25

PL/I– Keywords are not reserved– IF THEN THEN THEN = ELSE; ELSE ELSE=THEN;

6

Attributes for tokens

A lexical analyzer collects information about tokens into their associated attributes

Example – E = M * C ** 2

• <id, pointer to symbol-table entry for E>• <assign_op,>• <id, pointer to symbol-table entry for M>• <mult_op,_>• <id, pointer to symbol-table entry for C>• <exp_op,>• <num, integer value 2> generally stored in constant table

7

Lexical Errors

Rules for error recovery– Deleting an extraneous character– Inserting a missing character– Replacing an incorrect character by a correct character– Transposing two adjacent characters

Minimum-distance erroneous correction Example

– Detectable : 2as3, 2#31, …– Undetectable : fi(a == f(x)) …

8

Input Buffering

A single buffer could make a big difficulty– 두 버퍼 사이에 있는 word– Declare (arg1, …. , argn) array or function

Buffer pairs– A good solution– Sentinels 을 쓰면 매번 버퍼의 끝인지와

파일의 끝인지를 동시에 검사할 필요가 없음

9

Sentinels at end of each buffer half.

: : : E : : = : : M : * : eof C : * : * : 2 : eof : : : : : eof

lexeme_beginning

forward

10

Specification of Tokens

Strings and languages – Alphabet or character class finite set of symbols

– String sentence word

– |s| length of a string s

– ε : empty string, Ф ={ε} : empty set

– x, y are strings • xy : concatenation, εx = x ε = x

Operations on languages

11

Terms for parts of a string.

TERM DEFINTION

prefix of sA string obtained by removing zero or more trailing symbols of string s; e.g., ban is a prefix of banana.

suffix of sA string formed by deleting zero or more of the leading symbols of s; e.g., nana is a suffix of banana.

substring of s

A string obtained by deleting a prefix and a suffix from s; e.g., nan is a substring of banana. Every prefix and every suffix of s is a substring of s, but not every substring of s is a prefix or a suffix of s. For every string s, both s and are prefixes, suffixes, and substrings of s.

proper prefix, suffix, or substring of s

Any nonempty string x that is, respectively, a prefix, suffix, or substring of s such that s x.

subsequence of sAny string formed by deleting zero or more not necessarily contiguous symbols from s; e.g., baaa is a subsequence of banana.

12

Definitions of operations on languages.

OPERATION DEFINITION

union of L and M written

L M.L M = {s | s is in L or s is in M}

concatenation of L and M written LM

LM = { st | s is in L and t is in M }

Kleene closure of L

written L* L* denotes “zero or more concatenations of” L.

positive closure of L

written L+

L+ denotes “one or more concatenations of” L.

13

Regular Expressions

1. is a regular expression that denotes {}, that is, the set containing the empty string.

2. If a is symbol in , then a is a regular expression that denotes {a}, i.e., the set containing the string a. Although we use the same notation for all three, technically, the regular expression a is different from the string a or the symbol a. It will be clear from the context whether we are talking about a as a regular expression, string, or symbol.

3. Suppose r and s are regular expressions denoting the language L(r) and L(s). Then,a) (r)|(s) is a regular expression denoting L(r) L(s).

b) (r)(s) is a regular expression denoting L(r)L(s).

c) (r)* is a regular expression denoting (L(r))*.

d) (r) is a regular expression denoting L(r).

14

Examples on operations in regular expressions

Σ ={a,b} alphabets– a | b {a,b}– (a|b)(c|d) {ac, ad, bc, bd}– a* {ε, a, aa, aaa, …}

– (a|b)* (a*|b*)*

– aa* = a+, ε|a+ = a*

– (a|b) = (b|a)

15

Algebraic properties of regular expressions.

AXIOM DESCRIPTION

r|s = s|r | is commutative

r|(s|t) = (r|s)|t | is associative

(rs)t = r(st) concatenation is associative

r(s|t) = rs|rt

(s|t)r = sr|trconcatenation distributes over |

r = r

r = r is the identity element for concatenation

r* = (r|)* relation between * and

r** = r* * is idempotent

16

Regular Definitions

Regular definition– d1 r1 d2 r2 …. dn rn

• 예• letter A|B| … |Z|a|b| … |z• digit 0|1| … | 9• id letter (letter|digit)*

18

Notational Shorthands (1/2)

1. One or more instances. The unary postfix operator + means “one or more instances of.” If r is a regular expression that denotes the language L(r), then (r)+ is a regular expression that denotes the language (L(r))+. Thus, the regular expression a+ denotes the set of all strings of one or more a’s. The operator + has the same precedence and associativity as the operator *. The two algebraic identities r* = r+| and r+ = rr* relate the Kleene and positive closure operators.

2. Zero or one instance. The unary postfix operator ? means “zero or one instance of.” The notation r? is a shorthand for r|. If r is a regular expression, then, (r)? is a regular expression that denotes the language L(r) {}. For example, using the + and ? operators, we can rewrite the regular definition for num in Example 3.5 as

19

Notational Shorthands (2/2)

3. Character classes. The notation [abc] where a, b, and c are alphabet symbols denotes the regular expression a | b | c. An abbreviated character class such as [a – z] denotes the regular expression a | b | ··· | z. Using character classes, we can describe identifiers as being strings generated by the regular expression

[A – Za – z][A – Za – z0 – 9]*

digit

digits

optional _fraction

optional_exponent

num

0 | 1 | ··· | 9

digit+

( . digits )?

( E ( + | - )? digits )?

Digits optional_fraction optional_exponent

20

Nonregular set

{wcw-1|w is a string of a’s and b’s}

context-free grammar is required to

represent the string

21

Regular-expression patterns for tokens.

REGULAR

EXPRESSIONTOKEN ATTRIBUTE-VALUE

wsif

thenelseid

num<

<==

< >>

>=

-if

thenelseid

numreloprelopreloprelopreloprelop

----

pointer to table entrypointer to table entry

LTLEEQNEGTGE

22

Transition diagram

Finite-state automata states and edges 몇 가지 예를 보여줌 … . 다음 페이지 , 그림 3.14 는 앞의 예를 바탕으로 그림

23

9 10 1011letter otherstart

return(gettoken(), install_id())

letter or digit

*

Transition diagram for identifiers and keywords.

24

Lex 에 의한 구현

Regular definition finite automata, transition diagram

C 프로그램으로 출력 Lexical analysis, pattern matching, …

25

Creating a lexical analyzer with Lex.

Lexcompiler

lex.yy.c

Lexsource

programlex.l

Ccompiler

a.outlex.yy.c

a.outsequence

oftokens

inputstream

26

Lex program for the tokens of Fig. 3. 10. (1/2)

%{

/*definitions of manifest constants

LT, LE, EQ, NE, GT, GE,

IF, THEN, ELSE, ID, NUMBER, RELOP */

%}

/*regular definitions */

delim [ \ t \ n ]

ws { delim }+

letter [ A-Za-z ]

digit [ 0 – 9 ]

id { letter } ( { letter } | { digit } )*

number { digit } + ( \ .{ digit } + ) ? ( E [ + \ - ] ? { digit } + ) ?

27 Lex program for the tokens of Fig. 3. 10. (2/2)

%%{ ws } { /* no action and no return */ }if { return(IF); }then { return(THEN); }else { return(ELSE); }{ id } { yylval = install_id(); return(ID); }{ number } { yylval = install_num(); return(NUMBER); }“<” { yylval = LT; return(RELOP); }“<=” { yylval = LE; return(RELOP); }“=” { yylval = EQ; return(RELOP); }“<>” { yylval = NE; return(RELOP); }“>” { yylval = GT; return(RELOP); }“>=” { yylval = GE; return(RELOP); }%%

install_id() {/* procedure to install the lexeme, whose first character is pointed to by yytext and whose length is yyleng, into the symbol table and return a pointer thereto */

}install_num() {

/* similar procedure to install a lexeme that is a number */}

28

Lookahead operator

DO 5 I = 1.25 DO 5 I=1,25– DO/({letter | digit})* = ({letter} | {digit})*,– DO/{id}* = {digit}*,

IF(I,J)=3 IF(condition) statement– IF/ \( .* \) {letter}

chapter 3. lexical analysis (1)

Documents

regular expressions1

concatenation r

set of strings

language lr

digits digit digit

digits optional

tokensa lexical analyzer

strings xy