chapter 3. lexical analysis (1)
DESCRIPTION
Chapter 3. Lexical Analysis (1). Interaction of lexical analyzer with parser. Lexical Analysis. Issues Simpler design is preferred Compiler efficiency is improved Compiler portability is improved Terms Tokens terminal symbols in a grammar - PowerPoint PPT PresentationTRANSCRIPT
Chapter 3.
Lexical Analysis (1)
2
Interaction of lexical analyzer with parser.
sourceprogram
lexicalanalyzer
parser
symboltable
token
get nexttoken
3
Lexical Analysis
Issues – Simpler design is preferred – Compiler efficiency is improved– Compiler portability is improved
Terms– Tokens terminal symbols in a grammar– Patterns rules to describing strings of a token
– Lexemes a set of strings matched by the pattern
4
TOKEN SAMPLE LEXEMESINFORMAL DESCRIPTION OF
PATTERN
const
if
relation
id
num
literal
const
if
<, <=, =, <>, >, >=
pi, count, D2
3.1416, 0, 6.02E23
"core dumped"
const
if
< or <= or = or < > or >= or >
letter followed by letters and digits
any numeric constant
any characters between " and " except "
Examples of tokens.
5
Difficulties in implementing lexical
analyzers FORTRAN
– No delimiter is used– DO 5 I=1.25 DO 5 I=1,25 DO 5 I= 1 25
PL/I– Keywords are not reserved– IF THEN THEN THEN = ELSE; ELSE ELSE=THEN;
6
Attributes for tokens
A lexical analyzer collects information about tokens into their associated attributes
Example – E = M * C ** 2
• <id, pointer to symbol-table entry for E>• <assign_op,>• <id, pointer to symbol-table entry for M>• <mult_op,_>• <id, pointer to symbol-table entry for C>• <exp_op,>• <num, integer value 2> generally stored in constant table
7
Lexical Errors
Rules for error recovery– Deleting an extraneous character– Inserting a missing character– Replacing an incorrect character by a correct character– Transposing two adjacent characters
Minimum-distance erroneous correction Example
– Detectable : 2as3, 2#31, …– Undetectable : fi(a == f(x)) …
8
Input Buffering
A single buffer could make a big difficulty– 두 버퍼 사이에 있는 word– Declare (arg1, …. , argn) array or function
Buffer pairs– A good solution– Sentinels 을 쓰면 매번 버퍼의 끝인지와
파일의 끝인지를 동시에 검사할 필요가 없음
9
Sentinels at end of each buffer half.
: : : E : : = : : M : * : eof C : * : * : 2 : eof : : : : : eof
lexeme_beginning
forward
10
Specification of Tokens
Strings and languages – Alphabet or character class finite set of symbols
– String sentence word
– |s| length of a string s
– ε : empty string, Ф ={ε} : empty set
– x, y are strings • xy : concatenation, εx = x ε = x
Operations on languages
11
Terms for parts of a string.
TERM DEFINTION
prefix of sA string obtained by removing zero or more trailing symbols of string s; e.g., ban is a prefix of banana.
suffix of sA string formed by deleting zero or more of the leading symbols of s; e.g., nana is a suffix of banana.
substring of s
A string obtained by deleting a prefix and a suffix from s; e.g., nan is a substring of banana. Every prefix and every suffix of s is a substring of s, but not every substring of s is a prefix or a suffix of s. For every string s, both s and are prefixes, suffixes, and substrings of s.
proper prefix, suffix, or substring of s
Any nonempty string x that is, respectively, a prefix, suffix, or substring of s such that s x.
subsequence of sAny string formed by deleting zero or more not necessarily contiguous symbols from s; e.g., baaa is a subsequence of banana.
12
Definitions of operations on languages.
OPERATION DEFINITION
union of L and M written
L M.L M = {s | s is in L or s is in M}
concatenation of L and M written LM
LM = { st | s is in L and t is in M }
Kleene closure of L
written L* L* denotes “zero or more concatenations of” L.
positive closure of L
written L+
L+ denotes “one or more concatenations of” L.
13
Regular Expressions
1. is a regular expression that denotes {}, that is, the set containing the empty string.
2. If a is symbol in , then a is a regular expression that denotes {a}, i.e., the set containing the string a. Although we use the same notation for all three, technically, the regular expression a is different from the string a or the symbol a. It will be clear from the context whether we are talking about a as a regular expression, string, or symbol.
3. Suppose r and s are regular expressions denoting the language L(r) and L(s). Then,a) (r)|(s) is a regular expression denoting L(r) L(s).
b) (r)(s) is a regular expression denoting L(r)L(s).
c) (r)* is a regular expression denoting (L(r))*.
d) (r) is a regular expression denoting L(r).
14
Examples on operations in regular expressions
Σ ={a,b} alphabets– a | b {a,b}– (a|b)(c|d) {ac, ad, bc, bd}– a* {ε, a, aa, aaa, …}
– (a|b)* (a*|b*)*
– aa* = a+, ε|a+ = a*
– (a|b) = (b|a)
15
Algebraic properties of regular expressions.
AXIOM DESCRIPTION
r|s = s|r | is commutative
r|(s|t) = (r|s)|t | is associative
(rs)t = r(st) concatenation is associative
r(s|t) = rs|rt
(s|t)r = sr|trconcatenation distributes over |
r = r
r = r is the identity element for concatenation
r* = (r|)* relation between * and
r** = r* * is idempotent
16
Regular Definitions
Regular definition– d1 r1 d2 r2 …. dn rn
• 예• letter A|B| … |Z|a|b| … |z• digit 0|1| … | 9• id letter (letter|digit)*
17
Unsigned numbers
Pascal digit 0|1| … |9
digits digit digit*
operational_fraction . digits | ε optional_exponent (E(+|-| ε) digits | ε
num digits operational_fraction optional_exponent
18
Notational Shorthands (1/2)
1. One or more instances. The unary postfix operator + means “one or more instances of.” If r is a regular expression that denotes the language L(r), then (r)+ is a regular expression that denotes the language (L(r))+. Thus, the regular expression a+ denotes the set of all strings of one or more a’s. The operator + has the same precedence and associativity as the operator *. The two algebraic identities r* = r+| and r+ = rr* relate the Kleene and positive closure operators.
2. Zero or one instance. The unary postfix operator ? means “zero or one instance of.” The notation r? is a shorthand for r|. If r is a regular expression, then, (r)? is a regular expression that denotes the language L(r) {}. For example, using the + and ? operators, we can rewrite the regular definition for num in Example 3.5 as
19
Notational Shorthands (2/2)
3. Character classes. The notation [abc] where a, b, and c are alphabet symbols denotes the regular expression a | b | c. An abbreviated character class such as [a – z] denotes the regular expression a | b | ··· | z. Using character classes, we can describe identifiers as being strings generated by the regular expression
[A – Za – z][A – Za – z0 – 9]*
digit
digits
optional _fraction
optional_exponent
num
0 | 1 | ··· | 9
digit+
( . digits )?
( E ( + | - )? digits )?
Digits optional_fraction optional_exponent
20
Nonregular set
{wcw-1|w is a string of a’s and b’s}
context-free grammar is required to
represent the string
21
Regular-expression patterns for tokens.
REGULAR
EXPRESSIONTOKEN ATTRIBUTE-VALUE
wsif
thenelseid
num<
<==
< >>
>=
-if
thenelseid
numreloprelopreloprelopreloprelop
----
pointer to table entrypointer to table entry
LTLEEQNEGTGE
22
Transition diagram
Finite-state automata states and edges 몇 가지 예를 보여줌 … . 다음 페이지 , 그림 3.14 는 앞의 예를 바탕으로 그림
23
9 10 1011letter otherstart
return(gettoken(), install_id())
letter or digit
*
Transition diagram for identifiers and keywords.
24
Lex 에 의한 구현
Regular definition finite automata, transition diagram
C 프로그램으로 출력 Lexical analysis, pattern matching, …
25
Creating a lexical analyzer with Lex.
Lexcompiler
lex.yy.c
Lexsource
programlex.l
Ccompiler
a.outlex.yy.c
a.outsequence
oftokens
inputstream
26
Lex program for the tokens of Fig. 3. 10. (1/2)
%{
/*definitions of manifest constants
LT, LE, EQ, NE, GT, GE,
IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/*regular definitions */
delim [ \ t \ n ]
ws { delim }+
letter [ A-Za-z ]
digit [ 0 – 9 ]
id { letter } ( { letter } | { digit } )*
number { digit } + ( \ .{ digit } + ) ? ( E [ + \ - ] ? { digit } + ) ?
27 Lex program for the tokens of Fig. 3. 10. (2/2)
%%{ ws } { /* no action and no return */ }if { return(IF); }then { return(THEN); }else { return(ELSE); }{ id } { yylval = install_id(); return(ID); }{ number } { yylval = install_num(); return(NUMBER); }“<” { yylval = LT; return(RELOP); }“<=” { yylval = LE; return(RELOP); }“=” { yylval = EQ; return(RELOP); }“<>” { yylval = NE; return(RELOP); }“>” { yylval = GT; return(RELOP); }“>=” { yylval = GE; return(RELOP); }%%
install_id() {/* procedure to install the lexeme, whose first character is pointed to by yytext and whose length is yyleng, into the symbol table and return a pointer thereto */
}install_num() {
/* similar procedure to install a lexeme that is a number */}
28
Lookahead operator
DO 5 I = 1.25 DO 5 I=1,25– DO/({letter | digit})* = ({letter} | {digit})*,– DO/{id}* = {digit}*,
IF(I,J)=3 IF(condition) statement– IF/ \( .* \) {letter}