lecture 2: lexical analysis. 2 lexical analysis input: sequence of characters output: sequence of...
TRANSCRIPT
![Page 1: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/1.jpg)
Lecture 2: Lexical Analysis
![Page 2: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/2.jpg)
2
Lexical Analysis
INPUT: sequence of characters
OUTPUT: sequence of tokens
A lexical analyzer is generally a subroutine of parser: Simpler design Efficient Portable
Input Scanner Parser
SymbolTable
Next_char()
character token
Next_token()
![Page 3: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/3.jpg)
3
Definitions
token – set of strings defining an atomic element with a defined meaning
pattern – a rule describing a set of string lexeme – a sequence of characters that
match some pattern
![Page 4: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/4.jpg)
4
Examples
Token Pattern Sample Lexeme
while while while
relation_op = | != | < | > <
integer (0-9)* 42
string Characters between “ “
“hello”
![Page 5: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/5.jpg)
5
Input string: size := r * 32 + c
<token,lexeme> pairs: <id, size> <assign, :=> <id, r> <arith_symbol, *> <integer, 32> <arith_symbol, +> <id, c>
![Page 6: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/6.jpg)
6
Implementing a Lexical Analyzer
Practical Issues: Input buffering Translating RE into executable form Must be able to capture a large number
of tokens with single machine Interface to parser Tools
![Page 7: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/7.jpg)
7
Capturing Multiple Tokens
Capturing keyword “begin”
Capturing variable names
What if both need to happen at the same time?
b e g i n WS
WS – white spaceA – alphabeticAN – alphanumericA
AN
WS
![Page 8: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/8.jpg)
8
Capturing Multiple Tokens
b e g i n WS
WS – white spaceA – alphabeticAN – alphanumeric
A-b
AN WS
AN
Machine is much more complicated – just for these two tokens!
![Page 9: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/9.jpg)
9
Lex – Lexical Analyzer Generator
Flex/Lex
C/C++ compiler
a.out
Lexspecification
lex.yy.c
input tokens
![Page 10: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/10.jpg)
10
Lex Specification
%{ int charCount=0, wordCount=0, lineCount=0;%}word [^ \t\n]*%%{word} {wordCount++; charCount += yyleng; }[\n] {charCount++; lineCount++;}. {charCount++;}%%main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”,charCount, wordCount, lineCount);}
Definitions – Code, RE
Rules – RE/Action pairs
User Routines
![Page 11: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/11.jpg)
11
Lex definitions section
C/C++ code: Surrounded by %{… %} delimiters Declare any variables used in actions
RE definitions: Define shorthand for patterns: digit [0-9] letter [a-z] ident {letter}({letter}|{digit})* Use shorthand in RE section: {ident}
%{ int charCount=0, wordCount=0, lineCount=0;%}word [^ \t\n]*
![Page 12: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/12.jpg)
12
Lex Regular Expressions
Match explicit character sequences integer, “+++”, \<\>
Character classes [abcd] [a-zA-Z] [^0-9] – matches non-numeric
{word} {wordCount++; charCount += yyleng; }[\n] {charCount++; lineCount++;}. {charCount++;}
![Page 13: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/13.jpg)
13
Alternation twelve | 12
Closure * - zero or more + - one or more ? – zero or one {number}, {number,number}
Lex Regular Expressions(cont.)
![Page 14: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/14.jpg)
14
Other operators . – matches any character except newline ^ - matches beginning of line $ - matches end of line / - trailing context () – grouping {} – RE definitions
Lex Regular Expressions(cont.)
![Page 15: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/15.jpg)
15
Lex Matching Rules
Lex always attempts to match the longest possible string.
If two rules are matched (and match strings are same length), the first rule in the specification is used.
![Page 16: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/16.jpg)
16
Lex Operators
Highest: closure concatenation alternation
Special lex characters: - \ / * + > “ { } . $ ( ) | % [ ] ^Special lex characters inside [ ]: - \ [ ] ^
![Page 17: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/17.jpg)
17
Examples
a.*z (ab)+ [0—9]{1,5} (ab|cd)?ef = abef,cdef,ef -?[0-9]\.[0-9]
![Page 18: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/18.jpg)
18
Lex Actions
Lex actions are C (C++) code to implement some required functionality
Default action is to echo to output Can ignore input (empty action) ECHO – macro that prints out matched
string yytext – matched string yyleng – length of matched string
![Page 19: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/19.jpg)
19
User Subroutines
C/C++ code Copied directly into the lexer code User can supply ‘main’ or use default
main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”,charCount, wordCount, lineCount);}
![Page 20: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/20.jpg)
20
Uses for Lex
Transforming Input – convert input from one form to another (example 1). yylex() is called once; return is not used in specification
Extracting Information – scan the text and return some information (example 2). yylex() is called once; return is not used in specification.
Extracting Tokens – standard use with compiler (example 3). Uses return to give the next token to the caller.
![Page 21: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/21.jpg)
21
•A regular expression is a kind of pattern that can be applied to text (Strings, in Java)•A regular expression either matches the text (or part of the text), or it fails to match.• Regular expressions are an extremely useful tool for manipulating text
– Regular expressions are heavily used in the automatic generation of Web pages
Regular expression
![Page 22: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/22.jpg)
22
• Scan for virus signatures• Process natural languages• Search for information using Google• Search and replace in word processors• Filter text( spam, malware )• Validate data-entry field (dates, email, url)
Pattern matching applications:
![Page 23: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/23.jpg)
23
Basic Operation
• Notation to specify a set of strings
![Page 24: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/24.jpg)
24
Regular expression : examples
• Notation is surprisingly expressive.
Regular Expression Yes No
a* | (a*ba*ba*ba*)*multiple of 3 b's
εbbbaaa
abbbaaabbbaababba
a
bbb
abbaaaabaabbba
a
a | a(a|b)*abegins and ends with a
aabaaa
abbaabba
εabba
(a|b)*abba(a|b)*contains the substring
abba
abbabbabbabbabbaabba
εabb
bbaaba
![Page 25: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/25.jpg)
25
Using Regular expression
• Built in to Java, Perl , PHP, Unix, .NET,…. • Additional operations typically aded for
convenience.• Ex. [a-e]+ is shorthand for (a|b|c|d|e) (a|b|c|d|e)*
Operation Regular Expression Yes No
Concatenation hello helloOthello
say helloHello
Any single character ..oo..oo.bloodrootspoonfood
cookbookchoochoo
![Page 26: Lecture 2: Lexical Analysis. 2 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649ee05503460f94bf0449/html5/thumbnails/26.jpg)
26
Using Regular expression
Operation Regular Expression Yes No
Replication a(bc)*deade
abcdeabcbcde
abc
One or more a(bc)+deabcde
abcbcdeadeabc
Once or not at all a(bc)?deade
abcdeabc
abcbcde
Character classes [a-m]*blackmailimbecile
abovebelow
Negation of character classes [^aeiou]bc
ae
Exactly N times [^aeiou]{6}rhythmsyzygy
rhythmsallowed
Between M and N times [a-z]{4,6}spidertiger
jellyfishcow
Whitespace characters [a-z\s]*hellohello
say helloOthello2hello