syntactical analysis 1: lexing - ituitu.dk/people/mogel/splc2012/lectures/splc.2012.05.pdf ·...
TRANSCRIPT
Rasmus Ejlers Møgelberg
Syntactical analysis 1: Lexing
Programming Language Concepts and ImplementationFall 2012, Lecture 5
Rasmus Ejlers Møgelberg
Mandatory Exercises
• Second mandatory assignment due Thursday this week
• Rehand-in deadline: one week after first deadline
• Rehand-in only for student who did a fair attempt first time
2
Rasmus Ejlers Møgelberg
Syntactical analysis
• Syntactical analysis is the first stage of a compiler
• Goal is to convert a program from a string representation to an abstract representation in form of a syntax tree
• Checks programs for syntax errors
3
Lexer ParserProgram text
List of tokens Syntax tree
Rasmus Ejlers Møgelberg
Lexical tokens
• Are sequences of characters to be considered as units
• Tokens divided into types
4
24
Chapter 2: Lexical Analysis
lex-i-cal: of or relating to words or the vocabulary of a language as distinguished from its
grammar and construction
Webster's Dictionary
OVERVIEW
To translate a program from one language into another, a compiler must first pull it apart and
understand its structure and meaning, then put it together in a different way. The front end of
the compiler performs analysis; the back end does synthesis.
The analysis is usually broken up into
Lexical analysis: breaking the input into individual words or "tokens";
Syntax analysis: parsing the phrase structure of the program; and
Semantic analysis: calculating the program's meaning.
The lexical analyzer takes a stream of characters and produces a stream of names, keywords,
and punctuation marks; it discards white space and comments between the tokens. It would
unduly complicate the parser to have to account for possible white space and comments at
every possible point; this is the main reason for separating lexical analysis from parsing.
Lexical analysis is not very complicated, but we will attack it with highpowered formalisms
and tools, because similar formalisms will be useful in the study of parsing and similar tools
have many applications in areas other than compilation.
2.1 LEXICAL TOKENS
A lexical token is a sequence of characters that can be treated as a unit in the grammar of a
programming language. A programming language classifies lexical tokens into a finite set of
token types. For example, some of the token types of a typical programming language are
Type Examples
ID foo n14 last
NUM 73 0 00 515 082
REAL 66.1 .5 10. 1e67 5.5e-10
IF if
COMMA ,
NOTEQ !=
LPAREN (
RPAREN )
Punctuation tokens such as IF, VOID, RETURN constructed from alphabetic characters are
called reserved words and, in most languages, cannot be used as identifiers.
Rasmus Ejlers Møgelberg
Lexing, example
• Program text
• Should be lexed to
5
float match0(char *s) /* find a zero */{ if (!strncmp(s, “0.0”, 3)) return 0.;}
FLOAT ID(match0) LPAREN CHAR STAR ID(s) RPAREN LBRACE IF LPAREN BANG ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF
Rasmus Ejlers Møgelberg
Lecture plan
• Today: Lexing
• Next two weeks: Parsing
• Learning objectives
- Describe syntax of programming languages
• Regular expressions
• Context free grammars
- Use lexer- and parser-generators
- Understand how these generators work
6
Rasmus Ejlers Møgelberg
Overview
• Regular expressions
• Lexing using an automaton (DFA)
• Constructing DFA from regular expressions
• An alternative algorithm for testing match of regular expressions and strings
• Context free grammars
7
Rasmus Ejlers Møgelberg
Specifying tokens: Regular expressions
• A regular expression is a description of a set of strings
- ab* means {a, ab, abb, abbb, ...}
- (ab)* means {“”, ab, abab, ababab, ...}
- (a|b)* means {“”, a, b, aa, ab, ba, bb, aaa, aab, aba, ...}
8
Reg. expr Meaning Language L(r)
a Symbol a {“a”}
ε Empty string {“”}
r1 r2 r1 followed by r2 {s1 s2 | s1 ∈L(r1), s2 ∈L(r2)}
r* zero or more r {s1... sn | si ∈L(r)}
r1 | r2 r1 or r2 L(r1) ∪ L(r2)
Rasmus Ejlers Møgelberg
Regular expressions
• Abbreviations
9
Abbrev. Meaning Expansion
[acg5] Individual symbols a|c|g|5
[0-9] Range 0|...|9
[a-zA-Z] Ranges a|...|z|A|...|Z
r? zero or one r r|ε
r+ one or more rr*
Rasmus Ejlers Møgelberg
Examples and exercises
• Non-negative integer constants
• Integer constants
• Java variable names
• Java floating point constants
• Internet domains
• Email addresses
10
Rasmus Ejlers Møgelberg
A small language (from book)
• Longest match
- if8 should match ID rather than token IF
• Rules priority
- If a string matches more than one token, first listed token is preferred
11
if IF
[a-z][a-z0-9]* ID
[0-9]* NUM
([0-9]+”.”[0-9]*)| ([0-9]*”.”[0-9]+) REAL
(“--”+[a-z]*”\n”)|(” “|”\n”|”\t”)+ no token, white space
. error
any character
Rasmus Ejlers Møgelberg
Finite automata
12
28
Rule priority: For a particular longest initial substring, the first regular expression that can
match determines its token-type. This means that the order of writing down the regular-
expression rules has significance.
Thus, if8 matches as an identifier by the longest-match rule, and if matches as a reserved
word by rule-priority.
2.3 FINITE AUTOMATA
Regular expressions are convenient for specifying lexical tokens, but we need a formalism
that can be implemented as a computer program. For this we can use finite automata (N.B. the
singular of automata is automaton). A finite automaton has a finite set of states; edges lead
from one state to another, and each edge is labeled with a symbol. One state is the start state,
and certain of the states are distinguished as final states.
Figure 2.3 shows some finite automata. We number the states just for convenience in
discussion. The start state is numbered 1 in each case. An edge labeled with several characters
is shorthand for many parallel edges; so in the ID machine there are really 26 edges each
leading from state 1 to 2, each labeled by a different letter.
Figure 2.3: Finite automata for lexical tokens. The states are indicated by circles; final states
are indicated by double circles. The start state has an arrow coming in from nowhere. An edge
labeled with several characters is shorthand for many parallel edges.
In a deterministic finite automaton (DFA), no two edges leaving from the same state are
labeled with the same symbol. A DFA accepts or rejects a string as follows. Starting in the
start state, for each character in the input string the automaton follows exactly one edge to get
to the next state. The edge must be labeled with the input character. After making n transitions
for an n-character string, if the automaton is in a final state, then it accepts the string. If it is
not in a final state, or if at some point there was no appropriately labeled edge to follow, it
rejects. The language recognized by an automaton is the set of strings that it accepts.
For example, it is clear that any string in the language recognized by automaton ID must
begin with a letter. Any single letter leads to state 2, which is final; so a single-letter string is
accepted. From state 2, any letter or digit leads back to state 2, so a letter followed by any
number of letters and digits is also accepted.
Rasmus Ejlers Møgelberg
Automata vs. regular expressions
• Theorem.
- Given any regular expression r there is a deterministic finite automaton (DFA) that recognizes it
- For every DFA there is a regular expression that it recognizes
• Automata are easy to implement
- Array describes transitions: In state n on input b go to state transitions[n][b]
13
Rasmus Ejlers Møgelberg
Putting the automata together
14
29
In fact, the machines shown in Figure 2.3 accept the same languages as the regular
expressions of Figure 2.2.
These are six separate automata; how can they be combined into a single machine that can
serve as a lexical analyzer? We will study formal ways of doing this in the next section, but
here we will just do it ad hoc: Figure 2.4 shows such a machine. Each final state must be
labeled with the token-type that it accepts. State 2 in this machine has aspects of state 2 of the
IF machine and state 2 of the ID machine; since the latter is final, then the combined state
must be final. State 3 is like state 3 of the IF machine and state 2 of the ID machine; because
these are both final we use rule priority to disambiguate - we label state 3 with IF because we
want this token to be recognized as a reserved word, not an identifier.
Figure 2.4: Combined finite automaton.
We can encode this machine as a transition matrix: a two-dimensional array (a vector of
vectors), subscripted by state number and input character. There will be a "dead" state (state
0) that loops to itself on all characters; we use this to encode the absence of an edge.
int edges[][] = { /* ...012...-...e f g h i j... */
/* state 0 */ {0,0,...0,0,0...0...0,0,0,0,0,0...},
/* state 1 */ {0,0,...7,7,7...9...4,4,4,4,2,4...},
/* state 2 */ {0,0,...4,4,4...0...4,3,4,4,4,4...},
/* state 3 */ {0,0,...4,4,4...0...4,4,4,4,4,4...},
/* state 4 */ {0,0,...4,4,4...0...4,4,4,4,4,4...},
/* state 5 */ {0,0,...6,6,6...0...0,0,0,0,0,0...},
/* state 6 */ {0,0,...6,6,6...0...0,0,0,0,0,0...},
/* state 7 */ {0,0,...7,7,7...0...0,0,0,0,0,0...},
/* state 8 */ {0,0,...8,8,8...0...0,0,0,0,0,0...},
et cetera
}
There must also be a "finality" array, mapping state numbers to actions - final state 2 maps to
action ID, and so on.
RECOGNIZING THE LONGEST MATCH
It is easy to see how to use this table to recognize whether to accept or reject a string, but the
job of a lexical analyzer is to find the longest match, the longest initial substring of the input
Rasmus Ejlers Møgelberg
Automata for lexing
• Using the automata for lexing
- While possible follow transitions corresponding to next character
- When stuck produce token if state is accepting. Otherwise roll back to last accepting state
- Start over with remaining string
• Constructing automata:
- From regular expressions to NFA
- From NFA to DFA
15
An alternative approach to lexing
Rasmus Ejlers Møgelberg
A very different algorithm
• A direct testing algorithm for testing match of regular expression and string
• Due to Bob Harper (Paper: Proof-directed debugging)
• Implemented in C# by Carsten Schürmann
• Matching less efficient than DFA
17
Rasmus Ejlers Møgelberg
Representing regular expr. as objects
•
18
AbstractRegExp
Character One ZeroTimesPlus Star
Rasmus Ejlers Møgelberg
First (failed) attempt at matching
19
abstract public class RegExp { abstract public Boolean match(String s); }
public class Character : RegExp { Char c; public Character(Char cc) { c = cc; }
override public Boolean match(String s) { if (s.Length == 1 && s[0]==c) {return true;} else {return(false);}; } } public class One : RegExp { override public Boolean match(String s) { if (s.Length == 0) {return true;} else {return(false);}; }}
Rasmus Ejlers Møgelberg
First (failed) attempt at matching
20
public class Times : RegExp { RegExp r1; RegExp r2; public Times (RegExp s1, RegExp s2) { r1 = s1; r2 = s2;}
override public Boolean match (String s) { ???? } }
public class Star : RegExp { RegExp r; public Star(RegExp s) { r = s; }
override public Boolean match(String s) { ???? }}
Rasmus Ejlers Møgelberg
Representing regular expressions in C#
21
public delegate Boolean K(String s);
abstract public class RegExp { abstract public Boolean match(String s, K k); ! public Boolean match(String s) {! ! return match(s, delegate(String ss) {return (ss.Length == 0);});! }}
public class Character : RegExp { Char c; public Character(Char cc) { c = cc; }
override public Boolean match(String s, K k) { if (s.Length == 0) {return false;}; if (s[0]==c) {return (k(s.Substring(1)));} else {return(false);}; } } public class One : RegExp { override public Boolean match(String s, K k) { return (k(s)); }}
Rasmus Ejlers Møgelberg
Representing regular expressions in C#
22
public class Times : RegExp { RegExp r1; RegExp r2; public Times (RegExp s1, RegExp s2) { r1 = s1; r2 = s2;}
override public Boolean match (String s, K k) { return (r1.match (s, delegate (String ss){return r2.match (ss, k);})); } }
public class Star : RegExp { RegExp r; public Star(RegExp s) { r = s; }
override public Boolean match(String s, K k) { return (k(s) || r.match(s, delegate(String ss) { return match(ss, k); })); }}
Rasmus Ejlers Møgelberg
A generalised matching algorithm
• Idea: match(s,k) should return true if some initial segment of s matches the regular expression and the rest of s passes the test given by k
• Test k can always be described using a regular expression, but in actual code we use a delegate
• s matches r if match(s,k) is true for k the test that is only true on the empty string
23
Rasmus Ejlers Møgelberg
An example run
• Run of (ab)*.match(“abc”)
• Since “abc” does not match empty string we try to match initial segment of “abc” with ab and rest of string with (ab)*
• In the last line matching algorithm returns false
24
Remaining input current reg exp remaining reg exp (k)
abc (ab)⇤ ✏abc ab (ab)⇤
abc a b(ab)⇤
bc b (ab)⇤
c ab (ab)⇤
c a b(ab)⇤
Context free grammars
Rasmus Ejlers Møgelberg
Context free grammars
• An example and a derivation
• Think of it as regular expressions + recursion
• Terminology:
- 1 non-terminal Expr
- 5 terminals (tokens): +, *, (, ), num
- 4 productions (right hand sides)
- Terminals and nonterminals collectively are symbols
26
Expr = Expr + Expr | Expr * Expr | (Expr) | num
Expr => Expr + Expr=> Expr + Expr * Expr => 2 + 3*4
Rasmus Ejlers Møgelberg
Another example
• Straight line programs (from book)
27
43
Grammar 3.1 is an example of a grammar for straight-line programs. The start symbol is S
(when the start symbol is not written explicitly it is conventional to assume that the left-hand
nonterminal in the first production is the start symbol). The terminal symbols are
id print num, + ( ) := ;
GRAMMAR 3.1: A syntax for straight-line programs.
1. S ! S; S
2. S ! id := E
3. S ! print (L)
4. E ! id
5. E ! num
6. E ! E + E
7. E ! (S, E)
8. L ! E
9. L ! L, E
and the nonterminals are S, E, and L. One sentence in the language of this grammar is
id := num; id := id + (id := num + num, id)
where the source text (before lexical analysis) might have been
a : = 7;
b : = c + (d : = 5 + 6, d)
The token-types (terminal symbols) are id, num, :=, and so on; the names (a,b,c,d) and
numbers (7, 5, 6) are semantic values associated with some of the tokens.
DERIVATIONS
To show that this sentence is in the language of the grammar, we can perform a derivation:
Start with the start symbol, then repeatedly replace any nonterminal by one of its right-hand
sides, as shown in Derivation 3.2.
DERIVATION 3.2
! S
! S ; S
! S ; id := E
! id := E; id := E
! id := num ; id := E
! id := num ; id := E + E
! id := num ; id := E + (S, E)
! id := num ; id := id + (S, E)
! id := num ; id := id + (id := E, E)
! id := num ; id := id + (id := E + E, E)
! id := num ; id := id + (id := E + E, id )
! id := num ; id := id + (id := num + E, id)
! id := num ; id := id + (id := num + num, id)
S = S;S | id := E | print(L)E = id | num | E + E | (S,E)L = E | L,E
Rasmus Ejlers Møgelberg
Official definition
• A context free grammar consists of
- A finite set of nonterminals
- A finite set of terminals
- A finite set of productions
• A production consists of
- A nonterminal (called the left hand side)
- A string of symbols (terminals or nonterminals)
• This is called Backus-Naur Form (BNF)
28
Rasmus Ejlers Møgelberg
Exercises
• Construct context free grammars that match
- All strings of a’s and b’s with exactly same number of a’s and b’s
- Matching parentheses
- Matching parentheses and brackets
29
Rasmus Ejlers Møgelberg
Example: Mini Java
• From MCIJ (note mixed notation)
30
Rasmus Ejlers Møgelberg
Summary
• Lexer converts program string into stream of tokens
• Parser builds syntax tree from token stream
• Intended learning outcomes:
- Describe tokens using regular expressions
- Describe programming language syntax using CFGs
- Eliminate ambiguities in grammars
- Use parser generators
31