syntactical analysis 1: lexing - ituitu.dk/people/mogel/splc2012/lectures/splc.2012.05.pdf ·...

Rasmus Ejlers Møgelberg

Syntactical analysis 1: Lexing

Programming Language Concepts and ImplementationFall 2012, Lecture 5


Mandatory Exercises

• Second mandatory assignment due Thursday this week

• Rehand-in deadline: one week after first deadline

• Rehand-in only for student who did a fair attempt first time

2


Syntactical analysis

• Syntactical analysis is the first stage of a compiler

• Goal is to convert a program from a string representation to an abstract representation in form of a syntax tree

• Checks programs for syntax errors

3

Lexer ParserProgram text

List of tokens Syntax tree


Lexical tokens

• Are sequences of characters to be considered as units

• Tokens divided into types

4

24

Chapter 2: Lexical Analysis

lex-i-cal: of or relating to words or the vocabulary of a language as distinguished from its

grammar and construction

Webster's Dictionary

OVERVIEW

To translate a program from one language into another, a compiler must first pull it apart and

understand its structure and meaning, then put it together in a different way. The front end of

the compiler performs analysis; the back end does synthesis.

The analysis is usually broken up into

Lexical analysis: breaking the input into individual words or "tokens";

Syntax analysis: parsing the phrase structure of the program; and

Semantic analysis: calculating the program's meaning.

The lexical analyzer takes a stream of characters and produces a stream of names, keywords,

and punctuation marks; it discards white space and comments between the tokens. It would

unduly complicate the parser to have to account for possible white space and comments at

every possible point; this is the main reason for separating lexical analysis from parsing.

Lexical analysis is not very complicated, but we will attack it with highpowered formalisms

and tools, because similar formalisms will be useful in the study of parsing and similar tools

have many applications in areas other than compilation.

2.1 LEXICAL TOKENS

A lexical token is a sequence of characters that can be treated as a unit in the grammar of a

programming language. A programming language classifies lexical tokens into a finite set of

token types. For example, some of the token types of a typical programming language are

Type Examples

ID foo n14 last

NUM 73 0 00 515 082

REAL 66.1 .5 10. 1e67 5.5e-10

IF if

COMMA ,

NOTEQ !=

LPAREN (

RPAREN )

Punctuation tokens such as IF, VOID, RETURN constructed from alphabetic characters are

called reserved words and, in most languages, cannot be used as identifiers.


Lexing, example

• Program text

• Should be lexed to

5

float match0(char *s) /* find a zero */{ if (!strncmp(s, “0.0”, 3)) return 0.;}

FLOAT ID(match0) LPAREN CHAR STAR ID(s) RPAREN LBRACE IF LPAREN BANG ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF


Lecture plan

• Today: Lexing

• Next two weeks: Parsing

• Learning objectives

- Describe syntax of programming languages

• Regular expressions

• Context free grammars

- Use lexer- and parser-generators

- Understand how these generators work

6


Overview

• Regular expressions

• Lexing using an automaton (DFA)

• Constructing DFA from regular expressions

• An alternative algorithm for testing match of regular expressions and strings

• Context free grammars

7


Specifying tokens: Regular expressions

• A regular expression is a description of a set of strings

- ab* means {a, ab, abb, abbb, ...}

- (ab)* means {“”, ab, abab, ababab, ...}

- (a|b)* means {“”, a, b, aa, ab, ba, bb, aaa, aab, aba, ...}

8

Reg. expr Meaning Language L(r)

a Symbol a {“a”}

ε Empty string {“”}

r1 r2 r1 followed by r2 {s1 s2 | s1 ∈L(r1), s2 ∈L(r2)}

r* zero or more r {s1... sn | si ∈L(r)}

r1 | r2 r1 or r2 L(r1) ∪ L(r2)


Regular expressions

• Abbreviations

9

Abbrev. Meaning Expansion

[acg5] Individual symbols a|c|g|5

[0-9] Range 0|...|9

[a-zA-Z] Ranges a|...|z|A|...|Z

r? zero or one r r|ε

r+ one or more rr*


Examples and exercises

• Non-negative integer constants

• Integer constants

• Java variable names

• Java floating point constants

• Internet domains

• Email addresses

10


A small language (from book)

• Longest match

- if8 should match ID rather than token IF

• Rules priority

- If a string matches more than one token, first listed token is preferred

11

if IF

[a-z][a-z0-9]* ID

[0-9]* NUM

([0-9]+”.”[0-9]*)| ([0-9]*”.”[0-9]+) REAL

(“--”+[a-z]*”\n”)|(” “|”\n”|”\t”)+ no token, white space

. error

any character


Finite automata

12

28

Rule priority: For a particular longest initial substring, the first regular expression that can

match determines its token-type. This means that the order of writing down the regular-

expression rules has significance.

Thus, if8 matches as an identifier by the longest-match rule, and if matches as a reserved

word by rule-priority.

2.3 FINITE AUTOMATA

Regular expressions are convenient for specifying lexical tokens, but we need a formalism

that can be implemented as a computer program. For this we can use finite automata (N.B. the

singular of automata is automaton). A finite automaton has a finite set of states; edges lead

from one state to another, and each edge is labeled with a symbol. One state is the start state,

and certain of the states are distinguished as final states.

Figure 2.3 shows some finite automata. We number the states just for convenience in

discussion. The start state is numbered 1 in each case. An edge labeled with several characters

is shorthand for many parallel edges; so in the ID machine there are really 26 edges each

leading from state 1 to 2, each labeled by a different letter.

Figure 2.3: Finite automata for lexical tokens. The states are indicated by circles; final states

are indicated by double circles. The start state has an arrow coming in from nowhere. An edge

labeled with several characters is shorthand for many parallel edges.

In a deterministic finite automaton (DFA), no two edges leaving from the same state are

labeled with the same symbol. A DFA accepts or rejects a string as follows. Starting in the

start state, for each character in the input string the automaton follows exactly one edge to get

to the next state. The edge must be labeled with the input character. After making n transitions

for an n-character string, if the automaton is in a final state, then it accepts the string. If it is

not in a final state, or if at some point there was no appropriately labeled edge to follow, it

rejects. The language recognized by an automaton is the set of strings that it accepts.

For example, it is clear that any string in the language recognized by automaton ID must

begin with a letter. Any single letter leads to state 2, which is final; so a single-letter string is

accepted. From state 2, any letter or digit leads back to state 2, so a letter followed by any

number of letters and digits is also accepted.


Automata vs. regular expressions

• Theorem.

- Given any regular expression r there is a deterministic finite automaton (DFA) that recognizes it

- For every DFA there is a regular expression that it recognizes

• Automata are easy to implement

- Array describes transitions: In state n on input b go to state transitions[n][b]

13


Putting the automata together

14

29

In fact, the machines shown in Figure 2.3 accept the same languages as the regular

expressions of Figure 2.2.

These are six separate automata; how can they be combined into a single machine that can

serve as a lexical analyzer? We will study formal ways of doing this in the next section, but

here we will just do it ad hoc: Figure 2.4 shows such a machine. Each final state must be

labeled with the token-type that it accepts. State 2 in this machine has aspects of state 2 of the

IF machine and state 2 of the ID machine; since the latter is final, then the combined state

must be final. State 3 is like state 3 of the IF machine and state 2 of the ID machine; because

these are both final we use rule priority to disambiguate - we label state 3 with IF because we

want this token to be recognized as a reserved word, not an identifier.

Figure 2.4: Combined finite automaton.

We can encode this machine as a transition matrix: a two-dimensional array (a vector of

vectors), subscripted by state number and input character. There will be a "dead" state (state

0) that loops to itself on all characters; we use this to encode the absence of an edge.

int edges[][] = { /* ...012...-...e f g h i j... */

/* state 0 */ {0,0,...0,0,0...0...0,0,0,0,0,0...},

/* state 1 */ {0,0,...7,7,7...9...4,4,4,4,2,4...},

/* state 2 */ {0,0,...4,4,4...0...4,3,4,4,4,4...},

/* state 3 */ {0,0,...4,4,4...0...4,4,4,4,4,4...},

/* state 4 */ {0,0,...4,4,4...0...4,4,4,4,4,4...},

/* state 5 */ {0,0,...6,6,6...0...0,0,0,0,0,0...},

/* state 6 */ {0,0,...6,6,6...0...0,0,0,0,0,0...},

/* state 7 */ {0,0,...7,7,7...0...0,0,0,0,0,0...},

/* state 8 */ {0,0,...8,8,8...0...0,0,0,0,0,0...},

et cetera

}

There must also be a "finality" array, mapping state numbers to actions - final state 2 maps to

action ID, and so on.

RECOGNIZING THE LONGEST MATCH

It is easy to see how to use this table to recognize whether to accept or reject a string, but the

job of a lexical analyzer is to find the longest match, the longest initial substring of the input


Automata for lexing

• Using the automata for lexing

- While possible follow transitions corresponding to next character

- When stuck produce token if state is accepting. Otherwise roll back to last accepting state

- Start over with remaining string

• Constructing automata:

- From regular expressions to NFA

- From NFA to DFA

15

An alternative approach to lexing


A very different algorithm

• A direct testing algorithm for testing match of regular expression and string

• Due to Bob Harper (Paper: Proof-directed debugging)

• Implemented in C# by Carsten Schürmann

• Matching less efficient than DFA

17


Representing regular expr. as objects

•

18

AbstractRegExp

Character One ZeroTimesPlus Star


First (failed) attempt at matching

19

abstract public class RegExp { abstract public Boolean match(String s); }

public class Character : RegExp { Char c; public Character(Char cc) { c = cc; }

override public Boolean match(String s) { if (s.Length == 1 && s[0]==c) {return true;} else {return(false);}; } } public class One : RegExp { override public Boolean match(String s) { if (s.Length == 0) {return true;} else {return(false);}; }}


First (failed) attempt at matching

20

public class Times : RegExp { RegExp r1; RegExp r2; public Times (RegExp s1, RegExp s2) { r1 = s1; r2 = s2;}

override public Boolean match (String s) { ???? } }

public class Star : RegExp { RegExp r; public Star(RegExp s) { r = s; }

override public Boolean match(String s) { ???? }}


Representing regular expressions in C#

21

public delegate Boolean K(String s);

abstract public class RegExp { abstract public Boolean match(String s, K k); ! public Boolean match(String s) {! ! return match(s, delegate(String ss) {return (ss.Length == 0);});! }}

public class Character : RegExp { Char c; public Character(Char cc) { c = cc; }

override public Boolean match(String s, K k) { if (s.Length == 0) {return false;}; if (s[0]==c) {return (k(s.Substring(1)));} else {return(false);}; } } public class One : RegExp { override public Boolean match(String s, K k) { return (k(s)); }}


Representing regular expressions in C#

22

public class Times : RegExp { RegExp r1; RegExp r2; public Times (RegExp s1, RegExp s2) { r1 = s1; r2 = s2;}

override public Boolean match (String s, K k) { return (r1.match (s, delegate (String ss){return r2.match (ss, k);})); } }

public class Star : RegExp { RegExp r; public Star(RegExp s) { r = s; }

override public Boolean match(String s, K k) { return (k(s) || r.match(s, delegate(String ss) { return match(ss, k); })); }}


A generalised matching algorithm

• Idea: match(s,k) should return true if some initial segment of s matches the regular expression and the rest of s passes the test given by k

• Test k can always be described using a regular expression, but in actual code we use a delegate

• s matches r if match(s,k) is true for k the test that is only true on the empty string

23


An example run

• Run of (ab)*.match(“abc”)

• Since “abc” does not match empty string we try to match initial segment of “abc” with ab and rest of string with (ab)*

• In the last line matching algorithm returns false

24

Remaining input current reg exp remaining reg exp (k)

abc (ab)⇤ ✏abc ab (ab)⇤

abc a b(ab)⇤

bc b (ab)⇤

c ab (ab)⇤

c a b(ab)⇤

Context free grammars


Context free grammars

• An example and a derivation

• Think of it as regular expressions + recursion

• Terminology:

- 1 non-terminal Expr

- 5 terminals (tokens): +, *, (, ), num

- 4 productions (right hand sides)

- Terminals and nonterminals collectively are symbols

26

Expr = Expr + Expr | Expr * Expr | (Expr) | num

Expr => Expr + Expr=> Expr + Expr * Expr => 2 + 3*4


Another example

• Straight line programs (from book)

27

43

Grammar 3.1 is an example of a grammar for straight-line programs. The start symbol is S

(when the start symbol is not written explicitly it is conventional to assume that the left-hand

nonterminal in the first production is the start symbol). The terminal symbols are

id print num, + ( ) := ;

GRAMMAR 3.1: A syntax for straight-line programs.

1. S ! S; S

2. S ! id := E

3. S ! print (L)

4. E ! id

5. E ! num

6. E ! E + E

7. E ! (S, E)

8. L ! E

9. L ! L, E

and the nonterminals are S, E, and L. One sentence in the language of this grammar is

id := num; id := id + (id := num + num, id)

where the source text (before lexical analysis) might have been

a : = 7;

b : = c + (d : = 5 + 6, d)

The token-types (terminal symbols) are id, num, :=, and so on; the names (a,b,c,d) and

numbers (7, 5, 6) are semantic values associated with some of the tokens.

DERIVATIONS

To show that this sentence is in the language of the grammar, we can perform a derivation:

Start with the start symbol, then repeatedly replace any nonterminal by one of its right-hand

sides, as shown in Derivation 3.2.

DERIVATION 3.2

! S

! S ; S

! S ; id := E

! id := E; id := E

! id := num ; id := E

! id := num ; id := E + E

! id := num ; id := E + (S, E)

! id := num ; id := id + (S, E)

! id := num ; id := id + (id := E, E)

! id := num ; id := id + (id := E + E, E)

! id := num ; id := id + (id := E + E, id )

! id := num ; id := id + (id := num + E, id)

! id := num ; id := id + (id := num + num, id)

S = S;S | id := E | print(L)E = id | num | E + E | (S,E)L = E | L,E


Official definition

• A context free grammar consists of

- A finite set of nonterminals

- A finite set of terminals

- A finite set of productions

• A production consists of

- A nonterminal (called the left hand side)

- A string of symbols (terminals or nonterminals)

• This is called Backus-Naur Form (BNF)

28


Exercises

• Construct context free grammars that match

- All strings of a’s and b’s with exactly same number of a’s and b’s

- Matching parentheses

- Matching parentheses and brackets

29


Example: Mini Java

• From MCIJ (note mixed notation)

30


Summary

• Lexer converts program string into stream of tokens

• Parser builds syntax tree from token stream

• Intended learning outcomes:

- Describe tokens using regular expressions

- Describe programming language syntax using CFGs

- Eliminate ambiguities in grammars

- Use parser generators

31

syntactical analysis 1: lexing - ituitu.dk/people/mogel/splc2012/lectures/splc.2012.05.pdf ·...

Documents