syntax and type analysis - softech.informatik.uni-kl.de · lexical analysis lexical analysis tasks...

20
Syntax and Type Analysis Lecture Compilers SS 2009 Dr.-Ing. Ina Schaefer Software Technology Group TU Kaiserslautern Ina Schaefer Syntax and Type Analysis 1 Educational Objectives Tasks of Different Syntax Analysis Phases Interaction of Syntax Analysis Phases Specification Techniques for Syntax Analysis Generation Techniques Usage of Tools Lexical Analysis Context-free Analysis (Parsing) Context-sensitive Analysis Ina Schaefer Syntax and Type Analysis 2

Upload: others

Post on 19-Feb-2020

40 views

Category:

Documents


0 download

TRANSCRIPT

Syntax and Type AnalysisLecture Compilers SS 2009

Dr.-Ing. Ina Schaefer

Software Technology GroupTU Kaiserslautern

Ina Schaefer Syntax and Type Analysis 1

Educational Objectives

• Tasks of Different Syntax Analysis Phases• Interaction of Syntax Analysis Phases• Specification Techniques for Syntax Analysis• Generation Techniques• Usage of Tools• Lexical Analysis• Context-free Analysis (Parsing)• Context-sensitive Analysis

Ina Schaefer Syntax and Type Analysis 2

Introduction to Syntax and Type Analysis

Syntax Analysis

Tasks of Syntax Analysis• Check if Input is syntactically correct• Dependant on Result:

! Error Message! Generation of appropriate Data Structure for subsequent

processing

Ina Schaefer Syntax and Type Analysis 3

Introduction to Syntax and Type Analysis

Syntax Analysis Phases

Lexical Analysis:String ! Token Stream (or Symbol String)

Context-free Analysis:Token Stream ! Tree

Context-sensitive Analysis:Tree ! Tree with Cross References

Scanner

Source Codeas String

TokenStream

Parser

Name and Type Analysis

Syntax Tree

AttributedSyntax Tree

Ina Schaefer Syntax and Type Analysis 4

Introduction to Syntax and Type Analysis

Reasons for Separation of Phases

• Lexical and Context-free Analysis! Reduced load for context-free analysis, e.g. whitespaces are not

required for context-free analysis• Context-free and Context-sensitive Analysis

! Context-Sensitive Analysis uses tree structure instead of tokenstream

! Advantages for construction of target data structure• For Both Cases

! Increased efficiency! Natural process (cmp. natural language)! More appropriate tool support

Ina Schaefer Syntax and Type Analysis 5

Lexical Analysis

Lexical Analysis

Ina Schaefer Syntax and Type Analysis 6

Lexical Analysis

Lexical Analysis

Tasks

• Break input character string into symbol stream (or token stream)wrt. language definition

• Classify symbols into classes• Representation of symbols

! Hashing of identifieres! Conversion of constants

• Elimination of! whitespaces (spaces, comments...)! external constructs (compiler directives...)

Ina Schaefer Syntax and Type Analysis 7

Lexical Analysis

Lexical Analysis (2)

Terminology

• Symbol: a word over an alphabet of characters (often withadditional information, e.g. token class, encoding, position..)

• Symbol Class: a set of tokens (identifier, constants, ...);correspond to terminal symbols of a context-free grammar

Ina Schaefer Syntax and Type Analysis 8

Lexical Analysis

Lexical Analysis: ExampleInput Line 23:

!!if!(!A!<=!3.14!)!!!B!=!B--

33© A. Poetzsch-Heffter, TU Kaiserslautern25.04.2007

Beispiel: (lexikalische Analyse)

Zeile 23 der Eingabedatei:

Ergebnis der lexikalischen Analyse:

if( A <= 3.14) B = B---

Symbolklasse String Codierung Zeile:Spalte

IF “if“ 23:3

OPAR “(“ 23:5

ID “A“ 72 23:7

RELOP “<=“ 4 23:9

FLOATCONST “3.14“ 3,14 23:12

CPAR “)“ 23:16

ID “B“ 84 23:20

...

Hashcode des

Identifiers

Wert der

Konstanten

Codierung für

Operator <=

Symbolinformation

TokenClass

String Encoding Col:Row

Value of Constant

Hash Code of Identifier

Encoding of Operator

Token Information

33© A. Poetzsch-Heffter, TU Kaiserslautern25.04.2007

Beispiel: (lexikalische Analyse)

Zeile 23 der Eingabedatei:

Ergebnis der lexikalischen Analyse:

if( A <= 3.14) B = B---

Symbolklasse String Codierung Zeile:Spalte

IF “if“ 23:3

OPAR “(“ 23:5

ID “A“ 72 23:7

RELOP “<=“ 4 23:9

FLOATCONST “3.14“ 3,14 23:12

CPAR “)“ 23:16

ID “B“ 84 23:20

...

Hashcode des

Identifiers

Wert der

Konstanten

Codierung für

Operator <=

Symbolinformation

Input Line 23:

Result of Lexical Analysis:

Ina Schaefer Syntax and Type Analysis 9

Lexical Analysis Specification of Scanners

Specification

The Specification of the Lexical Analysis is a Part of the ProgrammingLanguage Specification.

The two Parts of Lexical Analysis Specification:• Scanning Algorithm (often only implicit)• Specification of Symbols and Symbol Classes

Ina Schaefer Syntax and Type Analysis 10

Lexical Analysis Specification of Scanners

Examples: Scanning

1. Statement in C

B!=!B!---!A;

Problem: Separation ( - - and - are symbols)Solution: Longest symbol is chosen, i.e

B!=!B!--!-!A;

2. Java Fragment

class!public!{!public!m()!{...}!}

Problem: Ambiguity (key word, identifier)Solution: Precedence Rules

Ina Schaefer Syntax and Type Analysis 11

Lexical Analysis Specification of Scanners

Standard Scan-Alogrithm (Concept)

Scaning is often implemented as Co-Routine:• State is remainder of input• Co-Routine returns next symbol• In error cases, co-routine returns the UNDEF symbol and updates

the input

Ina Schaefer Syntax and Type Analysis 12

Lexical Analysis Specification of Scanners

Standard Scan-Alogrithm (Pseudo Code)

String left_input : = input;

Symbol nextSymbol() {Symbol curSymbol := longestSymbolPrefix(left_input);left_input:= cut(curSymbol, left_input);return curSymbol;

}

where cut is defined as• if curToken "= UNDEF, curToken is removed from left_input• else left_input remains unchanged.

Ina Schaefer Syntax and Type Analysis 13

Lexical Analysis Specification of Scanners

Standard Scan-Alogrithm (2)

longestSymbolPrefix(String egr) {\\ length(egr) > 0int curLength := 0;String curPrefix := prefix(curLength,egr);Symbol longestSymbol := UNDEF;

while (curLength <= length(egr) && isSymbolPrefix(curPrefix))if (isSymbol(curPrefix) {

longestSymbol := curPrefix;}curLength++;curPrefix:=prefix(curLength,egr);

}return longestSymbol;

}

Ina Schaefer Syntax and Type Analysis 14

Lexical Analysis Specification of Scanners

Standard Scan-Algorithm (3)

Only Predicates have to be defined:• isSymbolPrefix: String ! bool• isSymbol: String ! bool

Remarks:• Standard Scan-Algorithm is used in many modern languages, but

not e.g. in FORTRAN because blanks are not special except inliteral symbols, e.g.

! DO 7 I = 1.25! DO 7 I is an identifier.! DO 7 I = 1,25! DO is a keyword.

• Error Cases are not handled• Complete Realisation of longestSymbolPrefix is discussed later.

Ina Schaefer Syntax and Type Analysis 15

Lexical Analysis Specification of Scanners

Specification of Symbols

• Symbols are specified by regular expressions.• Symbols Classes are described informally.

Ina Schaefer Syntax and Type Analysis 16

Lexical Analysis Specification of Scanners

Regular Expressions

Let ! be an alphabet, i.e. an non-empty set of characters. !! is the setof all words over !, ! is the empty word.

Definition (Regular Expressions, Regular Languages)

• ! is a regular expression (r.e.) and denotes the language L = {!}.• Each a # ! is a r.e. and denotes the language L = {a}.• Let r and s be two r.e. defining the languages R and S, resp.

Then the following are r.e. and define the corresponding languageL:

! (r |s) with L = R $ S Union! rs with L = {vw | v # R, w # S} Concatenation! r! with {v1 . . . vn | vi # R, 0 % i % n} Kleene Star

The language L & !! is called regular iff there exists r.e. r defining L.

Ina Schaefer Syntax and Type Analysis 17

Lexical Analysis Specification of Scanners

Regular Expressions (2)

Remarks:• L = ' is not regular according to the definition, but is often

considered regular.• Other Operators, e.g. +, ?, ., [] can be defined using the basic

operators, e.g.! r+ ( (rr!) ( r! \ {!}! [aBd ] ( a|B|d! [a) g] ( a|b|c|d |e|f |g

Caution: Regular Expressions only define valid symbols and do notspecify the program or translation units of a programming language.

Ina Schaefer Syntax and Type Analysis 18

Lexical Analysis Implementation of Scanners

Implementation of Scanners

Scanner Generator

Sequence of Regular Expressions and Actions(Input Language of Scanner Generator)

Scanner Program(mostly in Programming Language)

Ina Schaefer Syntax and Type Analysis 19

Lexical Analysis Implementation of Scanners

Scanner Generator: JFlex

• Typical Use of JFlex:

java -jar JFlex.jar Example.jflexjavac Yylex.java

Actions are written in Java• Examples :

1. Regular Expression in JFlex

[a-zA-Z_0-9] [a-zA-Z_0-9] *

2. JFlex Input with Abbreviations

ZI = [0-9]BU = [a-zA-Z_]BUZI = [a-zA-Z_0-9]%%{BU}{BUZI}* { anAction(); }

Ina Schaefer Syntax and Type Analysis 20

Lexical Analysis Implementation of Scanners

A complete JFlex Example

enum Token { DO, DOUBLE, IDENT, FLOATCONST, STRING;}%%

%type Token // declare token type

ZI = [0-9]BU = [a-zA-Z_]BUZI = [a-zA-Z_0-9]ZE = [a-zA-Z_0-9!?\]\[\.\t...]

%%[ \t]* /* whitespace */"do" { return Token.DO; }"double" { return Token.DOUBLE; }{BU}{BUZI}* { return Token.IDENT; }{ZI}+\.{ZI}+ { return Token.FLOATCONST; }\"({ZE}|\\\")*\" { return Token.STRING; }

Ina Schaefer Syntax and Type Analysis 21

Lexical Analysis Implementation of Scanners

Scanner Generators

• Scanner Generation uses the Equivalence between! Regular Expressions! Non-determininstic finite automata (NFA)! Deterministic finite automata (DFA)

• Construction Methods is based in two steps:! Regular Expressions ! NFA! NFA ! DFA

Ina Schaefer Syntax and Type Analysis 22

Lexical Analysis Implementation of Scanners

Definition of NFA

Definition (Non-deterministic Finite Automaton)A non-deterministic finite automaton is defined as a 5-tuple

M = (!, Q,", q0, F )

where• ! is the input alphabet• Q is the set of states• q0 # Q is the initial state• F & Q is the set of final states• " & Q * ! $ {!}*Q is the transition relation.

Ina Schaefer Syntax and Type Analysis 23

Lexical Analysis Implementation of Scanners

Regular Expressions ! NFA

Principle: For each regular sub-expression, construct NFA with onestart and end state that accepts the same language.

43© A. Poetzsch-Heffter, TU Kaiserslautern25.04.2007

1. Schritt: Reguläre Ausdrücke ! NEA

Übersetzungsschema:

• !

• a

• (r|s)

• (rs)

• r*

Prinzip: Konstruiere für jeden regulären TeilausdruckNEA mit genau einem Start- und Endzustand,der die gleiche Sprache akzeptiert.

s0 f0

s0

a

s0 f0

!

s1 f1R

s2 f2S!

!!

s1 f1R s2 f2S!

s1 f1R!

f0s0!

!

!

Ina Schaefer Syntax and Type Analysis 24

Lexical Analysis Implementation of Scanners

Regular Expressions ! NFA (2)

43© A. Poetzsch-Heffter, TU Kaiserslautern25.04.2007

1. Schritt: Reguläre Ausdrücke ! NEA

Übersetzungsschema:

• !

• a

• (r|s)

• (rs)

• r*

Prinzip: Konstruiere für jeden regulären TeilausdruckNEA mit genau einem Start- und Endzustand,der die gleiche Sprache akzeptiert.

s0 f0

s0

a

s0 f0

!

s1 f1R

s2 f2S!

!!

s1 f1R s2 f2S!

s1 f1R!

f0s0!

!

!

43© A. Poetzsch-Heffter, TU Kaiserslautern25.04.2007

1. Schritt: Reguläre Ausdrücke ! NEA

Übersetzungsschema:

• !

• a

• (r|s)

• (rs)

• r*

Prinzip: Konstruiere für jeden regulären TeilausdruckNEA mit genau einem Start- und Endzustand,der die gleiche Sprache akzeptiert.

s0 f0

s0

a

s0 f0

!

s1 f1R

s2 f2S!

!!

s1 f1R s2 f2S!

s1 f1R!

f0s0!

!

!

43© A. Poetzsch-Heffter, TU Kaiserslautern25.04.2007

1. Schritt: Reguläre Ausdrücke ! NEA

Übersetzungsschema:

• !

• a

• (r|s)

• (rs)

• r*

Prinzip: Konstruiere für jeden regulären TeilausdruckNEA mit genau einem Start- und Endzustand,der die gleiche Sprache akzeptiert.

s0 f0

s0

a

s0 f0

!

s1 f1R

s2 f2S!

!!

s1 f1R s2 f2S!

s1 f1R!

f0s0!

!

!

Ina Schaefer Syntax and Type Analysis 25

Lexical Analysis Implementation of Scanners

Example: Construction of NFA

44

© A

. P

oe

tzsch

-Heff

ter,

TU

Kais

ers

laute

rn25.04.2007

Üb

ers

etz

un

g a

m B

eis

pie

lvo

n F

olie

41

:

s0

s1

s5

s6

s7

s8

s9

s 10

s 11

s2

s4

s 13

s 12

s 17

s 16

s 14

s 15

d

el

bu

od

s3

o

LZ

, T

AB

BU

ZI

BU

ZI

ZI

.Z

I

ZI

s 18

s1

9“

s 20

ZE

s 21

s 22

s 23

s 24

!

\

s 26

“s 2

5

“!

!

!

!!

!

! !

!

! !

Ina Schaefer Syntax and Type Analysis 26

Lexical Analysis Implementation of Scanners

!-closure

Function closure computes the !-closure of a set of states s1, . . . , sn.

Definition (!-closure)For an NFA M = (!, Q,", q0, F ) and a state q # Q, the !-closure of qis defined by

!-closure(q) = {p # Q |p reachable from q via !-transitions}

For S & Q, the !-closure of S is defined by

!-closure(S) =!

s"S

!-closure(s)

Ina Schaefer Syntax and Type Analysis 27

Lexical Analysis Implementation of Scanners

Longest Symbol Prefix with NFA

longestSymbolPrefix(char[] egr) {// length(egr) > 0StateSet curState : = closure( {s0} );int curLength := 0;int symbolLength := undef;

while (curLength <= length(egr) && !isEmptySet(curState) )if (contains(curState,finalState)) {symbolLength := curLength;}

curLength++;curState:=closure(successor(curState,egr[curLength]));}return symbol(prefix(egr,symbolLength));

}

Ina Schaefer Syntax and Type Analysis 28

Lexical Analysis Implementation of Scanners

Longest Symbol Prefix with NFA (2)

Remark:

Problem of Ambiguity is not solved yet:

If there are more than one token matching the longest input prefix,one of these tokens is returned by the function symbol.

Ina Schaefer Syntax and Type Analysis 29

Lexical Analysis Implementation of Scanners

NFA ! DFA

Principle:

For each NFA, a DFA can be constructed that accepts the samelanguage. (In general, this does not hold for NFA with output.)

Properties of DFA:• No !-transitions.• Transitions are determined by function.

Ina Schaefer Syntax and Type Analysis 30

Lexical Analysis Implementation of Scanners

NFA ! DFA (2)

Definition (Deterministic Finite State Automaton)A deterministic finite automaton is defined as a 5-tuple

M = (!, Q,", q0, F )

where• ! is the input alphabet• Q is the set of states• q0 # Q is the initial state• F & Q is the set of final states• " : Q * ! ! Q is the transition function.

Ina Schaefer Syntax and Type Analysis 31

Lexical Analysis Implementation of Scanners

NFA ! DFA (3)

Construction: (according to John Myhill)• The States of the DFA are subsets of NFA states

(powerset construction). Subsets of finite sets are also finite.• The start state of the DFA is the !-closure of the NFA start state• The final states of the DFA are the sets of states that contain an

NFA final state.• The successor state of a state S in the DFA under input a is

obtained by! computing all successors p of q # S under a in the NFA! and adding the !-closure of p

Ina Schaefer Syntax and Type Analysis 32

Lexical Analysis Implementation of Scanners

NFA ! DFA (4)

• If working with character classes (e.g. [a-f]), characters andcharacter classes at outgoing transitions must be disjoint.

• Completion of automaton for error handling:! Insert additional (final) state (nT)! For each state, add a transition for each character for which no

outgoing transition exists to the nonToken state.

Ina Schaefer Syntax and Type Analysis 33

Lexical Analysis Implementation of Scanners

NFA ! DFA (5)

Definition (DFA for NFA)Let M = (!, Q,", q0, F ) be a NFA. Then, the DFA M # corresponding tothe NFA M is defined as M # = (!, Q#,"#, q#

0, F #) where• the set of states is Q# & P(Q), power set of Q• the initial state q#

0 is the !-closure of q0

• the final states are F # = {S & Q |S + F "= '}• "#(S, a) = !-closure({p | (q, a, p) # ", q # S}) for all a # !.

Ina Schaefer Syntax and Type Analysis 34

Lexical Analysis Implementation of Scanners

Example: DFA

48

© A

. P

oe

tzsch

-Heff

ter,

TU

Kais

ers

laute

rn25.04.2007

s0,1

,2,5

,12,1

4,1

8

s1

LZ

,T

AB

LZ

,T

AB

s 3,6

,13

s4,7

,13

s8,1

3

s 13

BU

\{d

}d

e

l

b

u

o

BU

ZI\

{b}

BU

ZI\

{u}

BU

ZI\

{o}

BU

ZI

BUZI\{l}

BUZI\{e}

BU

ZI

s 17

s 16

s 15

ZI

ZI

.Z

I

ZI

s19,2

0,2

2,2

5s

19,2

0,2

1,2

2,2

5

s 26

s 19,2

0,2

1,2

2,2

3,2

5s

19,2

0,2

2,2

4,2

5,2

6

s9,1

3

s 10,1

3

s 11,1

3

ZE

\

““\

ZE

“ “\

ZE

ZE

\

ksW

g. Ü

be

rsic

htli

chke

itK

an

ten z

u k

s n

ur

an

ged

eu

tet.

Transitions to nT

sketched.

nT

Ina Schaefer Syntax and Type Analysis 35

Lexical Analysis Implementation of Scanners

Longest Symbol Prefix with DFA

longestSymbolPrefix(char[] egr) {// length(egr) > 0State curState : = start_state;int curLength := 0;int symbolLength := undef;

while (curLength <= length(egr) && curState != nT)if ( curState is FinalState) {tokenLength := curLength;

}curLength++;curState := successor(curState,egr[curLength]));

}return symbol(prefix(egr,tokenLength));

}

Ina Schaefer Syntax and Type Analysis 36

Lexical Analysis Implementation of Scanners

Longest Symbol Prefix with DFA (2)

Remarks:• Computation of closure at construction time, not at runtime.

(Principle: Do as much statically as you can!)• Problem of ambiguity still not solved.

Most scanner generators use ordering of rulesin case of conflicts.

Ina Schaefer Syntax and Type Analysis 37

Lexical Analysis Implementation of Scanners

Longest Token Prefix with DFA (3)

Implementation Aspects:• Constructed DFA can be minimized.• Input buffering is important: often use of cyclic arrays (caution with

maximal token length, e.g. in case of comments)• Encode DFA in table• Choose suitable partitioning of alphabet in order to reduce number

of transitions (i.e. size of table)• Interface with Parser: usually parser asks proactively for next

token (co-routines)

Ina Schaefer Syntax and Type Analysis 38

Lexical Analysis Implementation of Scanners

Recommended Reading

• Wilhelm, Maurer: Chap. 7, pp. 239-269 (More theoretical)• Appel: Chap 2, pp. 16 - 37 (More practial)

Additional Reading:

• Aho, Sethi, Ullman: Chap. 3 (very detailled)

Ina Schaefer Syntax and Type Analysis 39