csc 3315 lexical and syntax analysis

32
CSC3315 (Spring 2009) 1 CSC 3315 CSC 3315 Lexical and Syntax Lexical and Syntax Analysis Analysis Hamid Harroud Hamid Harroud School of Science and Engineering, Akhawayn School of Science and Engineering, Akhawayn University University http://www.aui.ma/~H.Harroud/csc3315/

Upload: kirima

Post on 18-Jan-2016

96 views

Category:

Documents


0 download

DESCRIPTION

CSC 3315 Lexical and Syntax Analysis. Hamid Harroud School of Science and Engineering, Akhawayn University http://www.aui.ma/~H.Harroud/csc3315/. Syntax Description vs. Syntax Analysis. Syntax Description : the set of rules that can be used to generate any sentence in the language - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSC 3315 Lexical and Syntax Analysis

CSC3315 (Spring 2009) 1

CSC 3315CSC 3315Lexical and Syntax Lexical and Syntax AnalysisAnalysis

Hamid HarroudHamid HarroudSchool of Science and Engineering, Akhawayn School of Science and Engineering, Akhawayn

UniversityUniversityhttp://www.aui.ma/~H.Harroud/csc3315/

Page 2: CSC 3315 Lexical and Syntax Analysis

Syntax Description vs. Syntax Analysis

Syntax Description: the set of rules that can be used to generate any sentence in the language We can construct derivations (or parse trees) to

generate (synthesize) arbitrary sentences

Syntax Analysis: the reverse process Given a sentence, obtain the derivation (or parse

tree) that would generate this sentence based on some grammar

Page 3: CSC 3315 Lexical and Syntax Analysis

The Parsing Problem Given the sequence of tokens of a source

program (a sentence), we ask: is it syntactically valid, i.e. could it be generated from that grammar? If it is not; find all syntax errors (or as many as

possible) » for each error, produce a diagnostic message and

recover quickly– i.e. do not “crash” If it is; construct the corresponding parse tree

(i.e. the derivation)

Page 4: CSC 3315 Lexical and Syntax Analysis

The Parsing Problem

A parser is a program that solves the parsing problem It is the second phase of a compiler Parse tree constructed by parser is used in the

subsequent compilation phase: semantic analysis and code generation

A recognizer is similar to a parser, except that it does not produce the parse tree- it only gives a Yes or No answer

Page 5: CSC 3315 Lexical and Syntax Analysis

Syntax Description vs. Syntax Analysis

Parser design is nearly always directly based on a grammar description (BNF) of language syntax

Two categories of parsers Top down - parse tree is constructed starting at root,

using an order of a leftmost derivation Bottom up - parse tree is constructed starting at

leaves, using an order of the reverse of a rightmost derivation

Page 6: CSC 3315 Lexical and Syntax Analysis

Top-down Parsing: LL Parsing Top-down parsers are also called LL-parsers

First L: Scan input sentence from left to right Second L: Produce a left-most derivation

Scans the token stream from left to right: For each input token, it decides which rule to use to expand the

leftmost non-terminal in the current sentential form This decision is based only on whether the leftmost terminal

generated by the current leftmost non-terminal matches the current input token

=> one token of lookahead=> LL parsers that use one token of lookahead are called LL(1) parsers

Page 7: CSC 3315 Lexical and Syntax Analysis

LL(1) Parsing

The LL Parser must expand <A> using some rule of the form: <A> RHS

Obviously, this decision must be based both on the grammar and the unparsed portion of the token stream

A LL(1) parser determines the correct rule to use to expand a non-terminal <A> based only on the first token generated by <A>

Definition: if X is the RHS of some grammar rule then FIRST(X) = {a | =>* aY}

where a : a terminal symbol =>* : zero or more derivation steps

Page 8: CSC 3315 Lexical and Syntax Analysis

LL(1) Parsing Example:

Grammar: <A> b<B> | c<B>b | a

Suppose the input token currently being processed is c If the current sentential form is x<A>a, then there are

three possible rules we can apply, and the possible next sentential forms are then: xb<B>a, xc<B>ba, and xaa

=> But which rule to use next?=> Choose the rule such that the leftmost terminal

obtained when <A> is expanded matches our current input token “c”

Page 9: CSC 3315 Lexical and Syntax Analysis

LL(1) Parsing

Example (cont.) Consider now a slightly modified version of

this grammar: <A> c<B> | c<B>b | a

=> this grammar actually cannot be used with a LL(1) parser, since it is sometimes impossible to decide on next rule based on the current token only

Page 10: CSC 3315 Lexical and Syntax Analysis

Recursive-Descent Parsing This is a recursive implementation of LL(1) parsers Based directly on BNF description of the language:

There is a subprogram for each non-terminal in the grammar, which can parse sentences generated by that non-terminal

EBNF is ideally suited for being the basis for a recursive descent parser, because EBNF tends to minimize the number of non-terminals in grammar

In a way, this is analogous to how the lexical analyzer is implemented directly from the state diagram of the corresponding FA

Page 11: CSC 3315 Lexical and Syntax Analysis

Recursive-Descent Parsing Suppose a non-terminal <A> has only one RHS The subprogram for <A> is implemented as

follows: Scan the symbols of RHS one by one from left to right For each terminal symbol in the RHS, compare it with

the next input token; if they match, then read the next input token and continue; else report an error

For each non-terminal symbol in the RHS, call the subprogram associated with that non-terminal

Page 12: CSC 3315 Lexical and Syntax Analysis

Recursive-Descent Parsing

Example: <A> a<B>cb The subprogram for non-terminal <A>

does the following: Check if next token == ‘a’ else error() Call subprogram for non-terminal <B> Check if next token == ‘c’ else error() Check if next token == ‘b’ else error()

Page 13: CSC 3315 Lexical and Syntax Analysis

Recursive-Descent Parsing

Suppose a non-terminal <A> has more than one RHS

The subprogram for <A> is implemented as follows: First find rule <A> X such that FIRST(X) == next

token» If none of the RHS’s of <A> satisfies this, then report an error» If more than one RHS of <A> satisfies this, then the grammar is not appropriate for recursive descent parsing to begin with!

Continue as in the previous case

Page 14: CSC 3315 Lexical and Syntax Analysis

Recursive-Descent Parsing Assumptions:

We have a lexical analyzer subprogram Lex(), which whenever called, puts the next token code in a global variable nextToken

At the beginning of every parsing subprogram, the next input token is in nextToken

Obviously, the first subprogram called is the one associated with the start symbol of the grammar

Page 15: CSC 3315 Lexical and Syntax Analysis

Recursive-Descent Parsing

Example: consider the following grammar<A> a<B>cb | c<B> | ba<C>

Let ASub be subprogram associated with nonterminal <A>

Let BSub be subprogram associated with nonterminal <B>

Let CSub be subprogram associated with nonterminal <C>

Page 16: CSC 3315 Lexical and Syntax Analysis

Recursive-Descent ParsingParser() { //this is the main function of the parser of this grammar Lex(); //get the first token in the source program ASub(); //call the subprogram for the start symbol of the grammar}

ASub() { switch (nextToken) case ‘a’: Lex(); BSub(); if nextToken != ‘c’ then error(); Lex(); if nextToken != ‘b’ then error(); Lex(); break; case ‘c’: Lex(); BSub(); break; default: CSub();}

Page 17: CSC 3315 Lexical and Syntax Analysis

Recursive-Descent Parsing

An example in EBNF:<expr> <term> {(+ | -) <term>}<term> <factor> {(* | /) <factor>}<factor> id | ( <expr> )

Let expr() be subprogram associated with non-terminal <expr>

Let term() be subprogram associated with non-terminal <term>

Let factor() be subprogram associated with non-terminal <factor>

Page 18: CSC 3315 Lexical and Syntax Analysis

Recursive-Descent Parsing/* Parse strings in the language

generated by the rule:

<expr> <term> {(+ | -) <term>}

*/

void expr() {

term(); /* Parse the first term */

while (nextToken == PLUS_CODE || nextToken == MINUS_CODE)

{

lex();

term();

}

}

Page 19: CSC 3315 Lexical and Syntax Analysis

Recursive-Descent Parsing/* Parse strings in the languagegenerated by the rule:<term> <factor> {(* | /) <factor>}*/

void term() { factor(); /* Parse first factor */ while (nextToken == MULT_CODE || nextToken == DIV_CODE) { lex(); factor(); }}

Page 20: CSC 3315 Lexical and Syntax Analysis

Recursive-Descent Parsing/* Parse strings in the languagegenerated by the rule:<factor> id | (<expr>) */

void factor() { if (nextToken) == ID_CODE) lex(); else if (nextToken == LEFT_PAREN_CODE) { lex(); expr(); if (nextToken == RIGHT_PAREN_CODE) lex(); else error(); } /* End of else if (nextToken == ... */ else error(); /* Neither RHS matches */}

Page 21: CSC 3315 Lexical and Syntax Analysis

Recursive-Descent Parsing

Limitations:

Recursive descent cannot be used with a grammar that: contains a left recursive rule => an infinite loop!OR is such that we cannot always choose correct RHS based on a

single token of lookahead

These two features are in fact problematic for LL(1) parsers in general

Page 22: CSC 3315 Lexical and Syntax Analysis

Left Recursion Problem

Examples:

Direct recursion: <A> <A> + <B>

Indirect recursion:<A> <C>a<C> <A>b

=> In both cases, the subprogram for parsing <A> will call itself indefinitely (hence an infinite loop)

what about non-left recursion; does it always lead to an infinite loop?

Page 23: CSC 3315 Lexical and Syntax Analysis

Left Recursion Problem

A grammar can always be converted into one without left recursion EBNF is especially useful for this purpose

Example:

<expr> <term> | <expr> + <term>

<expr> <term> {+ <term>}

Page 24: CSC 3315 Lexical and Syntax Analysis

Pairwise Disjointness

Lack of pairwise disjointness: the inability to determine the correct RHS (to expand a non-terminal) on the basis of a single token of lookahead

To make sure a grammar does not have this problem, grammar must pass the pairwise disjointness test:

For each non-terminal A in the grammar, Let α1, α2, …, αm be the RHS’s for the m rules in which A appears

on the LHS For any pair (αi, αj) we must have

FIRST(αi) η FIRST(αj) = Ø

Page 25: CSC 3315 Lexical and Syntax Analysis

Pairwise Disjointness

The pair disjointness problem can sometimes be solved via a technique called left factoring

Example:<variable> identifier | identifier [<expression>]

<variable> identifier <new><new> [<expression>] | ε

<variable> identifier [ [<expression>] ]

Page 26: CSC 3315 Lexical and Syntax Analysis

Pairwise Disjointness

Exercise:

<S> <S>a | b

What is the language generated by this grammar? For which input token streams will the subprogram

for <S> run into an infinite loop? Does this grammar pass the pairwise disjointness

test?

Page 27: CSC 3315 Lexical and Syntax Analysis

Pairwise Disjointness

Exercise:

<A> a | <B>b | <C><B>

<B> c<C> | c

<C> b | a<B>

Does this grammar pass the pairwise disjointness test? If not, re-write grammar to fix the problem.

Page 28: CSC 3315 Lexical and Syntax Analysis

LR Parsing

LL(k) parsers predict which production to use, considering only the first k tokens.

What if we could postpone choosing a production until we see all input tokens corresponding to the entire right-hand side of a production?

LR(k): rightmost derivation, lookahead of k tokens How is rightmost derivation compatible with left-

to-right parsing of input?

Page 29: CSC 3315 Lexical and Syntax Analysis

LR Parsing

Top-down parsing Start at most abstract level (grammar sentence) and work

down to most concrete level (tokens) Leftmost derivation LL(k), recursive-descent or predictive parsing

Bottom-up parsing Work from tokens up to sentences Rightmost derivation LR(k)

Page 30: CSC 3315 Lexical and Syntax Analysis

LR Parsing

Page 31: CSC 3315 Lexical and Syntax Analysis

LR Parsing

An LR(k) parser uses the contents of its stack and the next k tokens of the input to decide which action to take. (In practice, k = 1.)

The parser knows when to shift and reduce by applying a DFA to the stack.

Edges are labeled with terminals and non-terminals. Transitions indicate actions such as “shift and go to state

n”, “reduce by rule k”, “accept”.

Page 32: CSC 3315 Lexical and Syntax Analysis

LR Parsing