cs335: syntax analysis

33
CS335: Syntax Analysis Swarnendu Biswas Semester 2019-2020-II CSE, IIT Kanpur Content influenced by many excellent references, see References slide for acknowledgements.

Upload: others

Post on 05-Apr-2022

16 views

Category:

Documents


0 download

TRANSCRIPT

CS335: Syntax AnalysisSwarnendu Biswas

Semester 2019-2020-II

CSE, IIT Kanpur

Content influenced by many excellent references, see References slide for acknowledgements.

An Overview of Compilation

CS 335 Swarnendu Biswas

lexical analyzer

semantic analyzer

source program

syntax analyzer code optimizer

code generator

intermediate code generator

target program

error handler

symbol table

Parser Interface

CS 335 Swarnendu Biswas

symbol table

token

get next token

parsetree

Lexical Analyzer

IRSyntax Analyzer

Rest of Front End

source program

Need for Checking Syntax

β€’ Given an input program, scanner generates a stream of tokens classified according to the syntactic category

β€’ The parser determines if the input program, represented by the token stream, is a valid sentence in the programming language

β€’ The parser attempts to build a derivation for the input program, using a grammar for the programming languageβ€’ If the input stream is a valid program, parser builds a valid model for later

phases

β€’ If the input stream is invalid, parser reports the problem and diagnostic information to the user

CS 335 Swarnendu Biswas

Syntax Analysis

β€’ Given a programming language grammar 𝐺 and a stream of tokens 𝑠, parsing tries to find a derivation in 𝐺 that produces 𝑠

β€’ In addition, a syntax analyserβ€’ Forward the information as IR to the next compilation phases

β€’ Handle errors if the input string is not in 𝐿(𝐺)

CS 335 Swarnendu Biswas

Context-Free Grammars

β€’ A context-free grammar (CFG) 𝐺 is a quadruple (𝑇, 𝑁𝑇, 𝑆, 𝑃)

CS 335 Swarnendu Biswas

𝑇 Set of terminal symbols (also called words) in the language 𝐿(𝐺)

𝑁𝑇 Set of nonterminal symbols that appear in the productions of 𝐺

𝑆 Goal or start symbol of the grammar 𝐺

𝑃 Set of productions (or rules) in 𝐺

Context-Free Grammars

β€’ Terminal symbols correspond to syntactic categories returned by the scannerβ€’ Terminal symbol is a word that can occur in a sentence

β€’ Nonterminals are syntactic variables introduced to provide abstraction and structure in the productions

β€’ 𝑆 represents the set of sentences in 𝐿(𝐺)

β€’ Each rule in 𝑃 is of the form 𝑡𝑻 β†’ (𝑻 βˆͺ 𝑡𝑻)βˆ—

CS 335 Swarnendu Biswas

Definitions

β€’ Derivation is a a sequence of rewriting steps that begins with the grammar 𝐺’s start symbol and ends with a sentence in the language

𝑆 ֜+𝑀 where 𝑀 ∈ 𝐿(𝐺)

β€’ At each point during derivation process, the string is a collection of terminal or nonterminal symbols

𝛼𝐴𝛽 β†’ 𝛼𝛾𝛽 if 𝐴 β†’ 𝛾

β€’ Such a string is called a sentential form if it occurs in some step of a valid derivation

β€’ A sentential form can be derived from the start symbol in zero or more steps

CS 335 Swarnendu Biswas

Example of a CFG

CFG

𝐸π‘₯π‘π‘Ÿ β†’ 𝐸π‘₯π‘π‘Ÿ

| 𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name

| name

𝑂𝑝 β†’ + βˆ’ Γ— | Γ·

(𝒂 + 𝒃) Γ— 𝒄

𝐸π‘₯π‘π‘Ÿ β†’ 𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name

β†’ 𝐸π‘₯π‘π‘Ÿ Γ— name

β†’ (𝐸π‘₯π‘π‘Ÿ) Γ— name

β†’ (𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name) Γ— name

β†’ (𝐸π‘₯π‘π‘Ÿ + name) Γ— name

β†’ (name + name) Γ— name

CS 335 Swarnendu Biswas

Sentential Form and Parse Tree

CS 335 Swarnendu Biswas

𝐸π‘₯π‘π‘Ÿ β†’ 𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name

β†’ 𝐸π‘₯π‘π‘Ÿ Γ— name

β†’ (𝐸π‘₯π‘π‘Ÿ) Γ— name

β†’ (𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name) Γ— name

β†’ (𝐸π‘₯π‘π‘Ÿ + name) Γ— name

β†’ (name + name) Γ— name

𝐸π‘₯π‘π‘Ÿ

𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name

𝐸π‘₯π‘π‘Ÿ( ) Γ—

𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name

name +

Parse Tree

Parse Tree

β€’ A parse tree is a graphical representation of a derivation β€’ Root is labeled with by the start symbol 𝑆

β€’ Each internal node is a nonterminal, and represents the application of a production

β€’ Leaves are labeled by terminals and constitute a sentential form, read from left to right, called the yield or frontier of the tree

β€’ Parse tree filters out the order in which productions are applied to replace nonterminals β€’ It just represents the rules applied

CS 335 Swarnendu Biswas

Derivations

β€’ At each step during derivation, we have two choices to make1. Which nonterminal to rewrite?

2. Which production rule to pick?

β€’ Rightmost (or canonical) derivation rewrites the rightmost nonterminal at each step, denoted by 𝛼

π‘Ÿπ‘šπ›½

β€’ Similarly, leftmost derivation rewrites the leftmost nonterminal at each step, denoted by 𝛼

π‘™π‘šπ›½

β€’ Every leftmost derivation can be written as π‘€π΄π›Ύπ‘™π‘šπ‘€π›Ώπ›Ύ

CS 335 Swarnendu Biswas

Leftmost Derivation

CS 335 Swarnendu Biswas

𝐸π‘₯π‘π‘Ÿ β†’ 𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name

β†’ (𝐸π‘₯π‘π‘Ÿ) 𝑂𝑝 name

β†’ 𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name 𝑂𝑝 name

β†’ name 𝑂𝑝 name 𝑂𝑝 name

β†’ name + name 𝑂𝑝 name

β†’ name + name Γ— name

𝐸π‘₯π‘π‘Ÿ

𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name

𝐸π‘₯π‘π‘Ÿ( ) Γ—

𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name

name +

Parse Tree

Ambiguous Grammars

β€’ A grammar 𝐺 is ambiguous if some sentence in 𝐿(𝐺) has more than one rightmost (or leftmost) derivation

β€’ An ambiguous grammar can produce multiple derivations and parse trees

CS 335 Swarnendu Biswas

Example of Ambiguous Grammar

β€’ A grammar 𝐺 is ambiguous if some sentence in 𝐿(𝐺) has more than one rightmost (or leftmost) derivation

β€’ An ambiguous grammar can produce multiple derivations and parse trees

CS 335 Swarnendu Biswas

Sπ‘‘π‘šπ‘‘ β†’ if 𝐸π‘₯π‘π‘Ÿ then π‘†π‘‘π‘šπ‘‘

| if 𝐸π‘₯π‘π‘Ÿ then π‘†π‘‘π‘šπ‘‘ else π‘†π‘‘π‘šπ‘‘

| 𝐴𝑠𝑠𝑖𝑔𝑛

Ambiguous Dangling-Else Grammar

CS 335 Swarnendu Biswas

if 𝐸π‘₯π‘π‘Ÿ1 then if 𝐸π‘₯π‘π‘Ÿ2 then 𝐴𝑠𝑠𝑖𝑔𝑛1 else 𝐴𝑠𝑠𝑖𝑔𝑛2

π‘†π‘‘π‘šπ‘‘

if 𝐸π‘₯π‘π‘Ÿ1 then π‘†π‘‘π‘šπ‘‘

if 𝐸π‘₯π‘π‘Ÿ2 then π‘†π‘‘π‘šπ‘‘ else π‘†π‘‘π‘šπ‘‘

𝐴𝑠𝑠𝑖𝑔𝑛1 𝐴𝑠𝑠𝑖𝑔𝑛2

if 𝐸π‘₯π‘π‘Ÿ1 then π‘†π‘‘π‘šπ‘‘ else π‘†π‘‘π‘šπ‘‘

π‘†π‘‘π‘šπ‘‘

if 𝐸π‘₯π‘π‘Ÿ2 then π‘†π‘‘π‘šπ‘‘

𝐴𝑠𝑠𝑖𝑔𝑛1 𝐴𝑠𝑠𝑖𝑔𝑛2

Dealing with Ambiguous Grammars

β€’ Ambiguous grammars are problematic for compilersβ€’ Compilers use parse trees to interpret the meaning of the expressions during

later stages

β€’ Multiple parse trees can give rise to multiple interpretations

β€’ Fixing ambiguous grammarsβ€’ Transform the grammar to remove the ambiguity

β€’ Include rules to disambiguate during derivations β€’ For e.g., associativity and precedence

CS 335 Swarnendu Biswas

Fixing the Ambiguous Dangling-Else Grammar

β€’ In all programming languages, an else is matched with the closest then

CS 335 Swarnendu Biswas

Sπ‘‘π‘šπ‘‘ β†’ if 𝐸π‘₯π‘π‘Ÿ then π‘†π‘‘π‘šπ‘‘

| if 𝐸π‘₯π‘π‘Ÿ then π‘‡β„Žπ‘’π‘›π‘†π‘‘π‘šπ‘‘ else π‘†π‘‘π‘šπ‘‘

| 𝐴𝑠𝑠𝑖𝑔𝑛

π‘‡β„Žπ‘’π‘›π‘†π‘‘π‘šπ‘‘ β†’ if 𝐸π‘₯π‘π‘Ÿ then π‘‡β„Žπ‘’π‘›π‘†π‘‘π‘šπ‘‘ else π‘‡β„Žπ‘’π‘›π‘†π‘‘π‘šπ‘‘

| 𝐴𝑠𝑠𝑖𝑔𝑛

Fixed Dangling-Else Grammar

CS 335 Swarnendu Biswas

if 𝐸π‘₯π‘π‘Ÿ1 then if 𝐸π‘₯π‘π‘Ÿ2 then 𝐴𝑠𝑠𝑖𝑔𝑛1 else 𝐴𝑠𝑠𝑖𝑔𝑛2

Sπ‘‘π‘šπ‘‘ β†’ if 𝐸π‘₯π‘π‘Ÿ then π‘†π‘‘π‘šπ‘‘

β†’ if 𝐸π‘₯π‘π‘Ÿ then if 𝐸π‘₯π‘π‘Ÿ then π‘‡β„Žπ‘’π‘›π‘†π‘‘π‘šπ‘‘ else π‘†π‘‘π‘šπ‘‘

β†’ if 𝐸π‘₯π‘π‘Ÿ then if 𝐸π‘₯π‘π‘Ÿ then π‘‡β„Žπ‘’π‘›π‘†π‘‘π‘šπ‘‘ else 𝐴𝑠𝑠𝑖𝑔𝑛

β†’ if 𝐸π‘₯π‘π‘Ÿ then if 𝐸π‘₯π‘π‘Ÿ then 𝐴𝑠𝑠𝑖𝑔𝑛 else 𝐴𝑠𝑠𝑖𝑔𝑛

Interpreting the Meaning

CFG

𝐸π‘₯π‘π‘Ÿ β†’ (𝐸π‘₯π‘π‘Ÿ)

| 𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name

| name

𝑂𝑝 β†’ + βˆ’ Γ— | Γ·

𝒂 + 𝒃 Γ— 𝒄

𝐸π‘₯π‘π‘Ÿ β†’ 𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name

β†’ 𝐸π‘₯π‘π‘Ÿ Γ— name

β†’ 𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name Γ— name

β†’ 𝐸π‘₯π‘π‘Ÿ + name Γ— name

β†’ name + name Γ— name

CS 335 Swarnendu Biswas

rightmost derivation

Corresponding Parse Tree

𝒂 + 𝒃 Γ— 𝒄

𝐸π‘₯π‘π‘Ÿ β†’ 𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name

β†’ 𝐸π‘₯π‘π‘Ÿ Γ— name

β†’ 𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name Γ— name

β†’ 𝐸π‘₯π‘π‘Ÿ + name Γ— name

β†’ name + name Γ— name

CS 335 Swarnendu Biswas

𝐸π‘₯π‘π‘Ÿ

name

𝐸π‘₯π‘π‘Ÿ

𝐸π‘₯π‘π‘Ÿ 𝑂𝑝 name

×𝑂𝑝 name

+

How do we evaluate the expression?

Associativity

CS 335 Swarnendu Biswas

π‘ π‘‘π‘Ÿπ‘–π‘›π‘” β†’ π‘ π‘‘π‘Ÿπ‘–π‘›π‘” + π‘ π‘‘π‘Ÿπ‘–π‘›π‘” π‘ π‘‘π‘Ÿπ‘–π‘›π‘” βˆ’ π‘ π‘‘π‘Ÿπ‘–π‘›π‘” 0 1 2|… |9

π‘ π‘‘π‘Ÿπ‘–π‘›π‘”

9 5

π‘ π‘‘π‘Ÿπ‘–π‘›π‘”

π‘ π‘‘π‘Ÿπ‘–π‘›π‘” + π‘ π‘‘π‘Ÿπ‘–π‘›π‘”

βˆ’ π‘ π‘‘π‘Ÿπ‘–π‘›π‘” 2

π‘ π‘‘π‘Ÿπ‘–π‘›π‘”

βˆ’ π‘ π‘‘π‘Ÿπ‘–π‘›π‘”

+π‘ π‘‘π‘Ÿπ‘–π‘›π‘” π‘ π‘‘π‘Ÿπ‘–π‘›π‘”

5 2

π‘ π‘‘π‘Ÿπ‘–π‘›π‘”

9

9 βˆ’ 5 + 2

Associativity

β€’ If an operand has operator on both the sides, the side on which operator takes this operand is the associativity of that operatorβ€’ +, -, *, / are left associative

β€’ ^, = are right associative

β€’ Grammar to generate strings with right associative operators

CS 335 Swarnendu Biswas

π‘Ÿπ‘–π‘”β„Žπ‘‘ β†’ π‘™π‘’π‘‘π‘‘π‘’π‘Ÿ = π‘Ÿπ‘–π‘”β„Žπ‘‘|π‘™π‘’π‘‘π‘‘π‘’π‘Ÿπ‘™π‘’π‘‘π‘‘π‘’π‘Ÿ β†’ π‘Ž 𝑏 … |𝑧

Parse Tree for Right Associative Grammars

CS 335 Swarnendu Biswas

a = b = cπ‘Ÿπ‘–π‘”β„Žπ‘‘

π‘™π‘’π‘‘π‘‘π‘’π‘Ÿ = π‘Ÿπ‘–π‘”β„Žπ‘‘

a π‘™π‘’π‘‘π‘‘π‘’π‘Ÿ π‘Ÿπ‘–π‘”β„Žπ‘‘

π‘™π‘’π‘‘π‘‘π‘’π‘Ÿ

c

=

b

Encode Precedence into the Grammar

π‘†π‘‘π‘Žπ‘Ÿπ‘‘ β†’ 𝐸π‘₯π‘π‘Ÿ

𝐸π‘₯π‘π‘Ÿ β†’ 𝐸π‘₯π‘π‘Ÿ + π‘‡π‘’π‘Ÿπ‘š|𝐸π‘₯π‘π‘Ÿ βˆ’ π‘‡π‘’π‘Ÿπ‘š|π‘‡π‘’π‘Ÿπ‘š

π‘‡π‘’π‘Ÿπ‘š β†’ π‘‡π‘’π‘Ÿπ‘š Γ— πΉπ‘Žπ‘π‘‘π‘œπ‘Ÿ|π‘‡π‘’π‘Ÿπ‘š Γ· πΉπ‘Žπ‘π‘‘π‘œπ‘Ÿ|πΉπ‘Žπ‘π‘‘π‘œπ‘Ÿ

πΉπ‘Žπ‘π‘‘π‘œπ‘Ÿ β†’ (𝐸π‘₯π‘π‘Ÿ)|num|name

CS 335 Swarnendu Biswas

pri

ori

ty

Corresponding Parse Tree

𝒂 βˆ’ 𝒃 + 𝒄

π‘†π‘‘π‘Žπ‘Ÿπ‘‘ β†’ 𝐸π‘₯π‘π‘Ÿ

β†’ 𝐸π‘₯π‘π‘Ÿ + π‘‡π‘’π‘Ÿπ‘š

β†’ 𝐸π‘₯π‘π‘Ÿ + πΉπ‘Žπ‘π‘‘π‘œπ‘Ÿ

β†’ 𝐸π‘₯π‘π‘Ÿ + name

β†’ 𝐸π‘₯π‘π‘Ÿ βˆ’ π‘‡π‘’π‘Ÿπ‘š + name

β†’ 𝐸π‘₯π‘π‘Ÿ βˆ’ πΉπ‘Žπ‘π‘‘π‘œπ‘Ÿ + name

β†’ 𝐸π‘₯π‘π‘Ÿ βˆ’ name + name

β†’ π‘‡π‘’π‘Ÿπ‘š βˆ’ name + name

β†’ πΉπ‘Žπ‘π‘‘π‘œπ‘Ÿ βˆ’ name + name

β†’ name βˆ’ name + name

CS 335 Swarnendu Biswas

𝐸π‘₯π‘π‘Ÿ

name

𝐸π‘₯π‘π‘Ÿ

𝐸π‘₯π‘π‘Ÿ + π‘‡π‘’π‘Ÿπ‘š

βˆ’ π‘‡π‘’π‘Ÿπ‘š πΉπ‘Žπ‘π‘‘π‘œπ‘Ÿ

nameπΉπ‘Žπ‘π‘‘π‘œπ‘Ÿ

name

π‘‡π‘’π‘Ÿπ‘š

πΉπ‘Žπ‘π‘‘π‘œπ‘Ÿ

Types of Parsers

Top-down

β€’ Starts with the root and grows the tree toward the leaves

Bottom-up

β€’ Starts with the leaves and grow the tree toward the root

Universal

β€’ More general algorithms, but inefficient to use in production compilers

CS 335 Swarnendu Biswas

Error Handling

β€’ The scanner cannot deal with all errors

β€’ Common source of programming errorsβ€’ Lexical errors

β€’ For e.g., illegal characters, missing quotes around strings

β€’ Syntactic errorsβ€’ For e.g., misspelled keywords, misplaced semicolons or extra or missing braces

β€’ Semantic errorsβ€’ For e.g., type mismatches between operators and operands, undeclared variables

β€’ Logical errors

CS 335 Swarnendu Biswas

Handling Errors

Panic-mode recovery

β€’ Parser discards input symbols one at a time until a synchronizing token is found

β€’ Synchronizing tokens are usually delimiters (for e.g., ; or })

Phrase-level recovery

β€’ Perform local correction on the remaining input

β€’ Can go into an infinite loop because of wrong correction, or the error may have occurred before it is detected

CS 335 Swarnendu Biswas

Handling Errors

Error productions

β€’ Augment the grammar with productions that generate erroneous constructs

β€’ Works only for common mistakes, complicates the grammar

Global correction

β€’ Given an incorrect input string π‘₯ and grammar 𝐺, find a parse tree for a related string 𝑦 such that the number of modifications (insertions, deletions, and changes) of tokens required to transform π‘₯ into 𝑦 is as small as possible

CS 335 Swarnendu Biswas

Context-Free vs Regular Grammar

β€’ CFGs are more powerful than REsβ€’ Every regular language is context-free, but not vice versa

β€’ We can create a CFG for every NFA that simulates some RE

β€’ Language that can be described by a CFG but not by a RE

CS 335 Swarnendu Biswas

𝐿 = π‘Žπ‘›π‘π‘› 𝑛 β‰₯ 1}

Limitations of Syntax Analysis

β€’ Cannot determine whether β€’ A variable has been declared before use

β€’ A variable has been initialized

β€’ Variables are of types on which operations are allowed

β€’ Number of formal and actual arguments of a function match

β€’ These limitations are handled during semantic analysis

CS 335 Swarnendu Biswas

References

β€’ A. Aho et al. Compilers: Principles, Techniques, and Tools, 2nd edition, Chapters 2 and 4.

β€’ K. Cooper and L. Torczon. Engineering a Compiler, 2nd edition, Chapter 3.

CS 335 Swarnendu Biswas