course overview
DESCRIPTION
Course Overview. PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II: inside a compiler 4Syntax analysis 5Contextual analysis 6Runtime organization 7Code generation PART III: conclusion - PowerPoint PPT PresentationTRANSCRIPT
1Syntax Analysis (Chapter 4)
Course Overview
PART I: overview material1 Introduction
2 Language processors (tombstone diagrams, bootstrapping)
3 Architecture of a compiler
PART II: inside a compiler4 Syntax analysis
5 Contextual analysis
6 Runtime organization
7 Code generation
PART III: conclusion8 Interpretation
9 Review
2Syntax Analysis (Chapter 4)
The “Phases” of a Compiler
Syntax Analysis
Contextual Analysis
Code Generation
Source Program
Abstract Syntax Tree
Decorated Abstract Syntax Tree
Object Code
Error Reports
Error Reports
This chapter
3Syntax Analysis (Chapter 4)
In Chapter 4
• Syntax Analysis– Scanning: recognize “words” or “tokens” in the input
– Parsing: recognize structure of program
• Different parsing strategies
• How to construct a recursive descent parser
– AST Construction
• Use of theoretical “Tools”:– Regular Expressions and Finite–State Machines
– Grammars
– Extended BNF notation
– First sets and Follow sets
4Syntax Analysis (Chapter 4)
Syntax Analysis
• The “job” of syntax analysis is to read the source program (text file) and determine its structure.
• Subphases – Scanning
– Parsing
– Construct an internal representation of the source text that shows the structure (usually an AST)
Note: A single-pass compiler usually does not explicitly construct an AST.
5Syntax Analysis (Chapter 4)
Multi Pass Compiler
Compiler Driver
Syntactic Analyzer
callscalls
Contextual Analyzer Code Generator
calls
Dependency diagram of a typical Multi Pass Compiler:
A multi pass compiler makes several passes over the program. The output of a preceding phase is stored in a data structure and used by subsequent phases.
input
Source Text
output
AST
input output
Decorated AST
input output
Object Code
This chapter
6Syntax Analysis (Chapter 4)
Syntax Analysis
Scanner
Source Program
Abstract Syntax Tree
Error Reports
Parser
Stream of “Tokens”
(Stream of Characters)
Error Reports
Dataflow chart
7Syntax Analysis (Chapter 4)
(1) Scan: Divide Input into Tokens
An example Mini–Triangle source program:let var y: Integerin !new year y := y+1
let
let
var
var
ident.
y
scanner
colon
:
ident.
Integer
in
in
ident.
y
becomes
:=
...
... ident.
y
op.
+
intlit
1
eot
Tokens are “words” in the input, for example keywords, operators, identifiers, literals, etc.
8Syntax Analysis (Chapter 4)
(2) Parse: Determine structure of program
Parser analyzes the structure of the token stream with respect to the grammar of the language.
let
let
var
var
id.
y
col.
:
id.
Int
in
in
id.
y
bec.
:=
id.
y
op
+
intlit
1
eot
Ident Ident Ident Ident Op. Int.Lit
V-NameV-NameType Denoter
single-Declaration
Declaration
primary-Exp
primary-Exp
Expression
single-Command
single-Command
Program
9Syntax Analysis (Chapter 4)
(3) AST Construction
Program
LetCommand
Ident Ident Ident Op Int.Lit
SimpleType
VarDecl
SimpleVar
VNameExp Int.ExprSimpleVar
BinaryExpr
AssignCommand
y Integer
Ident
y y + 1
10Syntax Analysis (Chapter 4)
Grammars
RECAP:– The Syntax of a Language can be specified by means of a CFG (Context
Free Grammar).
– CFG can be expressed in BNF (Bachus-Naur Form)
Example: Mini–Triangle grammar in BNF
Program ::= single-CommandCommand ::= single-Command | Command ; single-Commandsingle-Command ::= V-name := Expression | begin Command end | ...
Program ::= single-CommandCommand ::= single-Command | Command ; single-Commandsingle-Command ::= V-name := Expression | begin Command end | ...
11Syntax Analysis (Chapter 4)
Grammars (continued)For our convenience, we will use EBNF or “Extended BNF” rather than simple BNF.
EBNF = BNF + regular expressions
Program ::= single-CommandCommand ::= (single-Command ;)* single-Commandsingle-Command ::= V-name := Expression | begin Command end | ...
Program ::= single-CommandCommand ::= (single-Command ;)* single-Commandsingle-Command ::= V-name := Expression | begin Command end | ...
Example: Mini Triangle in EBNF
* means 0 or more occurrences of
12Syntax Analysis (Chapter 4)
Regular Expressions
• RE are a notation for expressing a set of strings of terminal symbols.
Different kinds of RE: The empty stringt Generates only the string tX Y Generates any string xy such that x is generated by x
and y is generated by YX | Y Generates any string which generated either
by X or by YX* The concatenation of zero or more strings generated
by X(X) Used for grouping
13Syntax Analysis (Chapter 4)
RE: Examples
What sets of strings do each of the following RE generate?
1.
2. M(r|s)“.”
3. (foo|bar)*
4. (foo|bar)(foo|bar)*
5. (0|1|2|3|4|5|6|7|8|9)*
6. 0|(1|..|9)(0|1|..|9)*
1.
2. M(r|s)“.”
3. (foo|bar)*
4. (foo|bar)(foo|bar)*
5. (0|1|2|3|4|5|6|7|8|9)*
6. 0|(1|..|9)(0|1|..|9)*
14Syntax Analysis (Chapter 4)
Regular Expressions
• The “languages” that can be defined by RE and CFG have been extensively studied by theoretical computer scientists. These are some important conclusions / terminology
– RE is a “weaker” formalism than CFG: Any language expressible by a RE can be expressed by CFG but not the other way around!
– The languages expressible as RE are called regular languages
– Generally: a language that exhibits “self–embedding” cannot be expressed by RE.
– Programming languages exhibit self–embedding. (Examples: an expression can contain another expression, and a command can contain another command).
15Syntax Analysis (Chapter 4)
Extended BNF
• Extended BNF combines BNF with RE• A production in EBNF looks like
LHS ::= RHS
where LHS is a non terminal symbol and RHS is an extended regular expression
• An extended RE is just like a regular expression except it is composed of terminals and non–terminals of the grammar.
• Simply put, EBNF adds to BNF these notations– (...) for the purpose of grouping and
– * for denoting “0 or more repetitions of … ”
16Syntax Analysis (Chapter 4)
Extended BNF: an Example
Expression ::= PrimaryExp (Operator PrimaryExp)*PrimaryExpression ::= Literal | Identifier | ( Expression )Identifier ::= Letter (Letter|Digit)*Literal ::= Digit Digit*Letter ::= a | b | c | ... |zDigit ::= 0 | 1 | 2 | 3 | 4 | ... | 9
Expression ::= PrimaryExp (Operator PrimaryExp)*PrimaryExpression ::= Literal | Identifier | ( Expression )Identifier ::= Letter (Letter|Digit)*Literal ::= Digit Digit*Letter ::= a | b | c | ... |zDigit ::= 0 | 1 | 2 | 3 | 4 | ... | 9
Example: a simple expression language
17Syntax Analysis (Chapter 4)
A little bit of useful theory
• We will now look at a few useful bits of theory. These will be necessary later when we implement parsers.– Grammar transformations
• A grammar can be transformed in a number of ways without changing its meaning (i.e. its language, or the set of strings that it generates)
– The definition and computation of starter sets (first sets), follow sets, and nullable symbols
18Syntax Analysis (Chapter 4)
Grammar Transformations
Left factorization
single-Command ::= V-name := Expression | if Expression then single-Command | if Expression then single-Command else single-Command
single-Command ::= V-name := Expression | if Expression then single-Command | if Expression then single-Command else single-Command
single-Command ::= V-name := Expression | if Expression then single-Command ( | else single-Command)
single-Command ::= V-name := Expression | if Expression then single-Command ( | else single-Command)
X Y | X Z X ( Y | Z )
Example:X Y= Z
19Syntax Analysis (Chapter 4)
Grammar Transformations (continued)
Elimination of Left RecursionN ::= X | N Y
Identifier ::= Letter | Identifier Letter | Identifier Digit
Identifier ::= Letter | Identifier Letter | Identifier Digit
N ::= X Y*
Example:
Identifier ::= Letter | Identifier (Letter|Digit)
Identifier ::= Letter | Identifier (Letter|Digit)
Identifier ::= Letter (Letter|Digit)*Identifier ::= Letter (Letter|Digit)*
20Syntax Analysis (Chapter 4)
Grammar Transformations (continued)
Substitution of non-terminal symbolsN ::= XM ::= N
single-Command ::= for controlVar := Expression direction
Expression do single-Commanddirection ::= to | downto
single-Command ::= for controlVar := Expression direction
Expression do single-Commanddirection ::= to | downto
Example:
N ::= XM ::= X
single-Command ::= for controlVar := Expression (to|downto)
Expression do single-Command
single-Command ::= for controlVar := Expression (to|downto)
Expression do single-Command
21Syntax Analysis (Chapter 4)
Starter Sets (a.k.a. First Sets)
Informal Definition:The starter set of a RE X is the set of terminal symbols that can occur as the start of any string generated by X
Example :starters[ (“+”| - | ) (0 | 1 |…| 9)+ ] = {+, -, 0, 1, …, 9}
Formal Definition:starters[={ }starters[t={t} (where t is any terminal symbol)starters[X Y] = starters[X] (if X doesn’t generate )starters[X Y= starters[Xstarters[Yif X generates )starters[X | Y= starters[Xstarters[Ystarters[X*= starters[X
22Syntax Analysis (Chapter 4)
Derivations
• Replacing a non-terminal
S ::= E E ::= T | E + TT ::= i | ( E )
S ::= E E ::= T | E + TT ::= i | ( E )
S S S => ES => ES => E => E + TS => E => E + TS => E => E + T => T + TS => E => E + T => T + TS => E => E + T => T + T => i + TS => E => E + T => T + T => i + TS => E => E + T => T + T => i + T => i + iS => E => E + T => T + T => i + T => i + i
• This is a left-most derivation (it replaces the left-most non-terminal at each step.• Can you find the corresponding right-most derivation?• Can you find a derivation that is neither left-most nor right-most?
• This is a left-most derivation (it replaces the left-most non-terminal at each step.• Can you find the corresponding right-most derivation?• Can you find a derivation that is neither left-most nor right-most?
23Syntax Analysis (Chapter 4)
Sentential forms• A sequence of grammar symbols that can be derived from the start symbol
• A sentence is a sentential form that contains only terminal symbols, that is, a string that can be generated using the grammar.
S => E => E + T => T + T => i + T => i + iS => E => E + T => T + T => i + T => i + i
24Syntax Analysis (Chapter 4)
Ambiguous grammars
A grammar is ambiguous if some sentence has more than one distinct parse tree.
Equivalently, a grammar is ambiguous if some sentence has more than one left-most derivation, or more than one right-most derivation.
S ::= E E ::= i | ( E ) | E + E
S ::= E E ::= i | ( E ) | E + E
Does i + i demonstrate the ambiguity?Does i + i demonstrate the ambiguity?Does i + i demonstrate the ambiguity? E => E + E => i + E => i + i
Does i + i demonstrate the ambiguity? E => E + E => i + E => i + iDoes i + i + i demonstrate the ambiguity?Does i + i + i demonstrate the ambiguity?Does i + i + i demonstrate an ambiguity?
E => E + E => i + E => i + E + E => i + i + E => i + i + i
E => E + E => E + E + E => i + E + E => i + i + E => i + i + i
Does i + i + i demonstrate an ambiguity?
E => E + E => i + E => i + E + E => i + i + E => i + i + i
E => E + E => E + E + E => i + E + E => i + i + E => i + i + i
25Syntax Analysis (Chapter 4)
Augmented grammars
We augment grammars to ensure that we can recognize and handle the end of the input string
S ::= E E ::= i | ( E ) | E + E
S ::= E E ::= i | ( E ) | E + E
S’ ::= S $S ::= E E ::= i | ( E ) | E + E
S’ ::= S $S ::= E E ::= i | ( E ) | E + E
Here $ denotes the end-of-file token
26Syntax Analysis (Chapter 4)
Nullable, First sets (starter sets), and Follow sets
• A non-terminal is nullable if it derives the empty string
• First(N) or starters(N) is the set of all terminals that can begin a sentence derived from N
• Follow(N) is the set of terminals that can follow N in some sentential form
Next we will see algorithms to compute each of these.