programming languages third edition chapter 6 syntax

Programming LanguagesThird Edition

Chapter 6Syntax

2

Objectives

• Understand the lexical structure of programming languages

• Understand context-free grammars and BNFs

• Become familiar with parse trees and abstract syntax trees

• Understand ambiguity, associativity, and precedence

• Learn to use EBNFs and syntax diagrams

2

3

Objectives (cont’d.)

• Become familiar with parsing techniques and tools

• Understand lexics vs. syntax vs. semantics

• Build a syntax analyzer for TinyAda

3

4

Introduction

• Syntax is the structure of a language

• 1950: Noam Chomsky developed the idea of context-free grammars

• John Backus and Peter Naur developed a notational system for describing these grammars, now called Backus-Naur forms, or BNFs – First used to describe the syntax of Algol60

• Every modern computer scientist needs to know how to read, interpret, and apply BNF descriptions of language syntax

4

5

Introduction (cont’d.)

• Three variations of BNF:– Original BNF– Extended BNF (EBNF)– Syntax diagrams

5

6

Lexical Structure of Programming Languages

• Lexical structure: the structure of the tokens, or words, of a language– Related to, but different than, the syntactic structure

• Scanning phase: the phase in which a translator collects sequences of characters from the input program and forms them into tokens

• Parsing phase: the phase in which the translator processes the tokens, determining the program’s syntactic structure

6

7

Lexical Structure of Programming Languages (cont’d.)

• Tokens generally fall into several categories:– Reserved words (or keywords)– Literals or constants– Special symbols, such as “;”m “<=“, or “+”– Identifiers

• Predefined identifiers: identifiers that have been given an initial meaning for all programs in the language but are capable of redirection

• Principle of longest substring: process of collecting the longest possible string of nonblank characters

7

8

Reserved words versus predefined identifiers

• Reserved words cannot be used as the name of anything in a definition (i.e., as an identifier).

• Predefined identifiers have special meanings, but can be redefined (although they probably shouldn’t).

• Examples of predefined identifiers in Java:anything in java.lang package, such as String, Object, System, Integer.

9

Java reserved words

• The keywords const and goto are reserved, even though they are not currently used. This may allow a Java compiler to produce better error messages if these C++ keywords incorrectly appear in programs.

• While true and false might appear to be keywords, they are technically boolean literals. Similarly, while null might appear to be a keyword, it is technically the null literal.

10


• Token delimiters (or white space): formatting that affects the way tokens are recognized

• Indentation can be used to determine structure

• Free-format language: one in which format has no effect on program structure other than satisfying the principle of longest substring

• Fixed format language: one in which all tokens must occur in prespecified locations on the page

• Tokens can be formally described by regular expressions

10

11

White space and comments

• “Internal” tokens of the scanner that are matched and discarded

• Typical white space: newlines, tabs, spaces• Comments:

• /* … */, // … \n (C, C++, Java)• -- … \n (Ada, Haskell)• (* … *) (Pascal, ML)• ; … \n (Scheme)

• Comments generally not nested.• Comments & white space ignored except they

function as delimiters (or separators).

12CS4303 Hong Lin

Token delimiters (white space)

• The principle of longest substring requires the use of white space

• FORTRAN violates this token convention– White space is suppressed:

• DO 99 I = 1.10 <=> DO99I = 1.10• DO 99 I = 1, 10 <=> for (I = 1; I <=10; I++)

– No reserved words, e.g.,IF = 2IF(IF.LT.0) IF = IF + 1ELSE IF = IF + 2

13


• Three basic patterns of characters in regular expressions:– Concatenation: done by sequencing the items– Repetition: indicated by an asterisk after the item to

be repeated– Choice, or selection: indicated by a vertical bar

between items to be selected

• [ ] with a hyphen indicate a range of characters

• ? indicates an optional item

• Period indicates any character

13

14

Example

• Example: (a|b)*c

–Matches: ababaac, aac, babbc, c

–not: aaaab, bca

15


• Examples:– Integer constants of one or more digits

– Unsigned floating-point literals

• Most modern text editors use regular expressions in text searches

• Utilities such as lex can automatically turn a regular expression description of a language’s tokens into a scanner

15

16


• Simple scanner input:

• Produces this output:

16

17CS4303 Hong Lin

How to write a scanner

• The main() function uses a loop to read characters from the input stream sequentially

• Use a branching statement to process a selection structure

• Use a looping statement to process a repetition structure

• An example in book Figure 6.1

18CS4303 Hong Lin 18

Example: (a|b)*cc = getchar();// priming read

while (c != ‘\n’) {

while (c == ‘a’ || c == ‘b’) // (a|b)*

c = getchar();

if (c == ‘c’) // c

printf(“token recognized!”);

else

printf(“token unrecognizable!”);

}

19CS4303 Hong Lin

Lab 1

• Write a scanner to recognize tokens in a declaration

• Complete scanner.cpp (follow the link in the course schedule)

20

Context-Free Grammars and BNFs

• Example: simple grammar

separates left and right sides

• | indicates a choice

20

21

Context-Free Grammars and BNFs (cont’d.)

• Metasymbols: symbols used to describe the grammar rules

• Some notations use angle brackets and pure text metasymbols– Example:

• Derivation: the process of building in a language by beginning with the start symbol and replacing left-hand sides by choices of right-hand sides in the rules

21

22


22

23


• Some problems with this simple grammar:– A legal sentence does not necessarily make sense– Positional properties (such as capitalization at the

beginning of the sentence) are not represented– Grammar does not specify whether spaces are

needed– Grammar does not specify input format or

termination symbol

23

24


• Context-free grammar: consists of a series of grammar rules

• Each rule has a single phrase structure name on the left, then a metasymbol, followed by a sequence of symbols or other phrase structure names on the right

• Nonterminals: names for phrase structures, since they are broken down into further phrase structures

• Terminals: words or token symbols that cannot be broken down further

24

25


• Productions: another name for grammar rules– Typically there are as many productions in a context-

free grammar as there are nonterminals

• Backus-Naur form: uses only the metasymbols “” and “|”

• Start symbol: a nonterminal representing the entire top-level phrase being defined

• Language of the grammar: defined by a context-free grammar

25

26


• A grammar is context-free when nonterminals appear singly on the left sides of productions– There is no context under which only certain

replacements can occur

• Anything not expressible using context-free grammars is a semantic, not a syntactic, issue

• BNF form of language syntax makes it easier to write translators

• Parsing stage can be automated

26

27


• Rules can express recursion

27

28


28

29


29

30

Parse Trees and Abstract Syntax Trees

• Syntax establishes structure, not meaning– But meaning is related to syntax

• Syntax-directed semantics: process of associating the semantics of a construct to its syntactic structure– Must construct the syntax so that it reflects the

semantics to be attached later

• Parse tree: graphical depiction of the replacement process in a derivation

30

31

Parse Trees and Abstract Syntax Trees (cont’d.)

31

32


32

33


• Nodes that have at least one child are labeled with nonterminals

• Leaves (nodes with no children) are labeled with terminals

• The structure of a parse tree is completely specified by the grammar rules of the language and a derivation of the sequence of terminals

• All terminals and nonterminals in a derivation are included in the parse tree

33

34


• Not all terminals and nonterminals are needed to determine completely the syntactic structure of an expression or sentence

34

36


• Abstract syntax trees (or syntax trees): trees that abstract the essential structure of the parse tree– Do away with terminals that are redundant

• Example:

36

37


• Can write out rules for abstract syntax similar to BNF rules, but they are of less interest to a programmer

• Abstract syntax is important to a language designer and translator writer

• Concrete syntax: ordinary syntax

37

Example 1:

number

number

number

digit

digit

digit

2

3

4

(value = 2)

(value = 2 * 10 + 3 = 23)

(value = 2)

(value = 2)

(value = 3)

(value = 3)

(value = 4)

(value = 4)

(value = 23 * 10 + 4 = 234)

Parse tree

2

3

4

(value = 2)

(value = 2 * 10 + 3 = 23)

(value = 23 * 10 + 4 = 234)

Abstract syntax tree

38

Example 2:

expr

* expr

number

digit

4

expr

expr expr +

number number

digit digit

2 3

expr

( )

2

2

2

2

3

3

3

3

2 + 3 = 5

5

5 * 4 = 20

4

4

4

4

Parse tree Abstract syntax tree

*

4 +

2 3 2

3

4

2 + 3 = 5

5 * 4 = 20

39

40

Ambiguity, Associativity, and Precedence

• Two different derivations can lead to the same parse tree or to different parse trees

• Ambiguous grammar: one for which two distinct parse or syntax trees are possible

• Example: derivation for 234 given earlier

40

41

Ambiguity, Associativity, and Precedence (cont’d.)

41

42


42

43


• Certain special derivations that are constructed in a special order can only correspond to unique parse trees

• Leftmost derivation: the leftmost remaining nonterminal is singled out for replacement at each step– Each parse tree has a unique leftmost derivation

• Ambiguity of a grammar can be tested by searching for two different leftmost derivations

43

44


44

45


• Ambiguous grammars present difficulties– Must either revise them to remove ambiguity or state

a disambiguating rule

• Usual way to revise the grammar is to write a new grammar rule called a term that establishes a precedence cascade

45

Precedence• Establish a “precedence cascade” to force the

matching of different operators at different point in the parse tree

• e.g., 1. expr → expr + expr | term

2. term → term * term | (expr) | number

3. number → number digit | digit

4. digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Associativity

• Associativity: The order for more than 2 successive operators in the same precedence level to be evaluated

• Left-associativity:– Parse 3+4+5 as (3+4)+5

• Right-associativity:– Parse 3+4+5 as 3+(4+5)

Two parse trees

© 2003 Brooks/Cole - Thomson Learning™

Right and left-recursive

• Right-recursive– expr → term + expr– Right-recursion causes right-associativity

• Left-recursive– expr → expr + term– Left-recursion causes left-associativity.

50


50

51


51

52

Solution shouldn’t change the language

• Fully-parenthesized expressions:expr ( expr + expr ) | ( expr * expr )

| NUMBERso: ((2 + 3) * 4)and: (2 + (3 * 4))

• Prefix expressions:expr + expr expr | * expr expr

| NUMBERso: + + 2 3 4and: + 2 * 3 4

53

Scheme uses prefix form, but keeps the parentheses. Why?

• Scheme allows any number of arguments to the arithmetic operators:expr (op exprlist )| NUMBERexprlist exprlist expr | emptyempty so: (+), (+ 1), (+ 1 2 3 4), etc.[- and / require at least one argument]

54

Unary Minus

• Add unary - (negation) so that at most one unary minus is allowed in each expression, and it must come at the beginning of an expression:-2 + 3 is legal (and equals 1)-2 + (-3) is legal-2 + -3 is not legal

• Answer:

expr expr + term | term | - term

55

EBNFs and Syntax Diagrams

• Extended Backus-Naur form (or EBNF): introduces new notation to handle common issues

• Use curly braces to indicate 0 or more repetitions– Assumes that any operator involved in a curly

bracket repetition is left-associative– Example:

• Use square brackets to indicate optional parts– Example:

55

56

EBNFs and Syntax Diagrams (cont’d.)

56

57

Optional part

• label-declaration-part → label label-list ‘;’ | empty

• label-declaration-part → [label label-list ‘;’]

• Right Associativity

– expr → term @ expr | term

– expr → term [@ expr]

58

Notes on use of EBNF• Use {…} only for left recursive rules:

expr term + expr | termshould become expr term [ + expr ]

• Do not start a rule with {…}: writeexpr term { + term }, notexpr { term + } term

• Exception to previous rule: simple token repetition, e.g. expr { - } term …

• Square brackets can be used anywhere, however:expr expr + term | term | unaryop termshould be written asexpr [ unaryop ] term { + term }

59


• Syntax diagram: indicates the sequence of terminals and nonterminals encountered in the right-hand side of the rule

59

60


• Use circles or ovals for terminals, and squares or rectangles for nonterminals– Connect them with lines and arrows indicating

appropriate sequencing

• Can condense several rules into one diagram

• Use loops to indicate repetition

60

62


62

63

Parsing Techniques and Tools

• A grammar written in BNF, EBNF, or syntax diagrams describes the strings of tokens that are syntactically legal– It also describes how a parser must act to parse

correctly

• Recognizer: accepts or rejects strings based on whether they are legal strings in the language

• Bottom-up parser: constructs derivations and parse trees from the leaves to the roots– Matches an input with right side of a rule and

reduces it to the nonterminal on the left63

64

Parsing Techniques and Tools (cont’d.)

• Bottom-up parsers are also called shift-reduce parsers– They shift tokens onto a stack prior to reducing

strings to nonterminals

• Top-down parser: expands nonterminals to match incoming tokens and directly construct a derivation

• Parser generator: a program that automates top-down or bottom-up parsing

• Bottom-up parsing is the preferred method for parser generators (also called compiler compilers)

64

65


• Recursive-descent parsing: turns nonterminals into a group of mutually recursive procedures based on the right-hand sides of the BNFs– Tokens are matched directly with input tokens as

constructed by a scanner– Nonterminals are interpreted as calls to the

procedures corresponding to the nonterminals

65

66


66

CS4303 Hong Lin67

Routine match()

void match(TokenType expected){ if (token == expected)

getToken();else error();

}

CS4303 Hong Lin68

points to notice

• A call to getToken is called whenever a new token is desired

• A call to getToken must precede the first call to the sentence procedure

• errors need only be detected when actual tokens are expected

69


• Left-recursive rules may present problems– Example: – May cause an infinite recursive loop– No way to decide which of the two choices to take

until a + is seen

• The EBNF description expresses the recursion as a loop:

• Thus, curly brackets in EBNF represent left recursion removal by the use of a loop

69

CS4303 Hong Lin70

Example

• Left-recursive rules cannot be rewritten into a recursive-descent procedure

• e.g., expr → expr + term | termvoid expr()

{

expr();

if (token == “+”) {

match(“+”);

term();

}

}

CS4303 Hong Lin71

Use EBNF to remove left-recursion

• Use EBNF rule:expr → term {+ term}

void expr(void)

{term();while (token == “+”){ match(“+”);;

term();/* perform left associative operations here */

}}

72


• Code for a right-recursive rule such as:

• This corresponds to the use of square brackets in EBNF:

– This process is called left-factoring

• In both left-recursive and left-factoring situations, EBNF rules or syntax diagrams correspond naturally to the code of a recursive-descent parser

72

CS4303 Hong Lin73

Right-recursive rules

• Rewriting causes no problem:expr → term @ expr | term

void expr(void)

{term();if (token == “@”) {

match(“@”);expr();

}}

CS4303 Hong Lin74

Optional constructs

• Rewriting is not possible for choices begin with the same prefix:if-statement → if (expression) statement | if

(expression) statement else statement

• Left-factoring: Use “[]” to factor out the common prefix:if-statement → if (expression) statement [else

statement]

CS4303 Hong Lin75

Recursive-descent code for <if-statement>

void ifStatement(){match(“if”);match(“(“);expression();match(“)”);statement();if (token == “else”) { match(“else”);

statement();}

}

76


• Single-symbol lookahead: using a single token to direct a parse

• Predictive parser: a parser that commits itself to a particular action based only on the lookahead

• Grammar must satisfy certain conditions to make this decision-making process work– Parser must be able to distinguish between choices

in a rule– For an optional part, no token beginning the optional

part can also come after the optional part

76

CS4303 Hong Lin77

Conditions

• A ::= α1 | α2 | … | αn,

First(αi) First(αj) = for all ij, where First(αi) (first set) is the set of tokens that can begin the string αi

• A ::= [] or A ::= [] First(α) Follow(α) = where Follow(α) (follow set) is the set of tokens that

can follow α

CS4303 Hong Lin78

Selection examplefactor → (expr) | numbernumber → digit {digit}digit → 0 | 1 | … | 9

First(‘(‘expr’)’) = {‘(‘}First(number) = First(digit)

= First(0)First(1)... First(9) = {0, …, 9}

First(factor) = First(‘(‘expr’)’) First(number) = {‘(‘, 0, …, 9}

79

Optional Part Example

79

80

Optional Part Example (cont’d)

80

CS4303 Hong Lin81

Option example (3)

• declaration → declaration-specifiers [init-declaration-list]’;’

• First(init-declaration-list) = {ID, *, ‘(‘}

• Follow(init-declaration-list) = {‘;’, ‘,’}

• First(init-declaration-list) Follow(init-declaration-list) =

82


• YACC: a widely used parser generator– Freeware version is called Bison– Generates a C program that uses a bottom-up

algorithm to parse the grammar

• YACC generates a procedure yyparse from the grammar, which must be called from a main procedure

• YACC assumes that tokens are recognized by a scanner procedure called yylex, which must be provided

82

CS4303 Hong Lin83

The format of YACC specification

%{ /* code to insert at the beginning of the parser */

%}

/* other YACC definitions, if necessary */

%%

/* grammar and associated actions */

%%

/* auxiliary procedures */

CS4303 Hong Lin84

An example (1)

CS4303 Hong Lin85

An example (2)

CS4303 Hong Lin86

An example (3)

CS4303 Hong Lin87

Scanner generator

• LEX/Flex is a scanner generator, used in Unix machines

• generate yylex that can be used with yyparse, the parser generated by YACC/Bison

CS4303 Hong Lin88

Lexics vs. Syntax vs. Semantics• Division between lexical and syntactic structure

is not fixed: a number can be a token or defined by a grammar rule.

• Implementation can often decide (scanners are faster, but parsers are more flexible).

• Division between syntax and semantics is in some ways similar: some authors call all static structure "syntax".

• Our view: if it isn't in the grammar (or the disambiguating rules), it's semantics.

89

Lexics vs. Syntax vs. Semantics (cont’d)

• Specific details of formatting, such as white-space conventions, are left to the scanner– Need to be stated as lexical conventions separate

from the grammar

• Also desirable to allow a scanner to recognize structures such as literals, constants, and identifiers– Faster and simpler and reduces the size of the

parser

• Must rewrite the grammar to express the use of a token rather than a nonterminal representation

89

90

Lexics vs. Syntax vs. Semantics (cont’d.)

• Example: a number should be a token

– Uppercase indicates it is a token whose structure is determined by the scanner

• Lexics: the lexical structure of a programming language

90

91


• Some rules are context-sensitive and cannot be written as context-free rules

• Examples:– Declaration before use for variables– No redeclaration of identifiers within a procedure

• These are semantic properties of a language

• Another conflict occurs between predefined identifiers and reserved words– Reserved words cannot be used as identifiers– Predefined identifiers can be redefined in a program

91

92


• Syntax and semantics can become interdependent when semantic information is required to distinguish ambiguous parsing situations

92

93

Case Study: Building a Syntax Analyzer for TinyAda

• TinyAda: a small language that illustrates the syntactic features of many high-level languages

• TinyAda includes several kinds of declarations, statements, and expressions

• Rules for declarations, statements, and expressions are indirectly recursive, allowed for nested declarations, statements, and expressions

• Parsing shell: applies the grammar rules to check whether tokens are of the correct types– Later, we will add mechanisms for semantic analysis

93

94

Case Study: Building a Syntax Analyzer for TinyAda (cont’d.)

94

programming languages third edition chapter 6 syntax

Documents

program structure

programs syntactic structure

discardedtypical white

comments white space

syntax analyzer

c keywords

commentsinternal tokens

way tokens