an introduction to grammars and parsinghxt/2010/19343/parsing2010.pdf · outline of the parsing...

An Introduction to Grammars and Parsing

Hayo ThieleckeUniversity of Birminghamwww.cs.bham.ac.uk/~hxt

23 November 2010

1 / 118

Outline of the parsing part of the module

1 Grammars and Regular Expressions

2 Formal definition of grammars, derivations, and parsers

3 From grammars to parsing methods

4 From grammars to parse trees as data structures

5 Parser generators

2 / 118

Recall Regular Expression matching

A regular expression

(a | b c)∗

It matches:

The empty string

bcabcbcbcaaaaaaaaabc

and so on: there are infinitely many strings that match

3 / 118

(a | b c)∗

It matches:

The empty string

4 / 118

(a | b c)∗

It matches:

The empty string

5 / 118

(a | b c)∗

It matches:

The empty string

6 / 118

(a | b c)∗

It matches:

The empty string

7 / 118

(a | b c)∗

It matches:

The empty string

8 / 118

(a | b c)∗

It matches:

The empty string

9 / 118

(a | b c)∗

It matches:

The empty string

10 / 118

Intuition: reg exps are like while and if

(a | b c)∗

while(randomBool()) {

if(randomBool()) {

print(’a’);

else {

print(’b’);

print(’c’);

11 / 118

Intuition: reg exps are like while and if

(a | b c)∗

while(randomBool()) {

if(randomBool()) {

print(’a’);

else {

print(’b’);

print(’c’);

12 / 118

Grammars also describe patterns

A context-free grammar for the Dyck language

D → [D ]D (1)

D → (2)

The grammar generates all well-bracketed strings:

The empty string

[ ] [ ]

[ [ ] ]

[ ] [ [ ] ] [ ]

and so on: there are infinitely many strings that are generated

13 / 118

D → [D ]D (1)

D → (2)

The empty string

[ ] [ ]

[ [ ] ]

[ ] [ [ ] ] [ ]

14 / 118

D → [D ]D (1)

D → (2)

The empty string

[ ] [ ]

[ [ ] ]

[ ] [ [ ] ] [ ]

15 / 118

D → [D ]D (1)

D → (2)

The empty string

[ ] [ ]

[ [ ] ]

[ ] [ [ ] ] [ ]

16 / 118

D → [D ]D (1)

D → (2)

The empty string

[ ] [ ]

[ [ ] ]

[ ] [ [ ] ] [ ]

17 / 118

D → [D ]D (1)

D → (2)

The empty string

[ ] [ ]

[ [ ] ]

[ ] [ [ ] ] [ ]

18 / 118

D → [D ]D (1)

D → (2)

The empty string

[ ] [ ]

[ [ ] ]

[ ] [ [ ] ] [ ]

19 / 118

Intuition: grammars are like recursive methods

D → [D ]D (1)

D → (2)

void D()

switch(randomBool()) {

case true:

print(’[’); D(); print(’]’); D(); break;

case false:

break;

20 / 118

Intuition: grammars are like recursive methods

D → [D ]D (1)

D → (2)

void D()

switch(randomBool()) {

case true:

print(’[’); D(); print(’]’); D(); break;

case false:

break;

21 / 118

Grammars and brackets

Grammars are good at expressing various forms of bracketingstructure.Dyck languages with enough brackets [1, . . . , [n can simulatestacks and hence all other context-free languages.

D →D → [j D ]j D

XML is essentially a Dyck language:

D → Fnord

D → <tag1> D </tag1> D

D → <tag2> D </tag2> D...

22 / 118

Brackets and Regular Expressions (*_*)

A regular expression for brackets

((\[)*(\])*)*

Note: the square bracket [ is escaped as \[.The round parentheses and Kleene star

( . . . )*

are not escaped.

All strings of matching brackets are generated this way

Example: [ ] [ [ ] ] [ ]

But it overgenerates

Example junk: ] ] ] [

23 / 118

Brackets and Regular Expressions (^_~)

((\[)*(\])*)*

( . . . )*

are not escaped.

Example: [ ] [ [ ] ] [ ]

24 / 118

Brackets and Regular Expressions \(^_^)/

((\[)*(\])*)*

( . . . )*

are not escaped.

Example: [ ] [ [ ] ] [ ]

25 / 118

Motivation for grammars

Write a program that reads strings like this and evaluatesthem:

(2+3*(2-3-4))*2

In particular, brackets and precedence must be handledcorrectly (* binds more tightly than +).

If you attempt brute-force hacking, you may end up withsomething that is inefficient, incorrect, or both.

This sort of problem was major challenge in the 1950s and 60sin compiling.

Parsing technology makes it straighforward.

Moreover, the techniques scale up to more realistic problems.

26 / 118

Why do you need to learn about grammars?

Grammars are widespread in programming.

XML is based on grammars and needs to be parsed.

Knowing grammars makes it much easier to learn the syntaxof a new programming language.

Powerful tools exists for parsing (e.g., yacc, bison, ANTLR,SableCC, . . . ). But you have to understand grammars to usethem.

Grammars give us examples of some more advancedobject-oriented programming: Composite Pattern andpolymorphic methods.

You may need parsing in your final-year project for readingcomplex input.

27 / 118

What do we need in a grammar

Some symbols can occur in the actual syntax

We also need other symbols that act as placeholders

Rules then say how to replace the placeholders

E.g. in Java, placeholders include “expression” “statement”etc

28 / 118

Grammars: formal definition

A context-free grammar consists of

some terminal symbols a, b, . . . , +, ),. . .

some non-terminal symbols A, B, S ,. . .

a distinguished non-terminal start symbol S

some rules of the form

A→ X1 . . .Xn

where n ≥ 0, A is a non-terminal, and the Xi are symbols.

29 / 118

Notation: Greek letters

Mathematicians and computer scientists are inordinately fond of Greekletters.

α alpha

β beta

γ gamma

ε epsilon30 / 118

Notational conventions for grammars

We will use Greek letters α, β, . . . , to stand for strings ofsymbols that may contain both terminals and non-terminals.

In particular, ε is used for the empty string (of length 0).

We will write A, B, . . . for non-terminals.

Terminal symbols are usually written in typewriter font, likefor, while, [.

These conventions are handy once you get used to them andare found in most books.

31 / 118

Derivations

If A→ α is a rule, we can replace A by α for any strings βand γ on the left and right:

β A γ ⇒ β α γ

This is one derivation step.

A string w consisting only of terminal symbols is generated bythe grammar if there is a sequence of derivation steps leadingto it from the start symbol S :

S ⇒ · · · ⇒ w

32 / 118

An example derivation

Recall the rules

D → [D ]D (1)

D → (2)

There is a unique leftmost derivation for each string in thelanguage. For example, we derive [ ] [ ] as follows:

⇒ [D ]D

⇒ [ ]D

⇒ [ ] [D ]D

⇒ [ ] [ ]D

⇒ [ ] [ ]

33 / 118

Recall the rules

D → [D ]D (1)

D → (2)

⇒ [D ]D

⇒ [ ]D

⇒ [ ] [D ]D

⇒ [ ] [ ]D

⇒ [ ] [ ]

34 / 118

Recall the rules

D → [D ]D (1)

D → (2)

⇒ [D ]D

⇒ [ ]D

⇒ [ ] [D ]D

⇒ [ ] [ ]D

⇒ [ ] [ ]

35 / 118

Recall the rules

D → [D ]D (1)

D → (2)

⇒ [D ]D

⇒ [ ]D

⇒ [ ] [D ]D

⇒ [ ] [ ]D

⇒ [ ] [ ]

36 / 118

Recall the rules

D → [D ]D (1)

D → (2)

⇒ [D ]D

⇒ [ ]D

⇒ [ ] [D ]D

⇒ [ ] [ ]D

⇒ [ ] [ ]

37 / 118

Recall the rules

D → [D ]D (1)

D → (2)

⇒ [D ]D

⇒ [ ]D

⇒ [ ] [D ]D

⇒ [ ] [ ]D

⇒ [ ] [ ]

38 / 118

The language of a grammar

In this context, a language is a set of strings (of terminalsymbols).

A language is called context-free if it is the language of somecontext-free grammar.

For useful grammars, there are usually infinitely many stringsin its language (e.g., all Java programs).

Wilhelm von Humboldt wrote (c. 1830) that language “makesinfinite use of finite means”.

Two different grammars can define the same language.Sometimes we may redesign the grammar as long as thelanguage remains the same.

39 / 118

Two grammars for the same language

Recall

D → [D ]D (1)

D → (2)

The usual presentation is easier to understand, but not suitable forparsing:

D → [ D ]

D → D D

40 / 118

Grammars can encode regular expressions

The alternative α | β can be expressed as follows:

A → α

A → β

Repetition α∗ can be expressed as follows:

A → αA

Hence we can use | and ∗ in grammars;sometimes called BNF, for Backus-Naur-form.Reg exps are still useful, as they are simple and efficient (in grepand lex)

41 / 118

A → α

A → β

A → αA

42 / 118

A → α

A → β

A → αA

43 / 118

A → α

A → β

A → αA

44 / 118

Grammar notations

Warning: notations differ and clash.In reg exps, brackets are used for character classes, e.g.

[A-D] means A | B | C | D

In BNF, brackets can stand for option:

[α] means α | ε

Sometimes {α} is written for α∗, e.g. in Sun’s Java grammar.Other notation: → may be written as “::=” or “:”.

45 / 118

Nesting in syntax

Specifically, we need to be very careful about bracketing andnesting. Compare:

while(i < n)

a[i] = 0;

i = i + 1;

while(i < n)

a[i] = 0;

i = i + 1;

Theses snippets looks very similar. But their difference is clearwhen indented properly—which requires parsing.

46 / 118

Example from programming language syntax

Some rules for statements S in Java or C:

S → if (E ) S else S

S → while (E ) S

S → V = E;

S → { B }B → S B

B →Here V is for variables and E for expressions.

E → E - 1

E → (E )

E → 1

E → E == 0

V → foo47 / 118

A Java grammar rule from Sun

This is one of more than 60 grammar rules for the syntax of Java:

Primary:

( Expression )

this [Arguments]

super SuperSuffix

Literal

new Creator

Identifier { . Identifier }[ IdentifierSuffix]

BasicType BracketsOpt .class

void.class

Note: the colon is like → and a new line is like |.More on http://java.sun.com/docs/books/jls/second_

edition/html/syntax.doc.html.

48 / 118

Example from English syntax

Subject-verb-object sentences:

E → S V O

S → D N

V → sat | spatO → P D N

N → cat | matD → the | aP → on | beside

Example sentence: the cat sat on the mat.No Yoda sayings like the mat the cat sat on.

49 / 118

E → S V O

S → D N

Example sentence: the cat sat on the mat.

No Yoda sayings like the mat the cat sat on.

50 / 118

E → S V O

S → D N

Example sentence: the cat sat on the mat.No Yoda sayings like the mat the cat sat on.

51 / 118

Definition of a parser

Suppose a grammar is given.A parser for that grammar is a program such that for any inputstring w :

If the string w is in the language of the grammar, the parserfinds a derivation of the string. In our case, it will always be aleftmost derivation.

If string w is not in the language of the grammar, the parsermust reject it, for example by raising a parsing exception.

The parser always terminates.

52 / 118

Recursion in grammars

A symbol may occur on the right hand-side of one of its rules:

E → (E )

We often have mutual recursion in grammars:

S → { B }

B → S B

Mutual recursion also exists in Java: for example, method f calls gand g calls f.

53 / 118

Recursion in grammars and in Java

Compare recursion in grammars, methods and classes:

T → . . .T . . .T . . .

int sumTree(Tree t)

... return sumTree(t.left) + sumTree(t.right);

and classes

public class Tree

public Tree left;

public Tree right;

54 / 118

Automata and grammars—what we are not going to do

Pushdown automata (stack machines) are covered in Modelsof Computation.

Independently of formal automata models, we can use aprogramming perspective: grammars give us classes ormethods.

For experts: we let Java manage the stack for us by usingmethods

The parsing stack is part of the Java call stack

55 / 118

Parsing techniques

There are many different parsing technologies (LL(k), LR(1),LALR(1), . . . ).

Here we consider only predictive parsers, sometime also calledrecursive descent parsers. They correspond to translatinggrammar rules into code as described below.

The hard part is choosing the rules according to thelookahead.

ANTLR works similarly, but with more lookahead.

56 / 118

Predictive parser

A predictive parser can be constructed from grammar rules.

The parser is allowed to “look ahead” in the input;based on what it sees there, it then makes predictions.

Canonical example: matching brackets.If the parser sees a [ as the next input symbol, it “predicts”that the input contains something in brackets.

More technically: switch on the lookahead; [ labels one case.

57 / 118

Overview: from grammars to Java code

Methods for processing the language follow the structure of thegrammar:Each non-terminal gives a method (with mutual recursion betweensuch methods).Each grammar gives us a Java class hierarchy (with mutualrecursion between such classes).Each word in the language of the grammar gives us an object ofthe class corresponding to the start symbol (a parse tree).

58 / 118

Recursive methods

From grammars to mutually recursive methods:

For each non-terminal A there is a method A. The methodbody is a switch statement that chooses a rule for A.

For each rule A→ X1 . . .Xn, there is a branch in the switchstatement. There are method calls for all the non-terminalsamong X1, . . . , Xn.

Each grammar gives us some recursive methods.For each derivation in the language, we have a sequence of methodcalls.

59 / 118

Parsing with lookahead

D → [D ]D (1)

D → (2)

We also need to know where else in the grammar a D could occur:

S → D $

Idea: suppose you are trying to parse a D. Look at the first symbolin the input:if it is a [, use the first rule;if it is a ] or $, use the second rule.

60 / 118

Translation to code (only recognizing)

void parseD() throws SyntaxError

switch(lookahead()) { // what is in the input?

case ’[’: // If I have seen a [

match(’[’); // remove the [

parseD(); // now parse what is inside

match(’]’); // make sure there is a ]

parseD(); // now parse what follows

break; // done in this case

case ’]’: case ’$’: // If I have seen a ] or $

break; // just return

default: throw new SyntaxError();

61 / 118

How do we get the symbols for the case labels?

Parsing with lookahead is easy if every rule for a givennon-terminal starts with a different terminal symbol.

In that case, the lookahead immediately tells us which rule tochoose.

But what if not? The right-hand-side could instead start witha non-terminal, or be the empty string.

More general methods for using the lookahead: FIRST andFOLLOW construction.

62 / 118

FIRST and FOLLOW sets

We define FIRST, FOLLOW and nullable:

A terminal symbol b is in first(α) if there is a derivation

α⇒∗ b β

(b is the first symbol in something derivable from α).

A terminal symbol b is in follow(X ) if if there is a derivation

S ⇒∗ αX b γ

(b can appear immediately behind X in some derivation)

α is nullable if α⇒∗ ε (we can derive the empty string fromit)

63 / 118

FIRST and FOLLOW give the case labels

FIRST and FOLLOW gives us the case labels for the branchesof the switch statement.

A branch for A→ α gets the labels in FIRST(α).

A branch for A→ ε gets the labels in FOLLOW(A).

FIRST and FOLLOW are tedious to compute by hand. Wewon’t go into the details here.

Parser generators like ANTLR compute this sort ofinformation automagically.

64 / 118

Case labels for the Dyck language

D → [D ]D (1)

D → (2)

For the rule beginning with the bracket, obviously [ is the caselabel.For the rule with the empty right-hand side, the FIRST set isempty.

65 / 118

A problem: left recursion

Left recursion is a problem for the simple parsers we have written,and for some parser generators, like ANTLR and JavaCC.Example:

E → E − E

E → 1

The symbol 1 is in the lookahead for both rules, so cannot guideour choice between them.If you have this situation, and you want to use ANTLR, you needto rewrite the grammar.There is a standard technique for eliminating left recursion,desribed in most Compiling books.Left recursion is not a problem for LR parsers like yacc or SableCC.

66 / 118

Left recursion elimination example

Consider this natural-looking but ambiguous grammar:

E → E - E

E → (E )

E → 1 | 2 | 3 | . . .

We observe that E = F- . . . -F where F → (E ) | 1 | 2 | 3 | . . ..

E → F R

R → - F R

R →F → (E )

F → 1 | 2 | 3 | . . .

67 / 118

Lookahead and programming languages

Many constructs start with a keyword telling us immediately whatit is.Some languages are easy to parse with lookahead, e.g. Lisp andScheme. Idea: everything is a prefix operator.

E → (+ E E)

E → (* E E)

E → 1

Obvious with 2 symbols of lookahead. Can do with 1 by tweakingthe grammar.C and Java are not so easy to parse (and C++ is worse). Any typename can begin a function definition, for instance.

68 / 118

The lookahead and match methods

A predictive parser relies on two methods for accessing theinput string:

char lookhead() returns the next symbol in the input,without removing it.

void match(char c) compares the next symbol in theoutput to c. If they are the same, the symbol is removed fromthe input. Otherwise, the parsing is stopped with an error; inJava, this can be done by throwing an exception.

69 / 118

Parsers and parse trees

Parse trees as such are abstract mathematical structures

We can represent them as data structures

Many modern tools give you such representations of parsetrees

Example: DOM tree parsers versus streaming XML parsers.

ANTLR and SableCC build trees, yacc does not.

We will see how to represent them in an OO way

These classes give return types for the parsing methods

70 / 118

Parse trees abstractly

The internal nodes are labelled with nonterminals.If there is a rule A→ X1 . . .Xn, then an internal node can have thelabel A and children X1, . . . , Xn.The root node of the whole tree is labelled with the start symbol.The leaf nodes are labelled with terminal symbols or ε.

Root: Start symbol

Non-terminal A

Terminal a . . .

Non-terminal B

ε . . . Terminal z

71 / 118

Example: parse trees

D → [D ]D (1)

D → (2)

Here is a parse tree for the string [ ]:

72 / 118

Parse trees and derivations

Parse trees and derivations are equivalent; they contain thesame information.

Intuition: Parse trees are extended in space, derivations intime.

For each derivation of a word, there is a parse tree for theword.(Idea: each step using A→ α tells us that the children ofsome A-labelled node are labelled with the symbols in α.)

For each parse tree, there is a (unique leftmost) derivation.(Idea: walk over the tree in depth-first order; each internalnode gives us a rule.)

Hence parse trees and leftmost derivations are isomorphic

73 / 118

Parse tree traversal and derivation

⇒ [D ]D

⇒ [ ]D

⇒ [ ]

74 / 118

⇒ [D ]D

⇒ [ ]D

⇒ [ ]

75 / 118

⇒ [D ]D

⇒ [ ]D

⇒ [ ]

76 / 118

⇒ [D ]D

⇒ [ ]D

⇒ [ ]

77 / 118

Problem: ambiguous grammars

A grammar is ambiguous if there is a string that has more thanone parse tree.Standard example:

E → E − E

E → 1

One such string is 1-1-1. It could mean (1-1)-1 or 1-(1-1)depending on how you parse it.Ambiguous grammars are a problem for parsing, as we do notknow what is intended.As with left recursion, one can try to rewrite the grammar toeliminate the problem.

78 / 118

From grammars to Java classes

We translate a grammar to some mutually recursive Java classes:

For each non-terminal A there is an abstract class A

For each rule A→ X1 . . .Xn, there is a concrete subclass of A.

It has instance fields for all non-terminals amongX1, . . . , Xn.

(Instead of an abstract class, we could also use an interface foreach non-terminal A.)

79 / 118

Constructors from rules

In the grammar, the rules do not have names. We could picknames, or just number the rules for each non-terminal A as A1,. . .An.Example:

D → [D ]D (1)

D → (2)

Then we translate this to two subclasses of class D, withconstructors

D1(D x, D y) and D2()

80 / 118

Method toString in the tree classes

The toString method of the class for rule

A→ X1 . . .Xn

concatenates the toString of all the Xi :if Xi is a terminal symbol, it is already a string;if Xi is a non-terminal, its toString method is called.Calling toString on a parse tree prints the word at the leaves ofthe tree.

81 / 118

Example: Dyck language constructors

abstract class D {

public abstract String toString();

class D2 extends D

// D ->

D2() { }

public String toString() { return ""; }

82 / 118

Example continued

class D1 extends D

// D -> [ D ] D

private D left, right;

D1(D l, D r) {

left = l; right = r;

public String toString() {

return "[ " + left.toString() + " ] "

+ right.toString(); }

83 / 118

Building the parse tree

For the parser that only recognizes, we had a void return type.We extend our translation of grammar rules to code:

The method for non-terminal A has as its return type theabstract class that we created for A.

Whenver we call a method for a non-terminal, we rememberits return value in a local variable.

At the end of the translation of a rule, we call the constructor.

84 / 118

Translation to code and building the parse tree

D parseD() throws SyntaxError

D left, right, result = null;

switch(lookahead()) {

case ’[’:

match(’[’);

left = parseD();

match(’]’);

right = parseD();

result = new D1(left, right);

break;

case ’]’: case ’$’:

result = new D2();

break;

default: throw new SyntaxError();

return result;

} 85 / 118

Abstract characterization of parsing as an inverse

Printing then parsing: same tree.

Tree String

Tree ∪ SyntaxError

toString

=parse

Parsing then printing: almost the same string.

String Tree ∪ SyntaxError

String

prettyPrinttoString

86 / 118

COMPOSITE pattern and (parse) trees

The representation of parse trees is an instance of theCOMPOSITE pattern in object-oriented software. Gamma et.al. define it as:“Compose objects into tree structures to represent part-wholehierarchies.”(From Design Patterns: Elements of Reusable Object-OrientedSoftware by Erich Gamma, Richard Helm, Ralph Johnson, JohnVlissides.)This is essentially the representation parse trees described above.It describes a “hierarchy” as a tree, where the nodes are the wholeand its children the parts.

87 / 118

Composite pattern in UML notation

Consider the binary tree grammar:

B → B B | 1 | 2 | . . .

In UML notation, we have the following class diagram:

abstract class B

class Leaf class Node

A B “is a” Leaf or a Node, and a Node “has a” B.

88 / 118

Trees in functional languages

Modern functional languages such as Haskell and ML (with itsdialects OCAML and F#) have much better constructs forrepresenting trees. E.g.,

type tree =

| Leaf of int

| Node of tree * tree

89 / 118

Adding methods in the parse tree

To perform useful work after the parse tree has been constructed,we call its methods.Canonical example: expression evaluation. Suppose we have agrammar for arithmetical expressions:

E → E + E

E → E ∗ EE → 1 | 2 | 3 | . . .

The each expression “knows” how to evaluate itself (depending onwhether it is an addition, a multiplication, . . . ).The methods in the parse tree can be given additional parametersto move information around the tree.

90 / 118

Expression evaluation as a method of the parse tree

abstract class Expression { abstract int eval() ... }

class Plus extends Expression

public int eval()

return left.eval() + right.eval();

91 / 118

Adding parameters to the methods

Suppose we also have variables: E → x | . . ..

abstract class Expression

abstract int eval(Environment env);

class Variable extends Expression

private String name;

public int eval(Environment env) {

return env.get(name);

92 / 118

OO and parsing

Parse trees are an instance of the composite pattern.

The classes made for rules A→ α extend the class for A

The return type of a parsing method for A is the abstractclass A, but what is actually return is an instance of one of itssubclasses.

The methods in the abstract class are overridden by thesubclasses. During a treewalk, we rely on dynamic dispatch.

Polymorphism is crucial.

The parsing methods could all be static (like C functions), butit could also be done in a more OO style.

93 / 118

Parser generators

Except when the grammar is very simple, one typically does notprogram a parser by hand from scratch.Instead, one uses a parser generator.Compare:

Java codeCompiler−→ Java byte code

Annotated grammarParser generator−→ Parser

Examples of parser generators: yacc, bison, ANLTR, JavaCC,SableCC, . . .

94 / 118

Parsers and lexers

In the real world, lookahead and match are calls to thelexical analyzer (lexer) that matches regular expressionslookahead returns tokens, not raw characters.

Example: substring 4711666.42 in the input becomes a singletoken FLOAT

We ignored all that to keep the parser as simple as possible.(Only single-letter keywords.)

But our simple predictive parsers are sufficient to demonstratethe principles.

95 / 118

an introduction to grammars and parsinghxt/2010/19343/parsing2010.pdf · outline of the parsing...

Documents

attribute grammars

sketch grammars

languages grammars

shape grammars

its paper 19343 3109106002 presentation

parsing grammars regular languages grammars

just as grammars of language grammars of language grammars

context-free grammars are a subset of context-sensitive...

grammars automata

regular grammars

unrestricted grammars

c sc 473 automata, grammars & languages automata, grammars...

adpositional grammars

second-order abstract categorial grammars as hyperedge...

lr grammars

mcwhorter the worlds simplest grammars are creole grammars

probabilistic grammars

perl6 grammars

story grammars

formal grammars