language processorsats/lp-2002-2/pdf/skript.pdf · title: language processors author: axel t....

1

Language Processors

Axel-Tobias SchreinerDepartment of Computer ScienceRochester Institute of Technology

This volume contains copies of the overhead slides used in class. This information is available online as part of the World Wide Web; it contains hypertext references to itself and to the documentation for various programming languages in the Web. The example programs are included into this text from the original sources.

So that it may be viewed on other platforms, this text also exists as a

PDF

document. With the Acrobat Reader from Adobe the text can be printed on Windows systems.

The text is not a complete transcript of the lectures. A significant knowledge of Java and some facts about grammars and programming language features is assumed; for self study one would have to consult introductory books on Java, programming language concepts, and compiler construction.

Contents

0 Introduction 11 Overview 32 Grammars and Trees 53 Recursive Descent 134 LR Parsing 255 Lexical Analysis 396 Visitors 457 Code Generation 538 LL Parsing With Objects 799 Interpretation 9110 Procedures 9311 Block Structure 95

2

References

These slides are developed in both environments of Mac

OS

X using

Adobe

FrameMaker

,

PhotoShop

, and

Distiller

. Drawings are created with

OmniGraffle

. The slides are available in the World Wide Web.

Today there are lots of programming languages and tools for compiler construction and even more books about them. Here are just a few useful ones:

Aho/Sethi/Ullman 0-201-10088-6 CompilersAppel 0-521-58388-8 Modern Compiler Construction With JavaEngel 0-201-30972-6 Programming for the Java Virtual MachineFlanagan 0-596-00283-1 Java in a Nutshell (4th Edition)Holub 0-130255252-5 Compiler Design in CWatt 0-13-025786-9 Programming Language Processors in JavaWirth 0-201-40353-6 Compiler Construction

http://www.aw.com/catalog/academic/product/1,4096,0201100886,00.html

http://vig.prenhall.com/catalog/academic/product/1,4096,0130257869,00.html

http://www.amazon.com/exec/obidos/tg/detail/-/0201403536/qid=1037988913/sr=1-10/ref=sr_1_10/002-2277797-5218468?v=glance&s=books#product-details

http://www.cs.rit.edu/~ats/lp-2002-2/

http://www.aw.com/catalog/academic/product/1,4096,0201309726,00.html

http://www.holub.com/compiler/compiler.html

http://www.cs.princeton.edu/~appel/modern/java/

http://www.oreilly.com/catalog/javanut4/

3

1Overview

1.1 Compilation process

Here is an overview as to what goes on inside a compiler:

In several steps input characters are turned into a sequence of

terminal symbols

and a symbol table and then into a tree representing the program expressed by the input characters.

The tree can then be converted into something that is more or less executable and will produce the effect expected by the program.

The compilation phases run more or less in parallel, i.e., the intermediate data may or may not be available explicitly.

How compilation progresses depends on the implemented language, e.g., if

declare before use

is required (as in C) or not (as in Java).

baumparse

tree

interpreter

tree

eval

characters

lexical analysis

symbol table

symbols

syntax analysis semantic analysis

code generation

instructionsbyte codes

virtual

machinehardware

4

1.2 Terminology

Syntax

What are terminal symbols — answered through lexical analysis.

How does a sequence of symbols form a

sentence

— answered through syntax analysis.

These aspects can be handled very well by formalisms and tools.

Semantics

What sentences are acceptable — because they make sense.

What does a sentence mean?

This is much harder to describe and test.

Compiler

Consider C or C++: a source program is compiled into an object.

The object is combined with library functions into an image.

The image can be executed (directly) on the target machine.

Interpreter

Consider Perl: a source program might be compiled into an intermediate form.

This is then read by an interpreter which produces the desired effect.

Compiler vs. Interpreter

There is no clear boundary between compilers and interpreters: a large function library might (almost) constitute an interpreter; compiling into a very primitive intermediate form requires (almost) a full-scale compiler. A bytecode interpreter such as the Java Virtual Machine (almost) acts like target hardware. A just-in-time (JIT) compiler usually performs simple expansions from bytecodes into machine instructions.

5

2Grammars and Trees

This chapter introduces grammars and related terminology. It describes several grammar specifications and their relationship to trees representing sentences.

2.1 Syntax Graphs

Wirth first described the syntax of Pascal using named, directed graphs where (blue) vertices with round corners represent terminal symbols and (black) rectangular vertices reference other graphs. Arithmetic expressions may be described as follows:

Nassi-Shneiderman diagrams can be adapted to provide a more restricted topology:

productsum

term

term

*

/

product

+

-

+

-

%

Number

( )sum

product

+ -

product product

sum

term

* /

term term

%

term

productterm

+ -

Number

(

sum

)

http://www.cs.inf.ethz.ch/~wirth/

http://www.cs.berkeley.edu/~jasonh/cs39i-seminar/project1/BenShneiderman/

http://www.nassi.com/ike.htm

http://www.smartdraw.com/resources/centers/software/nassi.htm

6

Each graph describes a phrase, i.e., eventually an acceptable sequence of terminal symbols.

The graphs can be used to create arithmetic expressions: Start traversing

sum

. When traversing a blue vertex, write it’s terminal symbol. When traversing a black vertex, traverse the corresponding graph.

The graphs can be used to check if arithmetic expressions are well formed. Start traversing

sum

. You may traverse a blue vertex if the next input symbol matches the vertex. You may traverse a black vertex by traversing it’s corresponding graph.

In general the graphs are used to check if a sequence of terminal symbols is acceptable: one needs to traverse the graph and compare the blue vertices with the terminal sequence; black vertices require that other graphs can be similarly traversed and compared.

For a

sum

one needs a

product

. After one or more plus or minus signs there can be another

product

. The

product

requires at least one

term

which could be a

Number

.

The process is non-trivial once the blue vertices are not unique enough as signposts directing the graph traversal.

2.2 Grammars

Formally, a

grammar

consists of a finite set of nonterminal symbols, a finite set of terminal symbols, a start symbol which is a nonterminal, and a finite set of ordered pairs of sequences of nonterminal and terminal symbols. For example:

Chomsky distinguishes four different kinds of grammars based on the structure of the pairs.

In a

context-free grammar

each pair must have a single nonterminal as the first sequence of the pair. It is known that a push-down automaton (

PDA

, stack machine) is sufficient for language recognition based on a context-free grammar. As an example, the grammar above is not context-free, but with the following pairs it would be:

In a

regular grammar

a pair consists either of a nonterminal and a terminal, or of a nonterminal and a sequence consisting of a nonterminal and a terminal. A finite state automaton (

FSA

) can be constructed from the grammar and perform language recognition based on a regular grammar. As an example, the grammar above with the context-free pairs is in fact regular.

It turns out that the pattern matching performed by commands like

grep

can (for the most part) be done with a

FSA

. This is where the

regular expressions

describing the patterns got their name.

nonterminals: a, bterminals: c, dstart symbol: apairs: (ac, ab), (a, )

pairs: (a, b), (a, ac), (b, d)

http://web.mit.edu/linguistics/www/chomsky.home.html

7

2.3 Grammar Rules — Backus-Naur Form

Grammars are often specified in Backus-Naur Form (

BNF

). The typical grammar for arithmetic expressions has the following rules:

sum: product | sum ’+’ product | sum ’-’ product;product: term | product ’*’ term | product ’/’ term | product ’%’ term;term: ’+’ term | ’-’ term | ’Number’ | ’(’ sum ’)’;

A context-free rule consists of a nonterminal on the left and of a sequence of expressions on the right, joined by

|

denoting alternatives and terminated by a semicolon. An expression consists of a sequence of terminals and nonterminals.

The nonterminal on the left has to be unique; therefore, a rule describes all pairs from the grammar that have this nonterminal as their first element and one of the expressions as their second element. By convention, the first rule designates the start symbol.

Nonterminals are represented by identifiers — the names on Wirth’s syntax graphs.

Quoted strings represent terminals either directly or with descriptive identifiers such as

Number

which represent a category of terminals.

One could add rules to reduce all terminals to single character strings. If

Number

were unquoted above:

Number : digit | Number digit;digit : ’0’ | ’1’ | ’2’ | ’3’ | ’4’ | ’5’ | ’6’ | ’7’ | ’8’ | ’9’;

In general, however, this is likely to make it very difficult to quickly perform graph traversal when such a grammar is turned into a syntax graph and used for recognition.

BNF

can describe itself:

grammar: rule | grammar rule;rule: ’Id’ ’:’ rhs ’;’;rhs: expression | rhs ’|’ expression;expression: /* empty */ | expression item;item: ’Id’ | ’String’;

The description cannot distinguish between identifiers used for nonterminals and for terminals. Note that an

expression

can be empty. Considering pairs, this should at most happen for one

expression

on a right hand side.

8

2.4 Extended Backus-Naur Form

BNF

has to use recursion to express repetition. For many people recursive constructs are harder to understand then iterative constructs.

Extended

BNF

(

EBNF

) adds a notation to express repetition. For example, the Internet Requests For Comments (

RFC

) or an

XML

Document Type Definition (

DTD

) use

Other versions use braces

{

}

to denote repetition and brackets

[

]

to denote an optional part.

Arithmetic expressions can now be described as follows:

sum: product ( ’+’ product | ’-’ product )*;product: term ( ’*’ term | ’/’ term | ’%’ term )*;term: ( ’+’ | ’-’ )* ( ’Number’ | ’(’ sum ’)’ );

This corresponds much more closely to syntax graphs, especially when the restricted topology of Nassi-Shneiderman diagrams is used:

EBNF

can also describe itself:

grammar: rule+;rule: ’Id’ ’:’ alt ’;’;alt: seq ( ’|’ seq )*;seq: ( term )+;term: item ( ’?’ | ’+’ | ’*’ )?;item: ’Id’ | ’String’ | ’(’ alt ’)’;

Empty expressions should not be necessary in

EBNF

.

( )

grouping

*

zero or more occurrences

+

one or more occurrences

?

zero or one occurrence

Number+ sum

Terminal Nonterminal Sequence

Alternative Optional ? Some + Many *

9

2.5 Syntax Trees

The start symbol of a context-free grammar produces

syntax trees

: Nodes are nonterminals or terminals. If a node has descendants the node must be a nonterminal and there must be a rule consisting of the nonterminal and the (ordered) sequence of descendants. As an example the rule

term: ’Number’ | ’(’ sum ’)’ | ’+’ term | ’-’ term;

can produce the following trees (and more):

The ordered sequence of all terminal leaves of a syntax tree (with the start symbol as a root) is called a

sentence

. These trees, with

term

as the root, only show the sentence

Number

.

A

language

is a set of sentences, i.e., sequences of terminals. A

grammar for a language must exactly produce all sentences. A language can have more than one grammar.

This works — more or less — for EBNF, too. As an example the rule

term: (’+’|’-’)* ( ’Number’ | ’(’ sum ’)’ );

can produce the following trees (and more):

EBNF may create nodes with arbitrary degrees.

term

( sum )

term

+ term

term

Number

term

- term

-+ ... Number

term

-+ ( sum )...

term

10

2.6 AmbiguityA grammar is called ambiguous if there is a sentence for which there is more than one syntax tree. As an example the grammar described by the single rule

sum: ’Number’ | sum ’+’ sum;

can produce the following different trees for the same sentence:

Note that only a grammar can be ambiguous; this has nothing to do with a language. As an example the grammar consisting of the rule

sum: Number | sum ’+’ Number;

is not ambiguous and produces the same sentences as the grammar above.

sum

+

sum

Number

sum

Number

sum

+

sum

Number

sum sum sum

sum

sum

11

2.7 Interpreter TreesCertain simplified trees are more interesting for practical purposes. Here the nodes represent certain terminals and the branches are similar to the corresponding syntax tree. As an example

is simplified as

except that the green nodes still have to be replaced by trees. And

is simplified as

The simplified trees still contain all the information which is needed for evaluation.

Interpreter trees are derived from syntax trees; therefore, meaning — such as operator precedence — is usually associated with syntax trees. This is why an ambiguous grammar is usually not acceptable for the purposes of compiler construction.

term

( sum )

term

+ term

term

Number

term

- term

-

term

sumterm Number

-sum product

sum

product

term

Number

term

Number

-

Number Number

13

3Recursive Descent

A grammar rigorously describes a language. A grammar can be specified in EBNF and EBNF can be represented as a syntax graph:

The Nassi-Shneiderman look of syntax graphs suggests that a syntax graph (and thus a grammar) can be interpreted as a recognition program.

This chapter discusses recursive descent, a programming style where a recognition program is handcrafted from a syntax graph, and shows how to evaluate and store arithmetic expressions.

product

+ -

product product

sum

term

* /

term term

%

term

productterm

+ -

Number

(

sum

)

14

3.1 OverviewExpression reads lines with arithmetic expressions from standard input, encodes them as interpreter trees, and evaluates them using float arithmetic. If the commandline argument -c is specified, an ArrayList with all trees is written to standard output. This ArrayList can be read with Go and evaluated in all arithmetic types:

$ cd expr/java$ javac -source 1.4 -classpath ../.. Expression.java$ javac -source 1.4 -classpath ../.. Go.java$ echo 'Main-Class: expr.java.Expression' > manifest$ cd ../.. && jar cfm expr/java/cmp.jar expr/java/manifest \> expr/java/Expression*.class expr/java/VM*.class$ echo 'Main-Class: expr.java.Go' > manifest$ cd ../.. && jar cfm expr/java/go.jar expr/java/manifest \> expr/java/Go.class expr/java/VM*.class

$ java -ea -jar cmp.jar -c | java -jar go.jar2+332768 * 2147483648 + 1284 * 92233720368547758074 * 9.2byte short int long float double5 5 5 5 5.0 5.0-128 128 128 70368744177792 7.0368744E13 7.0368744177792E13-4 -4 -4 -4 3.6893488E19 3.6893488147419103E1936 36 36 36 36.8 36.8

Class files are collected in one archive cmp.jar for compilation and immediate execution, and in another archive go.jar for execution only. The archives contain manifests so that they can be executed with the -jar option of java. All the instructions are stored in a makefile so that the command make test will compile the classes, collect the archives, and run the test.

Expression demonstrates how to use a StreamTokenizer to analyze a text, how to program a recognizer using recursive descent, and how to output persistent objects.

The interpreter trees are assembled from node objects provided by a factory VM. Node classes extend Number — this permits expressions to be used in place of literals.

Go shows how to input and use persistent objects.

VM and Go are collected in the archive go.jar which will be reused.

-source 1.4 is specified so that the new assert statement may be used. -ea enables all assertion checking.

../code/expr/java/makefile

15

3.2 Recursive Descent AnalysisNassi-Shneiderman diagrams are usually used to describe subroutines. Therefore, a suitable syntax graph can be written up as a recognizer function:

expr/java/Expression.java

/** recognizes <tt>sum: product (('+'|'-') product)*;</tt> @param s source of first input symbol, advanced beyond sum. @return tree with evaluators. @see Expression#line */ public static Number sum (StreamTokenizer s) throws ParseException, IOException { Number result = product(s); for (;;) switch (s.ttype) { case '+': s.nextToken(); result = vm.Add(result, product(s)); continue; case '-': s.nextToken(); result = vm.Sub(result, product(s)); continue; default: return result; } }

StreamTokenizer produces the input symbols, here mostly single characters. ttype contains the next symbol once nextToken() has been called. It is up to the recursive descent procedures to dispose of ttype by calling nextToken() once the symbol has been used.

vm is a factory object which produces objects subclassed from Number that can be used to build trees for arithmetic expressions.

product() is implemented just like sum(). Unfortunately, the switch pretty much precludes extending the function by overriding and calling super. The Java class notes suggest a more flexible architecture.

The methods are static because this parser has no global state such as a symbol table or input and output connections.

product

+ -

product product

sum

term

* /

term term

%

term

productterm

+ -

Number

(

sum

)

http://www.cs.rit.edu/~ats/java-2001-1/html/skript-18.html

16


/** recognizes term: '+'term | '-'term | '('sum')' | Number; @param s source of first input symbol, advanced beyond term. @return tree with evaluators. @see Expression#sum */ public static Number term (StreamTokenizer s) throws ParseException, IOException { switch (s.ttype) { case '+': s.nextToken(); return term(s); case '-': s.nextToken(); return vm.Minus(term(s)); case '(': s.nextToken(); Number result = sum(s); if (s.ttype != ')') throw new ParseException("expecting )"); s.nextToken(); return result; case StreamTokenizer.TT_WORD: if (Pattern.matches("^[0-9]+$", s.sval)) result = vm.Long(s.sval); else if (Pattern.matches("^([0-9]+\\.[0-9]*|\\.[0-9]+)$", s.sval)) result = vm.Double(s.sval); else throw new ParseException("expecting a number"); s.nextToken(); return result; } throw new ParseException("missing term"); }}

This recognition technique is called recursive descent because the functions usually call each other recursively while they build the syntax tree from the root (sum) to the leaves (term).

Note that term() does not quite agree with the (more efficient) syntax graph for term.

17

sum cannot describe an input line because an input line has to be terminated with a line feed which cannot be present in term. It is a good idea to use another syntax graph to deal with a line:


/** recognizes <tt>line: ’\n’* sum '\n';</tt>. @param s source of first input symbol, may be at end of file. @return tree for sum, null if only end of file is found. @throws ParseException for syntax error. @throws IOException discovered on s. */ public static Number line (StreamTokenizer s) throws ParseException, IOException { for (;;) switch (s.nextToken()) { default: Number result = sum(s); if (s.ttype != StreamTokenizer.TT_EOL) throw new ParseException("expecting nl"); return result; case StreamTokenizer.TT_EOL: continue; // ignore empty line case StreamTokenizer.TT_EOF: return null; } }

This function reads one symbol ahead and silently ignores line separators. Following a sum there definitely has to be a line separator which will be disposed of by the next call to line() — a change in invariants.

At end of input line() delivers null rather then a tree.

Errors are reported with a subclass of Exception so that they can be caught separately:


/** indicates parsing errors. */ public static class ParseException extends Exception { public ParseException (String msg) { super(msg); } }

line

sum

\n

\n

18

The main program creates a StreamTokenizer for standard input and configures it to report numerical sequences as words. line() delivers one tree at a time which is either evaluated or stored in an ArrayList.


package expr.java;import java.io.InputStreamReader;import java.io.IOException;import java.io.ObjectOutputStream;import java.io.StreamTokenizer;import java.util.ArrayList;import java.util.regex.Pattern;/** recognizes, stores, and evaluates arithmetic expressions. */public abstract class Expression { /** node factory. */ protected static VM vm = new VM(); /** reads lines from standard input, parses, and evaluates them or writes them to standard output. @param args if <tt>-c</tt> is specified, an <tt>ArrayList</tt> is written. */ public static void main (String args []) throws Exception { boolean cflag = args.length > 0 && args[0].equals("-c"); ArrayList lines = cflag ? new ArrayList() : null; StreamTokenizer scanner =new StreamTokenizer(new InputStreamReader(System.in)); scanner.resetSyntax(); scanner.commentChar('#'); // comments from # to end-of-line scanner.wordChars('0', '9'); // parse decimal numbers as words scanner.wordChars('.', '.'); scanner.whitespaceChars(0, ' '); // ignore control-* and space scanner.eolIsSignificant(true); // need '\n'

do try { Number n = Expression.line(scanner); if (n != null) if (cflag) lines.add(n); else System.out.println(n.floatValue()); } catch (java.lang.Exception e) { System.err.println(scanner +": "+ e); while (scanner.ttype != scanner.TT_EOL && scanner.nextToken() != scanner.TT_EOF) ; } while (scanner.ttype == scanner.TT_EOL); if (cflag) { ObjectOutputStream out = new ObjectOutputStream(System.out); out.writeObject(lines); out.close(); } }

19

Trees are written using an ObjectOutputStream, because ArrayList and all nodes are Serializable. It is important to call close() to ensure that the last buffer is written.

Following an error the StreamTokenizer has to be advanced to the next line separator; however, an IOException should still terminate the scan.

StreamTokenizer provides an input position indication as result of toString() — this is useful for error messages.

Expression is abstract because it does not make sense to create objects. However, the static methods can still be called. Recursive descent really is a matter of functions rather then methods.

package should be used to restrict the visibility of names. Sources and class files have to reside in a directory which is reached using the package path in an archive on the CLASSPATH or from the CLASSPATH.

import should not use wildcards — this way there is no doubt where a class is supposed to be found.

3.3 StreamTokenizer as a ScannerA StreamTokenizer is constructed from a Reader. It reads bytes(!) and returns characters in the range 0..255 and symbols as results of nextToken() and in ttype.

StreamTokenizer uses byte classification to specify symbols. The defaults for StreamTokenizer are strange, but following resetSyntax() one can setup everything from scratch:

commentChar() defines a character that starts a comment which ends with a line separator. Additionally there can be slashStarComments() and slashSlashComments().

ordinaryChars() resets a range of characters so that they are returned directly by nextToken() and in ttype.

quoteChar() defines a character that starts a string which extends to a matching character or to a line separator or to the end of input. The character is returned by nextToken() and in ttype, the body of the string can be found in sval. The string body can contain the usual backslash escapes \a, \b, \f, \n, \r, \t, \v, \x, and \ooo; they are converted to characters.

whitespaceChars() defines a range of characters which will be skipped. eolIsSignificant() can be set to obtain TT_EOL for a line separator.

parseNumbers() can be set. In this case, digits, minus(!) and period(!) can follow in a word, and if a word looks like a floating point number it’s value is set in nval and nextToken() and ttype return TT_NUMBER.

wordChars() defines a range of characters which can start a word and follow in a word. The word is stored in sval and nextToken() and ttype return TT_WORD.

If StreamTokenizer can be configured to do the job it should definitely be used as a scanner. It is not useful to deal with full Unicode.

20

3.4 Interpreter TreesThe interpreter should store numerical values and provide arithmetic in a selectable type, from byte through double. This is accomplished by deriving all nodes from Number and implementing methods such as intValue() to evaluate the subtrees and combine them appropriately.

VM encapsulates all interpreter classes and provides factory methods to create the nodes so that a subclass of VM could arrange to replace some or all interpreter classes.

expr/java/VM.java

package expr.java;/** factory to store and evaluate arithmetic expressions. */public class VM { /** factory method: create addition node. */ public Number Add (Number left, Number right) { return new Add(left, right); }

Node is the base class for all nodes. As a simplification, arithmetic is mapped to longValue() and doubleValue().

expr/java/VM.java

/** defines most value-functions so that subclasses need only deal with long and double arithmetic. */ protected abstract static class Node extends Number { /** maps byte arithmetic to long. @return truncated long value. */ public byte byteValue () { return (byte)longValue(); }

Node is Serializable because it is a subclass of Number; therefore, trees can serialized.

21

expr/java/VM.java

/** represents a binary operator. Must be subclassed to provide evaluation. */ protected abstract static class Binary extends Node { /** left operand subtree. */ protected final Number left; /** right operand subtree. */ protected final Number right; /** builds a node with two subtrees. @param left left subtree. @param right right subtree. */ protected Binary (Number left, Number right) { this.left = left; this.right = right; assert left != null; assert right != null; } }

Binary and similarly Unary are base classes for binary and unary operator nodes. The constructor stores the subtrees and thus forces the correct structure for the subclasses. The subclasses could implement toString() to provide a symbolic description of the interpreter tree.

expr/java/VM.java

/** implements addition. */ protected static class Add extends Binary { /** builds a node with two subtrees. @param left left subtree. @param right right subtree. */ protected Add (Number left, Number right) { super(left, right); } /** implements long addition. @return sum of subtree values. */ public long longValue () { return left.longValue() + right.longValue(); } /** implements double addition. @return sum of subtree values. */ public double doubleValue () { return left.doubleValue() + right.doubleValue(); } }

22

Add() uses package access to generate an addition node. This can be overwritten in a subclass of VM; therefore, the actual node class Add is completely hidden.

An operator class such as Add only needs to implement the arithmetic methods for the various datatypes that call the corresponding methods for the subtrees.

The entire system is based on Number; therefore, literals can be stored using the wrapper classes. It makes sense to hide this behind factory methods as well.

expr/java/VM.java

/** factory method: create integer value. */ public Number Long (String value) { return new Long(value); }

VM is a typical, completely polymorphic interpreter: literals are stored so that they deliver values of arbitrary types, operators combine values in arbitrary types. expr.java.Go demonstrates that one can select a datatype after an expression has been compiled and obtain different evaluation results.

The advatage of such an interpreter is of course that no semantic check is required. However, the polymorphic behaviour is relatively expensive and is only available in interpreters. In principle one could select and bind to specific types during code generation.

23

3.5 Interpreting a TreeGo reads an ArrayList with Number objects from standard input and evaluates it using different types.

expr/jav/Go.java

package expr.java;import java.io.ObjectInputStream;import java.util.ArrayList;/** executes arithmetic expressions from standard input. */public class Go { /** loads an <tt>ArrayList</tt> with <tt>Number</tt> elements from standard input and evaluates them. @param args ignored */ public static void main (String args []) throws Exception { ObjectInputStream in = new ObjectInputStream(System.in); ArrayList lines = (ArrayList)in.readObject(); System.out.println("byte\tshort\tint\tlong\tfloat\tdouble"); for (int i = 0; i < lines.size(); ++ i) { Number n = (Number)lines.get(i); assert n != null; System.out.println(n.byteValue()+"\t"+n.shortValue() +"\t"+n.intValue()+"\t"+n.longValue() +"\t"+n.floatValue()+"\t"+n.doubleValue()); } }}

Persistent objects are written with an ObjectOutputStream and read with an ObjectInputStream. The result of readObject() has to be cast to the appropriate class.

The stream only contains the object data; the class files have to be reachable through the CLASSPATH or by way of a ClassLoader, otherwise there will be exceptions. RMI uses MarshalledObject to provide a way to load the class files from a server prior to reviving the object.

If an ObjectInputStream is read until end of file, an EOFException has to be caught.

24

3.6 SummaryStreamTokenizer is useful to quickly construct scanners as long as an object can be configured as needed. The example demonstrates that differentiating integers and floating point values is difficult. Moreover, it is difficult to support operators consisting of more then one character.

Recursive descent is useful to quickly construct a recognizer if no tools can be used and if it is clear that the grammar is suitable. The resulting parser is usually hard to maintain.

Left recursion cannot be implemented using recursive descent:

sum: product | sum '+' product | sum '-' product;

A naively programmed recognizer would go into a recursive loop.

Right recursion is feasible but it results in trees that interpret operators as right associative and is therefore not very suitable for arithmetic expressions.

As another example:

output: 'write' sum ( ',' sum )* ','? ';' ;

The intention is for a trailing comma to suppress the output of a line separator. Recursive descent cannot decide following a comma whether to search for a sum or for a trailing semicolon. It takes looking at two input symbols to decide — the grammar is not suitable for recursive descent with one input symbol lookahead.

Serialization is an elegant technique to store an interpreter ready to be used or to send data from a compiler to an analyzer or generator program. If implemented carefully, execution (VM) can be completely independent of analysis (Expression).

25

4LR Parsing

Language recognition is usually implemented by processing a grammar with a parser generator. Most parser generators require a scanner or a screener to extract symbols from a sequence of input characters. This chapter discusses one of the classic parser generators, the next chapter describes a scanner generator.

4.1 The IdeaKnuth invented a technique for parser generation which Horning explained very well in an article in Compiler Construction, An Advanced Course.

extra: ^ sum EndOfInput;

A rule is added to the grammar which consists of the start symbol followed by a terminal for the end of input. The position before the start symbol is marked. For simplicity, rules are specified in BNF but with only a single alternative on the right hand side; there can be several rules for each nonterminal.

A marked rule is called a configuration. This first configuration is expanded into the start state of the parser. The end state requires the mark to be positioned past the end of input symbol.

sum: ^ product;sum: ^ product '+' sum;sum: ^ product '-' sum;product: ^ term;product: ^ term '*' product;product: ^ term '%' product;product: ^ term '/' product;term: ^ '+' term;term: ^ '-' term;term: ^ '(' sum ')';term: ^ Number;

The mark was in front of a nonterminal; therefore in a transitive operation all rules for that nonterminal are added and marked at the beginning.

A set of configurations is called a state.

Transitions and new states result by moving the mark individually in each configuration across one symbol and transitively adding more rules as above. Equal states are combined.

The set of states is finite. The result is a matrix of states and transitions triggered by symbols.

term: Number ^;

Once the mark reaches the end of a rule, the configuration is termed complete. If a state contains more then one complete configuration there is a conflict. Horning explains how conflicts can be circumvented by considering lookahead.

The matrix can be computed ahead of time and the grammar can be checked for suitability (no conflicts). A terminal causes a transition, the new state is placed on the stack.Note that this

http://jhorning4.home.attbi.com/pro-home.html

http://www-cs-faculty.stanford.edu/~knuth/

26

corresponds to accepting the terminal from many states. Once a complete configuration is reached a rule has been recognized. At this point a number of states corresponding to the rule length are removed from the stack, thus uncovering the state from which the rule was started. Now a transition is made with the rule’s nonterminal.

4.2 yacc and bisonyacc [Johnson, 1978] and bison [Corbett and Stallman], both, accept a table of grammar rules in BNF and actions written in C or a dialect such as Objective C or C++ and construct a C function to recognize a sentence.

If some input is acceptable to the right hand side of a rule the corresponding action is executed. One can think of the action as an observer for the rule.

A rule acts as a pattern to select part of the input and carry out the action on it; however, because the right hand side can contain nonterminals, the rules can call each other.

yacc and bison are grammar checkers because they test if a grammar is suitable for the parsing technique (LaLonde LR(1), not quite LR(1)). If it is, the grammar is not ambiguous.

Even if the table contains no actions, yacc and bison will produce a function that together with a scanner can check if an input sequence is a sentence for the grammar.

4.3 CUPCUP [Hudson et al. 1996] is a yacc look-alike implemented in Java. It constructs a parser class in Java. In the current version 0.10k error recovery seems to work correctly.

Unlike yacc, however, CUP has a fairly bizarre input syntax and requires the scanner to pass specific Symbol objects to the recognizer which may not be reused.

http://dinosaur.compilertools.net/

http://www.gnu.org/software/bison/bison.html

http://www.cs.princeton.edu/~appel/modern/java/CUP/

27

4.4 jayjay is yacc, still implemented in C but retargeted to Java. The sources were taken from BSD-Lite, the algorithm was not modified. jay generates a Java program that can optionally contain an improved trace. Actions are written in Java.

jay has been retargeted to C# by the Mono project.

jay acts as a filter for a program skeleton which is compiled conditionally. The skeleton could be modified to change the interface and packaging of the resulting program.

InputInput to yacc or jay consists of three parts which are separated by lines containing %%.

Terminals have to be declared in the first part using %token statements; only single character literals can be specified directly.

Parallel to the state stack yacc and jay maintain a value stack. jay stacks Object references, yacc stacks int values. Other classes or types can also be specified in the first part of the input using %token and %type statements.

The second part contains the grammar rules in BNF and the actions. An action normally follows an alternative of a rule, but it can also be embedded. An anonymous nonterminal is then generated in place of the action.

expr/jay/Expression.jay

%token <String> Int, Real%type <Number> line, sum, product, term%%

line : sum // $$ = $1 | /* null */ { $$ = null; }

sum : product // $$ = $1 | sum '+' product { $$ = vm.Add($1, $3); } | sum '-' product { $$ = vm.Sub($1, $3); }

product : term // $$ = $1 | product '*' term { $$ = vm.Mul($1, $3); } | product '/' term { $$ = vm.Div($1, $3); } | product '%' term { $$ = vm.Mod($1, $3); }

term : '+' term { $$ = $2; } | '-' term { $$ = vm.Minus($2); } | '(' sum ')' { $$ = $2; } | Int { $$ = vm.Long($1); } | Real { $$ = vm.Double($1); }

In an action $$ references the value that will be pushed for the nonterminal onto the value stack. $i reference the values that are on the stack for the symbols in the alternative. The default action is $$ = $1; i.e., the value corresponding to the first symbol is passed up. The actions above build an interpreter tree.

http://www.cs.rit.edu/~ats/projects/jay/

http://www.go-mono.org

28

Operationyacc and jay generate a pushdown automaton (PDA). The current terminal and the state on the stack select the operation of the PDA.

If -v is specified, yacc and jay describe the PDA in a file y.output.

yacc and jay primarily compute the transition matrix — this is not necessarily possible for an arbitrary grammar.

ConsequencesEvery symbol in an alternative of a rule corresponds to a slot on the state stack and a parallel slot on the value stack.

The state stack contains states, the value stack can be used by the actions.

For yacc the scanner places a value into a global variable yylval which is pushed onto the value stack during a shift operation.

For jay the object returned by sending value() to the scanner is pushed.

An action is executed during a reduce operation. It can access the value stack using $i and the value corresponding to the nonterminal for the upcoming goto operation using $$.

yacc defines int as the type for value stack elements, jay uses Object. These types can be changed for specific nonterminals with %type and for terminals with %token statements. yacc requires a %union statement to list alternative type names, jay permits any class or interface name.

Individual references can also be typed using the bizarre syntax $<type>i. Even $<type>$ might be needed for yacc but definitely not for jay.

shift state the current terminal is accepted, state is put on the stack.reduce rule the stacks contain a phrase (sequence of states) which can be replaced by the

nonterminal of rule — the syntax tree grows, the phrase is removed from both stacks.

goto state a nonterminal created by reduce is accepted, state is put on the stack.

operation

state

•

•

value

•

•

characters scanner terminal

29

Scanner InterfaceThe third part of the input to yacc and jay can contain almost arbitrary C or Java code. This is a good place to implement the scanner: yacc requires a function yylex() which must return characters or %token values and zero at end of input, jay prescribes an interface yyInput with the methods advance(), token(), and value().


%% /** lexical analyzer for arithmetic expressions. */ public static class Scanner extends StreamTokenizer implements yyInput { public Scanner (Reader r) { super(r); resetSyntax(); commentChar('#'); // comments from # to end-of-line wordChars('0', '9'); // parse decimal numbers as words wordChars('.', '.'); whitespaceChars(0, ' '); // ignore control-* and space eolIsSignificant(true); // need '\n' } /** moves to next input token. Consumes end of line and pretends (once) that it is end of file. @return false at end of file and once at each end of line. */ public boolean advance () throws IOException { if (ttype != TT_EOF) nextToken(); return ttype != TT_EOF && ttype != TT_EOL; } /** determines current input, sets value to String for Int and Real. @return Int, Real or token's character value. */ public int token () { switch (ttype) { case TT_EOL: case TT_EOF: assert false; // should not happen case TT_WORD: value = sval; return sval.indexOf(".") < 0 ? Int : Real; default: value = null; return ttype; } } /** value associated with current input. */ protected Object value; /** produces value associated with current input. @return value. */ public Object value () { return value; } }} // end of class Expression

30

Wrap-Upyacc creates a function yyparse() which returns 0 if successful. jay is required to create a Java source file, i.e., a class definition. The class starts in a code section delimited by %{ and %} in the first part of the input and ends with a single } at the end of the third part.


%{ package expr.jay; import java.io.InputStreamReader; import java.io.IOException; import java.io.Reader; import java.io.ObjectOutputStream; import java.io.StreamTokenizer; import java.util.ArrayList; import expr.java.VM; import jay.yydebug.yyAnim; // needed for animation only import jay.yydebug.yyDebugAdapter; // needed for trace only /** recognizes, stores, and evaluates arithmetic expressions. */ public class Expression { /** node factory. */ protected static VM vm = new VM(); /** reads lines from standard input, parses, and evaluates them or writes them to standard output. @param args <tt>-c</tt> to write an <tt>ArrayList</tt>,<br> <tt>-t</tt> to write a trace,<br> <tt>0..3</tt> to provide an animation. */ public static void main (String args []) throws Exception { boolean cflag = args.length > 0 && args[0].equals("-c"); ArrayList lines = cflag ? new ArrayList() : null; Object trace = null; if (!cflag && args.length > 0) if (args[0].equals("-t")) trace = new yyDebugAdapter(); else trace = new yyAnim("Expression", Integer.parseInt(args[0])); Expression parser = new Expression(); Scanner scanner = new Scanner(new InputStreamReader(System.in)); while (scanner.ttype != scanner.TT_EOF) try { Number n = (Number)parser.yyparse(scanner, trace); if (n != null) if (cflag) lines.add(n); else System.out.println(n.floatValue()); } catch (yyException ye) { System.err.println(scanner+": "+ye); } if (cflag) { ObjectOutputStream out = new ObjectOutputStream(System.out); out.writeObject(lines); out.close(); } }%}

31

The code section can start with package and import statements. It must at least contain the header of the class in which jay will create the method yyparse(), a few statically nested classes, and int constants for the %token names.

yyparse() needs a yyInput object that advances in the input upon advance() and returns the next terminal for token() and the corresponding value for value(). The value assigned to $$ in the action for the start symbol is returned by yyparse(). The interface yyInput does not depend on the parser class but it is generated as a member for convenience.

The value stack contains Object references. %type and %token can be used to declare more specific classes for nonterminals and terminals which are referenced through $i.

%type <A> nonterminal%token <A> terminal%%

nonterminal: terminal { $$ = $i; } // assigns class A to Object nonterminal: terminal // executes $$ = yyDefault($1), defined to return $1nonterminal: terminal { $$ = $<>i; } // assigns Object to Object nonterminal: terminal { $$ = $<B>i; } // assigns class B to Object

yacc interprets the names in <> as alternatives of a union on the value stack which is declared using %union in the first part of the input.

Both, yacc and jay, perform type checking once any type is declared explicitly.

32

4.5 The AlgorithmThe following Nassi-Shneiderman diagram [based on Schreiner, unix/mail 3/91] describes the parser generated by jay and points out how trace (interface jay.yydebug.yyDebug) and error recovery are accomplished.

An object implementing yyDebug can be passed to yyparse() as a second argument. There are two suitable implementations, described below, that provides a graphic animation of the parsing algorithm.

yyLoop:

yyDebug.push

yyDiscarded:

operation from table

yyInput.advance, yyDebug.lex

shift

accept

error

reduce

-- yyErrorFlag (to 0)

yyDebug.shift

continue yyLoop

yyDebug.accept, return last $$

yyErrorFlag

yyerror(), yyDebug.error

yyErrorFlag = 3

until stack empty

error

yyDebug.shift error

continue yyLoop

yyDebug.pop

yyDebug.reject, yyException

if eof reject, yyException

yyDebug.discard

continue yyDiscarded

yyDebug.reduce

{ action }

yyDebug.shift // goto

continue yyLoop

0

1,2

3

on top of stack

33

4.6 Precedenceyacc and jay can deal with some ambiguous grammars. In particular, arithmetic expressions can be described ambiguously — a table specifies associativity and increasing precedence to disambiguate the rules.

%left '+' '-' // left associative%right '^' // right associative, higher precedence%nonassoc DUMMY // not associative, still higher precedence%%

expr: expr '+' expr | expr '^' expr | '-' expr %prec DUMMY // explicit precedence

%nonassoc could be used for comparisons. Here it serves to illustrate that a placeholder name can be inserted into the precedence table to be referenced by a specific precedence for a rule.

Using a precedence table can be dangerous because if yacc and jay can use it they will not report ambiguities in the grammar.

4.7 Conflictsshift/reduce

If a state contains complete and incomplete configurations the parser can reduce a nonterminal prior to accepting a terminal or it can still accept the terminal and progress to another state.

statement: IF condition THEN statement | IF condition THEN statement ELSE statement

This is the infamous dangling else problem: statement can be reduced without the ELSE or the ELSE can be shifted.

yacc and jay report a shift/reduce conflict and proceed to shift in order to accept the longest phrase possible. This is desirable in most cases.

if (a < b)then if (c < d) then ...

An else is now shifted, i.e., it becomes part of the innermost if.

reduce/reduce

If a state contains several complete configurations the parser can reduce one of several nonterminals prior to accepting a terminal.

statement: Variable '=' condition | Variable '=' expression

condition: expression | expression '<' expression

Here, either one of the two alternatives for statement can be reduced. yacc and jay report a reduce/reduce conflict and proceed to reduce using the first alternative. This need not be the correct choice.

34

4.8 AnimationIf the command line option -t is specified jay generates a parser which accepts an additional object as a parameter for yyparse(). This object must implement jay.yydebug.yyDebug, intended to trace the execution of the parser. (yyparse() does not specify yyDebug in it’s signature so that the package is not needed for an untraced parser.)

jay.yydebug.yyDebugAdapter implements a trace reported as diagnostic output.

jay.yydebug.yyAnim is a Frame implementing a graphical trace inspired by Holub’s Compiler Design in C. During construction the bits IN and OUT control if a TextArea is added which acts as a terminal screen for input and output

If continue is clicked the parser continues until there is an output into an area where the Checkbox is set.

Older implementations of the JVM tend to block completely while waiting for keyboard input; in this case IN must definitely be set so that input is obtained from a TextArea.

yyAnim contains a yyAnimPanel, which receives yyDebug messages, and possibly a TextArea with Checkbox. If required, a yyInputStream is entered as a KeyListener and set as System.in and a yyPrintStream, which sends output to the TextArea, is set as System.err and System.out. Additional threads are not needed because inputs happen in the event thread and the parser produces output in the main thread; therefore the animation can be paused through thread synchronization.

The building blocks are suitable for an unsigned applet; however, a separate thread is required for the parser and the scanner must be connected to the streams directly.

35

4.9 Error RecoveryIf the next terminal does not fit the state on the stack and if a reduction is not possible an input error results.

yyparse() now assumes a fictitious terminal error and calls yyerror(). jay creates a parser that tries to pass a String array of acceptable terminals to be included in an error message and there is a suitable default implementation of the method.

If necessary the stack is popped until error can be accepted or until the stack is empty.

In the first case recognition continues with error just like a normal terminal. In the second case yyparse() terminates in Java with a yyException and in C returns a non-zero value.

Following error there have to be three successful shift operations before another error triggers another call of yyerror().

This can be abbreviated by calling yyerrok; in C or setting yyErrorFlag=0; in Java. However, this can lead to an infinite loop if it is done in an action itself triggered by error.

PrinciplesThe grammar has to be extended with rules containing error in plausible positions.

These rules should be close to the start symbol — as an ultimate safeguard.

These rules should be far away from the start symbol — to recover as soon as possible.

It turns out that a good strategy is to make iterations as robust as possible. Details are explained in Schreiner/Friedman Introduction to Compiler Construction with Unix.

36

ExamplesRecover is part of the jay distribution. It reads lines with different kinds of sequences and demonstrates error recovery for typical iterations. By activating a trace one can check if terminals are ignored during error recovery.

%%line: // null | line OPT opt '\n' { yyErrorFlag = 0; System.out.println("opt"); } | line SEQ seq '\n' { yyErrorFlag = 0; System.out.println("seq"); } | line LIST list '\n' { yyErrorFlag = 0; System.out.println("list"); }

Optional sequence

Recover accepts lines starting with opt optionally followed by one or more words. Inputting a comma would be an error.

opt: // null | opt WORD { yyErrorFlag = 0; } | opt error

Sequence

Recover accepts lines starting with seq followed by one or more words. Inputting no word or a comma would be an error.

seq: WORD | seq WORD { yyErrorFlag = 0; } | error | seq error

List

Recover accepts lines starting with list followed by one or more words separated by comma. Inputting no word or omitting a comma would be an error.

list: WORD | list ',' WORD { yyErrorFlag = 0; } | error | list error | list error WORD { yyErrorFlag = 0; } | list ',' error

The general principle is to insert error in every place in an iteration where a terminal could appear. yyErrorFlag=0; is specified exactly following a terminal. One can check that all alternatives are in fact necessary.

Caution: older implementations of yacc (and CUP) had a defective error recovery logic which this technique provoked to execute very strange reductions.

37

4.10 Summaryyacc and it’s descendants have very simple conventions and are more powerful then the LL(1) parser generators discussed next. The syntax for rules and actions is trivial. yacc’s algorithm can cope with certain ambiguities and can use precedence tables; this simplifies the design of grammars and makes the parsers more efficient.

jay is implemented in C; the sources may be distributed under the Berkeley copyright. jay can be compiled on many platforms; however, it is not as portable as a reimplementation in Java such as CUP. On the other hand, creating jay from yacc took hardly any effort and it was definitely worthwhile to reuse a reliable implementation which has been available for many years. [However, Corbett implements %nonassoc differently from Johnson’s original. A trivial precedence table with a single %nonassoc is silently ignored.]

yacc and bison are at pains to permit having more then one parser in the same program. jay lets the user wrap yyparse() into a class; this keeps the namespace quite uncluttered.

In C the value stack can be a union. In Java the values must be objects; this is not quite as efficient. yyDefault() may have to be overwritten to do a deep copy. The trivial action $$=$1; might not happen if the right hand side of a rule is empty. One will usually construct a tree, however, and thus the lack of a union in Java tends to be irrelevant.

yacc can use #line directives and force the C compiler to relate error messages directly to the source presented to yacc rather then to the source created by yacc; this is impossible with jay and Java.

Tracing and animation is much more elegantly implemented in jay and Java.

39

5Lexical Analysis

It turns out that the symbol representations for most programming languages can be described by regular grammars, i.e., finite state automata are useful for extracting symbols from a sequence of characters.

It also turns out that search patterns as used by Unix commands as early as ed can also be implemented with finite state automata — a set of positions in the pattern is the current state. This is why patterns are usually referred to as regular expressions.

Consequently, patterns are a very convenient way to express the symbol representations for a programming language. This chapter deals with tools that convert patterns into functions for lexical analysis.

5.1 lex and flexlex [Lesk, 1978] and flex [Paxson] each accept a table of egrep-like patterns and C statements and construct a C function yylex() to recognize and process pieces of the input.

If part of the input matches a pattern the corresponding C statement is executed. Unrecognized parts of the input are silently copied.

lex and flex are program generators: the patterns serve as a control structure to select actions. The programs are quite compatible but for the fact that flex buffers the input.

Example: Line Numbering%{ /* * line numbering */%}

%%

\n ECHO;^.*$ printf("%d\t%s", yylineno, yytext);

A lex program has three parts, separated by lines with %%. The first part can contain C code enclosed in %{ and %}, the third part is simply copied.

This program contains one pattern which copies newline characters and one pattern that recognizes the complete text on a line and prints it preceded by a line number.

$ lex -t num.l > num.c; cc -o num num.c -ll$ flex -l -t num.l > num.c; cc -o num num.c -L/usr/local/gnu/lib -lfl

The libraries contain (among other things) a main program which calls the generated function yylex() once.

http://www.gnu.org/software/flex/

40

5.2 JLexJLex [Berk 1996] is a reimplementation of lex in Java. The pattern/action syntax is fairly similar but the prolog caters to many Java-isms. JLex is a bit more parser-oriented then lex.

import java.io.IOException;

%%

%public%class Num

%type void%eofval{ return;%eofval}

%line

%{ public static void main (String args []) throws IOException { new Num(System.in).yylex(); }%}

%%

\n { System.out.println(); }.*$ { System.out.println((yyline+1)+"\t"+yytext()); }

Again, there are three parts to the input, separated by lines with %%. The first part is copied, the second part contains configuration options and Java code within %{ and %}, and the third part is the pattern/action table.

Unlike lex, JLex does not support ^ to match the beginning of a line. If %line is specified as an option, yyline counts the input line numbers from 0. The text matching the pattern is available as a String returned by yytext(). All text must be recognized — a mismatch is reported as an error.

http://www.cs.princeton.edu/~appel/modern/java/JLex/

41

5.3 JLex OptionsOptions should start at the beginning of a line.

5.4 Patternslex and JLex patterns are very similar to egrep:

Ambiguities are resolved in favour of the longest match and the first pattern.¶

%public make generated class public%class name defines name of generated class%implements name add one interface to class header%{ %} add code to the class%init{ %init} add code to the constructor%initthrow{ %inithrow} add list of exceptions to the constructor header%type name defines return type of yylex()%function name redefines name of yylex()%yylexthrow{ %yylexthrow} add list of exceptions to yylex() header%eofval{ %eofval} add code to return value from yylex() at end of input%integer %intwrap use int or Integer as return types%yyeof defines end of file constant for %integer%eof{ %eof} add code to function executed at end of input%eofthrow{ %eofthrow} add list of exceptions to this code’s headername = definition define text replacement for {name} in patterns%state name, ... define list of state names other then YYINITIAL%char activates yychar for character position counting%line activates yyline for line counting%notunix recognizes \r or \n for newline%full %unicode uses 8-bit or 16-bit character set rather then ASCII%ignorecase lets character classes match either case

abc non-whitespace characters denote themselves unless..."abc" characters in double quotes denote themselves\" denotes double quote only within double quotes\n \t \b \f \r denote newline, tab, backspace, formfeed, and carriage return\ooo \xhh \uhhhh denote characters as octal, hexadecimal, and Unicode values.\^C denotes a control character\x denotes any other character x. matches any character but newline[abd-x...] matches any character from the class[^abd-x...] matches any character not from the class$ matches end of linex* matches zero or more occurrencesx+ matches one or more occurrencesx? matches zero or one occurrencexy concatenationx|y alternatives, lower precedence(...) grouping{name} reference to pattern defined in the options section

42

Typical Patternsalpha = [a-zA-Z_]alnum = [a-zA-Z_0-9]oct = [0-7]dec = [0-9]hex = [0-9a-fA-F]sign = [+-]?exp = ([eE]{sign}{dec}+)L = [lL]X = [xX]

StatesA pattern can be preceded by a list of state names in angle brackets. The pattern is then only recognized if one of the states was entered by calling yybegin() with a state name. The initial (and often only) state is YYINITIAL.

States can be used to apply essentially separate scanners to parts of the input. For an extensive example look at the scanner of jag.

"/*"([^*]|"*"+[^/*])*"*"+"/" Java comment"//".*$

"{"[^}]*"}" Pascal comment"(*"([^*]|"*"+[^*)])*"*"+")"

\"([^\"\\\n]|\\.|\\\n)*\" string (mostly)’([^’\n]|’’)+’ Pascal string

0{oct}+ octal0{oct}+{L} octal long0{X}{hex}+ hexadecimal0{X}{hex}+{L} hexadecimal long{dec}+ decimal{dec}+{L} decimal long

{dec}+"."{dec}*{exp}?|{dec}*"."{dec}+{exp}?|{dec}+{exp} floating point

’([^’\\\n]|\\[^0-7\n]|\\[0-7][0-7]?[0-7]?)’ character

{alpha}{alnum}* identifier

.|\n rest, any one character

http://www.cs.rit.edu/~ats/projects/jag/

43

5.5 JLex and jayIf a StreamTokenizer is not sufficient, JLex is the tool of choice to implement a scanner for a parser generated, e.g., with jay.

The patterns describe input symbols and the actions arrange for the informations that the parser expects. It is a good idea to specify %line and implement a position indication for error messages.

expr/jlex/Scanner.lex

package expr.jlex;/** scanner for arithmetic expressions. */%%%public%class Scanner%implements Expression.yyInput%{ /** returned by {@link #token()}. */ protected int token; /** returned by {@link #value()}. */ protected Object value; /** current input symbol. */ public int token () { return token; } /** null or string associated with current input symbol. */ public Object value () { return value; } /** position for error message. */ public String toString () { return "("+(yyline+1)+")"; }%}%type boolean%function advance%eofval{ return false;%eofval}%linecomment = ("#".*)space = [\ \t\b\015]+digit = [0-9]integer = {digit}+real = ({digit}+"."{digit}*|{digit}*"."{digit}+)%%{space} { }{comment} { }{integer} { token = Expression.Int; value = yytext(); return true; }{real} { token = Expression.Real; value = yytext(); return true; }\n|. { token = yytext().charAt(0); value = null; return true; }

44

The parser now has to perform error recovery on lines. main() is changed to connect to the JLex scanner and yyerror() is overwritten to include the position indication.

expr/jlex/Expression.jay

%{ ... /** reads lines from standard input, parses, and evaluates them or writes them to standard output. @param args <tt>-c</tt> to write an <tt>ArrayList</tt>. */ public static void main (String args []) throws Exception { boolean cflag = args.length > 0 && args[0].equals("-c"); ArrayList lines = cflag ? new ArrayList() : null; final Scanner scanner = new Scanner(new InputStreamReader(System.in)); Expression parser = new Expression(lines) { public void yyerror (String message, String[] expecting) { System.err.print(scanner+" "); super.yyerror(message, expecting); } }; parser.yyparse(scanner);... /** set up to collect or evaluate sums. @param lines null or container */ public Expression (List lines) { this.lines = lines; } /** null or container to collect sums. */ protected final List lines;%}...%%lines : /* null */ | lines '\n' | lines sum '\n' { if (lines != null) lines.add($2); else System.out.println($2.floatValue()); } | lines error | lines sum error { if (lines != null) lines.add($2); else System.out.println($2.floatValue()); }

$ java -jar jlex.jar Scanner.lex$ mv Scanner.lex.java Scanner.java$ javac Scanner.java # JLex needs no runtime support

5.6 SummaryJLex is implemented in Java and available on many platforms. The generator algorithm is very powerful and still simple to use, not only for scanners but also for filters that work on character streams outside line boundaries. JLex can manage character and line positions in the input.

Unfortunately, the Java connection requires a multitude of keywords to influence the generated class.

45

6Visitors

6.1 Design PatternThe Visitor design pattern is intended to introduce the occupants of a container to a visitor object. It usually requires two methods. The container and/or it’s elements implement something like

public void visit (Visitor visitor, Object data) { ... }

so that the visitor can be introduced to the container or an element. A Visitor implements something like

public void visit (Object element, Object data) { ... }

so that the element can be introduced to the visitor object. data provides an opportunity to pass extra information along the visit.

If the container has a linked structure and if the visitor knows about element connections it can implement it’s visit() to pursue it’s own path through the container. This does, however, defeat the idea that a container conceals it’s interior structure.

6.2 Divide and ConquerA syntax tree or an interpreter tree can be viewed as a container with element objects from many classes. If a visitor is charged with evaluation or code generation it is likely to apply different algorithms to different kinds of nodes. The following interface tries to simplify decision making:

expr/visitor/VM.java

public interface Visitor { Object visit (Node node, Object data); Object visit (Add node, Object data); Object visit (Sub node, Object data); Object visit (Mul node, Object data); Object visit (Div node, Object data); Object visit (Mod node, Object data); Object visit (Minus node, Object data); Object visit (Long node, Object data); Object visit (Double node, Object data); /** indicates the visit is over. */ Object done (Object data); }

Each tree node class is sent to the Visitor with a separate method. This tends to separate concerns, but it has the unfortunate effect that the visitor is very closely tied to the complete set of classes used in the tree.

46

6.3 NodesNode classes are quite trivial to implement. The architecture is similar to Interpreter Trees, page 20.


public class VM { /** defines access to subtrees. Needs to be <tt>public</tt> for parsers in other packages. */ public abstract static class Node extends ArrayList { /** subclass must implement this as <tt>return visitor.visit(this, data);</tt> to achieve switch into proper visitor method. */ public abstract Object visit (Visitor visitor, Object data); /** convenience method to visit a subtree. @param n index of subtree. */ public Object visit (int n, Visitor visitor, Object data) { return ((Node)get(n)).visit(visitor, data); } }

An ArrayList is a quick way to link to subtrees. The base class of all tree nodes ensures that all nodes can be visited and provides convenient access to subtrees.


/** factory method: create addition node. */ public Node Add (Node left, Node right) { return new Add(left, right); } /** represents addition. This is <tt>public</tt> for the <tt>Visitor</tt>. */ public static class Add extends Node { /** builds a node with two subtrees. @param left left subtree. @param right right subtree. */ protected Add (Node left, Node right) { add(left); add(right); }

public Object visit (Visitor visitor, Object data) { return visitor.visit(this, data); } }

Add is typical for all node classes. Each class must implement a specific call to it’s own method in Visitor.

47

6.4 VisitorsOperations on the program tree require implementing the Visitor interface. Dump prints an indented preorder display of the tree:

expr/visitor/Dump.java

package expr.visitor;/** visitor to display a <tt>VM</tt> tree on diagnostic output. */public class Dump implements VM.Visitor { /** this should not happen. */ public Object visit (VM.Node node, Object data) { assert false : "call to Node visit"; return null; } /** preorder display node and subtrees. */ protected Object subtrees (VM.Node node, Object data) { print(data, node.getClass().getName()); for (int n = 0; n < node.size(); ++ n) node.visit(n, this, indent(data)); return null; } /** preorder display node and linked values. */ protected Object leaves (VM.Node node, Object data) { print(data, node.getClass().getName()); for (int n = 0; n < node.size(); ++ n) print(indent(data), node.get(n)); return null; } /** indented display. */ protected void print (Object indent, Object value) { System.err.println((indent != null ? indent.toString() : "")+value); } /** indent one more level. */ protected String indent (Object indent) { return (indent != null ? indent.toString() : "")+" "; } public Object visit (VM.Add node, Object data) { return subtrees(node, data); }

public Object visit (VM.Long node, Object data) { return leaves(node, data); }

Most nodes, like Add, have subtrees and their display is handled by a common method. Leaves are handled separately to display descendants which are not nodes. The data argument of the Visitor methods is used to manage indentation by propagating a string.

48

expr/visitor/Expression.jay

/** reads lines from standard input, parses, and evaluates them or writes them to standard output. @param args classnames of visitors. */ public static void main (String args []) throws Exception { Class[] classes = new Class[args.length]; for (int n = 0; n < args.length; ++ n) classes[n] = Class.forName(args[n]); final Scanner scanner = new Scanner(new InputStreamReader(System.in)); Expression parser = new Expression(classes) { public void yyerror (String message, String[] expecting) { System.err.print(scanner+" "); super.yyerror(message, expecting); } }; parser.yyparse(scanner); } /** set up to collect or evaluate sums. @param classes array of visitor classes. */ public Expression (Class[] classes) { this.classes = classes; } /** array of visitor classes. */ protected final Class[] classes; /** list of serializable trees. */ protected final List lines = new ArrayList();...%%program : lines { try { ObjectOutputStream out = new ObjectOutputStream(System.out); out.writeObject(lines); out.close(); } catch (Exception e) { throw new yyException("output error ["+e+"]"); } }

lines : /* null */ | lines '\n' | lines sum '\n' { visit($2); } | lines error | lines sum error { visit($2); }

Expression sets up an array of Visitor class names and starts parsing as usual. The grammar is extended with a new start symbol to serialize an ArrayList of lines once recognition is (almost) complete. This avoids a reference from the static method main() to a field of the parser object, but the output is written even if parsing fails at the very end.

49

The tree built for each complete line is visited by an object from each Visitor class:

expr/visitor/Expression.jay

/** run visitors on tree. */ protected void visit (VM.Node tree) throws yyException { if (tree != null) for (int n = 0; n < classes.length; ++ n) try { VM.Visitor visitor = (VM.Visitor)classes[n].newInstance(); tree.visit(visitor, null); visitor.done(lines); } catch (Exception e) { throw new yyException(classes[n].getName() +": cannot instantiate ["+e+"]"); } }

At the end of the visit, the visitor is given a chance to store something in lines for subsequent serialization.

$ java -jar cmp.jar expr.visitor.Dump expr.visitor.Eval expr.visitor.Tree |> java -jar go.jar2+3expr.visitor.VM$Add expr.visitor.VM$Long 2 expr.visitor.VM$Long 35.0byte short int long float double5 5 5 5 5.0 5.0

Dump displays the tree, Eval evaluates the tree, and Tree creates an interpreter tree over the library described in Interpreter Trees, page 20 which can be evaluated by the class described in Interpreting a Tree, page 23.

50

Evaluation

expr/visitor/Eval.java

package expr.visitor;/** visitor to evaluate a <tt>VM</tt> tree using floating point arithmetic and display the result on diagnostic output. */public class Eval implements VM.Visitor { /** this should not happen. */ public Object visit (VM.Node node, Object data) { assert false : "call to Node visit"; return null; } /** tracks last computed value. */ protected float result; public Object visit (VM.Add node, Object data) { return new Float(result = ((Number)node.visit(0, this, data)).floatValue() + ((Number)node.visit(1, this, data)).floatValue()); }... public Object visit (VM.Double node, Object data) { Number result = (Number)node.get(0); this.result = result.floatValue(); return result; } public Object done (Object data) { System.err.println(result); return null; }}

Visitor methods for evaluation return an object representing a subtree value. As it stands, done() is not given that object; therefore, a scalar instance variable result is used to track the most recently computed value which is printed by done().

Leaf nodes such as Double contain a constant value represented as a reference to a wrapper class object. This must be considered in each Visitor implementation.

51

Copying a Tree

expr/visitor/Tree.java

package expr.visitor;import java.util.Collection;/** visitor to create a <tt>expr.java.VM</tt> tree. */public class Tree implements VM.Visitor { /** this should not happen. */ public Object visit (VM.Node node, Object data) { assert false : "call to Node visit"; return null; } /** tracks last computed value. */ protected Number result; /** factory for target tree nodes. Could be a construction argument, but not for the default constructor. */ protected expr.java.VM vm = new expr.java.VM(); public Object visit (VM.Add node, Object data) { return result = vm.Add( (Number)node.visit(0, this, data), (Number)node.visit(1, this, data)); }... public Object visit (VM.Double node, Object data) { return result = vm.Double(node.get(0).toString()); } /** @param data result is added if Collection. */ public Object done (Object data) { if (data instanceof Collection) ((Collection)data).add(result); return result; }}

Tree uses the same architecture as Eval and builds an interpreter tree using the factory described in Interpreter Trees, page 20. done() inserts the result in a Collection for later serialization by Expression.

52

6.5 InheritanceExtending a Visitor turns out to be possible, but it requires some ugly casting. The compiler for arithmetic expressions described above evolved into a small language and new node classes were added:

/** factory to store simple language for visitors. */public class VM extends expr.visitor.VM { /** what a <tt>VM</tt> visitor must do. */ public interface Visitor extends expr.visitor.VM.Visitor { Object visit (Stmts node, Object data);... } /** factory method: create list of statements node. */ public Node Stmts () { return new Stmts(); } /** represents list of statements. */ public static class Stmts extends Node { protected Stmts () { } public Object visit (expr.visitor.VM.Visitor visitor, Object data) { return ((Visitor)visitor).visit(this, data); } }...

The interface is extended with one more method for each new node class.

Each new node class is derived from Node. Objects must be created through a factory method with package access to the protected constructor.

The ugly casting happens in visit() in each node class. The parameter is a superclass visitor which, however, has to be cast to the new interface so that the version of visit() specific to the new node class is reached for the callback.

6.6 SummaryThe Visitor design pattern provides a methodical way to process syntax and interpreter trees. It has a divide-and-conquer approach built in. Trees for visiting and a Visitor interface can be generated mechanically, e.g., in a fashion as supported by JavaCC.

However, a Visitor knows all node classes at once and interacts with their interconnections. It seems likely that a Visitor implementation requires substantial knowledge about all node classes, thus defeating the principle of information hiding.

It is possible, albeit ugly, to inherit from a Visitor implementation. It is not likely, however, that a parser generator tool can facilitate such inheritance because a grammar can usually not be extended with new nonterminals.

../problems/3/sl/

http://www.webgain.com/products/java_cc/

53

7Code Generation

Once a syntax tree has been verified, code generation can be based on tree traversal, i.e., on a variant of the Visitor pattern. It turns out that the selection (divide and conquer) mechanism should include not only a node itself but also some information about it’s subtrees.

This chapter describes a tool for visitor generation and applies it to the problem of code generation for typical machine architectures.

7.1 jagjag was designed to simplify the implementation of visitors in Java and avoid the drawbacks described in the previous chapter. jag reads a table of patterns to be recognized in a tree and actions to be executed at the appropriate points in the tree. The actions can, among other things, decide whether and how to visit subtrees.

jag’s ultimate ancestor is Wilcox’ template coder which was developed around 1970 as part of Cornell’s implementation of PL/I. The key difference is that jag only uses classes to specify patterns.

jag is implemented with jay and JLex as a compiler emitting visitors as Java code which requires a small runtime system that participates in action selection. Earlier implementations were self-contained interpreters, but compilation has the advantage that powerful actions can be written in Java while the selection mechanism still remains very transparent. Compilation, however, is much more difficult to implement.

jag is still an experimental tool. It is employed here to present the principles of efficient code generation for typical machine architectures in a compact fashion.

Nodesjag visitors operate on trees. Leaves may belong to arbitrary classes, all other nodes must implement the java.util.List interface.

The nodes implemented for the Visitor design pattern (see Nodes, page 46) were based on ArrayList and thus can be used for jag visitors as well. The visit() methods in the nodes are no longer required.


http://home.nycap.rr.com/pflass/plc/V5628003.htm

54

JLex, jay, and jagThe parser and scanner for arithmetic expressions remain unchanged. Expression is modified slightly to load one or more serialized jag visitors as resources and pass each expression to them:

expr/jag/Expression.jay

/** reads lines from standard input, parses, and visits them with visitors specified as arguments. Each visitor operates either on the original tree or on the object returned by the preceding visitor. Any objects eventually returned are serialized to standard output. @param args top-level resource names of visitors. */ public static void main (String args []) throws Exception { if (args == null || args.length == 0) { System.err.println("no visitors specified"); System.exit(1); } final Scanner scanner = new Scanner(new InputStreamReader(System.in)); Expression parser = new Expression() { public void yyerror (String message, String[] expecting) { System.err.print(scanner+" "); super.yyerror(message, expecting); } }; List lines = (List)parser.yyparse(scanner); Visitor visitor [] = new Visitor[args.length]; for (int a = 0; a < args.length; ++ a) { ObjectInputStream in = new ObjectInputStream(Expression.class.getResourceAsStream("/"+args[a])); visitor[a] = (Visitor)in.readObject(); in.close(); }

List output = new ArrayList(); for (int l = 0; l < lines.size(); ++ l) { Object tree = lines.get(l); for (int j = 0; j < visitor.length; ++ j) { Object result = visitor[j].visit(tree); // run generator if (result == null) System.err.println(); else tree = result; } if (tree != null) output.add(tree); }

if (output.size() != 0) { ObjectOutputStream out = new ObjectOutputStream(System.out); out.writeObject(output); out.close(); } }

The visitors are serialized and stored as files at the top level of the jar file containing the other parts of the compiler.

55

Example: PostfixArithmetic expressions can be uniquely expressed in postfix notation where operators follow their operands. Postfix notation can be produced by visiting an interpreter tree in postorder, i.e., working on roots after working on their subtrees.

expr/jag/Postfix.jag

package expr.jag;import edu.rit.cs.jag.Visitor;import java.io.ObjectOutputStream;import java.io.PrintStream;import java.io.Serializable;/** generates visitor to traverse a {@link VM} tree to generate postfix as diagnostic output. */public class Postfix implements Serializable, Visitor {%%

VM.Node: VM.Node VM.Node { $1; $2; $0; } | VM.Node { $1; $0; };

VM.Double: Double { `" "+$1.floatValue()`; };VM.Long: Long { `" "+$1.longValue()`; };

VM.Add: { `" add"`; };VM.Mul: { `" mul"`; };VM.Sub: { `" sub"`; };VM.Div: { `" div"`; };VM.Mod: { `" mod"`; };VM.Minus: { `" minus"`; };

%% /** serialize the visitor to standard output. */ public static void main (String args []) throws Exception { ObjectOutputStream out = new ObjectOutputStream(System.out); out.writeObject(new Postfix() { /** overwritten: diagnostic output. */ protected PrintStream out () { return System.err; } }); out.close(); }}

This produces the following output:

$ java -jar cmp.jar Postfix >/dev/null 2 + 3 2 3 add

56

Input FormatInput to jag consists of three parts, separated by lines with %%. The first part must start a class which implements edu.rit.cs.jag.Visitor. The second part contains the table with patterns and actions which jag will convert into the methods required by the Visitor interface. The third part contains Java code and must close the class started in the first part. Typically, it contains a main() method to instantiate and serialize a visitor.

table entry consists of a class name and a sequence of patterns and actions separated by the | symbol. When a tree node is visited, the class of the node, a superclass or an implemented interface, each with increasing distance from the class, are looked up in the table.

pattern consists of a list of zero or more class names. The action following the pattern is executed if the classes of the list of direct descendants of the tree node are assignable to the list of classes in the pattern. Patterns are considered sequentially, i.e., patterns with more specific classes must be placed near the beginning of the table entry for a node class.

null in place of a class name in a pattern means that the node cannot have a descendant in this position. Clearly, a reference to this descendant cannot appear in the corresponding action.

... if the pattern is followed by three periods, it will match a tree node which has more descendants then those specified explicitly in the pattern.

action is Java code enclosed in { and }. The special syntax described next may be used in an action to simplify producing output and to reference or visit descendants of the matched tree node. If there is no action following a pattern, the search for a matching table entry continues by considering another class related to the tree node.

` object ` is replaced by out().print(object), i.e., this simplifies printing. jag generates an instance variable out initialized to null and a method out() to return out or System.out by default. out can be assigned to and out() can be overwritten to redirect printing.

$n references the n’th descendant of the tree node; the reference is cast to the corresponding class name in the pattern.

$0 references the tree node itself, cast to the class name of the table entry.

$n; calls visit(); to visit the n’th descendant of the tree node.

$0; calls visit0(); for the tree node to execute the action for a class related to the tree node which is specified with an empty pattern. In a sense this is a visit to the tree node as if it had no descendants. As the Postfix example shows, this can vastly simplify the design of a visitor.

{$n} {$0} also call visit() and visit0() respectively, but the generated calls are not terminated with a semicolon, i.e., this notations provides convenient access to the object that those methods return.

$$ can be assigned to to set the result of the action which is in turn returned by the aforementioned calls to visit() or visit0(). This is intended for visitors which evaluate trees or create new trees.

57

Implementationjag generates out, out(), and the methods specified in the Visitor interface:

expr/jag/Postfix.java

/** print action stream. Can be replaced to reroute output. */ protected transient java.io.PrintStream out; /** returns print action stream. Defaults to System.out. */ protected java.io.PrintStream out () { ... } /** visits an object. This is a facade for the class search. @param object to be visited, not null. @return result of action. @throws NoRuleException if no rule can be found. */ public Object visit (Object object) throws Exception { ... } /** visits an object, ignoring possible descendants. */ public Object visit0 (Object object) throws Exception { ... } /** visits an object. @param _action index. @param _object to be visited, not null. @return result of action. @throws NoRuleException if no rule can be found. */ public Object visit (int _action, Object _object) throws Exception { Object _result = null; // $$ switch (_action) { case 12: // line number of action in input to jag visit(((java.util.List)_object).get(0)); // $1; visit(((java.util.List)_object).get(1)); // $2; visit0(_object); // $0; break;... default: throw new Error(_action+": unexpected action index"); } return _result; } /** maps class to array of {@link edu.rit.cs.jag.Rule}. */ protected final java.util.HashMap _rules = new java.util.HashMap(); { _rules.put(VM.Node.class, new edu.rit.cs.jag.Rule[]{// VM.Node: VM.Node VM.Node new edu.rit.cs.jag.Rule(new Class[]{VM.Node.class,VM.Node.class}, false, 12), new edu.rit.cs.jag.Rule(new Class[]{VM.Node.class}, false, 13) // | VM.Node });... }

switch contains the actions, accessed by input line number. _rules contains the data for the table search which is implemented in the runtime support class edu.rit.cs.jag.Rule. There should be at most one action per line and names should not start with _.

58

A call to visit(someNode) starts a traversal or continues it recursively and returns the result of the corresponding action; NoRuleException is thrown if no suitable rule can be found.

visit0(someNode) has the same effect but searches as if someNode has no subtrees. Both methods are covers for a static method Rule.visit() which implements the actual search and calls the Visitor back with visit(action, someNode).

jag is a relatively simple preprocessor that facilitates the coding of a visitor with a pattern-based table search. jag performs a few semantic checks, e.g., it only permits one table entry per class and does not permit references to subtrees that were not mentioned in a pattern. jag’s scanner is a bit tricky — states are introduced to use different regular expressions to analyze rules, actions, and printing shorthands.

AssemblyExpression demonstrates that it is advantageous to package a serialized Visitor as a resource file at the top level of the jar file containing the compiler:

$ java -jar jlex.jar Scanner.lex && mv Scanner.lex.java Scanner.java # scanner$ jay < skeleton.java Expression.jay > Expression.java # parser$ java -jar jag.jar < Postfix.jag > Postfix.java # visitor$ javac -classpath ../..:jag-go.jar -source 1.4 Expression.java Postfix.java$ java -classpath jag-go.jar -ea expr.jag.Postfix > Postfix # serialized visitor$ echo Main-Class: expr.jag.Expression > manifest$ cd ../.. && jar cfm expr/jag/cmp.jar expr/jag/manifest expr/jag/*.class \> expr/visitor/VM*.class expr/java/VM*.class # compiler and tree node classes$ jar xf jag-go.jar edu && jar uf cmp.jar edu Postfix # plus visitor and runtime

Expression also shows that many implementations of Visitor can be applied to a tree within the same program. They can also be cascaded to successively create new trees:

$ java -jar cmp.jar Postfix Value | java -jar go.jar32768 * 2147483648 + 128 32768 2147483648 mul 128 addbyte short int long float double-1 -1 2147483647 70368744177792 7.0368744E13 7.0368744177792E13

$ java -jar cmp.jar Postfix Tree | java -jar go.jar32768 * 2147483648 + 128 32768 2147483648 mul 128 addbyte short int long float double-128 128 128 70368744177792 7.0368744E13 7.0368744177792E13

Value and Tree are both applied to the same tree as the Postfix visitor. Value uses floating point arithmetic to evaluate the tree and creates a Double object with the result, Tree uses the classes discussed in Interpreter Trees, page 20, to represent the tree for further processing. Both results can be passed to the evaluator discussed in Interpreting a Tree, page 23; however, only the second result produces different results for the various kinds of arithmetic used in the general evaluator.

59

Example: Evaluation

expr/jag/Value.jag

VM.Double: Double { $$ = ($1); }; // reference, don't visit.VM.Long: Long { $$ = ($1); };

VM.Add: VM.Node VM.Node { $$ = new Double(((Number){$1}).doubleValue() + ((Number){$2}).doubleValue());};VM.Sub: VM.Node VM.Node { $$ = new Double(((Number){$1}).doubleValue() - ((Number){$2}).doubleValue());};VM.Mul: VM.Node VM.Node { $$ = new Double(((Number){$1}).doubleValue() * ((Number){$2}).doubleValue());};VM.Div: VM.Node VM.Node { $$ = new Double(((Number){$1}).doubleValue() / ((Number){$2}).doubleValue());};VM.Mod: VM.Node VM.Node { $$ = new Double(((Number){$1}).doubleValue() % ((Number){$2}).doubleValue());};VM.Minus: VM.Node { $$ = new Double(-((Number){$1}).doubleValue());};

Each action creates a Number object by combining the values of the subtrees.

The syntax to reference the Double and Long descendants at the leaves of the expression tree is noteworthy: If $1 were followed by a semicolon jag would compil it as a visit to the descendant — which is impossible because the descendant Double or Long does not implement List. Parentheses come to the syntactic rescue.

60

Example: Tree Generation

expr/jag/Tree.jag

public class Tree implements Serializable, Visitor { /** factory object for target tree. <tt>static</tt> because <tt>Rule</tt> is <tt>static</tt>. */ protected transient expr.java.VM vm; /** ensures factory creation. */ protected expr.java.VM vm () { if (vm == null) vm = new expr.java.VM(); return vm; }%%

VM.Double: Double { $$ = ($1); }; // reference, don't visit.VM.Long: Long { $$ = ($1); };

VM.Add: VM.Node VM.Node { $$ = vm().Add((Number){$1}, (Number){$2});};VM.Sub: VM.Node VM.Node { $$ = vm().Sub((Number){$1}, (Number){$2});};VM.Mul: VM.Node VM.Node { $$ = vm().Mul((Number){$1}, (Number){$2});};VM.Div: VM.Node VM.Node { $$ = vm().Div((Number){$1}, (Number){$2});};VM.Mod: VM.Node VM.Node { $$ = vm().Mod((Number){$1}, (Number){$2});};VM.Minus: VM.Node { $$ = vm().Minus((Number){$1});};

Generating a new tree is just as straightforward as evaluation. The only complication results from the fact that Expression expects serialized visitors whereas the factory object for the Interpreter Trees, page 20, cannot be serialized — an earlier oversight. The problem is easily circumvented by using self-initializing code.

61

7.2 Code GenerationWe now turn to code generation for arithmetic expressions and typical hardware architectures. As an example consider - (1 % ( 2 + 3 * 4 - 5 / 6 )) corresponding to the following tree:

One approach would be to represent the expression in postfix notation and simulate a stack on each hardware architecture, but there are more appropriate solutions for each individual architecture.

A goal is to minimize the memory requirements during evaluation. This is accomplished by varying the order in which the subtrees are visited based on the operator in the root node — one of the central reasons for implementing jag.

-

%

1

2

+

3 4

*

-

5 6

/

62

Operator properties can be marked using interfaces — even when reusing an already existing implementation of tree nodes:

expr/jag/VM.java

package expr.jag;/** factory to store arithmetic expressions for jag. The superclass is already based on a <tt>List</tt> representation; other features of the superclass are unused. */public class VM extends expr.visitor.VM { /** marks numerical leaves. */ public interface Number { } /** marks commutative operators. */ public interface Commutative { } /** factory method: create commutative addition node. */ public Node Add (Node left, Node right) { return new Add(left, right); }

public static class Add extends expr.visitor.VM.Add implements Commutative { protected Add (Node left, Node right) { super(left, right); } }... /** factory method: create floating point number. */ public Node Double (String value) { return new Double(value); }

public static class Double extends expr.visitor.VM.Double implements Number { protected Double (String value) { super(value); } }}

Note that self-contained code for the next four examples is distributed with jag and employs it’s own classes for tree representation.


63

7.3 0-Address CodeA 0-address or stack machine works on arguments on a — more or less infinite — stack. Arithmetic instructions such as add have no explicit arguments:

There have to be additional instructions such as load and store to transfer information between some storage and the stack.

A naive postorder visit to an expression tree produces 0-address code but it can wase memory. More efficient code for the tree introduced in Code Generation, page 61, can be designed by exploiting commutativity:

$ java -jar cmp.jar G0- (1 % ( 2 + 3 * 4 - 5 / 6 ))load 1.0load 4.0load 3.0mulload 2.0addload 5.0load 6.0divsubmodminus

arith/G0.jag

List: Serializable Serializable { $1; $2; $0; } // default: postorder | Serializable { $1; $0; };

Arith.Commutative: Double Serializable { $2; $1; $0; }; // optimization

Arith.Add: { `"add\n"`; }; // machine instructionsArith.Mul: { `"mul\n"`; };Arith.Sub: { `"sub\n"`; };Arith.Div: { `"div\n"`; };Arith.Mod: { `"mod\n"`; };Arith.Minus: { `"minus\n"`; };Double: { `"load\t"+$0.floatValue()+"\n"`; };

push

stack

arithmetic

pop

pop

left

right

64

7.4 1-Address CodeA 1-address machine works much like a pocket calculator with a display. Arithmetic instructions such as add have a single argument and combine it with the display:

There have to be additional instructions such as load and store to transfer information between some storage and the display.

Efficient code for the tree introduced in Code Generation, page 61, can be designed by exploiting commutativity. Constants must be loaded directly:

$ java -jar cmp.jar G1- (1 % ( 2 + 3 * 4 - 5 / 6 ))load 5.0div 6.0store temp1load 4.0mul 3.0add 2.0sub temp1store temp1load 1.0mod temp1store temp1load 0.0sub temp1

The code can be substantially improved if there is a 0-address instruction minus to change the sign of the display.

Most hardware architectures need to store temporary results. An edu.rit.cs.jag.Temp object manipulates an integer value (initialized to zero) as follows:

This compile-time stack turns out to be sufficient to manage temporary memory for most hardware architectures.

get() increments the current value and returns the new value.ref() returns the current value.free() returns the current value and decrements it.

load

storememory

arithmetic,

result

add

65

arith/G1.jag

public class G1 implements Serializable, Visitor { /** manage a stack of temporary locations. */ protected Temp temp = new Temp();%%

List: Serializable Double { $1; // optimization: constant operand $0; `$2.floatValue()+"\n"`; }

| Serializable Serializable { $2; // default: optimized postorder `"store temp"+temp.get()+"\n"`; $1; $0; `"temp"+temp.free()+"\n"`; };

Arith.Commutative: Double Serializable { $2; // optimization: commutative $0; `$1.floatValue()+"\n"`; };

Arith.Minus: Double { // optimization: constant operand `"load -("+$1.floatValue()+")\n"`; }

| Serializable { $1; // default: postorder `"store temp"+temp.get()+"\n"`; `"load 0.0\n"`; `"sub temp"+temp.free()+"\n"`; };

Arith.Add: { `"add "`; }; // machine instructionsArith.Mul: { `"mul "`; };Arith.Sub: { `"sub "`; };Arith.Div: { `"div "`; };Arith.Mod: { `"mod "`; };Double: { `"load "+$0.floatValue()+"\n"`; };

66

7.5 2-Address Code2-address machines are the theoretical version of a very common architecture. The are also theoretically important because their code (triples) can easily be optimized and mapped to other architectures.

An arithmetic instruction has two arguments, combines them, and overwrites one of them with the result:

There should be an additional instruction such as move to transfer information between storage cells.

Reasonable code for the tree introduced in Code Generation, page 61, can be designed by exploiting commutativity. Constants must be loaded directly:

$ java -jar cmp.jar G2- (1 % ( 2 + 3 * 4 - 5 / 6 ))move temp1,0.0move temp2,1.0move temp3,5.0div temp3,6.0move temp4,3.0mul temp4,4.0add temp4,2.0sub temp3,temp4mod temp2,temp3sub temp1,temp2

The code can be substantially improved if there is a 1-address instruction minus to change the sign of a storage cell.

Temp can be used for memory management by creating each result in the location which Temp would create next. Mixed-Mode Code for a Register-Machine, page 70, illustrates the dramatic improvements which can result from a more intelligent temporary storage management.

memoryarithmetic

left

right

sum

67

arith/G2.jag

List: Double Double { // constant operands `"move temp"+temp.get()+","`; $1; `"\n"`; $0; `"temp"+temp.free()+","`; $2; `"\n"`; }

| Serializable Double { $1; // optimization: constant right operand $0; `"temp"+temp.get()+","`; $2; `"\n"`; temp.free(); }

| Double Serializable { // constant left operand `"move temp"+temp.get()+","`; $1; `"\n"`; $2; $0; `"temp"+temp.ref()+",temp"+temp.get()+"\n"`; temp.free(); temp.free(); }

| Serializable Serializable { $2; // default: optimized postorder temp.get(); $1; $0; `"temp"+temp.ref()+",temp"+temp.get()+"\n"`; temp.free(); temp.free(); };

Arith.Commutative: Double Double // revert to superclass of node

| Double Serializable { $2; // optimization: commutative $0; `"temp"+temp.get()+","`; $1; `"\n"`; temp.free(); };

Arith.Minus: Double { `"move temp"+temp.get()+","+"-("`; $1; `")\n"`; temp.free(); }

| Serializable { `"move temp"+temp.get()+",0.0\n"`; $1; `"sub temp"+temp.ref()+",temp"+temp.get()+"\n"`; temp.free(); temp.free(); };

Arith.Add: { `"add "`; }; // machine instructionsArith.Mul: { `"mul "`; };Arith.Sub: { `"sub "`; };Arith.Div: { `"div "`; };Arith.Mod: { `"mod "`; };Double: { `$0.floatValue()`; };

The elegance of the pattern search through superclasses implemented by jag lies in the fact that a few general rules will already do the work. Specialized rules can be added, even later, to improve code generation in certain situations. Arith.Commutative demonstrates, however, that the sequential search through the list of patterns for one node class has to be considered when an action is added for a special case: the Double Double case has to be explicitly excluded before the Double Serializable case can be dealt with.

68

7.6 3-Address Code3-address machines are another important theoretical case because their code (quadruples) can also be optimized and mapped to other architectures easily.

An arithmetic instruction has three arguments, combines two, and overwrites the third with the result:

There should be an additional instruction such as move to transfer information between storage cells.

Code for the tree introduced in Code Generation, page 61, is more compact and easier to generate then in the 2-address case:

$ java -jar cmp.jar G3- (1 % ( 2 + 3 * 4 - 5 / 6 ))mul temp1,3.0,4.0add temp1,2.0,temp1div temp2,5.0,6.0sub temp1,temp1,temp2mod temp1,1.0,temp1sub temp1,0.0,temp1

The code can be improved if there is a 1-address instruction minus to change the sign of storage cell.

Temp can be used for memory management by creating each result in the location which Temp would create next.

Um explizite Speicherverwaltung zu vermeiden, kann man das Resultat immer in der Speicherzelle erzeugen, die Temp.get() als nächste anlegen würde:

memory

arithmetic

left

right

sum

69

arith/G3.jag

List: Double Double { $0; `"temp"+temp.get()+","`; // constant operands $1; `","`; $2; `"\n"`; temp.free(); }

| Serializable Double { $1; // optimization: constant right operand $0; `"temp"+temp.get()+","`; `"temp"+temp.ref()+","`; $2; `"\n"`; temp.free(); }

| Double Serializable { $2; // optimization: constant left operand $0; `"temp"+temp.get()+","`; $1; `","`; `"temp"+temp.ref()+"\n"`; temp.free(); }

| Serializable Serializable { $1; // default: proper postorder temp.get(); $2; $0; `"temp"+temp.ref()+","`; `"temp"+temp.ref()+","`; `"temp"+temp.get()+"\n"`; temp.free(); temp.free(); };

Arith.Minus: Double { `"move temp"+temp.get()+","`; // constant operand `"-("`; $1; `")\n"`; temp.free(); }

| Serializable { $1; // default: postorder `"sub temp"+temp.get()+","`; `"0.0,"`; `"temp"+temp.free()+"\n"`; };

Arith.Add: { `"add "`; }; // machine instructionsArith.Mul: { `"mul "`; };Arith.Sub: { `"sub "`; };Arith.Div: { `"div "`; };Arith.Mod: { `"mod "`; };Double: { `$0.floatValue()`; };

70

7.7 Mixed-Mode Code for a Register-Machine A register machine is a special case of a 2-address machine. It has relatively few, fast registers with simple addresses which serve as source and target for arithmetic instructions.

Modern RISC architectures arrange for very many registers; however, register management is still not trivial because registers are also needed to address memory.

Efficient code for the tree introduced in Code Generation, page 61, can be as follows:

$ java -jar cmp.jar Reg >/dev/null- (1 % ( 2 + 3 * 4 - 5 / 6 )) load r0, 3 mul r0, 4 add r0, 2 load r1, 5 div r1, 6 subr r0, r1 load r1, 1 modr r1, r0 load r0, 0 subr r0, r1

This code is much shorter then previous versions because the registers are not managed as a stack.

Expression can generate Long and Double leaves. A typical register machine has different registers for integer and floating point arithmetic and in the spirit of C arithmetic must employ Double if at least one operand belongs to that type. Two auxiliary classes, R.Long and R.Float, both derived from a class R, manage the registers and their objects represent the result type of a generated machine instruction.

register

arithmeticleft right

memory

sum

71

expr/jag/Reg.jag

VM.Node: VM.Long VM.Long { // special case; two Longs $$ = new R.Long((String){$0}, new R.Long((Long){$1}), (Long){$2}); } | VM.Number VM.Number { // special case: two constants, at least one Double $$ = new R.Float((String){$0}, new R.Float((Number){$1}), (Number){$2}); } | VM.Long VM.Node {{ // special case: Long and expression R right = (R)$2; $$ = right instanceof R.Long ? (R)new R.Long((String){$0}, new R.Long((Long){$1}), right) : (R)new R.Float((String){$0}, new R.Float((Long){$1}), right); }} | VM.Number VM.Node {{ // special case: Double and expression R right = (R)$2; $$ = new R.Float((String){$0}, new R.Float((Number){$1}), right); }} | VM.Node VM.Long {{ // special case: expression and Long R left = (R)$1; $$ = left instanceof R.Long ? (R)new R.Long((String){$0}, left, (Long){$2}) : (R)new R.Float((String){$0}, left, (Long){$2}); }} | VM.Node VM.Number {{ // special case: expression and Double R left = (R)$1; $$ = new R.Float((String){$0}, left, (Number){$2}); }} | VM.Node VM.Node {{ // default: postorder R left = (R){$1}, right = (R)$2; if (left instanceof R.Long && right instanceof R.Long) $$ = new R.Long((String){$0}, left, right); else $$ = new R.Float((String){$0}, left, right); }};

This visitor uses the marking interfaces introduced in Code Generation, page 61, to distinguish their individual types of constant leaves. The sequential search through the patterns requires that all special cases precede the most general postorder action.

The double braces are necessary if local variables are declared within a single case of the generated action switch.

72

A few more special cases for commutative operators further improve the generated code:

expr/jag/Reg.jag

VM.Commutative: VM.Long VM.Long // revert to superclass of node | VM.Number VM.Number // revert to superclass of node | VM.Long VM.Node {{ // optimized: flip Long to right R right = (R)$2; $$ = right instanceof R.Long ? (R)new R.Long((String){$0}, right, (Long){$1}) : (R)new R.Float((String){$0}, right, (Long){$1}); }} | VM.Number VM.Node {{ // optimized: flip Double to right R right = (R)$2; $$ = new R.Float((String){$0}, right, (Number){$1}); }};

Root-only rules contain the machine instructions as usual:

expr/jag/Reg.jag

VM.Add: { $$ = "add"; }; // machine instructionsVM.Mul: { $$ = "mul"; };VM.Sub: { $$ = "sub"; };VM.Div: { $$ = "div"; };VM.Mod: { $$ = "mod"; };

VM.Number: Number { $$ = ($1); }; // reference, don't visit

A minus sign is mapped to sub:

expr/jag/Reg.jag

VM.Minus: VM.Long { $$ = new R.Long("sub", new R.Long(new Long(0)), (Long){$1}); } | VM.Number { $$ = new R.Float("sub", new R.Float(new Float(0)), (Number){$1}); } | VM.Node {{ R right = (R)$1; if (right instanceof R.Long) $$ = new R.Long("sub", new R.Long(new Long(0)),right); else $$ = new R.Float("sub", new R.Float(new Long(0)),right); }};

73

Register allocation has to be reinitialized between two expressions — otherwise the result of an expression would remain behind in a register permanently. This would normally be the responsibility of whoever passes an expression tree to the Reg visitor; however, Expression does not cater to any visitor specifically. Subclassing and once-only code can come to the rescue:

expr/jag/Reg.jag

/** generates visitor to generate register machine code for a mixed-mode {@link VM} tree. */public class Reg implements Serializable, Visitor {%%...%% /** serialize the visitor to standard output. */ public static void main (String args []) throws Exception { ObjectOutputStream out = new ObjectOutputStream(System.out); out.writeObject(new Reg() { /** overwritten: diagnostic output. */ protected PrintStream out () { return System.err; } /** false for top-level visit. */ protected transient boolean inner; /** kludge to reset register allocation at top level and return null to avoid serialization. */ public Object visit (Object object) throws Exception { Object result = null; if (!inner) { inner = true; super.visit(object); inner = false; R.reset(); } else result = super.visit(object); return result; } }); out.close(); }}

Finally, it should be noted that this code generator has not been designed to deal with trivial expressions consisting only of a constant value.

74

7.8 Register Allocation R is an abstract class representing a register as an array index. The array element is used to track the reservation. This model would have to be refined once registers have to be paged to memory.

expr/jag/R.java

package expr.jag;/** manages registers and output for a register machine. Nested classes use package access to register descriptions. */public abstract class R { /** makes all registers available. */ public static void reset () { for (int r = 0; r < Long.reg.length; ++ r) Long.reg[r] = false; for (int r = 0; r < Float.reg.length; ++ r) Float.reg[r] = false; } /** allocates a register. @throws RuntimeException if there is no register available. */ protected static int get (boolean reg []) { for (int r = 0; r < reg.length; ++ r) if (! reg[r]) { reg[r] = true; return r; } throw new RuntimeException("no more registers"); } /** register containing this result. */ protected int rX;

get() is the register allocation algorithm shared by the subclasses.

rX ist used in each object to indicate which register the object represents.

75

Concrete subclasses of R represent registers for Long and Double arithmetic. The constructors create machine instructions and track the result registers.

The necessary constructors can be inferred from the jag rules in the preceding section.

expr/jag/R.java

/** describes a result in a general register. */ public static class Long extends R { /** number of general registers for integer arithmetic. */ public static final int regs = 8; /** true if general register is in use. */ protected static boolean reg [] = new boolean[regs]; /** loads a number into a new general register. @throws RuntimeException if there is no register available. */ public Long (java.lang.Long number) { rX = R.get(reg); System.err.println("\tload\tr"+rX+", "+number.longValue()); } /** combines a general register and a number. @param left general register. */ public Long (String opcode, R left, java.lang.Long right) { rX = ((Long)left).rX; // ensure left is Long System.err.println("\t"+opcode+"\tr"+rX+", "+right.longValue()); } /** combines two general registers. @param left general register, used as result. @param right general register, freed. */ public Long (String opcode, R left, R right) { rX = ((Long)left).rX; // ensure left is Long System.err.println("\t"+opcode+"r\tr"+rX+", r"+right.rX); reg[right.rX] = false; ((Long)right).rX = -1; // ensure right is Long and gets trashed } }

76

expr/jag/R.java

/** describes a result in a floating point register. */ public static class Float extends R { /** number of registers for floating point arithmetic. */ public static final int regs = 8; /** true if floating point register is in use. */ protected static boolean reg [] = new boolean[regs]; /** loads a number into a new floating point register. @throws RuntimeException if there is no register available. */ public Float (Number number) { rX = R.get(reg); System.err.println("\tloadf\tfr"+rX+", "+number.floatValue()); } /** transfers a number into a new floating point register and frees the general register. @param r general register. @throws RuntimeException if there is no register available. */ public Float (R r) { rX = R.get(reg); System.err.println("\tloadfr\tfr"+rX+", r"+r.rX); Long.reg[r.rX] = false; ((Long)r).rX = -1; // ensure r is Long and gets trashed } /** combines a register and a number. @param left register, converted to floating point if necessary. */ public Float (String opcode, R left, Number right) { if (! (left instanceof Float)) left = new Float(left); rX = left.rX; System.err.println("\t"+opcode+"f\tfr"+rX+", "+right.floatValue()); } /** combines two registers. @param left register, converted to floating point, used as result. @param right register, converted as needed, freed. */ public Float (String opcode, R left, R right) { if (! (left instanceof Float)) left = new Float(left); rX = left.rX; if (! (right instanceof Float)) right = new Float(right); System.err.println("\t"+opcode+"fr\tfr"+rX+", fr"+right.rX); reg[right.rX] = false; right.rX = -1; // ensure right gets trashed } }}

R.Float has an additional constructor to convert from Long in a general register into Double in a floating point register.

77

7.9 Test GenerationCode generation for register machines depends on lookahead over a single level in the expression tree. Inspection of the fundamental rules in Reg reveals that there are four binary situations and two unary situations — each decendant can be a tree or a leaf, respectively.

However, two data types, Long and Double, and simplifications because of commutativity have to be considered in each case, too. This increases the number of test cases from 6 to 28.

Widen is a visitor which clones arithmetic expressions by inserting Long values for Double values in copies of the original tree and producing all possible combinations.

expr/jag/Widen.jag

VM.Add: { $$ = " + "; };VM.Mul: { $$ = " * "; };VM.Sub: { $$ = " - "; };VM.Div: { $$ = " / "; };VM.Mod: { $$ = " % "; };VM.Minus: { $$ = "- "; };

VM.Long: Long { $$ = new String[] { ""+($1) }; // reference, don't visit };

VM.Double: Double { // create two cases $$ = new String[] { ""+($1).longValue(), ""+($1).longValue()+"." }; };

VM.Node: VM.Node VM.Node {{ // create all combinations String[] left = (String[]){$1}, right = (String[])$2; String[] rs = new String[left.length * right.length]; for (int l = 0; l < left.length; ++ l) for (int r = 0; r < right.length; ++ r) rs[l + r*left.length] = "("+left[l]+{$0}+right[r]+")"; $$ = rs; }} | VM.Node {{ String[] sub = (String[])$1; String[] rs = new String[sub.length]; for (int s = 0; s < sub.length; ++ s) rs[s] = "("+{$0}+sub[s]+")"; $$ = rs; }};

Each rule returns an array of strings with fully parenthesized expressions. The same coding trick as used in Reg to reset register allocation is used in Widen to print the array of strings at the top level of the visit without the benefit of cooperation by Expression.

78

A little bit of shell programming provides a regression test of all possibilities:

$ { java -jar cmp.jar Widen >/dev/null; } 2>&1 |> tee benchmark | > { java -jar cmp.jar Reg >/dev/null; } 2>&1 |> if [ -r benchmark.out ]; then> diff -b - benchmark.out> else> tee benchmark.out> fi1.+2.3.-4.5.+(6+7.)8.-(9+10.)(11+12.)-13.(14+15.)-(16+17.)-18.-(19+20.)

7.10 SummaryIt has been demonstrated often enough that the principle behind jag — tree traversals based on pattern recognition while influencing the order of traversal — is is quite useful for efficient code generation. A basic solution is available quickly and it can be continuously refined by incorporating special cases.

The examples in this chapter are intended to document that pattern recognition based on tree node classes and their hierarchical relationships is sufficiently powerful. The multiple inheritance available through interfaces does not seem to be confusing; rather, it can be used quite elegantly for marking properties and creating common ancestors of various node classes.

jag was quite simple to implement and it should be easily mastered because it encompasses very few concepts. This implementation with compiled actions is by far superior to an earlier implementation as an interpreter — this is particularly obvious when the relatively realistic example of the register machine was implemented with an earlier, interpreted version of jag.

Serializing the code generators permits a significant separation between frontend and backend of a compiler. The rules should make it possible to implement jag-based visitors for a node class library for which frontends can then be developed independently.

Code generation for the register machine demonstrates that one could defer semantic considerations until code generation time; however, those should really be dealt with earlier.

Regression testing by shell programming and successive analysis with diff is an indispensible tool specifically for refining code generators.

79

8LL Parsing With Objects

This chapter describes architecture and use of the object-oriented parser generator oops.

8.1 RecognitionRecursive Descent Analysis, page 15, pointed out that a grammar like

expr/oops/expr.rfc

// typical RFC-style EBNF grammar for lines with arithmetic expressions

lines: ( sum ? ’\n’ )*;sum: product ( ’+’ product | ’-’ product )*;product: term ( ’*’ term | ’/’ term | ’%’ term )*;term: ( ’+’ | ’-’ )* ( ’Number’ | ’(’ sum ’)’ );

can be represented as a syntax graph (see Syntax Graphs, page 5) or as a modified Nassi-Shneiderman diagram and that these diagrams can be interpreted as recognizer programs:

The Nassi-Shneiderman diagrams can be represented as objects from the following classes, collected as member classes of edu.rit.cs.oops.Rules:

The diagrams can be viewed as an interpreter tree, i.e., the classes should support some evaluation concept — it seems reasonable to require that a tree representing a grammar should be able to recognize a program conforming to the grammar.

product

+ -

product product

sum

term

* /

term term

%

term

productterm

+ -

Number

(

sum

)

sum

\n

lines

Number+ sum

Terminal Nonterminal Seq

Xor Rep ? + *

http://www.cs.rit.edu/~ats/projects/oops/edu/doc/

80

Given a tree of recognizer objects, the appropriate algorithm is recursive descent in a method parse(). Therefore, the classes are derived from

public abstract class Node implements Serializable { /** lookahead token or set. */ protected final Token lookahead; /** @throws IllegalArgumentException if lookahead is null. */ protected Node (Token lookahead) { this.lookahead = lookahead; if (lookahead == null) throw new IllegalArgumentException("null lookahead"); } /** performs recognition, without error recovery. This method will not be called, unless the current token is in the lookahead of this node. @param observer to which terminals must be shifted. @see edu.rit.cs.oops.Rules#run */ public abstract void parse (Observer observer) throws IOException;

// other methods for display ... }

and Rules is a container for Rule objects, each containing a diagram. Rules implements/** what a parser must do. A parser is a {@link Program} which fetches tokens from a {@link Scanner} and interacts with an {@link ObserverFactory} to process them. */public interface Parser extends Program { /** sets scanner, prior to {@link #run run}. */ void setScanner (Scanner scanner); /** returns scanner, e.g., for access to associated values. */ Scanner getScanner (); /** returns output stream. */ PrintStream getOut (); /** parses tokens and notifies observers, without error recovery. @param environment manages messaging, etc. @param in to be read. @param out to be written. */ void run (Environment environment, InputStream in, PrintStream out) throws IOException; /** returns current position for messages or an empty string. */ String position ();

// other methods to deal with observing ...}

where Program is a general interface consisting of run() and position(), which permits typical programs to be run in an Environment with configuration resources and properties and various message levels. Environment is implemented by Main and Applet.

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/Rules.java

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/Program.java

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/Environment.java

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/Main.java

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/Applet.java

81

The Scanner interface is patterned after the one used by jay, page 27:

/** what a scanner must do. A scanner reads characters and assembles them into a {@link Token} which is used by a {@link Parser}. It may also associate a value with a token. @see TokenizerScanner */public interface Scanner { /** called just prior to use. @param environment manages trace, etc. @param reader to be consumed. */ void init (Environment environment, Reader reader) throws IOException; /** advances in input, posts values for {@link #token} and {@link #value}. @return false at end of input. */ boolean advance () throws IOException; /** returns true at end of input. */ boolean atEof (); /** returns current terminal. This should be set by each call to {@link #advance}. It might be called more than once. */ Token token (); /** returns value associated with current terminal. This should be set by each call to {@link #advance}. It might be called more than once. */ Object value (); /** returns input position for {@link Environment#message message}. */ String toString ();}

Token, finally, remains mostly opaque:

/** what a token must do. Tokens are used to represent terminals and interact with the lookahead sets of the nodes representing a grammar. @see Set @see TokenSets */public interface Token extends Serializable { /** returns true if tokens are equal. */ boolean matches (Token token); /** returns false. There is no terminal that can accept empty input. @see Set */ boolean matches (); /** describes represented terminal. */ String toString ();}

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/Scanner.java

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/Token.java

82

Recursive Descent in ClassesEach Node contains lookahead, a single Token or a Set representing Token values which can start it’s part of the diagram. Parsing is based on the agreement that a Node is only entered if the current input as reported by Scanner’s token() matches the lookahead.

Given lookahead, it is trivial to implement parse() — note that empty input can be accepted by certain classes:

Rules

The recursive descent starts by sending run() to a Rules container object which in turn tries to send parse() to the first Rule, i.e., to the start symbol of the grammar which the Rules object represents.

Rule

A Rule implements parse() by asking it’s right hand side, i.e., a diagram represented over the other classes, to parse(). There is nothing to check because the Rule could not have been entered if the lookahead had not matched.

Terminal

Again, nothing to check — the required terminal is in the input and can immediately be consumed.

Nonterminal

Nothing to check — simply defer to the corresponding Rule.

Seq

A Seq implements parse() by sending it to each subtree in turn — after making sure that the current token matches. Alternatively, if the subtree lookahead does not match the current token but permits empty input, the subtree is skipped.

Xor

Nothing to check — the current token must match exactly one alternative to which parse() is sent.

Rep

The current token will match the body at least once; therefore, parse() is sent to the body. This continues as long as repetition is allowed and the current token matches. Finally, Rep checks that there were sufficiently many repetitions.

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/Set.java

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/Token.java

83

Observing RecognitionAn Observer design pattern provides clean separation between parse() and user actions once pieces of the input are recognized:

interface Observer { /** called when a {@link edu.rit.cs.oops.Parser.Terminal Terminal} is accepted. @see Rules.Terminal#parse */ void shift (Terminal sender); /** called when a rule is completed. @see Rules.Rule#parse */ void reduce ();

One can add more Terminal classes, e.g., for terminals decorated with different quotes in the grammar, to simplify a divide-and-conquer approach:

void shift (BTerminal sender); // back quotes void shift (DTerminal sender); // double quotes

The key idea, however, is to allow each Observer to create a new Observer for a nonterminal in it’s rule:

/** called when a rule is activated. @param name of rule's nonterminal which will be activated. @param rule number which will be activated. @return the observer to watch progress in this rule activation. @see Rules.Rule#parse */ Observer init (String name, int rule);

Once the new Observer has seen reduce(), it is sent back to it’s creator:

/** called when a nonterminal is accepted. @param sender the observer that watched the nonterminal. @see Rules.Rule#parse */ void shift (Observer sender);

The creator can then ask the terminated Observer for a result — essentially, this plays the role of $$ in jay actions:

/** returns some value, perhaps associated with the completed rule. */ Object value (); }

The Observer interface is embedded in an ObserverFactory which must be provided to a Rules container (Parser) and which is responsible for creating the very first Observer:

public interface ObserverFactory { /** called just prior to use. @param environment manages trace, etc. @param parser will send events. @return the top-level observer, which will receive init for rule 0, shift once this rule activation is complete, and reduce as end of input is ascertained. */ Observer init (Environment environment, Parser parser);

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/ObserverFactory.java

84

8.2 Checking LL(1)Parsing as described above is critically dependent on each Node knowing it’s lookahead and on simple matches between the current input and the lookahead being sufficient for deciding what to do — the LL(1) condition.

Parsing depends on seven classes. Rules defers to the first Rule and Nonterminal defers to the corresponding Rule. Therefore, only five classes need to be considered:

Therefore, to apply this algorithm to decide if a grammar is not ambiguous one needs to know for each node from these classes which symbols can start and which can follow the node. edu.rit.cs.oops.pg.Rules has the same inner class names as edu.rit.cs.oops.Rules but this package can compute lookahead and follow sets and perform the LL(1) check outlined above.

More is actually checked: each Nonterminal must have a corresponding Rule, each Rule must be used, there cannot be left recursion — that would make it impossible to compute lookahead — and there cannot be unlimited recursion.

If a grammar is represented using edu.rit.cs.oops.pg.Rules as a container, the message parser() performs all checks and returns a representation over edu.rit.cs.oops.Rules which can be serialized and/or executed.

Splitting the technology into two packages improves the runtime efficiency of the parsers and permits binary distributions. It also simplified development: parsing could be attempted before complete checking was available.

Terminal

No chance for ambiguity here — the current input will have to match.

Rule

If a Rule does not require input, the current input must decide if the Rule should be entered or bypassed, i.e., the lookahead and whatever might follow the Rule cannot have a symbol in common.

Xor

lookahead for all alternatives must be mutually disjoint; at most one alternative may not require input.

If any does, all lookaheads together must be disjoint from what follows.

Seq

If no input is altogether required, the lookahead must be disjoint from what follows.

Rep

lookahead must be disjoint from what follows, irrespective of whether input is required, so that the repetition can be terminated.

+

sum

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/pg/Rules.java

85

lookaheadThis is a two-pass algorithm. First, Rules asks each Rule in turn:

Unless there is left recursion, each Rule now knows it’s lookahead.

For the second pass, Rules only asks the start symbol Rule. Each Rule, if inactive, marks itself as active and asks the right-hand side. All other nodes, in particular Seq, unconditionally ask all children.

After the second pass, Rules checks that each Rule has been reached. All reachable interior nodes have now been asked and know their lookahead.

Rule

If this Rule is currently trying to get it’s lookahead there is left recursion.

If lookahead is not known, ask right-hand side.

Nonterminal

ask the Rule.

Terminal

lookahead consists of the symbol to be recognized.

Xor

lookahead is union of alternatives.

lookahead for all alternatives must be mutually disjoint; at most one alternative may not require input.

Seq

lookahead is union of terms’ lookaheads until the first term that requires input. The remaining terms are not considered in this pass.

Rep

lookahead is lookahead of body.

No input is acceptable exactly if the lower bound is zero.

sum

sum

+

86

followThis is a multi-pass algorithm. Rules sends end of file as initial follow set to the start symbol Rule. Then Rules asks each Rule to propagate it’s follow set until nothing changes during a complete pass.

The algorithm terminates because there is a finite set of symbols and rules.

Rule

send sets or increases the current follow set. Any change is marked and arranges for another pass in Rules.

propagate clears the mark if any and sends the current follow set to the right hand side if a change was marked.

Nonterminal

sends the incoming follow set to the corresponding Rule where it may cause a change.

Terminal

no operation.

Xor

saves the incoming follow set and sends it to each alternative.

Seq

saves the incoming follow set and loops backwards over the descendants: if a descendant’s lookahead requires input, it is the predecessor’s follow set; otherwise the symbols in the lookahead are merged with the follow set and sent to the predecessor.

Rep

saves the incoming follow set. If the upper bound is 1 the follow set is sent to the body, otherwise the symbols in the body’s lookahead are merged with the follow set and sent to the body.

sum

sum

+

87

8.3 Parser GenerationEBNF can describe itself:

expr/oops/rfc.rfc

// RFC-style grammar for EBNF

grammar: rule+;rule: ’Id’ ’:’ alt ’;’;alt: seq ( ’|’ seq )*;seq: ( term )+;term: item ( ’?’ | ’+’ | ’*’ )?;item: ’Id’ | ’String’ | ’(’ alt ’)’;

If this grammar is represented over edu.rit.cs.oops.Rules the resulting tree can recognize grammars — including the grammar above.

If an Observer represents a recognized grammar over edu.rit.cs.oops.pg.Rules the resulting tree can be asked to check the recognized grammar, i.e., itself, and represent it over edu.rit.cs.oops.Rules, i.e., as a tree that can recognize programs conforming to the recognized grammar and communicate with a suitable Observer.

In order to get started, a program edu.rit.cs.oops.pg.bnf.Boot is written by hand and executed once to represent the grammar above over edu.rit.cs.oops.pg.Rules and serialize the resulting parser tree using edu.rit.cs.oops.Main as the launcher:

public class Boot implements Program {

public void run (Environment environment, InputStream in, PrintStream out) throws IOException {

Rules rules = new Rules(environment);

//r parser: rule+; rules.add(rules.Rule("parser", rules.Rep(rules.Nonterminal("rule"), 1, Rules.Rep.INFINITY)));

//r rule: "Id" ':' alt ';'; rules.add(rules.Rule("rule", rules.Seq(new Rules.Node[] { rules.DTerminal("Id"), rules.Terminal(":"), rules.Nonterminal("alt"), rules.Terminal(";") })));

//r alt: seq ( '|' seq )*; rules.add(rules.Rule("alt", rules.Seq(new Rules.Node[] { rules.Nonterminal("seq"), rules.Rep(rules.Seq(new Rules.Node[] { rules.Terminal("|"), rules.Nonterminal("seq") }), 0, Rules.Rep.INFINITY) })));

// ...

ObjectOutputStream o = new ObjectOutputStream(out); o.writeObject(rules.parser()); o.close(); }}

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/pg/bnf/Boot.java

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/Main.java

88

The serialized tree can recognize grammars written according to the grammar above and communicate with an Observer.

edu.rit.cs.oops.bnf.Rfc is an ObserverFactory that fits the grammar above and represents grammars over edu.rit.cs.oops.pg.Rules. In particular, it can represent the grammar above.

Using the serialized tree and Rfc, any arbitrary grammar — written according to the grammar above — can therefore be represented over edu.rit.cs.oops.pg.Rules, checked for the LL(1) condition, represented over edu.rit.cs.oops.Rules, and serialized as a recognizer tree.

BootstrapThe details are further discussed in the oops documentation.

boot.jar is a parser-generator which compiles grammars into recognizer trees. It consists of the hand-crafted tree from Boot, the observers from Rfc, and the necessary class files:

$ java -Doops.program=edu.rit.cs.oops.pg.bnf.Boot \> edu.rit.cs.oops.Main > oops/program$ echo oops.observers = edu.rit.cs.oops.pg.bnf.Rfc > oops/properties$ echo Main-Class: edu.rit.cs.oops.Main > manifest$ jar cfm boot.jar manifest oops/* edu/rit/cs/oops/*.class \ edu/rit/cs/oops/pg/*.class \ edu/rit/cs/oops/pg/bnf/Boot*.class \ edu/rit/cs/oops/pg/bnf/Rfc*.class

rfc.jar is functionally equivalent to boot.jar, but it is built by using boot.jar to compile the grammar into the same tree (hopefully) that Boot creates manually. Note that rfc.jar does not need to contain the Boot class files anymore.

$ java -Doops.stripper=//r \> -Doops.pg.tokenizer.comment2=/ \> -Doops.pg.tokenizer.string.BString=\` \> -Doops.pg.tokenizer.string.DString=\" \> -Doops.pg.tokenizer.string.String=\' \> -Doops.pg.tokenizer.word=Id \> -jar boot.jar rit/cs/oops/pg/bnf/Boot.java > oops/program$ echo oops.observers = edu.rit.cs.oops.pg.bnf.Rfc > oops/properties$ echo Main-Class: edu.rit.cs.oops.Main > manifest$ jar cfm rfc.jar manifest oops/* edu/rit/cs/oops/*.class \ edu/rit/cs/oops/pg/*.class \ edu/rit/cs/oops/pg/bnf/Rfc*.class

By convention, the implementation of Boot (or an ObserverFactory) contains the grammar rules for which it was written, marked by comments //r. boot.jar is instructed by the property oops.stripper to operate on exactly those lines in the source of Boot.

The grammar contains all terminal symbols. A reimplementation of StreamTokenizer is automatically configured to be a scanner/screener for the parser. A few properties configure comment conventions and the representation of certain terminal symbols. All properties are extensively described in the oops documentation.



http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/pg/bnf/Rfc.obsevers

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/Tokenizer.java

89

rfc.jar can compile itself:

$ java -Doops.stripper=//r \> -Doops.pg.tokenizer.comment2=/ \> -Doops.pg.tokenizer.string.BString=\` \> -Doops.pg.tokenizer.string.DString=\" \> -Doops.pg.tokenizer.string.String=\' \> -Doops.pg.tokenizer.word=Id \> -jar rfc.jar rit/cs/oops/pg/bnf/Rfc.observers | \> cmp - oops/program

This time the grammar is extracted from the implementation of the ObserverFactory. Rfc.observers contains the same grammar rules as Boot.java.

UsageIn order to implement an interpreter for arithmetic expressions, an ObserverFactory Eval.observers is derived from Observers — thus inheriting tracing facilities. By convention, the grammar rules are written as part of the ObserverFactory.

$ java -classpath . edu.rit.cs.oops.Stripper \> -r RULE //r < rit/cs/oops/examples/Eval.observers \> > edu/rit/cs/oops/examples/Eval.java

Stripper can be used to make sure rule numbers and Java code of an ObserverFactory are kept in sync.

$ javac -classpath . edu/rit/cs/oops/examples/Eval.java$ java -Doops.stripper=//r \> -Doops.pg.tokenizer.comment=# \> -Doops.pg.tokenizer.hex=Number \> -Doops.pg.tokenizer.octal=Number \> -Doops.pg.tokenizer.digits=Number \> -jar rfc.jar rit/cs/oops/examples/Eval.observers > oops/program$ echo oops.observers = edu.rit.cs.oops.examples.Eval \> > oops/properties$ echo Main-Class: edu.rit.cs.oops.Main > manifest$ jar cfm eval.jar manifest oops edu/rit/cs/oops/*.class \ edu/rit/cs/oops/examples/Eval*.class

When the recognizer tree and it’s ObserverFactory are combined, only the class files of edu.rit.cs.oops are required. eval.jar is a compiler and interpreter for arithmetic expressions:

$ java -jar eval.jar# comprehensive example of arithmetic expressions1 + 010 * 0xa 811 * (010 + 0xa) 18+ (1 + 22 - 33 * 4 / 5 % + 6) 21

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/pg/bnf/Rfc.observers

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/pg/bnf/Boot.java

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/examples/Eval.observers

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/Observers.java

http://www.cs.rit.edu/~ats/projects/oops/edu/rit/cs/oops/Stripper.java

90

8.4 Extensions

8.5 Summary

91

9Interpretation

A syntax tree can be built over a class library that provides useful functionality for the tree. There can be methods to check the tree for semantic problems and there can be methods to interpret the tree as a program. This chapter introduces such a class library and looks at the implementation of a simple language using this library.

93

10Procedures

This chapter looks at the implementation of functions, procedures, and parameter passing.

95

11Block Structure

Algol introduced the concept of block structure, where visibility of names and life expectancy of values depends on nesting in the source code. This chapter discusses the necessary algorithms.

language processorsats/lp-2002-2/pdf/skript.pdf · title: language processors author: axel t....

Documents