cs 153: concepts of compiler design october 21 class meeting department of computer science san jose...
TRANSCRIPT
1
CS 153: Concepts of Compiler DesignOctober 21 Class Meeting
Department of Computer ScienceSan Jose State University
Fall 2015Instructor: Ron Mak
www.cs.sjsu.edu/~mak
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
2
Backus Naur Form (BNF)
A text-based way to describe source language syntax. Named after John Backus and Peter Naur.
Text-based means it can be read by a program.
Example: A compiler-compiler that can automatically generate a parser for a source language after reading (and parsing) the language’s syntax rules written in BNF.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
3
Backus Naur Form (BNF), cont’d
Uses certain meta-symbols.
Symbols that are part of BNF itself but are not necessarily part of the syntax of the source language.
::= “is defined as”
| “or”
< > Surround names of nonterminal (not literal) items
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
4
BNF Example: U.S. Postal Address <postal-address> ::= <name-part> <street-part> <city-state-part> <name-part> ::= <first-part> <last-name> | <first-part> <last-name> <suffix><first-part> ::= <first-name> | <capital-letter> . <suffix> ::= Sr. | Jr. | <roman-numeral><street-part> ::= <house-number> <street-name> | <house-number> <street-name> <apartment-number> <city-state-part > ::= <city-name> , <state-code> <ZIP-code> <first-name> ::= <name><last-name> ::= <name><street-name> ::= <name><city-name> ::= <name><house-number> ::= <number><apartment-number> ::= <number><state-code> ::= <capital-letter> <capital-letter><capital-letter> ::= A|B|C|D|E|F|G|H|I|J|K|L|M |N|O|P|Q|R|S|T|U|V|W|X|Y|Z<name> ::= …<number> ::= …etc.
Mary Jane 123 Easy StreetSan Jose, CA 95192
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
5
BNF: Optional and Repeated Items
To show optional items in BNF, use the vertical bar |.
Example: “An expression is a simple expression optionally followed by a relational operator and another simple expression.”
<expression> ::= <simple expression>
| <simple expression> <rel op> <simple expression>
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
6
BNF: Optional and Repeated Items
BNF uses recursion for repeated items.
Example: “A digit sequence is a digit followed by zero or more digits.”
Rightrecursive
Leftrecursive
<digit sequence> ::= <digit> | <digit> <digit sequence>
<digit sequence> ::= <digit> | <digit sequence> <digit>
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
7
BNF Example: Pascal Number
<digit sequence> ::= <digit> | <digit> <digit sequence><unsigned integer> ::= <digit sequence><unsigned real> ::= <unsigned integer>.<digit sequence> | <unsigned integer>.<digit sequence> <e> <scale factor> | <unsigned integer > <e> <scale factor><scale factor> ::= <unsigned integer> | <sign> <unsigned integer><e> ::= E | e<sign> ::= + | -<number> ::= <unsigned integer> | <unsigned real>
Repetition via recursion.
The sign is optional.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
8
BNF Example: Pascal IF Statement
<if statement> ::= IF <expression> THEN <statement> | IF <expression> THEN <statement> ELSE <statement>
It should be straightforward to write a parsing method from either the syntax diagram or the BNF.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
9
A grammar defines a language.
Grammar = the set of all the BNF rules (or syntax diagrams)
Language = the set of all the legal strings of tokens according to the grammar
Legal string of tokens = a syntactically correct statement
Grammars and Languages
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
10
A statement is in the language (it’s syntactically correct) if it can be derived by the grammar.
Each grammar rule “produces” a token string in the language.
The sequence of productions required to arrive at a syntactically correct token string is the derivation of the string.
Grammars and Languages
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
11
Grammars and Languages, cont’d
Example: A very simplified expression grammar:
What strings (expressions) can we derive from this grammar?
<expr> ::= <digit> | <expr> <op> <expr> | ( <expr> )<digit> ::= 0|1|2|3|4|5|6|7|8|9<op> ::= + | *
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
12
Derivations and Productions Is (1 + 2)*3 valid in our expression language?
Yes! The expression is valid.
<digit> ::= 1 ( 1 + 2 )*3
<expr> ::= <digit> ( <digit> + 2 )*3
<op> ::= + ( <expr> + 2 )*3
<digit> ::= 2 ( <exp> <op> 2 )*3
<expr> ::= <digit> ( <expr> <op> <digit> )*3
<expr> ::= <expr> <op> <expr> ( <expr> <op> <expr> )*3
<expr> ::= ( <expr> ) ( <expr> )*3
<op> ::= * <expr>*3
<digit> ::= 3 <expr> <op> 3
<expr> ::= <digit> <expr> <op> <digit>
<expr> ::= <expr> <op> <expr><expr> <expr> <op> <expr>
GRAMMAR RULEPRODUCTION
<expr> ::= <digit> | <expr> <op> <expr> | ( <expr> )<digit> ::= 0|1|2|3|4|5|6|7|8|9<op> ::= + | *
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
13
Extended BNF (EBNF)
Extended BNF (EBNF) adds meta-symbols { } and [ ]
{ } Surround items to be repeated zero or more times.
[ ] Surround optional items.
Originally developed by Niklaus Wirth. Inventor of Pascal. Early user of syntax diagrams.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
14
Extended BNF (EBNF)
Repetition (one or more):
BNF:
EBNF:
<digit sequence> ::= <digit> | <digit> <digit sequence>
<digit sequence> ::= <digit> { <digit> }
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
15
Extended BNF, cont’d
Optional items.
BNF:
EBNF:
<if statement> ::= IF <expression> THEN <statement> | IF <expression> THEN <statement> ELSE <statement>
<if statement> ::= IF <expression> THEN <statement> [ ELSE <statement> ]
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
16
Extended BNF, cont’d
Optional items.
BNF:
EBNF:
<expression> ::= <simple expression> | <simple expression> <rel op> <simple expression>
<expression> ::= <simple expression> [ <rel op> <simple expression> ]
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
17
Compiler-Compilers
Professional compiler writers generally do not write scanners and parsers from scratch.
A compiler-compiler is a tool for writing compilers.
It can include:
A scanner generator A parser generator Parse tree utilities
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
18
Compiler-Compilers, cont’d
Feed a compiler-compiler a grammar written in a textual form such as BNF or EBNF.
The compiler-compiler outputs code written in a high-level language that implements a scanner, parser, and a parse tree.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
19
Popular Compiler-Compilers Yacc
“Yet another compiler-compiler” Generates a bottom-up parser written in C. GNU version: Bison
Lex Generates a scanner written in C. GNU version: Flex
JavaCC Generates a scanner and a top-down parser written in Java.
JJTree Preprocessor for JavaCC grammars. Generates Java code to build and process parse trees.
The code generated by acompiler-compiler can be in ahigh level language such asC or Java. However, you mayfind the code to be uglyand hard to read.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
20
JavaCC Compiler-Compiler
Feed JavaCC the grammar for a source language and it will automatically generate a scanner and a parser.
Specify the source language tokens with regular expressions JavaCC generates a scanner for the source language.
Specify the source language syntax rules with Extended BNF JavaCC generates a parser for the source language.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
21
JavaCC Compiler-Compiler, cont’d
The generated scanner and parser are written in Java.
Note: JavaCC calls the scanner the “tokenizer”.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
22
Download and Install
Download and install JavaCC and JJTree http://javacc.java.net/
Download the examples from the book Generating Parsers with JavaCC http://generatingparserswithjavacc.com/
examples.html
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
23
The JavaCC Eclipse Plug-in
How to install the plug-in:1. Start Eclipse.
2. Help Install New Software ...
3. Click the Add button
4. Name: SF Eclipse JavaCC Location: http://eclipse-javacc.sourceforge.net/
5. Click the OK button.
6. Select “SF JavaCC Eclipse Plug-in”
Or, visithttp://sourceforge.net/projects/eclipse-javacc/
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
24
JavaCC Regular Expressions
Literals <HELLO : "hello">
Character classes <SPACE_OR_COMMA : [" ", ","]>
Character ranges <LOWER_CASE : ["a"-"z"]>
Alternates <UPPER_OR_LOWER : ["A"-"Z"] | ["a"-"z"]>
Tokenname
Tokenstring
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
25
JavaCC Regular Expressions
Negation <NO_DIGITS : ~["0"-"9"]>
Repetition <THREE_A : ("a"){3}> <TWO_TO_FOUR_A : ("a"){2,4}>
Quantifiers <ONE_OR_MORE_A : ("a")+> <ZERO_OR_ONE_SIGN : (["+", "-"])?> <ZERO_OR_MORE_DIGITS : (["0"-"9"])*>
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
26
Example: helloworld1.jj
options { BUILD_PARSER=false;}
PARSER_BEGIN(HelloWorld) public class HelloWorld {}PARSER_END(HelloWorld)
TOKEN_MGR_DECLS : { public static void main(String[] args) { java.io.StringReader sr = new java.io.StringReader(args[0]); SimpleCharStream scs = new SimpleCharStream(sr); HelloWorldTokenManager mgr = new HelloWorldTokenManager(scs); for (Token t = mgr.getNextToken(); t.kind != EOF; t = mgr.getNextToken()) { debugStream.println("Found token: " + t.image); } }}
TOKEN : { <HELLO : "hello"> | <WORLD : "world">}
Literals
Recognize tokens “hello” and “world”.
Main Java methodof the tokenizer
Demo
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
27
Example: helloworld2.jjoptions { BUILD_PARSER=false; IGNORE_CASE=true;}
PARSER_BEGIN(HelloWorld) public class HelloWorld {}PARSER_END(HelloWorld)
TOKEN_MGR_DECLS : { public static void main(String[] args) { java.io.StringReader sr = new java.io.StringReader(args[0]); SimpleCharStream scs = new SimpleCharStream(sr); HelloWorldTokenManager mgr = new HelloWorldTokenManager(scs);
while (mgr.getNextToken().kind != EOF) {} }}
SKIP : { <IGNORE : [" ", ","]>}
TOKEN : { <HELLO : "hello"> { debugStream.println("HELLO token: " + matchedToken.image); } | <WORLD : "world"> { debugStream.println("WORLD token: " + matchedToken.image); }}
Recognize tokens “hello” and “world” Ignore case. Ignore spaces and commas between tokens. Use lexical actions.
Character class
Lexicalactions
Demo
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
28
options { BUILD_PARSER=false; IGNORE_CASE=true;}
PARSER_BEGIN(HelloWorld) public class HelloWorld {}PARSER_END(HelloWorld)
TOKEN_MGR_DECLS : { public static void main(String[] args) { ... }}
SKIP : { <IGNORE : [" ", ","]>}
TOKEN : { <HELLO : "hello"> { debugStream.println("HELLO token: " + matchedToken.image); } | <WORLD : "world"> { debugStream.println("WORLD token: " + matchedToken.image); } | <IDENTIFIER : ["a"-"z"](["a"-"z", "0"-"9", "_"])*> { debugStream.println("IDENTIFIER token: " + matchedToken.image); }}
Example: helloworld3.jj
Recognize words other than “hello” and “world” as identifiers.
Demo
Why aren’t “hello” and “world”recognized as identifiers?
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
29
Example: helloworld4.jjoptions { BUILD_PARSER=false; IGNORE_CASE=true;}
PARSER_BEGIN(HelloWorld) public class HelloWorld {}PARSER_END(HelloWorld)
TOKEN_MGR_DECLS : { public static void main(String[] args) { ... }}
SKIP : { <IGNORE : [" ", ","]>}
TOKEN : { <HELLO : "hello"> { debugStream.println("HELLO token: " + matchedToken.image); } | <WORLD : "world"> { debugStream.println("WORLD token: " + matchedToken.image); } | <IDENTIFIER : <LETTER> (<LETTER> | <DIGIT> | "_")*> { debugStream.println("IDENTIFIER token: " + matchedToken.image); } | <#DIGIT : ["0"-"9"]> | <#LETTER : ["a"-"z"]>}
Use private tokensDIGIT and LETTER.
Private tokensDemo
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
30
Example: helloworld5.jjoptions { BUILD_PARSER=false; IGNORE_CASE=true; DEBUG_TOKEN_MANAGER=true;}
PARSER_BEGIN(HelloWorld) public class HelloWorld {}PARSER_END(HelloWorld)
TOKEN_MGR_DECLS : { public static void main(String[] args) { ... }}
SKIP : { <IGNORE : [" ", ","]>}
TOKEN : { <HELLO : "hello"> { debugStream.println("HELLO token: " + matchedToken.image); } | <WORLD : "world"> { debugStream.println("WORLD token: " + matchedToken.image); } | <IDENTIFIER : <LETTER> (<LETTER> | <DIGIT> | "_")*> { debugStream.println("IDENTIFIER token: " + matchedToken.image); } | <#DIGIT : ["0"-"9"]> | <#LETTER : ["a"-"z"]>}
What’s the tokenizer doing? Turn on debugging output.
Demo
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
31
Review: DFA for a Pascal Identifier or Number
6 9 104 7 11digit
digit
digit
digit digit
digit
+
+
-
E
digit
digitE
.
5 8
12
[other] [other]
[other]3
-
digit
1 20letter [other]
letter
private static final int matrix[][] = {
/* letter digit + - . E other */ /* 0 */ { 1, 4, 3, 3, ERR, 1, ERR }, /* 1 */ { 1, 1, -2, -2, -2, 1, -2 }, /* 2 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 3 */ { ERR, 4, ERR, ERR, ERR, ERR, ERR }, /* 4 */ { -5, 4, -5, -5, 6, 9, -5 }, /* 5 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 6 */ { ERR, 7, ERR, ERR, ERR, ERR, ERR }, /* 7 */ { -8, 7, -8, -8, -8, 9, -8 }, /* 8 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR }, /* 9 */ { ERR, 11, 10, 10, ERR, ERR, ERR }, /* 10 */ { ERR, 11, ERR, ERR, ERR, ERR, ERR }, /* 11 */ { -12, 11, -12, -12, -12, -12, -12 }, /* 12 */ { ERR, ERR, ERR, ERR, ERR, ERR, ERR },};
Negative numbersin the matrix are
accepting states.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
32
The JavaCC Tokenizer
The JavaCC tokenizer (scanner) is a DFA created from the regular expressions.
The generated Java code implements state-transitions with switch statements instead of a matrix.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
33
The JavaCC Tokenizer, cont’d
Example (from HelloWorldTokenManager.java):static private int jjMoveStringLiteralDfa0_0(){ switch(curChar) { case 72: case 104: return jjMoveStringLiteralDfa1_0(0x4L); case 87: case 119: return jjMoveStringLiteralDfa1_0(0x8L); default : return jjMoveNfa_0(0, 0); }}
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
34
Assignment #5
Use JavaCC to create a simple Java tokenizer (i.e., scanner).
Write the .jj file. Due Friday, October 31.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
35
Compiler Team Project
Write a compiler for a procedure-oriented source language that will generate code for the Java virtual machine (JVM).
The source language should be a procedural, non-object-oriented language or subset thereof, or a language that the team invents.
Tip: Start with a simple language! No Scheme, Lisp, or Lisp-like languages.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
36
Compiler Team Project
The object language must be Jasmin, the assembly language for the Java virtual machine.
You will be provided an assembler that translates Jasmin assembly language programs into .class files for the JVM.
You must use the JavaCC compiler-compiler. You can also use any Java code from the
Writing Compilers and Interpreters book.
Compile and run source programs written in your language.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
37
Compiler Team Project Deliverables
Java and JavaCC source files of a working compiler.
The JavaCC .jj (or .jjt) grammar files. Do not include the Java files that JavaCC generated.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
38
Compiler Team Project Deliverables, cont’d
Written report (5-10 pp. single spaced)
Include: Syntax diagrams for key source language constructs.
Include: Code diagrams for key source language constructs.
Optional: UML diagrams for key compiler components.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
39
Compiler Team Project Deliverables, cont’d
Instructions on how to build your compiler. If it’s not standard or not obvious.
Instructions on how to run your compiler (scripts OK).
Sample source programs to compile and execute.
Sample output from executingyour source programs.
Computer Science Dept.Fall 2015: October 21
CS 153: Concepts of Compiler Design© R. Mak
40
Post Mortem Report
Private individual post mortem report(up to 1 page from each student)
What did you learn from this course?
An assessment of your accomplishments for your project.
An assessment of each of your project team members.