foundations of software design

47
1 Foundations of Software Design Lecture 24: Compilers, Lexers, and Parsers; Intro to Graphs Marti Hearst Fall 2002

Upload: mairi

Post on 09-Jan-2016

18 views

Category:

Documents


2 download

DESCRIPTION

Foundations of Software Design. Lecture 24: Compilers, Lexers, and Parsers; Intro to Graphs Marti Hearst Fall 2002. Programming Languages. Compiler. Assembly Language. CPU. Address Space. Circuits. Code vs. Data. Gates. Orders of Magnitude. Boolean Logic. Number Systems. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Foundations of Software Design

1

Foundations of Software Design

Lecture 24: Compilers, Lexers, and Parsers; Intro to Graphs Marti HearstFall 2002 

Page 2: Foundations of Software Design

2

How Do Computers Work (Revisited)?

Bits & Bytes Binary Numbers

Number Systems

Orders of MagnitudeGates

Boolean Logic

Circuits

CPU Machine Instructions

Assembly Language

Programming Languages

Address Space

Code vs. Data

Compiler

Page 3: Foundations of Software Design

3Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Compiler

• What is a compiler? – A recognizer (of some source language L). – A translator (of programs written in L into programs

written in some object or target language L').

• A compiler is itself a program, written in some host language

• Operates in phases

Machine Instructions

Assembly Language

Programming Languages

Compiler

Page 4: Foundations of Software Design

4

Converting Java to Byte Code

• When you compile a java program, javac produces byte codes (stored in the class file).

• The byte codes are not converted to machine code.

• Instead, they are interpreted in the VM when you run the program called java.

Page 5: Foundations of Software Design

5

Machine Code

Assembly Language

C codeTranslatedby the Ccompiler(gcc or cc)

Byte code (class file)

Java codeTranslatedby the javacompiler (javac or jit)

Java Virtual Machine

Creates theJVM once

Individual program isloaded & run in JVM

Page 6: Foundations of Software Design

6

Compiler Compilers

• Which came first: the compiler or the program?– The very first one has to be written in assembly

language!– This is why most programming languages today start

with the C code generator

• After you have created the first compiler for a given language, say java, then you …

• Use that compiler to compile itself!!

Page 7: Foundations of Software Design

7

Compiling Your Compiler

Write the first java compiler using C

Javac in C

Compile using gcc

Write the second java compiler using java

Javac in java

Compile using javac

Write other java programs

Compile using javac

Page 8: Foundations of Software Design

8Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Lexical analyzer (scanner)

Syntax analyzer (parser)

Semantic analyzer

Intermediate Code Generator

Optimizer

Code Generator

Compiler in more detail.

Page 9: Foundations of Software Design

9Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Scanner

• Task: – Translate the sequence of characters into a

corresponding sequence of tokens (by grouping characters into lexemes).

• How it’s done– Specify lexemes using Regular Expressions– Convert these Regular Expressions into Finite Automata

Page 10: Foundations of Software Design

10Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Lexemes and TokensHere are some Java lexemes and the corresponding tokens:

; = index tmp 37 102 SEMI-COLON ASSIGN IDENT IDENT INT-LIT INT-LIT Note that multiple lexemes can correspond to the same token (e.g.,

there are many identifiers).

Given the source code: position = initial + rate * 60 ;

a Java scanner would return the following sequence of tokens:

IDENT ASSIGN IDENT PLUS IDENT TIMES INT-LIT SEMI-COLON

Page 11: Foundations of Software Design

11Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Scanner

• Also called the Lexer• How it works:

– Reads characters from the source program. – Groups the characters into lexemes (sequences of

characters that "go together"). – Each lexeme corresponds to a token;

• the scanner returns the next token (plus maybe some additional information) to the parser.

– The scanner may also discover lexical errors (e.g., erroneous characters).

• The definitions of what is a lexeme, token, or bad character all depend on the source language.

Page 12: Foundations of Software Design

12Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Two kinds of Automata

Deterministic (DFA): – No state has more than one outgoing edge with the

same label.

Non-Deterministic (NFA):– States may have more than one outgoing edge with

same label.– Edges may be labeled with (epsilon), the empty

string. – The automaton can take an epsilon transition

without looking at the current input character.

Page 13: Foundations of Software Design

13Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Regular Expressions to Finite Automata

• Generating a scanner

Regularexpressions

NFA

DFA

LexicalSpecification

Table-driven Implementation of DFA

Page 14: Foundations of Software Design

14

BNF

• Backus-Naur form, Backus-Normal form– A set of rules (or productions)– Each of which expresses the ways symbols of the

language can be grouped together• Non-terminals are written upper-case• Terminals are written lower-case• The start symbol is the left-hand side of the first

production

• The rules for a CFG are often referred to as its BNF

Page 15: Foundations of Software Design

15Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Java Identifier Definition

Described in the Java specification:– http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.ht

ml#44591

– “An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter.

– An identifier cannot have the same spelling (Unicode character sequence) as a keyword (§3.9), Boolean literal (§3.10.3), or the null literal (§3.10.7).”

Page 16: Foundations of Software Design

16Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Java Identifier Definition

Page 17: Foundations of Software Design

17Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Java Integer Literals

• An integer literal may be expressed in decimal (base 10), hexadecimal (base 16), or octal (base 8)

• Examples:0 2 0372 0xDadaCafe 1996 0x00FF00FF

(opt means optional)

Page 18: Foundations of Software Design

18Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Defining Java Decimal Numerals

A decimal numeral is either the single ASCII character 0, representing the integer zero, or consists of an ASCII digit from 1 to 9, optionally followed by one or more ASCII digits from 0 to 9, representing a positive integer:

Page 19: Foundations of Software Design

19Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Defining Floating-Point LiteralsA floating-point literal has the following parts: a whole-number part, a decimal point (represented by an ASCII period character), a fractional part, an exponent, and a type suffix. The exponent, if present, is indicated by the ASCII letter e or E followed by an optionally signed integer.

Page 20: Foundations of Software Design

20

From the Lucene HTML Scanner

Page 21: Foundations of Software Design

21Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Functionality of the Parser

• Input: sequence of tokens from lexical analysis

• Output: parse tree of the program – parse tree is generated if the input is a legal program– if input is an illegal program, syntax errors are issued

• Note: – Instead of parse tree, some parsers produce directly:

• abstract syntax tree (AST) + symbol table, or• intermediate code, or• object code

Page 22: Foundations of Software Design

22Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Parser vs. Scanner

Phase Input Output

Scanner String of characters

String of tokens

Parser String of tokens Parse tree

Page 23: Foundations of Software Design

23Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Parser• Groups tokens into "grammatical phrases", discovering

the underlying structure of the source program. • Finds syntax errors.

– Example • position = * 5 ;

– corresponds to the sequence of tokens: IDENT ASSIGN TIMES INT-LIT SEMI-COLON

– All are legal tokens, but that sequence of tokens is erroneous. • Might find some "static semantic" errors, e.g., a use of an

undeclared variable, or variables that are multiply declared.

• Might generate code, or build some intermediate representation of the program such as an abstract-syntax tree.

Page 24: Foundations of Software Design

24Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

What must the parser do?1. Recognizer: not all strings of tokens are programs

– must distinguish between valid and invalid strings of tokens

2. Translator: must expose program structure• e.g., associativity and precedence• must return the parse tree

We need:– A language for describing valid strings of tokens

• context-free grammars• (analogous to regular expressions in the scanner)

– A method for distinguishing valid from invalid strings of tokens (and for building the parse tree)• the parser• (analogous to the state machine in the scanner)

Page 25: Foundations of Software Design

25Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Parser Example

position = initial + rate * 60 ;

=

+

*

position

initial

rate 60

Page 26: Foundations of Software Design

26Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Semantic Analyzer• The semantic analyzer checks for (more) "static

semantic" errors, e.g., type errors. • Annotates and/or changes the abstract syntax tree

– (e.g., it might annotate each node that represents an expression with its type).

– Example with before and after:

=

+

*position

initial

rate 60

=

+

*position

initial

rate

60

(float)

(float)

(float)(float)

(float)

(float) int-to-float()

(float)

(int)

Page 27: Foundations of Software Design

27Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Intermediate Code Generator

The intermediate code generator translates from abstract-syntax tree to intermediate code.

– One possibility is 3-address code. – Here's an example of 3-address code for the abstract-

syntax tree shown above:

temp1 = int-to-float(60)temp2 = rate * temp1 temp3 = initial + temp2 position = temp3

Page 28: Foundations of Software Design

28Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Optimizer

• Examine the program and rewrite it in ways the preserve the meaning but are more efficient.

• Incredibly complex programs and algorithms• Example

– Move the declaration of temp outside the loop so it isn’t re-declared every time the loop is executed

– Change 2*5 to 10 since it is a constant (no need to do an expensive multiply at run time)

– If we removed the line with temp, the program might even skip the loop altogether

• You can see in advance that count ends up = 30

int count = 0;for (int j=0; j < 2*5; j++) { int temp = j + 1; count += 3;}

Page 29: Foundations of Software Design

29Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Code Generator

• The code generator generates object code from (optimized) intermediate code.

LOADF rate,R1 MULF #60.0,R1 LOADF initial,R2 ADDF R2,R1 STOREF R1,position

Page 30: Foundations of Software Design

30Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Tools

• Scanner Generator– Used to create a scanner automatically– Input:

• a regular expression for each token to be recognized

– Output:• a finite state machine

– Examples:• lex or flex (produce C code), or jlex (produce java)

• Compiler Compilers• yacc (produces C) or JavaCC (produces Java, also has a

scanner generator).

Page 31: Foundations of Software Design

31

From the Lucene HTML Parser

Page 32: Foundations of Software Design

32

From the Lucene HTML Parser

Page 33: Foundations of Software Design

33

Graphs / Networks

Page 34: Foundations of Software Design

34Slide adapted from Goodrich & Tamassia

What is a Graph?

Page 35: Foundations of Software Design

35Slide adapted from Goodrich & Tamassia

Page 36: Foundations of Software Design

36Slide adapted from Goodrich & Tamassia

Page 37: Foundations of Software Design

37Slide adapted from Goodrich & Tamassia

Page 38: Foundations of Software Design

38Slide adapted from Goodrich & Tamassia

Page 39: Foundations of Software Design

39Slide adapted from Goodrich & Tamassia

Page 40: Foundations of Software Design

40Slide adapted from Goodrich & Tamassia

Page 41: Foundations of Software Design

41Slide adapted from Goodrich & Tamassia

Page 42: Foundations of Software Design

42Slide adapted from Goodrich & Tamassia

Page 43: Foundations of Software Design

43Slide adapted from Goodrich & Tamassia

Page 44: Foundations of Software Design

44Slide adapted from Goodrich & Tamassia

Page 45: Foundations of Software Design

45Slide adapted from Goodrich & Tamassia

Page 46: Foundations of Software Design

46Slide adapted from Goodrich & Tamassia

Page 47: Foundations of Software Design

47Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Next Time

• Graph Traversal• Directed Graphs (digraphs)• DAGS• Weighted Graphs