language translation issues

62
Language Translation Issues Lecture 5: Dolores Zage

Upload: radwan

Post on 21-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Language Translation Issues. Lecture 5: Dolores Zage. Programming Language Syntax. The arrangement of words as elements in a sentence to show their relationship In C, X = Y + Z represents a valid sequence of symbols, XY +- does not provides significant information for - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Language Translation Issues

Language Translation Issues

Lecture 5:

Dolores Zage

Page 2: Language Translation Issues

Programming Language Syntax

The arrangement of words as elements in a sentence to show their relationship In C, X = Y + Z represents a valid sequence of

symbols, XY +- does not provides significant information for

understanding a program translation into an object program

rules: 2 + 3 x 4 is 14 not 20 (2+3) x 4 - specify interpretation by syntax - syntax guides

the translator

Page 3: Language Translation Issues

General Syntactic Criteria Provide a common notation between the programmer

and the programming language processor the choice is constrained only slightly by the necessity

to communicate particular items of information for example: a variable may be represented as a real can be

done by an explicit declaration as in Pascal or by an implicit naming convention as FORTRAN

general criteria: easy to read, write, translate and unambiguous

Page 4: Language Translation Issues

Readability Algorithm is apparent from inspection of text self-documenting natural statement formats liberal use of key words and noise words provision for embedded comments unrestricted length identifiers mnemonic operator symbols COBOL design emphasizes readability often at the

expense of ease of writing and translation

Page 5: Language Translation Issues

Writeability

Enhanced by concise and regular structures (notice readability->verbose, different; help us to distinguish programming features)

FORTRAN - implicit naming does not help us catch misspellings (like indx and index, both are good integer variables, even though the programmer wanted indx to be index)

redundancy can be good easier to read and allows for error checking

Page 6: Language Translation Issues

Translation

Ease of Key of easy translation is regularity of

structure LISP can be translated in a few short easy

rules, but it is a bear to read. COBOL has large number of syntactic

constructs -> hard to translate

Page 7: Language Translation Issues

Lack of ambiguity

Central problem in every language design! Ambiguous construction allows for two or

more different interpretations these do not arise in the structure of

individual program elements but in the interplay between structures

The dangling else is a classic example:

Page 8: Language Translation Issues

If then else

If (boolean) then if(boolean) then statement 1 else statement 2

B1 B1

B2 B2

S1

S2

S1 S2

Page 9: Language Translation Issues

Resolve dangling else

Include begin … end delimiter around embedded conditional -ALGOL

Ada-> delimiter end if C and Pascal -> final else is paired with the

nearest then

Page 10: Language Translation Issues
Page 11: Language Translation Issues

Character set

ASCII 26 letters -> other languages have hundreds of letters identifiers and key words and reserved words blanks can be not significant except in literal

character-string data (FORTRAN) or used as separators

delimiters -> begin, end { }

Page 12: Language Translation Issues

Other elements Identifiers, operators, key words, reserved

words Free vrs Fixed format -

free written anywhere fixed - FORTRAN - first five characters are

reserved for labels statements -

simple - no embedding structured or nested - embedded

Page 13: Language Translation Issues

Overall Program-Subprogram Structure Separate subprogram definitions ( Common blocks in

FORTRAN) separate data definitions ( class mechanism) nested subprogram definitions (Pascal nesting one

subprogram in the other) separate interface definitions - package interface in Ada

- in C you can do this with an include file data descriptions separated from executable

statements (COBOL data and environment divisions) unseparated subprogram divisions - no organization -

early BASIC and SNOBOL

Page 14: Language Translation Issues

Stages in Translation

Process of translation of a program from its original syntax into executable form is central in every programming implementation

translation can be quite simple as in LISP and Prolog but more often quite complex

most languages could be implemented with only trivial translation if you wrote a software interpreter and willing to accept slow execution speeds

Page 15: Language Translation Issues

Stages in Translation

Syntactic recognition parts of compiler theory are fairly standard Analysis of the Source Program

the structure of the program must be laboriously built up character by character during translation

Synthesis of the Object Program construction of the executable program from the

output of the semantic analysis

Page 16: Language Translation Issues

Structure of a Compiler

Lexical analysis

Syntactic analysis

Semantic analysis

Optimization

Code generation linking

SymboltableOthertables

source program

lexical tokens

parse tree

intermediate code

optimized intermediate code

Objectcode

Executablecode

Object code fromother compilations

SOURCEPROGRAMRECOGNITIONPHASES

OBJECTCODEGENERATIONPHASES

Page 17: Language Translation Issues

Analysis of the Source Program

lexical analysis (tokenizing) parsing ( syntactic analysis) semantic analysis

symbol-table maintenance insertion of implicit information (default

settings)macro processing and compile-time

operations(#ifdefs)

Page 18: Language Translation Issues

Synthesis of the Object Program

Optimization code generation - internal representation must

be formed into assembly language statements, machine code or other object form

linking and loading - references to external data or other subprograms

Page 19: Language Translation Issues

Translator Groupings

Crudely grouped by the number of passes they make over the source code

standard - uses 2 passes decomposes into components, variable name usage generates an object program from collected information

one pass - fast compilation - Pascal was designed so that it could be done in one pass

three or more passes - if execution speed is paramount

Page 20: Language Translation Issues
Page 21: Language Translation Issues

Formal Translation Models

Based on the context-free theory of languages

the formal definition of the syntax of a programming language is called a grammar

a grammar consists of a set of rules (production) that specify the sequences of characters (lexical items) that form allowable programs in the language beginning defined

Page 22: Language Translation Issues

Chomsky Hierarchy

Language syntax was one of the earliest formal modes to be applied to programming language design

in 1959 Chomsky outlined a model of grammars

Page 23: Language Translation Issues

Classes of grammar and abstract machines

Chomsky Level Grammar Class Machine Class

0 Unrestricted Turning machine1 Context sensitive Linear-bounded automaton2 Context free Pushdown automaton3 Regular Finite-state automaton

Type 2 are our BNF grammars. Type 2 and 3 are what we use in programming languages

A type n language is one that is generated by a type n grammar, where there is no grammar type n + 1 that also generates it. Every grammar of type is, by definition, also a grammar of type n-1.

Page 24: Language Translation Issues

Grammar To Chomsky it is a 4-tuple (V, T, P, Z) where V is an alphabet T in V is an alphabet of terminal symbols P is a finite set of rewriting rules Z the distinguished symbol, is a member of T-V The language of a grammar is the set of terminal

strings which can be represented from Z The difference in the four types is in the form of the

rewriting rules allowed in P

Page 25: Language Translation Issues

Type 0 or phrase structure

Rules can have the form: u :: = V with

u in V+ and V in V*

That is, the left part u can also be a sequence of symbols and the right part can be empty

abc -> dca a -> nil

Page 26: Language Translation Issues

Type 1 or context sensitive or context dependent Restrict the rewriting rules xUy ::= xuy we are only allowed to Rewrite U as u only in

the context x…y all productions a -> b where the length side a always must be less than or

equal to the length of b

Page 27: Language Translation Issues

G = ( {S,B,C}, {a,b,c}, S, P)

P = S -> aSBC S -> abC bB -> bb bC -> bc CB -> BC cC -> cc What language is generated by this context

sensitive grammar?

Page 28: Language Translation Issues

Deciding the language?

always start with the start rule: in this case it is S but it can any nonTerminal (look at the 4-tuple definition)

create a tree starting with the start rule and apply the productions finally finishing with all terminals

“generalize” the pattern

Page 29: Language Translation Issues

Identifying L given G

P = 1. S -> aSBC 2. S -> abC 3. bB -> bb 4. bC -> bc 5. CB -> BC 6. cC -> cc

SabC aSBC

abcaabCBC

aabBCC

aabbCC

aabbcC

aabbcc

aaSBCBCaaabCBCBC

aaabBCCBC

aaabBCBCC

aaabBBCCC

aaabbBCCC

aaabbbCCC

aaabbbcCC

aaabbbccC

aaabbbccc

L -> anbncn where n>= 1

Page 30: Language Translation Issues

Type 2 or context free

U can be rewritten as u regardless of the context in which it appears

This grammar has only one symbol on the left hand side

It also allows a rule to go the empty string

Page 31: Language Translation Issues

Context Free Expression Grammar

E-> E + T | E - T | T T -> T * F | T / F | F F -> number | name | (E)

Page 32: Language Translation Issues

Type 3 - regular grammars

Restrict the rules once more all rules must have the form u :: N or u :: WN

Page 33: Language Translation Issues

Grammars

As we moved from type 3 to type 2 to type 1 to type 0, the resulting languages became more complex

type 2 and type 3 became important in programming languages

type 3 provided a model (FSM) for building lexical analyzers

type 2 (BNF) for developing parse trees of programs

Page 34: Language Translation Issues

BNF Grammars

Consider the structure of an English sentence. We usually describe it as sequence of categories

subject / verb / object

Examples:

The girl/ played / baseball.

The boy / cooked / dinner.

Page 35: Language Translation Issues

BNF Grammars Each category can be further divided. For example subject is represented by

article noun

article / noun / verb / object

There are other possible sentence structures besides the simple declarative ones, such as questions.

auxiliary verb / subject / predicate

Is / the boy / cooking dinner?

Page 36: Language Translation Issues

Represent sentences by a set of rules

<sentence> ::= <declarative> | <question> <declarative> ::= <subject> <verb> <object>. <subject> ::= <article><noun> <question> ::= <auxiliary verb> <subject> <predicate>

This specific notation is called BNF (Backus-Naur form) and was developed in the late 1950s by John Backus as way to express the syntactic definition of ALGOL. At the same time Chomsky developed a similar grammatical form, the context-free grammar. The BNF and context-free grammar for are equivalent in power; the differences are only in notation. For this reason BNF grammar and context-free grammar are interchangeable. (in grammars)

Page 37: Language Translation Issues

Syntax

A BNF grammar is composed of a finite set of BNF grammar rules, which define a language

syntax is concerned with form rather than meaning, a (programming) language consists of a set of syntactically correct programs, each of which is simply a sequence of characters

Page 38: Language Translation Issues

Production Rules

A grammar -> set of production rules <real-number> ::= <integer_part> . <fraction> <integer_part> ::= <digit> | <integer_part> <digit> <fraction> ::= <digit>| <digit> <fraction> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9

nonterminalsToken or terminal

Page 39: Language Translation Issues

Doesn’t Have to Make Sense! A syntactically correct program need not make

any sense semantically. If it is executed it would not have to compute

anything useful it could not computer anything at all

For example look at our simple declarative and imperative sentences -> the syntax

subject verb object is fulfilled but doesn’t make any sense

The home / ran / the girl.

Page 40: Language Translation Issues

Parse Trees Production rules are rules for building strings of tokens beginning with the starting nonterminal, you can use

the rules to build a tree The parse tree -

each leaf either has a terminal or is empty nonleaf nodes are with nonterminals generates the string formed by reading terminals at its

leaves from left to right a string is only in a language if is generated by some parse

tree

Page 41: Language Translation Issues

Parse tree

<real-number> ::= <integer_part> . <fraction><integer_part> ::= <digit> | <integer_part> <digit><fraction> ::= <digit>| <digit> <fraction><digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9

<real-number>

<integer_part>

<fraction>.String 13.13

<digit> <fraction><integer_part> <digit>

<digit> <digit>

1

1

3

3

Page 42: Language Translation Issues

Use of Formal Grammar

Important to the language user and language implementor

user may consult to answer subtle questions about program form, punctuation, and structure

implementor may use it to determine all the possible cases of input program structures that are allowed

common agreed upon definition

Page 43: Language Translation Issues

BNF grammar or Context free

Assigns a structure to each string in the language

is is always a tree because of the restrictions on BNF grammar rules

parse tree provides an intuitive semantic structure

BNF does a good job in defining the syntax of a language

Page 44: Language Translation Issues

Syntax not defined by BNF notation

Despite the elegance, power and simplicity of BNF grammars there are areas of language that cannot be expressed (contextual dependence)

ex: the same identifier may not be defined twice in the same scope

also every language can be defined by multiple grammars

problem : ambiguity (the dangling else) They /are /flying planes or They / are flying/ planes

Page 45: Language Translation Issues

Ambiguity

Ambiguity is often a property of a given grammar

G : S -> SS | 0 | 1 the grammar that generates binary strings is

ambiguous because there is a string in the language that has two distinct parse trees

Page 46: Language Translation Issues

Ambiguous Grammar

S S

S S

S S

S

SS S

0 0 0 01 1

Page 47: Language Translation Issues

Ambiguous Grammar

If every grammar for a given language is ambiguous, then the language is inherently ambiguous. However, the language that generates binary string is not because there is a grammar thatthat is unambiguous

G: T -> 0T | 1T | 0 | 1

Page 48: Language Translation Issues

Expressions

We need control structures for expressions Implicit (default) control - are in effect unless

modified by the programmer through some explicit structure

explicit - modify implicit sequence

Page 49: Language Translation Issues

Sequencing with Arithmetic Expressions

Root = -B B2 - 4 * A * C

2 * A

There are 15 separate operations in this formula In a programming language this can be stated as a single expression

Page 50: Language Translation Issues

Sequencing with Arithmetic Expressions Expressions are powerful and a natural

device for expressing sequences of operations however, they raise new problems.

The sequence-control mechanisms that operate to determine the order of operations within an expression are complex and subtle

Page 51: Language Translation Issues

Tree-Structure Representation

Clarifies the control structure of the expression

*

+ -

a b c d

(a+b) * (c-a)

Page 52: Language Translation Issues

Syntax for Expressions

For a programming language we must have a notation for writing trees as linear sequences of symbols

There are three common ones prefix postfix infix

Page 53: Language Translation Issues

Expression Notationprefix opE1E2 +ab

postfix E1E2op ab+

infix E1opE2 a+bpostfix and prefix, nice -> do not have to use ()

infix postfix prefix

(a+b)*c ab+c* *+abc

a+b*c abc*+ +a*bc

a+b+c ab+c+ ++abc

(a+b)+c ab+c+ ++abc

a + (b+c) abc++ +a+bc

Page 54: Language Translation Issues

Which of the following is a valid expression (either postfix or prefix)?

B C * D - + * A B C - B B B * *

Page 55: Language Translation Issues

Expression Notation - Infix

However, infix is familiar and easy to read Infix is suited to binary operators, for unary

operators or multi-agrument function calls must be exceptions to the general infix property

But how to decode a+b*c? Precedence (order of operations) Associativity ( normally left to right)

Page 56: Language Translation Issues

Precedence

Give operators precedence levels higher precedence operators are evaluated

before lower precedence operators without precedence rules, parentheses would

be needed in expressions works well with all mathematical symbols but

breaks done with new operators not from classical mathematics (?: in C)

Page 57: Language Translation Issues

Associativity

What if operators with the same precedence are grouped together?

Operators + - / * are left associative 1+2+3+4 : left associative a=b=c=2+3 : right associative 234 : right associative mixfix notation - when symbols or keywords

interspersed with the components of expressions - IF a>b then a else b

Page 58: Language Translation Issues

Abstract Syntax Tree Infix, postfix, prefix use a different notation,

but all have the same meaningful components

an abstract syntax tree is a way to represent this for the notations

infix postfix prefix

(a+b)*c ab+c* *+abc

*+ c

a b

Page 59: Language Translation Issues

Side Effects

The use of operations that have side effects in expressions is the basis of a long-standing controversy in programming language design

Side effects are implicit results. For example an operation may return an explicit result, as in the sum returned as the result of an addition, but it may also modify the values stored in other data objects.

Page 60: Language Translation Issues

A * fun(x ) + a

First, we must fetch the r-value of a and the fun(x) must be evaluated.

Notice the addition requires the value of a and the result of the multiplication.

It is clearly desirable to fetch a once and use it twice

Moreover, it should make no difference whether fun(x) is evalutated before or after the value of a if fetched

Page 61: Language Translation Issues

A * fun(x ) + a However if fun has the side effect of changing the

value of a, then the exact order of evaluation is critical!

If a has the initial value of 1 and fun(x) returns 3 and also changes the value of a to 2, then the possible values for this expression can be: evaluate each term in sequence: 1 * 3 + 2 = 5 evaluate a only once: 1 * 3 * 1 = 4 call fun(x) before evaluating a: 2 * 3 + 2 = 8 all are correct according the syntax

Page 62: Language Translation Issues

Positions on side effects in expressions Outlaw them! Disallow functions with side

effects or make them undefined allow them but make it clear exactly what the

order of evaluation is so the programmer can make proper use

The later is most general, but many language definitions this question is ignored and the result is different implementations provide conflicting interpretations