parsing - computational geometry lab - indexmorin/teaching/3002/notes/ll1.pdfparser generators •...

52
Parsing COMP 3002 School of Computer Science

Upload: others

Post on 05-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Parsing

    COMP 3002School of Computer Science

  • 2

    The Structure of a Compiler

    syntacticanalyzer

    codegenerator

    programtext

    interm.rep.

    machinecode

    tokenizer parsertokenstream

  • 3

    Role of the Parser• Converts a token stream into an

    intermediate representation– Captures the meaning (instead of text) of the

    program– Usually, intermediate representation is a parse tree

    id, "x"

    =

    id, "x"

    +

    number, "1"

    x = x + 1

  • 4

    Kinds of Parsers• Universal

    – Can parse any grammar– Cocke-Younger-Kasami and Earley's algorithms– Not efficient enough to be used in compilers

    • Top-down– Builds parse trees from the top (root) down

    • Bottom-up– Builds parse trees from the bottom (leaves) up

  • 5

    Errors in Parsing• Lexical errors

    – Misspelled identifiers, keywords, or operators

    • Syntactical errors– Misplaced or mismatched parentheses, case

    statement outside of any switch statement,...

    • Semantic errors– Type mismatches between operators and operands

    • Logical errors– Bugs – the programmer said one thing but meant

    something else • if (x = y) { ... } • if (x == y) { ... }

  • 6

    Error Reporting • A parser should

    – report the presence of errors clearly and correctly and

    – recover from errors quickly enough to detect further errors

  • 7

    Error Recovery Modes• Panic-Mode

    – discard input symbols until a "synchronizing token" is found

    – Examples (in Java): semicolon, '}'

    • Phrase-Level– replace a prefix of the remaining input to correct it– Example: Insert ';' or '{'– must be careful to avoid infinite loops

  • 8

    Error Recover Modes (Cont'd)• Error Productions

    – Specify common errors as part of the language specification

    • Global Correction– Compute the smallest set of changes that will make

    the program syntactically correct (impractical and usually not usually useful)

  • 9

    Context-Free Grammars• CF grammars are used to define languages• Specified using BNF notation

    – A set of non-terminals N– A set of terminals T– A list of rewrite rules (productions)– The LHS of each rule contains one non-terminal

    symbol– The RHS of each rule contains a regular expression

    over the alphabet N T– A special non-terminal is usually designated as the

    start symbol• Usually, start symbols is LHS of the first production

  • 10

    Context Free Grammars and Compilers• In a compiler

    – N consists of language constructs (function, block, if-statement, expression, ...)

    – T consists of tokens

  • 11

    Grammar Example• Non-terminals: E, T, F

    – E = expression– T = term– F = factor

    • Terminals: id, +, *, (, )• Start symbol E• "Mathematical formulae using + and *

    operators"

    E E + T | TT T * F | FF  ( E ) | id

    E T E'E' + T E' | T F T'T' * F T' | T T * F | FF  ( E ) | id

    E E + E | E * E | ( E ) | id

  • 12

    Derivations• From a grammar specification, we can

    derive any string in the language– Start with the start symbol– While the current string contains some non-terminal

    N• expand N using a rewrite rule with N on the LHS

    E E + E | E * E | ( E ) | id

    EE + EE + idE * E + idE * id + id( E ) * id + id( E + E ) * id + id( id + E ) * id + id( id + id ) * id + id

  • 13

    Derivation Example• Derive id + id * id with these grammars:

    E E + T | TT T * F | FF  ( E ) | id

    E T E'E' + T E' | T F T'T' * F T' | T T * F | FF  ( E ) | id

    E E + E | E * E | ( E ) | id

  • 14

    Derivation Example• Derive:

    – id * id + id– id + id * id

    E T E'E' + T E' | T F T'T' * F T' | T T * F | FF  ( E ) | id

  • 15

    Terminology• The strings of terminals that we derive from

    the start symbol are called sentences• The strings of terminals and non-terminals

    are called sentential forms

  • 16

    Leftmost and Rightmost Derivations• A derivation is leftmost if at each stage we

    always expand the leftmost non-terminal• A derivation is rightmost if at each stage we

    always expand the rightmost non-terminal• Give a leftmost and rightmost derivation of

    – id * id + id

    E E + E | E * E | ( E ) | id

  • 17

    Derivations and Parse Trees• A parse-tree is a graphical representation of

    a derivation• Internal nodes are labelled with non-

    terminals– Root is the start symbol

    • Leaves are labelled with terminals– String is represented by left-to-right traversal of

    leaves

    • When applying an expansion E ABC...Z– Children of node E become nodes labelled A,B,C,...Z

  • 18

    Derivations and Parse Trees - Example

    EE + EE + idE * E + idE * id + id( E ) * id + id( E + E ) * id + id( id + E ) * id + id( id + id ) * id + id

    E

    E E

    id

    +

    E E*

    id( )E

    id

    E E+

    id

    E E + E | E * E | ( E ) | id

  • 19

    Ambiguity• Different parse trees for the same sentence

    result in ambiguity

    E E + E | E * E | ( E ) | id

    EE + EE + E * Ea + b * c

    EE * EE + E * Ea + b * c

    E

    E E

    id

    *

    E E+ c

    a b

    E

    E E

    a

    +

    E E*

    b c

  • 20

    Ambiguity – Cont'd• Ambiguity is usually bad• The same program means two different

    things• We try to write grammars that avoid

    ambiguity

  • 21

    Context Free Grammars and Regular Expressions• CFGs are more powerful than regular

    expressions– Converting a regular expression to a CFG is trivial– The CFG S aSb | generates a language that is not

    regular

    • But not that powerful– The language { ambncmdn : n,m>0 } can not be

    expressed by a CFG

  • 22

    Enforcing Order of Operations• We can write a CFG to enforce specific

    order of operations– Example: + and *– Exercises: • Add comparison operator with lower level of

    precedence than +• Add exponentiation operator with higher level of

    precedence than *

    E PEPE TE + PE | TETE id * TE | id | ( E )

  • 23

    Picking Up - Context free grammars• CFGs can specify programming languages• It's not enough to write a correct CFG

    – An ambiguous CFG can give two different parse trees for the same string• Same program has two different meanings!

    – Not all CFGs are easy to parse efficiently

    • We look at restricted classes of CFGs– Sufficiently restricted grammars can be parsed easily– The parser can be generated automatically

  • 24

    Parser Generators• Benefits of parser generators

    – No need to write code (just grammar)– Parser always corresponds exactly to the grammar

    specification– Can check for errors or ambiguities in grammars– No surprise programs

    • Drawbacks– Need to write a restricted class of grammar [LL(1),

    LR(1), LR(k),...]– Must be able to understand when and why a

    grammar is not LL(1) or LR(1) or LR(k)– Means learning a bit of parsing theory– Means learning how to make your grammar LL(1),

    LR(1), or LR(k)

  • 25

    Ambiguity• This grammar is ambiguous

    – consider the input • a - b - c

    • Rewrite this grammar to be unambiguous• Rewrite this grammar so that - becomes left

    associative:• a - b - c ~ ((a - b) - c)

    E E - E | id

  • 26

    Solutions

    E id MM - E |

    E M | idM E - id

  • 27

    A Common Ambiguity – The Dangling Else

    • Show that this grammar is ambiguous• Remove the ambiguity

    – Implement the “else matches innermost if” rule

    stmt if expr then stmt | if expr then stmt else stmt | other

  • 28

    Solution

    stmt matched_stmt | open_stmtmatched_stmt if expr then matched_stmt else matched_stmt | otheropen_stmt if expr then stmt | if expr then matched_stmt else open_stmt

  • 29

    Left Recursion• A top-down parser expands the left-most

    non-terminal based on the next token• Left-recursion is difficult for top-down

    parsing• Immediate left recursion:

    – A A– Rewrite as: AA' and A' A'

    • More complicated left recursion occurs when A can derive a string starting with A– A + A

  • 30

    Removing Left Recursion• Removing immediate left-recursion is easy• Simple case:

    – A A– Rewrite as: AA' and A' A'

    • More complicated: – A AAA – Rewrite as: • AA'|A'| | A'

    • A' A'A'A'

  • 31

    Algorithm for Removing all Left Recursion• Textbook page 213

  • 32

    Left Factoring• Left factoring is a technique for making a

    grammar suitable for top-down parsing• For each non-terminal A find the longest

    prefix common to two or more alternatives– Replace A 1 n with

    – A A' and A' 1 n• Repeat until no two alternatives have a common prefix

  • 33

    Left Factoring Example• Left factor the following grammars

    E PEPE TE + PE | TETE id * TE | id | ( E )

  • 34

    Summary of Grammar-Manipulation Tricks• Eliminating ambiguity

    – Different parse trees for same program

    • Enforcing order of operations– Left-associative– right-associative

    • Eliminating left-recursion– Gets rid of potential "infinite recursions"

    • Left factoring– Allows choosing between alternative productions

    based on current input symbol

  • 35

    Exercise

    • Remove left recursion• Left-factor

    rexpr rexpr + rterm | rtermrterm rterm rfactor | rfactorrfactor rfactor * | rprimaryrprimary a | b

  • 36

    Top-Down Parsing• Top-down parsing is the problem of

    constructing a pre-order traversal of the parse tree

    • This results in a leftmost derivation• The expansion of the leftmost non-terminal

    is determined by looking at a prefix of the input

  • 37

    LL(1) and LL(k)• If the correct expansion can always be

    determined by looking ahead at most k symbols then the grammar is an LL(k) grammar

    • LL(1) grammars are most common

  • 38

    FIRST()• Let be any string of grammar symbols• FIRST() is the set of terminals that begin

    strings that can be derived from a– If can derive then is also in FIRST()

    • Why is FIRST useful– Suppose A | and FIRST() and FIRST() are

    disjoint– Then, by looking at the next symbol we know which

    production to use next

  • 39

    Computing FIRST(X)• If X is a terminal then FIRST(X) = {X}• If X is a non-terminal and X Y1 Y2...Yk

    – i = 0 ; define FIRST(Y0) = { }

    – while is in FIRST(Yi)• Add FIRST(Yi+1) to FIRST(X)• i = i+1

    – if (i =k or X )• Add to FIRST(X)

    • Repeat above step for all non-terminals until nothing is added to any FIRST set

  • 40

    Example• Compute FIRST(E), FIRST(PE), FIRST(TE),

    FIRST(TE') E PEPE TE + E | TETE id TE'TE' * E TE' id TE' ( E )

  • 41

    Computing FIRST(X1X2...Xk)• Given FIRST(X) for every symbol X we can

    compute FIRST(X1X2...Xk) for any string of symbols X1X2...Xk:– i = 0 ; define FIRST(X0) = { }

    – while is in FIRST(Xi)• Add FIRST(Xi+1) to FIRST(X1X2...Xk)• i = i+1

    – if (i =k)• Add to FIRST(X)

  • 42

    FOLLOW(A)• Let A be any non-terminal• FOLLOW(A) is the set of terminals a that

    can appear immediately to the right of A in some sentential form– I.e. S * A a for some and and start symbol S– Also, if A can be a rightmost symbol in some

    sentential form then $ (end of input marker) is in FOLLOW(A)

  • 43

    Computing FOLLOW(A)• Place $ into FOLLOW(S)• Repeat until nothing changes:

    – if A B then add FIRST()\{} to FOLLOW(B)– if A B then add FOLLOW(A) to FOLLOW(B)– if A B and is in FIRST() then add FOLLOW(A)

    to FOLLOW(B)

  • 44

    Example• Compute FOLLOW(E), FOLLOW(PE),

    FOLLOW(TE), FOLLOW(TE') E PEPE TE + E | TETE id TE'TE' * E TE' id TE' ( E )

  • 45

    FIRST and FOLLOW Example

    • FIRST(F) = FIRST(T) = FIRST(E) = {(, id }• FIRST(E') = {+, }• FIRST(T') = {*, }• FOLLOW(E) = FOLLOW(E') = {), $}• FOLLOW(T) = FOLLOW(T') = {+,),$}• FOLLOW(F) = {+, *, ), $}

    E T E'E' + T E' | T F T'T' * F T' | F ( E ) | id

  • 46

    LL(1) Grammars• Left to right parsers producing a leftmost

    derivation looking ahead by at most 1 symbol

    • Grammar G is LL(1) iff for every two productions of the form A |– FIRST() and FIRST() are disjoint– If is in FIRST() then FIRST() and FOLLOW(A) are

    disjoint (and vice versa)

    • Most programming language constructs are LL(1) but careful grammar writing is required

  • 47

    LL(1) Predictions Tables• LL(1) languages can be parsed efficiently

    through the use of a prediction table– Rows are non-terminals– Columns are input symbols (terminals)– Values are productions

  • 48

    Constructin LL(1) Prediction Table• The following algorithm constructs the LL(1)

    prediction table• For each production A in the grammar

    – For each terminal a in FIRST(), set M[A,a] = A – If is in FIRST() then for each terminal b in

    FOLLOW(A), set M[A,b] = A

  • LL(1) Prediction Table Example

    E T E'E' + T E' | T F T'T' * F T' | F ( E ) | id

    Id + * ( ) $

    E

    E' E'   e

    T

    T'

    F

    E  T E' E  T E'E'  + T E' E'  

    T  F T' T  F T'T'   T'  * F T' T'   T'  

    F  id F  ( E )

  • LL(1) Prediction Table Example• Use the table to find the derivation of

    – id + id * id + id

    Id + * ( ) $

    E

    E' E'   eT

    T'

    F

    E  T E' E  T E'E'  + T E' E'  

    T  F T' T  F T'T'   T'  * F T' T'   T'  

    F  id F  ( E )

  • 51

    LL(1) Parser Generators• Given a grammar G, an LL(1) parser

    generator can– Computer FIRST(A) and FOLLOW(A) for every non-

    terminal A in G– Determine if G is LL(1)– Construct the prediction table for G– Create code that parses any string in G and produces

    the parse tree

    • In Assignment 2 we will use such a parser generator (javacc)

  • 52

    Summary• Programming languages can be specified

    with context-free grammars• Some of these grammars are easy to parse

    and generate a unique parse tree for any program

    • An LL(1) grammar is one for which a leftmost derivation can be done with only one symbol of lookahead

    • LL(1) parser generators exist and can produce efficient parsers given only the grammar

    Title of presentationSlide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49Slide 50Slide 51Slide 52