1 syntax sudeshna sarkar 25 aug 2008. 2 top-down and bottom-up top-down only searches for trees that...

1 Syntax Sudeshna Sarkar 25 Aug 2008 2 Top-Down and Bottom-Up Top-down Only searches for trees that can be answers (i.e. Ss) But also suggests trees that are not consistent with any of the words Bottom-up Only forms trees consistent with the words But suggest trees that make no sense globally 3 Problems Even with the best filtering, backtracking methods are doomed if they dont address certain problems Ambiguity Shared subproblems 4 Ambiguity 5 Shared Sub-Problems No matter what kind of search (top-down or bottom-up or mixed) that we choose. We dont want to unnecessarily redo work weve already done. 6 Shared Sub-Problems Consider A flight from Indianapolis to Houston on TWA 7 Shared Sub-Problems Assume a top-down parse making bad initial choices on the Nominal rule. In particular Nominal -> Nominal Noun Nominal -> Nominal PP 8 Shared Sub-Problems 9 10 Shared Sub-Problems 11 Shared Sub-Problems 12 Parsing CKY Earley Both are dynamic programming solutions that run in O(n**3) time. CKY is bottom-up Earley is top-down 13 Sample Grammar 14 Dynamic Programming DP methods fill tables with partial results and Do not do too much avoidable repeated work Solve exponential problems in polynomial time (sort of) Efficiently store ambiguous structures with shared sub-parts. 15 CKY Parsing First well limit our grammar to epsilon-free, binary rules (more later) Consider the rule A -> BC If there is an A in the input then there must be a B followed by a C in the input. If the A spans from i to j in the input then there must be some k st. i B C is a rule in the grammar THEN There must be a B in [i,k] and a C in [k,j] for some i w That is rules can expand to either 2 non-terminals or to a single terminal. 29 Binarization Intuition Eliminate chains of unit productions. Introduce new intermediate non-terminals into the grammar that distribute rules with length > 2 over several rules. So S -> A B C Turns into S -> X C X - A B Where X is a symbol that doesnt occur anywhere else in the the grammar. 30 CNF Conversion 31 CKY Algorithm 32 Example Filling column 5 33 Example 34 Example 35 Example 36 Example 37 END 38 Statistical parsing Over the last 12 years statistical parsing has succeeded wonderfully! NLP researchers have produced a range of (often free, open source) statistical parsers, which can parse any sentence and often get most of it correct These parsers are now a commodity component The parsers are still improving year-on-year. 39 Classical NLP Parsing Wrote symbolic grammar and lexicon S NP VPNN interest NP (DT) NNNNS rates NP NN NNSNNS raises NP NNPVBP interest VP V NPVBZ rates Used proof systems to prove parses from words This scaled very badly and didnt give coverage Minimal grammar on Fed raises sentence: 36 parses Simple 10 rule grammar: 592 parses Real-size broad-coverage grammar: millions of parses 40 Classical NLP Parsing: The problem and its solution Very constrained grammars attempt to limit unlikely/weird parses for sentences But the attempt make the grammars not robust: many sentences have no parse A less constrained grammar can parse more sentences But simple sentences end up with ever more parses Solution: We need mechanisms that allow us to find the most likely parse(s) Statistical parsing lets us work with very loose grammars that admit millions of parses for sentences but to still quickly find the best parse(s) 41 The rise of annotated data: The Penn Treebank ( (S (NP-SBJ (DT The) (NN move)) (VP (VBD followed) (NP (NP (DT a) (NN round)) (PP (IN of) (NP (NP (JJ similar) (NNS increases)) (PP (IN by) (NP (JJ other) (NNS lenders))) (PP (IN against) (NP (NNP Arizona) (JJ real) (NN estate) (NNS loans)))))) (,,) (S-ADV (NP-SBJ (-NONE- *)) (VP (VBG reflecting) (NP (NP (DT a) (VBG continuing) (NN decline)) (PP-LOC (IN in) (NP (DT that) (NN market))))))) (..))) 42 The rise of annotated data Going into it, building a treebank seems a lot slower and less useful than building a grammar But a treebank gives us many things Reusability of the labor Broad coverage Frequencies and distributional information A way to evaluate systems 43 Human parsing Humans often do ambiguity maintenance Have the police eaten their supper? come in and look around. taken out and shot. But humans also commit early and are garden pathed: The man who hunts ducks out on weekends. The cotton shirts are made from grows in Mississippi. The horse raced past the barn fell. 44 Phrase structure grammars = context-free grammars G = (T, N, S, R) T is set of terminals N is set of nonterminals For NLP, we usually distinguish out a set P N of preterminals, which always rewrite as terminals S is the start symbol (one of the nonterminals) R is rules/productions of the form X , where X is a nonterminal and is a sequence of terminals and nonterminals (possibly an empty sequence) A grammar G generates a language L. 45 Probabilistic or stochastic context- free grammars (PCFGs) G = (T, N, S, R, P) T is set of terminals N is set of nonterminals For NLP, we usually distinguish out a set P N of preterminals, which always rewrite as terminals S is the start symbol (one of the nonterminals) R is rules/productions of the form X , where X is a nonterminal and is a sequence of terminals and nonterminals (possibly an empty sequence) P(R) gives the probability of each rule. A grammar G generates a language model L. 46 Soundness and completeness A parser is sound if every parse it returns is valid/correct A parser terminates if it is guaranteed to not go off into an infinite loop A parser is complete if for any given grammar and sentence, it is sound, produces every valid parse for that sentence, and terminates (For many purposes, we settle for sound but incomplete parsers: e.g., probabilistic parsers that return a k-best list.) 47 Top-down parsing Top-down parsing is goal directed A top-down parser starts with a list of constituents to be built. The top-down parser rewrites the goals in the goal list by matching one against the LHS of the grammar rules, and expanding it with the RHS, attempting to match the sentence to be derived. If a goal can be rewritten in several ways, then there is a choice of which rule to apply (search problem) Can use depth-first or breadth-first search, and goal ordering. 48 Top-down parsing 49 Bottom-up parsing Bottom-up parsing is data directed The initial goal list of a bottom-up parser is the string to be parsed. If a sequence in the goal list matches the RHS of a rule, then this sequence may be replaced by the LHS of the rule. Parsing is finished when the goal list contains just the start category. If the RHS of several rules match the goal list, then there is a choice of which rule to apply (search problem) Can use depth-first or breadth-first search, and goal ordering. The standard presentation is as shift-reduce parsing. 50 Shift-reduce parsing: one path cats scratch people with claws catsscratch people with clawsSHIFT Nscratch people with clawsREDUCE NPscratch people with clawsREDUCE NP scratchpeople with clawsSHIFT NP V people with clawsREDUCE NP V peoplewith clawsSHIFT NP V Nwith clawsREDUCE NP V NPwith clawsREDUCE NP V NP withclawsSHIFT NP V NP PclawsREDUCE NP V NP P clawsSHIFT NP V NP P NREDUCE NP V NP P NPREDUCE NP V NP PPREDUCE NP VPREDUCE SREDUCE What other search paths are there for parsing this sentence? 51 Problems with top-down parsing Left recursive rules A top-down parser will do badly if there are many different rules for the same LHS. Consider if there are 600 rules for S, 599 of which start with NP, but one of which starts with V, and the sentence starts with V. Useless work: expands things that are possible top-down but not there Top-down parsers do well if there is useful grammar-driven control: search is directed by the grammar Top-down is hopeless for rewriting parts of speech (preterminals) with words (terminals). In practice that is always done bottom-up as lexical lookup. Repeated work: anywhere there is common substructure 52 Problems with bottom-up parsing Unable to deal with empty categories: termination problem, unless rewriting empties as constituents is somehow restricted (but then it's generally incomplete) Useless work: locally possible, but globally impossible. Inefficient when there is great lexical ambiguity (grammar-driven control might help here) Conversely, it is data-directed: it attempts to parse the words that are there. Repeated work: anywhere there is common substructure 53 Repeated work 54 Principles for success: take 1 If you are going to do parsing-as-search with a grammar as is: Left recursive structures must be found, not predicted Empty categories must be predicted, not found Doing these things doesn't fix the repeated work problem: Both TD (LL) and BU (LR) parsers can (and frequently do) do work exponential in the sentence length on NLP problems. 55 Principles for success: take 2 Grammar transformations can fix both left-recursion and epsilon productions Then you parse the same language but with different trees Linguists tend to hate you But this is a misconception: they shouldn't You can fix the trees post hoc: The transform-parse-detransform paradigm 56 Principles for success: take 3 Rather than doing parsing-as-search, we do parsing as dynamic programming This is the most standard way to do things Q.v. CKY parsing, next time It solves the problem of doing repeated work But there are also other ways of solving the problem of doing repeated work Memoization (remembering solved subproblems) Also, next time Doing graph-search rather than tree-search. 57 Probabilistic or stochastic context- free grammars (PCFGs) G = (T, N, S, R, P) T is set of terminals N is set of nonterminals For NLP, we usually distinguish out a set P N of preterminals, which always rewrite as terminals S is the start symbol (one of the nonterminals) R is rules/productions of the form X , where X is a nonterminal and is a sequence of terminals and nonterminals (possibly an empty sequence) P(R) gives the probability of each rule. A grammar G generates a language model L. 58 PCFGs Notation w 1n = w 1 w n = the word sequence from 1 to n (sentence of length n) w ab = the subsequence w a w b N j ab = the nonterminal N j dominating w a w b N j w a w b Well write P(N i j ) to mean P(N i j | N i ) Well want to calculate max t P(t * w ab ) 59 The probability of trees and strings P(t) -- The probability of tree is the product of the probabilities of the rules used to generate it. P(w 1n ) -- The probability of the string is the sum of the probabilities of the trees which have that string as their yield P(w 1n ) = j P(w 1n, t) where t is a parse of w 1n = j P(t) 60 A Simple PCFG (in CNF) S NP VP 1.0 VP V NP 0.7 VP VP PP 0.3 PP P NP 1.0 P with 1.0 V saw 1.0 NP NP PP 0.4 NP astronomers 0.1 NP ears 0.18 NP saw 0.04 NP stars 0.18 NP telescope 0.1 61 62 63 Tree and String Probabilities w 15 = astronomers saw stars with ears P(t 1 ) = 1.0 * 0.1 * 0.7 * 1.0 * 0.4 * 0.18 * 1.0 * 1.0 * 0.18 = P(t 2 ) = 1.0 * 0.1 * 0.3 * 0.7 * 1.0 * 0.18 * 1.0 * 1.0 * 0.18 = P(w 15 ) = P(t 1 ) + P(t 2 ) = = 64 Chomsky Normal Form All rules are of the form X Y Z or X w. This makes parsing easier/more efficient 65 N-ary Trees in Treebank Lexicon and Grammar Binary Trees TreeAnnotations.annotateTree Parsing TODO: CKY parsing Treebank binarization 66 ROOT S NP VP N cats V NP PP PNP claws withpeople scratch N N An example: before binarization 67 P NP claws with NP N cats people scratch N VP VROOT After binarization.. 68 function CKY(words, grammar) returns most probable parse/prob score = new double[#(words)+1][#(words)+][#(nonterms)] back = new Pair[#(words)+1][#(words)+1][#nonterms]] for i=0; i words[i] in grammar score[i][i+1][A] = P(A -> words[i]) //handle unaries boolean added = true while added added = false for A, B in nonterms if score[i][i+1][B] > 0 && A->B in grammar prob = P(A->B)*score[i][i+1][B] if(prob > score[i][i+1][A]) score[i][i+1][A] = prob back[i][i+1] [A] = B added = true The CKY algorithm (1960/1965) 69 for span = 2 to #(words) for begin = 0 to #(words)- span end = begin + span for split = begin+1 to end-1 for A,B,C in nonterms prob=score[begin][split][B]*score[split][end][C]*P(A->BC) if(prob > score[begin][end][A]) score[begin]end][A] = prob back[begin][end][A] = new Triple(split,B,C) //handle unaries boolean added = true while added added = false for A, B in nonterms prob = P(A->B)*score[begin][end][B]; if(prob > score[begin][end] [A]) score[begin][end] [A] = prob back[begin][end] [A] = B added = true return buildTree(score, back) The CKY algorithm (1960/1965) 70 score[0][1] score[1][2] score[2][3] score[3][4] score[4][5] score[0][2] score[1][3] score[2][4] score[3][5] score[0][3] score[1][4] score[2][5] score[0][4] score[1][5] score[0][5] cats scratchwallswithclaws 71 N cats P cats V cats N scratch P scratch V scratch N walls P walls V walls N with P with V with N claws P claws V claws cats scratchwallswithclaws for i=0; i words[i] in grammar score[i][i+1][A] = P(A -> words[i]); 72 N cats P cats V cats NP NP N scratch P scratch V scratch NP NP N walls P walls V walls NP NP N with P with V with NP NP N claws P claws V claws NP NP // handle unaries cats scratchwallswithclaws 73 N cats P cats V cats NP NP N scratch P scratch V scratch NP NP N walls P walls V walls NP NP N with P with V with NP NP N claws P claws V claws NP NP PP VP PP VP PP VP PP VP prob=score[begin][split][B]*score[split][end][C]*P(A->BC) >_P) For each A, only keep the A->BC with highest prob. cats scratchwallswithclaws 74 N cats P cats V cats NP NP N scratch P scratch V scratch NP N NP NP N walls P walls V walls NP N NP NP N with P with V with NP N NP NP N claws P claws V claws NP N NP NP PP VP PP PP VP PP PP VP PP PP VP PP // handle unaries cats scratchwallswithclaws N scratch P scratch V scratch NP NP N walls P walls V walls NP NP N with P with V with NP NP N claws P claws V claws NP NP 75 76 N cats P cats V cats NP NP N scratch P scratch V scratch NP NP N walls P walls V walls NP NP N with P with V with NP NP N claws P claws V claws NP NP PP VP VP PP PP PP VP VP PP PP PP VP VP PP PP PP VP VP PP PP NP S ROOT S E E E-4 ROOTS NP S ROOT S NP PP 5.187E-6 VP VP PP PP 5.187E-6 PP VP VP PP PP 1.600E-4 NP 5.335E-5 S ROOT S NP 5.335E Call buildTree(score, back) to get the best parse cats scratchwallswithclaws

1 syntax sudeshna sarkar 25 aug 2008. 2 top-down and bottom-up top-down only searches for trees that...

Documents