parsing potsdam, 10 may 2012 - hasso plattner instituteparsing potsdam, 10 may 2012 saeedeh momtazi...

55
Natural Language Processing Parsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book

Upload: others

Post on 12-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Natural Language ProcessingParsing

Potsdam, 10 May 2012

Saeedeh MomtaziInformation Systems Group

based on the slides of the course book

Page 2: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Parsing

Finding structural relationship between words in a sentence

ApplicationsSpell checkingSpeech recognitionMachine translationLanguage modeling

Saeedeh Momtazi | NLP | 10.05.2012

2

Page 3: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Outline

1 Phrase Structure

2 Syntactic ParsingCKY Algorithm

3 Statistical Parsing

Saeedeh Momtazi | NLP | 10.05.2012

3

Page 4: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Outline

1 Phrase Structure

2 Syntactic ParsingCKY Algorithm

3 Statistical Parsing

Saeedeh Momtazi | NLP | 10.05.2012

4

Page 5: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Constituency

Working based on Constituency (Phrase structure)

Organizing words into nested constituentsShowing that groups of words within utterances can act as singleunitsForming coherent classes from these units that can behave insimilar ways

With respect to their internal structureWith respect to other units in the language

Considering a head word for each constituent

Saeedeh Momtazi | NLP | 10.05.2012

5

Page 6: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Constituency

the writer talked to the audiences about his new book.

the writer talked about his new book to the audiences. 4

about his new book the writer talked to the audiences. 4

the writer talked book to the audiences about his new. 7

Saeedeh Momtazi | NLP | 10.05.2012

6

Page 7: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Context Free Grammar (CFG)

Grammar G consists of

Terminals (T )Non-terminals (N)Start symbol (S)Rules (R)

Saeedeh Momtazi | NLP | 10.05.2012

7

Page 8: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CFG

TerminalsThe set of words in the text

Non-TerminalsThe constituents in a language(noun phrase, verb phrase, ....)

Start symbolThe main constituent of the language(sentence)

RulesEquations that consist of a single non-terminal on the left andany number of terminals and non-terminals on the right

Saeedeh Momtazi | NLP | 10.05.2012

8

Page 9: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CFGS → NP VPS → VPNP → NNP → Det NNP → NP NPNP → NP PPVP → VVP → VP PPVP → VP NPPP → Prep NP

N → bookV → bookDet → theN → flightPrep → throughN → Houston

Saeedeh Momtazi | NLP | 10.05.2012

9

Page 10: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CFG

Saeedeh Momtazi | NLP | 10.05.2012

10

Page 11: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CFG

Saeedeh Momtazi | NLP | 10.05.2012

11

Page 12: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Outline

1 Phrase Structure

2 Syntactic ParsingCKY Algorithm

3 Statistical Parsing

Saeedeh Momtazi | NLP | 10.05.2012

12

Page 13: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Parsing

Taking a string and a grammar and returning proper parsetree(s) for that string

Covering all and only the elements of the input string

Reaching the start symbol at the top of the string

The system cannot select the correct tree among all thepossible trees

Saeedeh Momtazi | NLP | 10.05.2012

13

Page 14: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Main Grammar Fragments

Sentence

Noun PhraseAgreement

Verb PhraseSub-categorization

Saeedeh Momtazi | NLP | 10.05.2012

14

Page 15: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Grammar Fragments: Sentence

DeclarativesA plane left.S → NP VP

ImperativesLeave!S → VP

Yes-No QuestionsDid the plane leave?S → Aux NP VP

WH QuestionsWhen did the plane leave?S → NPWH Aux NP VP

Saeedeh Momtazi | NLP | 10.05.2012

15

Page 16: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Grammar Fragments: NP

Each NP has a central critical noun called head

The head of an NP can be expressed usingPre-nominals: the words that can come before the headPost-nominals: the words that can come after the head

Saeedeh Momtazi | NLP | 10.05.2012

16

Page 17: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Grammar Fragments: NP

Pre-nominalsSimple lexical items: the, this, a, an, ...a car

Simple possessivesJohn’s car

Complex recursive possessivesJohn’s sister’s friend’s car

Quantifiers, cardinals, ordinals...three cars

Adjectiveslarge cars

Saeedeh Momtazi | NLP | 10.05.2012

17

Page 18: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Grammar Fragments: NP

Post-nominalsPrepositional phrasesflight from Seattle

Non-finite clausesflight arriving before noon

Relative clausesflight that serves breakfast

Saeedeh Momtazi | NLP | 10.05.2012

18

Page 19: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Agreement

Having constraints that hold among various constituentsConsidering these constraints in a rule or set of rules

Example: determiners and the head nouns in NPs have toagree in number

This flight 4

Those flights 4

This flights 7

Those flight 7

Grammars that do not consider constraints will over-generateAccepting and assigning correct structures to grammatical examples (this

flight)

But also accepting incorrect examples (these flight)

Saeedeh Momtazi | NLP | 10.05.2012

19

Page 20: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Agreement at sentence level

Considering similar constraints at sentence level

Example: subject and verb in sentences have to agree in numberand person

John flies 4

We fly 4

John fly 7

We flies 7

Saeedeh Momtazi | NLP | 10.05.2012

20

Page 21: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Agreement

Possible CFG solution

Ssg → NPsg VPsg

Spl → NPpl VPpl

NPsg → Detsg Nsg

NPpl → Detpl Npl

VPsg → Vsg NPsg

VPpl → Vpl NPpl

...

Shortcoming:Introducing many rules in the system

Saeedeh Momtazi | NLP | 10.05.2012

21

Page 22: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Grammar Fragments: VP

VPs consist of a head verb along with zero or moreconstituents called arguments

VP → VVP → V NPVP → V PPVP → V NP PPVP → V NP NP

disappearprefer a morning flightfly on Thursdayleave Boston in the morninggive me the flight number

ArgumentsObligatory: complementOptional: adjunct

Saeedeh Momtazi | NLP | 10.05.2012

22

Page 23: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Sub-categorization

Even though there are many valid VP rules, not all verbs areallowed to participate in all VP rules

disappear a morning flight 7

Solution:Subcategorizing the verbs according to the sets of VP rules thatthey can participate inThis is a modern take on the traditional notion oftransitive/intransitiveModern grammars may have 100s or such classes

Saeedeh Momtazi | NLP | 10.05.2012

23

Page 24: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Sub-categorization

Example:

SneezeFindGiveHelpPreferTold

John sneezedPlease find [a flight to NY]NP

Give [me]NP[a cheaper fair]NP

Can you help [me]NP[with a flight]PP

I prefer [to leave earlier]TO-VP

I was told [United has a flight]S

John sneezed the book 7

I prefer United has a flight 7

Give with a flight 7

Saeedeh Momtazi | NLP | 10.05.2012

24

Page 25: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Sub-categorization

The over-generation problem also exists in VP rulesPermitting the presence of strings containing verbs andarguments that do not go together

John sneezed the bookVP → V NP

Solution:Similar to agreement phenomena, we need a way to formallyexpress the constraints

Saeedeh Momtazi | NLP | 10.05.2012

25

Page 26: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Parsing Algorithms

Top-DownStarting with the rules that give us an S, since trees should berooted with an SWorking on the way down from S to the words

Bottom-UpStarting with trees that link up with the words, since trees shouldcover the input wordsWorking on the way up from words to larger and larger trees

Saeedeh Momtazi | NLP | 10.05.2012

26

Page 27: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Top-Down vs. Bottom-Up

Top-DownOnly searches for trees that can be answers (i.e. S’s)But also suggests trees that are not consistent with any of thewords

Bottom-UpOnly forms trees consistent with the wordsBut suggests trees that make no sense globally

Saeedeh Momtazi | NLP | 10.05.2012

27

Page 28: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Top-Down vs. Bottom-Up

In both cases, we left out how to keep track of the search spaceand how to make choices

SolutionsBacktracking

Making a choice, if it works out then fineIf not, then back up and make a different choice⇒ duplicated work

Dynamic programmingAvoiding repeated workSolving exponential problems in polynomial timeStoring ambiguous structures efficiently

Saeedeh Momtazi | NLP | 10.05.2012

28

Page 29: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Dynamic Programming Methods

CKY: bottom-upEarly: top-down

Saeedeh Momtazi | NLP | 10.05.2012

29

Page 30: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Outline

1 Phrase Structure

2 Syntactic ParsingCKY Algorithm

3 Statistical Parsing

Saeedeh Momtazi | NLP | 10.05.2012

30

Page 31: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Chomsky Normal Form

Each grammar can be represented by a set of binary rules

A→ B C

A→ w

A, B, C are noun-terminals w is a terminal

Saeedeh Momtazi | NLP | 10.05.2012

31

Page 32: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Chomsky Normal Form

Converting to Chomsky normal form

A→ B C D

X → B C

A→ X D

X does not occur anywhere else in the the grammar

Saeedeh Momtazi | NLP | 10.05.2012

32

Page 33: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Chomsky Normal Form

Converting to Chomsky normal form

A→ B

B → C D

A→ C D

Saeedeh Momtazi | NLP | 10.05.2012

33

Page 34: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CKY Parsing

A→ B C

If there is an A somewhere in the input, then there must be a Bfollowed by a C in the input

If the A spans from i to j in the input, then there must be a ksuch that i < k < j

B spans from i to kC spans from k to j

Saeedeh Momtazi | NLP | 10.05.2012

34

Page 35: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CKY Parsing

[0,1] [0,2] [0,3] [0,4] [0,5]

[1,2] [1,3] [1,4] [1,5]

[2,3] [2,4] [2,5]

[3,4] [3,5]

[4,5]

Book the flight through Houston0 1 2 3 4 5

Saeedeh Momtazi | NLP | 10.05.2012

35

Page 36: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CKY ParsingN → book[0,1]

V → book[0,1]

[0,1] [0,2] [0,3] [0,4] [0,5]

Det → the[1,2]

[1,2] [1,3] [1,4] [1,5]

N → flight[2,3]

[2,3] [2,4] [2,5]

Prep → through[3,4]

[3,4] [3,5]

N → houston[4,5]

[4,5]

Book the flight through Houston0 1 2 3 4 5

Saeedeh Momtazi | NLP | 10.05.2012

36

Page 37: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CKY ParsingN → book[0,1]

V → book[0,1]

NP → N[0,1]

VP → V[0,1]

S → VP[0,1]

[0,1] [0,2] [0,3] [0,4] [0,5]

Det → the[1,2]

[1,2] [1,3] [1,4] [1,5]

N → flight[2,3]

NP → N[2,3]

[2,3] [2,4] [2,5]

Prep → through[3,4]

[3,4] [3,5]

N → houston[4,5]

NP → N[4,5]

[4,5]

Book the flight through Houston0 1 2 3 4 5

Saeedeh Momtazi | NLP | 10.05.2012

37

Page 38: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CKY ParsingN → book[0,1]

V → book[0,1]

NP → N[0,1]

VP → V[0,1]

S → VP[0,1]

[0,1] [0,2] [0,3] [0,4] [0,5]

Det → the[1,2]

[1,2]

NP → Det[1,2], N[2,3]

[1,3] [1,4] [1,5]

N → flight[2,3]

NP → N[2,3]

[2,3] [2,4] [2,5]

Prep → through[3,4]

[3,4] [3,5]

N → houston[4,5]

NP → N[4,5]

[4,5]

Book the flight through Houston0 1 2 3 4 5

Saeedeh Momtazi | NLP | 10.05.2012

38

Page 39: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CKY ParsingN → book[0,1]

V → book[0,1]

NP → N[0,1]

VP → V[0,1]

S → VP[0,1]

[0,1] [0,2] [0,3] [0,4] [0,5]

Det → the[1,2]

[1,2]

NP → Det[1,2], N[2,3]

[1,3] [1,4] [1,5]

N → flight[2,3]

NP → N[2,3]

[2,3] [2,4] [2,5]

Prep → through[3,4]

[3,4]

PP→Prep[3,4],NP[4,5]

[3,5]

N → houston[4,5]

NP → N[4,5]

[4,5]

Book the flight through Houston0 1 2 3 4 5

Saeedeh Momtazi | NLP | 10.05.2012

39

Page 40: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CKY ParsingN → book[0,1]

V → book[0,1]

NP → N[0,1]

VP → V[0,1]

S → VP[0,1]

[0,1] [0,2]

NP → NP[0,1], NP[1,3]

VP → VP[0,1], NP[1,3]

S → VP[0,3]

[0,3] [0,4] [0,5]

Det → the[1,2]

[1,2]

NP → Det[1,2], N[2,3]

[1,3] [1,4] [1,5]

N → flight[2,3]

NP → N[2,3]

[2,3] [2,4] [2,5]

Prep → through[3,4]

[3,4]

PP→Prep[3,4],NP[4,5]

[3,5]

N → houston[4,5]

NP → N[4,5]

[4,5]

Book the flight through Houston0 1 2 3 4 5

Saeedeh Momtazi | NLP | 10.05.2012

40

Page 41: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CKY ParsingN → book[0,1]

V → book[0,1]

NP → N[0,1]

VP → V[0,1]

S → VP[0,1]

[0,1] [0,2]

NP → NP[0,1], NP[1,3]

VP → VP[0,1], NP[1,3]

S → VP[0,3]

[0,3] [0,4] [0,5]

Det → the[1,2]

[1,2]

NP → Det[1,2], N[2,3]

[1,3] [1,4] [1,5]

N → flight[2,3]

NP → N[2,3]

[2,3] [2,4]

NP → NP[2,3], PP[3,5]

[2,5]

Prep → through[3,4]

[3,4]

PP→Prep[3,4],NP[4,5]

[3,5]

N → houston[4,5]

NP → N[4,5]

[4,5]

Book the flight through Houston0 1 2 3 4 5

Saeedeh Momtazi | NLP | 10.05.2012

41

Page 42: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CKY ParsingN → book[0,1]

V → book[0,1]

NP → N[0,1]

VP → V[0,1]

S → VP[0,1]

[0,1] [0,2]

NP → NP[0,1], NP[1,3]

VP → VP[0,1], NP[1,3]

S → VP[0,3]

[0,3] [0,4] [0,5]

Det → the[1,2]

[1,2]

NP → Det[1,2], N[2,3]

[1,3] [1,4]

NP → NP[1,3], PP[3,5]

[1,5]

N → flight[2,3]

NP → N[2,3]

[2,3] [2,4]

NP → NP[2,3], PP[3,5]

[2,5]

Prep → through[3,4]

[3,4]

PP→Prep[3,4],NP[4,5]

[3,5]

N → houston[4,5]

NP → N[4,5]

[4,5]

Book the flight through Houston0 1 2 3 4 5

Saeedeh Momtazi | NLP | 10.05.2012

42

Page 43: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CKY ParsingN → book[0,1]

V → book[0,1]

NP → N[0,1]

VP → V[0,1]

S → VP[0,1]

[0,1] [0,2]

NP → NP[0,1], NP[1,3]

VP → VP[0,1], NP[1,3]

S → VP[0,3]

[0,3] [0,4]

VP → VP[0,1], NP[1,5]

VP' → VP[0,3], PP[3,5]

S → VP[0,5]

S → VP'[0,5]

[0,5]

Det → the[1,2]

[1,2]

NP → Det[1,2], N[2,3]

[1,3] [1,4]

NP → NP[1,3], PP[3,5]

[1,5]

N → flight[2,3]

NP → N[2,3]

[2,3] [2,4]

NP → NP[2,3], PP[3,5]

[2,5]

Prep → through[3,4]

[3,4]

PP→Prep[3,4],NP[4,5]

[3,5]

N → houston[4,5]

NP → N[4,5]

[4,5]

Book the flight through Houston0 1 2 3 4 5

ambiguity

Saeedeh Momtazi | NLP | 10.05.2012

43

Page 44: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Outline

1 Phrase Structure

2 Syntactic ParsingCKY Algorithm

3 Statistical Parsing

Saeedeh Momtazi | NLP | 10.05.2012

44

Page 45: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Probabilistic Context FreeGrammar (PCFG)

Grammar G consists of

Terminals (T )Non-terminals (N)Start symbol (S)Rules (R)Probability function (P)

P : R → [0, 1]∀X ∈ N,

∑X→λ∈R P(X → λ) = 1

Saeedeh Momtazi | NLP | 10.05.2012

45

Page 46: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

CFGS → NP VPS → VPNP → NNP → Det NNP → NP NPNP → NP PPVP → VVP → VP PPVP → VP NPPP → Prep NP

N → bookV → bookDet → theN → flightPrep → throughN → Houston

Saeedeh Momtazi | NLP | 10.05.2012

46

Page 47: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

PCFGS → NP VPS → VPNP → NNP → Det NNP → NP NPNP → NP PPVP → VVP → VP PPVP → VP NPPP → Prep NP

0.90.10.30.40.10.20.10.30.61.0

N → bookV → bookDet → theN → flightPrep → throughN → Houston

0.51.01.00.41.00.1

Saeedeh Momtazi | NLP | 10.05.2012

47

Page 48: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Treebank

A treebank is a corpus in which each sentence has been pairedwith a parse tree

These are generally created byParsing the collection with an automatic parserCorrecting each parse by human annotators if required

Requirement:detailed annotation guidelines that provide

A POS tagsetA grammarAnnotation schema

Instructions for how to deal with particular grammaticalconstructions

Saeedeh Momtazi | NLP | 10.05.2012

48

Page 49: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Penn Treebank

Penn Treebank is a widely used treebank for EnglishMost well-known section: Wall Street Journal Section

1 M words from 1987-1989

(S (NP (NNP John))(VP (VPZ flies)

(PP (IN to)(NNP Paris)))

(. .))

Saeedeh Momtazi | NLP | 10.05.2012

49

Page 50: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Statistical Parsing

Considering the corresponding probabilities while parsing asentenceSelecting the parse tree which has the highest probability

Tree and string probabilitiesP(t): the probability of a tree t

Product of the probabilities of the rules used to generate the tree

P(s): the probability of a string sSum of the probabilities of the trees which created to parse thestring

Saeedeh Momtazi | NLP | 10.05.2012

50

Page 51: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

PCFGS → NP VPS → VPNP → NNP → Det NNP → NP NPNP → NP PPVP → VVP → VP PPVP → VP NPPP → Prep NP

0.90.10.30.40.10.20.10.30.61.0

N → bookV → bookDet → theN → flightPrep → throughN → Houston

0.51.01.00.41.00.1

Saeedeh Momtazi | NLP | 10.05.2012

51

Page 52: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Statistical Parsing

P(t) = 0.2× 0.4× 1.0× 1.0× 0.4× 1.0× 0.1 = 0.0032

Saeedeh Momtazi | NLP | 10.05.2012

52

Page 53: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Probabilistic CKY ParsingN → book[0,1]

V → book[0,1]

NP → N[0,1]

VP → V[0,1]

S → VP[0,1]

[0,1]

0.1*0.15*0.16=0.00240.6*0.1*0.16=0.00960.1*0.0096=0.00096

[0,2]

NP → NP[0,1], NP[1,3]

VP → VP[0,1], NP[1,3]

S → VP[0,3]

[0,3] [0,4]

VP → VP[0,1], NP[1,5]

VP' → VP[0,3], PP[3,5]

S → VP[0,5]

S → VP'[0,5]

[0,5]

Det → the[1,2]

[1,2]

NP → Det[1,2], N[2,3]

0.4*1.0*0.4=0.16

[1,3] [1,4]

NP → NP[1,3], PP[3,5]

[1,5]

N → flight[2,3]

NP → N[2,3]

[2,3] [2,4]

NP → NP[2,3], PP[3,5]

[2,5]

Prep → through[3,4]

[3,4]

PP→Prep[3,4],NP[4,5]

[3,5]

N → houston[4,5]

NP → N[4,5]

[4,5]

Book the flight through Houston0 1 2 3 4 5

0.51.0

0.3*0.5=0.150.1*1.0=0.1

0.1*0.1=0.01

1.0

0.40.3*0.4=0.12

Saeedeh Momtazi | NLP | 10.05.2012

53

Page 54: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Exercise

Implement the probabilistic CKY algorithm which works basedon the grammar rules R.

Saeedeh Momtazi | NLP | 10.05.2012

54

Page 55: Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Further Reading

Speech and Language ProcessingChapters 12, 13, 14, 15

Saeedeh Momtazi | NLP | 10.05.2012

55