cs60057 speech &natural language processing

91
Lecture 1, 7/21/2005 Natural Language Processing 1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 16 5 September 2007

Upload: mariko

Post on 14-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

CS60057 Speech &Natural Language Processing. Autumn 2007. Lecture 16 5 September 2007. Parsing with features. We need to constrain the rules in CFGs, for example to coerce agreement within and between constituents to pass features around to enforce subcategorisation constraints - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 1

CS60057Speech &Natural Language

Processing

Autumn 2007

Lecture 16

5 September 2007

Page 2: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 2

Parsing with features

We need to constrain the rules in CFGs, for example to coerce agreement within and between constituents to pass features around to enforce subcategorisation constraints

Features can be easily added to our grammars And later we’ll see that feature bundles can completely replace

constituents

Page 3: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 3

Parsing with features

Rules can stipulate values, or placeholders (variables) for values Features can be used within the rule, or passed up via the mother

nodes Example: subject-verb agreement

S NP VP [if NP and VP agree in number] number of NP depends on noun and/or determiner number of VP depends on verb

S NP(num=X) VP (num=X)

NP (num=X) det(num=X) n (num=X)

VP(num=X) v(num=X) NP(num=?)

Page 4: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 4

Declarative nature of features

The rules can be used in various ways To build an NP only if det and n

agree (bottom-up) When generating an NP, to

choose an n which agrees with the det (if working L-to-R) (top-down)

To show that the num value for an NP comes from its components (percolation)

To ensure that the num value is correctly set when generating an NP (inheritance)

To block ill-formed input

NP (num=X) det(num=X) n (num=X)

this det (num=sg)

these det (num=pl)

the det (num=?)

man n (num=sg)

men n (num=pl)

det(num=sg) n(num=sg)

this man

NP (num=sg)

n(num=pl)

men

Page 5: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 5

Use of variables Unbound (unassigned)

variables (ie variables with a free value): the can combine with any

value for num Unification means that the

num value for the is set to sg

NP (num=X) det(num=X) n (num=X)

this det (num=sg)

these det (num=pl)

the det (num=?)

man n (num=sg)

men n (num=pl)

det(num=?) n(num=sg)

the man

NP (num=sg)

Page 6: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 6

Parsing with features Features must be compatible Formalism should allow features to remain unspecified Feature mismatch can be used to block false analyses, and

disambiguate e.g. they can fish ~ he can fish ~ he cans fish

Formalism may have attribute-value pairs, or rely on argument position

e.g. NP(_num,_sem) det(_num) n (_num,_sem) an = det(sing) the = det(_num) man = n(sing,hum)

Page 7: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 7

Parsing with features

Using features to impose subcategorization constraints

VP v e.g. dance

VP v NP e.g. eat

VP v NP NP e.g. give

VP v PP e.g. wait (for)

VP(_num) v(_num,intr)

VP(_num) v(_num,trans) NP

VP(_num) v(_num,ditrans) NP NP

VP(_num) v(_num,prepobj(_case)) PP(_case)

PP(_case) prep(_case) NP

dance = v(plur,intr)

dances = v(sing,intr)

danced = v(_num,intr)

waits = v(sing,prepobj(for))

for = prep(for)

Page 8: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 8

v(sing,intrans)

S NP(_num) VP(_num)NP(_num) det(_num) n(_num)VP(_num) v(_num,intrans)VP(_num) v (_num,trans) NP(_1)

Parsing with features (top-down)

S NP(_num) VP(_num)S

NP VP(_num) (_num)

NP(_num) det(_num) n(_num)

the man shot those elephants

det n(_num) (_num)

the = det(_num)

the

man = n(sing)

man

VP(sing) v(sing,intrans)

shot = v(sing,trans)

(sing)

(sing)(sing)

(sing)VP(sing) v(sing,trans) NP(_1)

shot = v(sing,trans)

v NP(sing,trans) (_1)

shot det n (_1) (_1)

those elephants

(pl)

(pl)

NP(_1) det(_1) n(_1)

those = det(pl)

elephants = n(pl)

_num=sing

(pl)

Page 9: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 9

Feature structures

Instead of attaching features to the symbols, we can parse with symbols made up entirely of attribute-value pairs: “feature structures”

Can be used in the same way as seen previously Values can be atomic … … or embedded feature structures …

CAT NPNUMBER SGPERSON 3

ATTR1 VAL1ATTR2 VAL2ATTR3 VAL3

CAT NP

AGRNUM SGPERS 3

Page 10: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 10

Unification

Probabilistic CFG

August 31, 2006

Page 11: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 11

Feature Structures

A set of feature-value pairs

No feature occurs in more than one feature-value pair

(a partial function from features to values)

Circular structures are prohibited.

Page 12: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 12

Structured Feature Structure

Part of a third-person singular NP:

Page 13: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 13

Reentrant Feature Structure Two features can share a feature structure as value.

Not the same thing as them having equivalent values!

Two distinct feature structure values:

One shared value (reentrant feature structure):

Page 14: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 14

they can be coindexed

CAT S

HEADAGR 1

SUBJ [ AGR 1 ]

NUM SGPERS 3

Page 15: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 15

Parsing with feature structures

Grammar rules can specify assignments to or equations between feature structures

Expressed as “feature paths”

e.g. HEAD.AGR.NUM = SG

CAT S

HEADAGR 1

SUBJ [ AGR 1 ]

NUM SGPERS 3

Page 16: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 18

Subsumption () (Partial) ordering of feature structures Based on relative specificity

The second structure carry less information, but is more general (or subsumes) the first.

Page 17: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 19

Subsumption

A more abstract (less specific) feature structure subsumes an equally or more specific one.

A feature structure F subsumes a feature structure G ( F G)

if and only if : For every structure x in F, F(x) G(x) (where F(x) means the value of the

feature x of the feature structure F). For all paths p and q in F such that F(p)=F(q), it is also the case that

G(p)=G(q).

An atomic feature structure neither subsumes

nor is subsumed by another atomic feature structure. Variables subsume all other feature structures. A feature structure F subsumes a feature structure G (F G) if if all parts of F

subsumes all parts of G.

Page 18: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 20

Subsumption Example

Consider the following feature structures:

(1)

(2)

(3)

(1) (3)

(2) (3)

but there is no subsumption relation between (1) and (2)

3PERSON

3PERSON

SGNUMBER

SGNUMBER

Page 19: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 21

Feature Structures in The Grammar

We will incorporate the feature structures and the unification process as follows:

All constituents (non-terminals) will be associated with feature structures.

Sets of unification constraints will be associated with grammar rules, and these rules must be satisfied for the rule to be satisfied.

These attachments accomplish the following goals:

To associate feature structures with both lexical items and instances of grammatical categories.

To guide the composition of feature structures for larger grammatical constituents based on the feature structures of their component parts.

To enforce compatibility constraints between specified parts of grammatical constraints.

Page 20: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 22

Feature unification

Feature structures can be unified if They have like-named attributes that have the same

value:

[NUM SG] [NUM SG] = [NUM SG] Like-named attributes that are “open” get the value

assigned:

CAT NPNUMBER ??PERSON 3

NUMBER SGPERSON 3 =

CAT NPNUMBER SGPERSON 3

Page 21: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 23

Feature unification

Complementary features are brought together

Unification is recursive

Coindexed structures are identical (not just copies): assignment to one effects all

CAT NPNUMBER SG

[PERSON 3] =CAT NPNUMBER SGPERSON 3

CAT NPAGR [NUM SG]

CAT NP

AGRNUM SGPERS 3

=CAT NPAGR [PERS 3]

Page 22: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 24

Example

CAT NPAGR _1 _2SEM _3

CAT DETAGR _1

CAT NAGR _2 SEM _3

CAT DET

AGRVAL INDEFNUM SG

a

CAT DETAGR [VAL DEF]

the

CAT NLEX “man”AGR [NUM SG]SEM HUM

man

Page 23: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 25

the manthe

CAT NAGR _2 SEM _3

CAT DETAGR [VAL DEF]

the

CAT NPAGR _1 _2SEM [_3]

CAT DETAGR _1 [VAL DEF]VAL DEF

CAT NLEX “man”AGR [NUM SG]SEM HUM

man

LEX “man”AGR [NUM SG]SEM HUM

the man

NUM SG

HUM

Page 24: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 26

a mana

CAT NAGR _2 SEM _3

CAT NPAGR _1

_2

SEM [_3]

CAT DET

AGR _1

CAT NLEX “man”AGR [NUM SG]SEM HUM

man

LEX “man”AGR [NUM SG]SEM HUM

a man

[NUM SG]

HUM

CAT DET

AGRVAL INDEFNUM SG

a

VAL INDEFNUM SG

VAL INDEFNUM SG

VAL INDEFAGR NUM SG

Page 25: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 27

Types and inheritance

Feature typing allows us to constrain possible values a feature can have e.g. num = {sing,plur} Allows grammars to be checked for consistency, and can

make parsing easier We can express general “feature co-occurrence conditions” … And “feature inheritance rules” Both these allow us to make the grammar more compact

Page 26: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 28

Co-occurrence conditions and Inheritance rules

General rules (beyond simple unification) which apply automatically, and so do not need to be stated (and repeated) in each rule or lexical entry

Examples:

[cat=np] [num=??, gen=??, case=??]

[cat=v,num=sg] [tns=pres]

[attr1=val1] [attr2=val2]

Page 27: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 29

Inheritance rules

Inheritance rules can be over-ridden

e.g. [cat=n] [gen=??,sem=??]

sex={male,female}

gen={masc,fem,neut}

[cat=n,gen=fem,sem=hum] [sex=female]

uxor [cat=n,gen=fem,sem=hum]

agricola [cat=n,gen=fem,sem=hum,sex=male]

Page 28: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 30

Unification in Linguistics

Lexical Functional Grammar If interested, see PARGRAM project

GPSG, HPSG Construction Grammar Categorial Grammar

Page 29: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 31

Unification

Joining the contents of two feature structures into one new

(the union of the two originals).

The union is most general feature structure subsumed by both.

The union of two contradictory feature structures is undefined (unification fails).

Page 30: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 32

Unification Constraints

Each grammar rule will be associated with a set of unification constraints.

0 1 … n {set of unification constraints}

Each unification constraint will be in one of the following forms.

< i feature path> = Atomic value

< i feature path> = < j feature path>

Page 31: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 33

Unification Constraints -- Example

For example, the following rule

S NP VP

Only if the number of the NP is equal to the number of the VP.

will be represented as follows:

S NP VP

<NP NUMBER> = <VP NUMBER>

Page 32: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 34

Agreement Constraints

S NP VP

<NP NUMBER> = <VP NUMBER>

S Aux NP VP

<Aux AGREEMENT> = <NP AGREEMENT>

NP Det NOMINAL

<Det AGREEMENT> = <NOMINAL AGREEMENT>

<NP AGREEMENT> = <NOMINAL AGREEMENT>

NOMINAL Noun

<NOMINAL AGREEMENT> = <Noun AGREEMENT>

VP Verb NP

<VP AGREEMENT> = <Verb AGREEMENT>

Page 33: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 35

Agreement Constraints -- Lexicon Entries

Aux does <Aux AGREEMENT NUMBER> = SG

<Aux AGREEMENT PERSON> = 3

Aux do <Aux AGREEMENT NUMBER> = PL

Det these <Det AGREEMENT NUMBER> = PL

Det this <Det AGREEMENT NUMBER> = SG

Verb serves <Verb AGREEMENT NUMBER> = SG

<Verb AGREEMENT PERSON> = 3

Verb serve <Verb AGREEMENT NUMBER> = PL

Noun flights <Noun AGREEMENT NUMBER> = PL

Noun flight <Noun AGREEMENT NUMBER> = SG

Page 34: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 36

Head Features

Certain features are copied from children to parent in feature structures. Example: AGREEMENT feature in NOMINAL is copied into NP. The features for most grammatical categories are copied from one of

the children to the parent. The child that provides the features is called head of the phrase,

and the features copied are referred to as head features. A verb is a head of a verb phrase, and a nominal is a head of a noun

phrase. We may reflect these constructs in feature structures as follows:

NP Det NOMINAL

<Det HEAD AGREEMENT> = <NOMINAL HEAD AGREEMENT>

<NP HEAD> = <NOMINAL HEAD>

VP Verb NP

<VP HEAD> = <Verb HEAD>

Page 35: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 37

SubCategorization Constraints

For verb phrases, we can represent subcategorization constraints using three techniques:

Atomic Subcat Symbols Encoding Subcat lists as feature structures Minimal Rule Approach (using lists directly)

We may use any of these representations.

Page 36: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 38

Atomic Subcat Symbols

VP Verb

<VP HEAD> = <Verb HEAD>

<VP HEAD SUBCAT> = INTRANS

VP Verb NP

<VP HEAD> = <Verb HEAD>

<VP HEAD SUBCAT> = TRANS

VP Verb NP NP

<VP HEAD> = <Verb HEAD>

<VP HEAD SUBCAT> = DITRANS

Verb slept <Verb HEAD SUBCAT> = INTRANS

Verb served <Verb HEAD SUBCAT> = TRANS

Verb gave <Verb HEAD SUBCAT> = DITRANS

Page 37: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 39

Encoding Subcat Lists as Features

Verb gave

<Verb HEAD SUBCAT FIRST CAT> = NP

<Verb HEAD SUBCAT SECOND CAT> = NP

<Verb HEAD SUBCAT THIRD> = END

VP Verb NP NP

<VP HEAD> = <Verb HEAD>

<VP HEAD SUBCAT FIRST CAT> = <NP CAT>

<VP HEAD SUBCAT SECOND CAT> = <NP CAT>

<VP HEAD SUBCAT THIRD> = END

We are only encoding lists using positional features

Page 38: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 40

Minimal Rule Approach

In fact, we do not use symbols like SECOND, THIRD. They are just used to encode lists. We can use lists directly (similar to LISP).

<SUBCAT FIRST CAT> = NP

<SUBCAT REST FIRST CAT> = NP

<SUBCAT REST REST> = END

Page 39: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 41

Subcategorization Frames for Lexical Entries

We can use two different notations to represent subcategorization frames for lexical entries (verbs).

Verb want

<Verb HEAD SUBCAT FIRST CAT> = NP

Verb want

<Verb HEAD SUBCAT FIRST CAT> = VP

<Verb HEAD SUBCAT FIRST FORM> = INFINITITIVE

INFINITIVEVFORMHEAD

VPCATNPCATSUBCATHEAD

VERBCAT

WANTORTH

,

Page 40: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 42

Implementing Unification

The representation we have used cannot facilitate the destructive merger aspect of unification algorithm.

For this reason, we add additional features (additional edges to DAGs) into our feature structures.

Each feature structure will consists of two fields:

Content Field -- This field can be NULL or may contain ordinary feature structure.

Pointer Field -- This field can be NULL or may contain a pointer into another feature structure.

If the pointer field of a DAG is NULL, the content field of DAG contains the actual feature structure to be processed.

If the pointer field of a DAG is not NULL, the destination of that pointer represents the actual feature structure to be processed.

Page 41: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 43

Extended Feature Structures

3PERSON

SGNUMBER

NULLPOINTERNULLPOINTER

CONTENTPERSON

NULLPOINTER

SGCONTENTNUMBER

CONTENT3

Page 42: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 44

Extended DAG

C

P

C

C

P

P

Num

Per

Null

Null

3

SG

Null

Page 43: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 45

Unification of Extended DAGs

33

PERSON

SGNUMBERPERSONSGNUMBER

C

P

Per

C

PNull

3

Null

C

P

Num

C

PNull

SG

Null

Page 44: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 46

Unification of Extended DAGs (cont.)

C

P

Num

C

PNull

SG

Null

C

P

Per

C

PNull

3

Null

C

PNull

Null

P

Per

Page 45: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 47

Unification Algorithm

function UNIFY(f1,f2) returns fstructure or failure

f1real real contents of f1 /* dereference f1 */

f2real real contents of f2 /* dereference f2 */

if f1real is Null then { f1.pointer f2; return f2; }

else if f2real is Null then { f2.pointer f1; return f1; }

else if f1real and f2real are identical then { f1.pointer f2; return f2; }

else if f1real and f2real are complex feature structures then

{ f2.pointer f1;

for each feature in f2real do

{ otherfeature Find or create a feature corresponding to feature in f1real;

if UNIFY(feature.value,otherfeature.value) returns failure then

return failure; }

return f1; }

else return failure;

Page 46: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 48

Example - Unification of Complex Structures

3

)1(

)1(

PERSONAGREEMENTSUBJECT

AGREEMENTSUBJECT

SGNUMBERAGREEMENT

)1(3

)1(

AGREEMENTSUBJECTPERSON

SGNUMBERAGREEMENT

Page 47: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 49

Example - Unification of Complex Structures (cont.)

•Null

•C •Agr •Num •C•C• SG

•Null

•C•Agr•C• Null

•Null •Null

•Null

Sub

•C •Sub •Agr •C•C•

•Null •Null •Null

•C•Per

•Null

3

•C NullPer

Page 48: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 50

Parsing with Unification Constraints

Let us assume that we have augmented our grammar with sets of unification constraints.

What changes do we need to make a parser to make use of them?

Building feature structures and associate them with sub-trees.

Unifying feature structures when sub-trees are created. Blocking ill-formed constituents

Page 49: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 51

Earley Parsing with Unification Constraints

What do we have to do to integrate unification constraints with Early Parser?

Building feature structures (represented as DAGs) and associate them with states in the chart.

Unifying feature structures as states are advanced in the chart.

Blocking ill-formed states from entering the chart. The main change will be in COMPLETER function of Earley Parser.

This routine will invoke the unifier to unify two feature structures.

Page 50: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 52

Building Feature Structures

NP Det NOMINAL

<Det HEAD AGREEMENT> = <NOMINAL HEAD AGREEMENT>

<NP HEAD> = <NOMINAL HEAD>

corresponds to

)2()1(

)2(

)1(

AGREEMENTHEADNOMINAL

AGREEMENTHEADDet

HEADNP

Page 51: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 53

Augmenting States with DAGs

Each state will have an additional field to contain the DAG representing the feature structure corresponding to the state.

When a rule is first used by PREDICTOR to create a state, the DAG associated with the state will simply consist of the DAG retrieved from the rule.

For example,

S NP VP, [0,0],[],Dag1

where Dag1 is the feature structure corresponding to S NP

VP.

NP Det NOMINAL, [0,0],[],Dag2

where Dag2 is the feature structure corresponding to S Det

NOMINAL.

Page 52: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 54

What does COMPLETER do?

When COMPLETER advances the dot in a state, it should unify the feature structure of the newly completed state with the appropriate part of the feature structure being advanced.

If this unification process is succesful, the new state gets the result of the unification as its DAG, and this new state is entered into the chart. If it fails, nothing is entered into the chart.

Page 53: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 55

A Completion Example

)2()1(

)2(

)1(

AGREEMENTHEADNOMINAL

SGNUMBERAGREEMENTHEADDet

HEADNP

NP Det NOMINAL, [0,1],[SDet],Dag1

Dag1

Parsing the phrase that flight after that is processed.

A newly completed state NOMINAL Noun , [1,2],[SNoun],Dag2

SGNUMBERAGREEMENTHEADNoun

HEADNOMINAL

)1(

)1(Dag2

To advance in NP, the parser unifies the feature structure found under the NOMINAL feature of Dag2, with the feature structure found under the NOMINAL feature of Dag1.

Page 54: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 56

Earley Parse

function EARLEY-PARSE(words,grammar) returns chart

ENQUEUE(( S, [0,0], chart[0],dag)

for i from 0 to LENGTH(words) do

for each state in chart[i] do

if INCOMPLETE?(state) and NEXT-CAT(state) is not a PS then

PREDICTOR(state)

elseif INCOMPLETE?(state) and NEXT-CAT(state) is a PS then

SCANNER(state)

else

COMPLETER(state)

end

end

return(chart)

Page 55: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 57

Predictor and Scanner

procedure PREDICTOR((A B , [i,j],dagA))

for each (B ) in GRAMMAR-RULES-FOR(B,grammar) do

ENQUEUE((B , [i,j],dagB), chart[j])

end

procedure SCANNER((A B , [i,j],dagA))

if (B PARTS-OF-SPEECH(word[j]) then

ENQUEUE((B word[j] , [j,j+1],dagB), chart[j+1])

end

Page 56: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 58

Completer and UnifyStates

procedure COMPLETER((B , [j,k],dagB))

for each (A B , [i,j],dagA) in chart[j] do

if newdag UNIFY-STATES(dagB,dagA,B) fails then ENQUEUE((A B , [i,k],newdag),

chart[k])

end

procedure UNIFY-STATES(dag1,dag2,cat)

dag1cp CopyDag(dag1);

dag2cp CopyDag(dag2);

UNIFY(FollowPath(cat,dag1cp),FollowPath(cat,dag2cp));

end

Page 57: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 59

Enqueue

procedure ENQUEUE(state,chart-entry)

if state is not subsumed by a state in chart-entry then

Add state at the end of chart-entry

end

Page 58: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 60

Probabilistic Parsing

Slides by Markus Dickinson, Georgetown University

Page 59: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 61

Motivation and Outline

Previously, we used CFGs to parse with, but: Some ambiguous sentences could not be

disambiguated, and we would like to know the most likely parse

How do we get such grammars? Do we write them ourselves? Maybe we could use a corpus …

Where we’re going: Probabilistic Context-Free Grammars (PCFGs) Lexicalized PCFGs Dependency Grammars

Page 60: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 62

Statistical Parsing

Basic idea Start with a treebank

a collection of sentences with syntactic annotation, i.e., already-parsed sentences

Examine which parse trees occur frequently Extract grammar rules corresponding to those parse

trees, estimating the probability of the grammar rule based on its frequency

That is, we’ll have a CFG augmented with probabilities

Page 61: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 63

Using Probabilities to Parse

P(T): probability of a particular parse tree

P(T) = ΠnєT p(r(n))

i.e., the product of the probabilities of all the rules r used to expand each node n in the parse tree

Example: given the probabilities on p. 449, compute the probability of the parse tree on the right

Page 62: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 64

Computing probabilities

We have the following rules and probabilities (adapted from Figure 12.1):

S VP .05 VP V NP .40 NP Det N .20 V book .30 Det that .05 N flight .25

P(T) = P(SVP)*P(VPV NP)*…*P(Nflight)= .05*.40*.20*.30*.05*.25 = .000015, or 1.5 x 10-5

Page 63: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 65

Using probabilities

So, the probability for that parse is 0.000015. What’s the big deal?

Probabilities are useful for comparing with other probabilities

Whereas we couldn’t decide between two parses using a regular CFG, we now can.

For example, TWA flights is ambiguous between being two separate NPs (cf. I gave [NP John] [NP money]) or one NP:

A: [book [TWA] [flights]] B: [book [TWA flights]]

Probabilities allows us to choose choice B (see figure 12.2)

Page 64: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 66

Obtaining the best parse

Call the best parse T(S), where S is your sentence Get the tree which has the highest probability, i.e. T(S) = argmaxTєparse-trees(S) P(T)

Can use the Cocke-Younger-Kasami (CYK) algorithm to calculate best parse

CYK is a form of dynamic programming CYK is a chart parser, like the Earley parser

Page 65: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 67

The CYK algorithm

Base case Add words to the chart Store P(A w_i) for every category A in the chart

Recursive case makes this dynamic programming because we only calculate B and C once

Rules must be of the form A BC, i.e., exactly two items on the RHS (we call this Chomsky Normal Form (CNF))

Get the probability for A at this node by multiplying the probabilities for B and for C by P(A BC)

P(B)*P(C)*P(A BC)

For a given A, only keep the maximum probability (again, this is dynamic programming)

Page 66: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 68

Problems with PCFGs

It’s still only a CFG, so dependencies on non-CFG info not captured

e.g., Pronouns are more likely to be subjects than objects: P[(NPPronoun) | NP=subj] >> P[(NPPronoun) | NP =obj]

Ignores lexical information (statistics), which is usually crucial for disambiguation

(T1) America sent [[250,000 soldiers] [into Iraq]] (T2) America sent [250,000 soldiers] [into Iraq]

send with into-PP always attached high (T2) in PTB!

To handle lexical information, we’ll turn to lexicalized PCFGs

Page 67: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 69

Lexicalized Grammars

The head information is passed up in a syntactic analysis? e.g., VP[head *1] V[head *1] NP

Well, if you follow this down all the way to the bottom of a tree, you wind up with a head word

In some sense, we can say that Book that flight is not just an S, but an S rooted in book

Thus, book is the headword of the whole sentence

By adding headword information to nonterminals, we wind up with a lexicalized grammar

Page 68: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 70

Lexicalized PCFGs

Lexicalized Parse Trees Each PCFG rule in a tree

is augmented to identify one RHS constituent to be the head daughter

The headword for a node is set to the head word of its head daughter

[book]

[book]

[flight]

[flight]

Page 69: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 71

Incorporating Head Proabilities: Wrong Way

Simply adding headword w to node won’t work: So, the node A becomes A[w] e.g., P(A[w]β|A) =Count(A[w]β)/Count(A)

The probabilities are too small, i.e., we don’t have a big enough corpus to calculate these probabilities

VP(dumped) VBD(dumped) NP(sacks) PP(into) 3x10-10

VP(dumped) VBD(dumped) NP(cats) PP(into) 8x10-11

These probabilities are tiny, and others will never occur

Page 70: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 72

Incorporating head probabilities: Right way

Previously, we conditioned on the mother node (A): P(Aβ|A)

Now, we can condition on the mother node and the headword of A (h(A)):

P(Aβ|A, h(A))We’re no longer conditioning on simply the mother

category A, but on the mother category when h(A) is the head

e.g., P(VPVBD NP PP | VP, dumped)

Page 71: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 73

Calculating rule probabilities

We’ll write the probability more generally as: P(r(n) | n, h(n)) where n = node, r = rule, and h = headword

We calculate this by comparing how many times the rule occurs with h(n) as the headword versus how many times the mother/headword combination appear in total:

P(VP VBD NP PP | VP, dumped)

= C(VP(dumped) VBD NP PP)/ Σβ C(VP(dumped) β)

Page 72: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 74

Adding info about word-word dependencies

We want to take into account one other factor: the probability of being a head word (in a given context)

P(h(n)=word | …)We condition this probability on two things: 1. the category of

the node (n), and 2. the headword of the mother (h(m(n))) P(h(n)=word | n, h(m(n))), shortened as: P(h(n) | n,

h(m(n))) P(sacks | NP, dumped)

What we’re really doing is factoring in how words relate to each other

We will call this a dependency relation later: sacks is dependent on dumped, in this case

Page 73: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 75

Putting it all together

See p. 459 for an example lexicalized parse tree forworkers dumped sacks into a bin

For rules r, category n, head h, mother m

P(T) = ΠnєT p(r(n)| n, h(n))

e.g., P(VP VBD NP PP |VP, dumped)

subcategorization info*p(h(n) | n, h(m(n)))

e.g. P(sacks | NP, dumped)dependency info

between words

Page 74: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 76

Dependency Grammar

Capturing relations between words (e.g. dumped and sacks) is moving in the direction of dependency grammar (DG)

In DG, there is no such thing as constituencyThe structure of a sentence is purely the binary relations

between wordsJohn loves Mary is represented as:

LOVE JOHN LOVE MARY

where A B means that B depends on A

Page 75: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 77

Dependency parsing

Page 76: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 78

Dependency Grammar/Parsing

A sentence is parsed by relating each word to other words in the sentence which depend on it.

The idea of dependency structure goes back a long way To Pāṇini’s grammar (c. 5th century BCE)

Constituency is a new-fangled invention 20th century invention

Modern work often linked to work of L. Tesniere (1959) Dominant approach in “East” (Eastern bloc/East Asia)

Among the earliest kinds of parsers in NLP, even in US: David Hays, one of the founders of computational linguistics, built early

(first?) dependency parser (Hays 1962)

Page 77: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 79

Dependency structure

Words are linked from head (regent) to dependent Warning! Some people do the arrows one way; some the other way

(Tesniere has them point from head to dependent…). Usually add a fake ROOT so every word is a dependent

Shaw Publishing acquired 30 % of American City in March $$

Page 78: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 80

Relation between CFG to dependency parse

A dependency grammar has a notion of a head Officially, CFGs don’t But modern linguistic theory and all modern statistical

parsers (Charniak, Collins, Stanford, …) do, via hand-written phrasal “head rules”: The head of a Noun Phrase is a noun/number/adj/… The head of a Verb Phrase is a verb/modal/….

The head rules can be used to extract a dependency parse from a CFG parse (follow the heads).

A phrase structure tree can be got from a dependency tree, but dependents are flat (no VP!)

Page 79: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 81

Propagating head words

Small set of rules propagate heads

S(announced)

NP(Smith)

NP(Smith)

NNP

John

NNP

Smith

NP(president)

NP

DT

the

NN

president

PP(of)

IN

of

NP

NNP

IBM

VP(announced)

VBD

announced

NP(resignation)

PRP$

his

NN

resignation

NP

NN

yesterday

Page 80: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 82

Extracted structure

NB. Not all dependencies shown here

Dependencies are inherently untyped, though some work like Collins (1996) types them using the phrasal categories

NP

[John Smith]

NPNP

[the president] of [IBM]

SNP VP

announced [his Resignation] [yesterday]

VPVBD NP

NPVP

VBD

Page 81: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 83

Sources of information: bilexical dependencies distance of dependencies valency of heads (number of dependents)

A word’s dependents (adjuncts, arguments)

tend to fall near it

in the string.

Dependency Conditioning Preferences

These next 6 slides are based on slides by Jason Eisner and Noah Smith

Page 82: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 84

Probabilistic dependency grammar: generative model

1. Start with left wall $

2. Generate root w0

3. Generate left children w-1, w-2, ..., w-ℓ from the FSA λw0

4. Generate right children w1, w2, ..., wr from the FSA ρw0

5. Recurse on each wi for i in {-ℓ, ..., -1, 1, ..., r}, sampling αi (steps 2-4)

6. Return αℓ...α-1w0α1...αr

w0

w-1

w-2

w-ℓ wr

w2

w1

......

w-ℓ.-1

$

λw-ℓ

λw0 ρw0

Page 83: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 85

Naïve Recognition/Parsing

It takes two to tango

It

takes two to tango

totakes

takes

takes

O(n5) combinations

It

p

p c

i j k

O(n5N3) if N nonterminal

s r0 n

goal

goal

Page 84: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 86

Dependency Grammar Cubic Recognition/Parsing (Eisner & Satta, 1999)

Triangles: span over words, where tall side of triangle is the head, other side is dependent, and no non-head words expecting more dependents

Trapezoids: span over words, where larger side is head, smaller side is dependent, and smaller side is still looking for dependents on its side of the trapezoid

}

}

Page 85: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 87

Dependency Grammar Cubic Recognition/Parsing (Eisner & Satta, 1999)

It takes two to tango

goal

One trapezoid

per dependency

.

A triangle is a head with some left (or right) subtrees.

Page 86: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 88

Cubic Recognition/Parsing (Eisner & Satta, 1999)

i j k i j k

i j ki j k

O(n3) combinations

O(n3) combinations

0 i n

goal

Gives O(n3) dependency grammar parsing

O(n) combinations

Page 87: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 89

Evaluation of Dependency Parsing: Simply use (labeled) dependency accuracy

1 2 3 4 5

1 2 We SUBJ

2 0 eat ROOT

3 5 the DET4 5 cheese

MOD5 2 sandwich

SUBJ

1 2 We SUBJ

2 0 eat ROOT

3 4 the DET4 2 cheese OBJ5 2 sandwich

PRED

Accuracy = number of correct dependenciestotal number of dependencies

= 2 / 5 = 0.40

40%

GOLD PARSED

Page 88: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 90

McDonald et al. (2005 ACL):Online Large-Margin Training of Dependency Parsers

Builds a discriminative dependency parser Can condition on rich features in that context

Best-known recent dependency parser Lots of recent dependency parsing activity connected

with CoNLL 2006/2007 shared task Doesn’t/can’t report constituent LP/LR, but evaluating

dependencies correct: Accuracy is similar to but a fraction below dependencies

extracted from Collins: 90.9% vs. 91.4% … combining them gives 92.2% [all lengths] Stanford parser on length up to 40:

Pure generative dependency model: 85.0% Lexicalized factored parser: 91.0%

Page 89: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 91

McDonald et al. (2005 ACL):Online Large-Margin Training of Dependency Parsers

Score of a parse is the sum of the scores of its dependencies

Each dependency is a linear function of features times weights

Feature weights are learned by MIRA, an online large-margin algorithm But you could think of it as using a perceptron or maxent classifier

Features cover: Head and dependent word and POS separately Head and dependent word and POS bigram features Words between head and dependent Length and direction of dependency

Page 90: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 92

Extracting grammatical relations from statistical constituency parsers

[de Marneffe et al. LREC 2006] Exploit the high-quality syntactic analysis done by statistical

constituency parsers to get the grammatical relations [typed dependencies]

Dependencies are generated by pattern-matching rules

Bills on ports and immigration were submitted by Senator Brownback

NPS

NP

NNP NNP

PP

IN

VP

VP

VBN

VBD

NNCCNNS

NPIN

NP PP

NNS

submitted

Bills were Brownback

Senator

nsubjpass auxpass agent

nnprep_on

ports

immigration

cc_and

Page 91: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 93

Evaluating Parser Output

Dependency relations are also useful for comparing parser output to a treebank

Traditional measures of parser accuracy: Labeled bracketing precision:

# correct constituents in parse/# constituents in parse

Labeled bracketing recall:# correct constituents in parse/# (correct) constituents in treebank

parse

There are known problems with these measures, so people are trying to use dependency-based measures instead

How many dependency relations did the parse get correct?