cs60057 speech &natural language processing

Lecture 1, 7/21/2005 Natural Language Processing 1

CS60057Speech &Natural Language

Processing

Autumn 2007

Lecture 16

5 September 2007


Parsing with features

We need to constrain the rules in CFGs, for example to coerce agreement within and between constituents to pass features around to enforce subcategorisation constraints

Features can be easily added to our grammars And later we’ll see that feature bundles can completely replace

constituents



Rules can stipulate values, or placeholders (variables) for values Features can be used within the rule, or passed up via the mother

nodes Example: subject-verb agreement

S NP VP [if NP and VP agree in number] number of NP depends on noun and/or determiner number of VP depends on verb

S NP(num=X) VP (num=X)

NP (num=X) det(num=X) n (num=X)

VP(num=X) v(num=X) NP(num=?)


Declarative nature of features

The rules can be used in various ways To build an NP only if det and n

agree (bottom-up) When generating an NP, to

choose an n which agrees with the det (if working L-to-R) (top-down)

To show that the num value for an NP comes from its components (percolation)

To ensure that the num value is correctly set when generating an NP (inheritance)

To block ill-formed input


this det (num=sg)

these det (num=pl)

the det (num=?)

man n (num=sg)

men n (num=pl)

det(num=sg) n(num=sg)

this man

NP (num=sg)

n(num=pl)

men


Use of variables Unbound (unassigned)

variables (ie variables with a free value): the can combine with any

value for num Unification means that the

num value for the is set to sg


this det (num=sg)

these det (num=pl)

the det (num=?)

man n (num=sg)

men n (num=pl)

det(num=?) n(num=sg)

the man

NP (num=sg)


Parsing with features Features must be compatible Formalism should allow features to remain unspecified Feature mismatch can be used to block false analyses, and

disambiguate e.g. they can fish ~ he can fish ~ he cans fish

Formalism may have attribute-value pairs, or rely on argument position

e.g. NP(_num,_sem) det(_num) n (_num,_sem) an = det(sing) the = det(_num) man = n(sing,hum)



Using features to impose subcategorization constraints

VP v e.g. dance

VP v NP e.g. eat

VP v NP NP e.g. give

VP v PP e.g. wait (for)

VP(_num) v(_num,intr)

VP(_num) v(_num,trans) NP

VP(_num) v(_num,ditrans) NP NP

VP(_num) v(_num,prepobj(_case)) PP(_case)

PP(_case) prep(_case) NP

dance = v(plur,intr)

dances = v(sing,intr)

danced = v(_num,intr)

waits = v(sing,prepobj(for))

for = prep(for)


v(sing,intrans)

S NP(_num) VP(_num)NP(_num) det(_num) n(_num)VP(_num) v(_num,intrans)VP(_num) v (_num,trans) NP(_1)

Parsing with features (top-down)

S NP(_num) VP(_num)S

NP VP(_num) (_num)

NP(_num) det(_num) n(_num)

the man shot those elephants

det n(_num) (_num)

the = det(_num)

the

man = n(sing)

man

VP(sing) v(sing,intrans)

shot = v(sing,trans)

(sing)

(sing)(sing)

(sing)VP(sing) v(sing,trans) NP(_1)

shot = v(sing,trans)

v NP(sing,trans) (_1)

shot det n (_1) (_1)

those elephants

(pl)

(pl)

NP(_1) det(_1) n(_1)

those = det(pl)

elephants = n(pl)

_num=sing

(pl)


Feature structures

Instead of attaching features to the symbols, we can parse with symbols made up entirely of attribute-value pairs: “feature structures”

Can be used in the same way as seen previously Values can be atomic … … or embedded feature structures …

CAT NPNUMBER SGPERSON 3

ATTR1 VAL1ATTR2 VAL2ATTR3 VAL3

CAT NP

AGRNUM SGPERS 3


Unification

Probabilistic CFG

August 31, 2006


Feature Structures

A set of feature-value pairs

No feature occurs in more than one feature-value pair

(a partial function from features to values)

Circular structures are prohibited.


Structured Feature Structure

Part of a third-person singular NP:


Reentrant Feature Structure Two features can share a feature structure as value.

Not the same thing as them having equivalent values!

Two distinct feature structure values:

One shared value (reentrant feature structure):


they can be coindexed

CAT S

HEADAGR 1

SUBJ [ AGR 1 ]

NUM SGPERS 3


Parsing with feature structures

Grammar rules can specify assignments to or equations between feature structures

Expressed as “feature paths”

e.g. HEAD.AGR.NUM = SG

CAT S

HEADAGR 1

SUBJ [ AGR 1 ]

NUM SGPERS 3


Subsumption () (Partial) ordering of feature structures Based on relative specificity

The second structure carry less information, but is more general (or subsumes) the first.


Subsumption

A more abstract (less specific) feature structure subsumes an equally or more specific one.

A feature structure F subsumes a feature structure G ( F G)

if and only if : For every structure x in F, F(x) G(x) (where F(x) means the value of the

feature x of the feature structure F). For all paths p and q in F such that F(p)=F(q), it is also the case that

G(p)=G(q).

An atomic feature structure neither subsumes

nor is subsumed by another atomic feature structure. Variables subsume all other feature structures. A feature structure F subsumes a feature structure G (F G) if if all parts of F

subsumes all parts of G.


Subsumption Example

Consider the following feature structures:

(1)

(2)

(3)

(1) (3)

(2) (3)

but there is no subsumption relation between (1) and (2)

3PERSON

3PERSON

SGNUMBER

SGNUMBER


Feature Structures in The Grammar

We will incorporate the feature structures and the unification process as follows:

All constituents (non-terminals) will be associated with feature structures.

Sets of unification constraints will be associated with grammar rules, and these rules must be satisfied for the rule to be satisfied.

These attachments accomplish the following goals:

To associate feature structures with both lexical items and instances of grammatical categories.

To guide the composition of feature structures for larger grammatical constituents based on the feature structures of their component parts.

To enforce compatibility constraints between specified parts of grammatical constraints.


Feature unification

Feature structures can be unified if They have like-named attributes that have the same

value:

[NUM SG] [NUM SG] = [NUM SG] Like-named attributes that are “open” get the value

assigned:

CAT NPNUMBER ??PERSON 3

NUMBER SGPERSON 3 =

CAT NPNUMBER SGPERSON 3


Feature unification

Complementary features are brought together

Unification is recursive

Coindexed structures are identical (not just copies): assignment to one effects all

CAT NPNUMBER SG

[PERSON 3] =CAT NPNUMBER SGPERSON 3

CAT NPAGR [NUM SG]

CAT NP

AGRNUM SGPERS 3

=CAT NPAGR [PERS 3]


Example

CAT NPAGR _1 _2SEM _3

CAT DETAGR _1

CAT NAGR _2 SEM _3

CAT DET

AGRVAL INDEFNUM SG

a

CAT DETAGR [VAL DEF]

the

CAT NLEX “man”AGR [NUM SG]SEM HUM

man


the manthe

CAT NAGR _2 SEM _3

CAT DETAGR [VAL DEF]

the

CAT NPAGR _1 _2SEM [_3]

CAT DETAGR _1 [VAL DEF]VAL DEF


man

LEX “man”AGR [NUM SG]SEM HUM

the man

NUM SG

HUM


a mana

CAT NAGR _2 SEM _3

CAT NPAGR _1

_2

SEM [_3]

CAT DET

AGR _1


man

LEX “man”AGR [NUM SG]SEM HUM

a man

[NUM SG]

HUM

CAT DET

AGRVAL INDEFNUM SG

a

VAL INDEFNUM SG

VAL INDEFNUM SG

VAL INDEFAGR NUM SG


Types and inheritance

Feature typing allows us to constrain possible values a feature can have e.g. num = {sing,plur} Allows grammars to be checked for consistency, and can

make parsing easier We can express general “feature co-occurrence conditions” … And “feature inheritance rules” Both these allow us to make the grammar more compact


Co-occurrence conditions and Inheritance rules

General rules (beyond simple unification) which apply automatically, and so do not need to be stated (and repeated) in each rule or lexical entry

Examples:

[cat=np] [num=??, gen=??, case=??]

[cat=v,num=sg] [tns=pres]

[attr1=val1] [attr2=val2]


Inheritance rules

Inheritance rules can be over-ridden

e.g. [cat=n] [gen=??,sem=??]

sex={male,female}

gen={masc,fem,neut}

[cat=n,gen=fem,sem=hum] [sex=female]

uxor [cat=n,gen=fem,sem=hum]

agricola [cat=n,gen=fem,sem=hum,sex=male]


Unification in Linguistics

Lexical Functional Grammar If interested, see PARGRAM project

GPSG, HPSG Construction Grammar Categorial Grammar


Unification

Joining the contents of two feature structures into one new

(the union of the two originals).

The union is most general feature structure subsumed by both.

The union of two contradictory feature structures is undefined (unification fails).


Unification Constraints

Each grammar rule will be associated with a set of unification constraints.

0 1 … n {set of unification constraints}

Each unification constraint will be in one of the following forms.

< i feature path> = Atomic value

< i feature path> = < j feature path>


Unification Constraints -- Example

For example, the following rule

S NP VP

Only if the number of the NP is equal to the number of the VP.

will be represented as follows:

S NP VP

<NP NUMBER> = <VP NUMBER>


Agreement Constraints

S NP VP

<NP NUMBER> = <VP NUMBER>

S Aux NP VP

<Aux AGREEMENT> = <NP AGREEMENT>

NP Det NOMINAL

<Det AGREEMENT> = <NOMINAL AGREEMENT>

<NP AGREEMENT> = <NOMINAL AGREEMENT>

NOMINAL Noun

<NOMINAL AGREEMENT> = <Noun AGREEMENT>

VP Verb NP

<VP AGREEMENT> = <Verb AGREEMENT>


Agreement Constraints -- Lexicon Entries

Aux does <Aux AGREEMENT NUMBER> = SG

<Aux AGREEMENT PERSON> = 3

Aux do <Aux AGREEMENT NUMBER> = PL

Det these <Det AGREEMENT NUMBER> = PL

Det this <Det AGREEMENT NUMBER> = SG

Verb serves <Verb AGREEMENT NUMBER> = SG

<Verb AGREEMENT PERSON> = 3

Verb serve <Verb AGREEMENT NUMBER> = PL

Noun flights <Noun AGREEMENT NUMBER> = PL

Noun flight <Noun AGREEMENT NUMBER> = SG


Head Features

Certain features are copied from children to parent in feature structures. Example: AGREEMENT feature in NOMINAL is copied into NP. The features for most grammatical categories are copied from one of

the children to the parent. The child that provides the features is called head of the phrase,

and the features copied are referred to as head features. A verb is a head of a verb phrase, and a nominal is a head of a noun

phrase. We may reflect these constructs in feature structures as follows:

NP Det NOMINAL

<Det HEAD AGREEMENT> = <NOMINAL HEAD AGREEMENT>

<NP HEAD> = <NOMINAL HEAD>

VP Verb NP

<VP HEAD> = <Verb HEAD>


SubCategorization Constraints

For verb phrases, we can represent subcategorization constraints using three techniques:

Atomic Subcat Symbols Encoding Subcat lists as feature structures Minimal Rule Approach (using lists directly)

We may use any of these representations.


Atomic Subcat Symbols

VP Verb


<VP HEAD SUBCAT> = INTRANS

VP Verb NP


<VP HEAD SUBCAT> = TRANS

VP Verb NP NP


<VP HEAD SUBCAT> = DITRANS

Verb slept <Verb HEAD SUBCAT> = INTRANS

Verb served <Verb HEAD SUBCAT> = TRANS

Verb gave <Verb HEAD SUBCAT> = DITRANS


Encoding Subcat Lists as Features

Verb gave

<Verb HEAD SUBCAT FIRST CAT> = NP

<Verb HEAD SUBCAT SECOND CAT> = NP

<Verb HEAD SUBCAT THIRD> = END

VP Verb NP NP


<VP HEAD SUBCAT FIRST CAT> = <NP CAT>

<VP HEAD SUBCAT SECOND CAT> = <NP CAT>

<VP HEAD SUBCAT THIRD> = END

We are only encoding lists using positional features


Minimal Rule Approach

In fact, we do not use symbols like SECOND, THIRD. They are just used to encode lists. We can use lists directly (similar to LISP).

<SUBCAT FIRST CAT> = NP

<SUBCAT REST FIRST CAT> = NP

<SUBCAT REST REST> = END


Subcategorization Frames for Lexical Entries

We can use two different notations to represent subcategorization frames for lexical entries (verbs).

Verb want

<Verb HEAD SUBCAT FIRST CAT> = NP

Verb want

<Verb HEAD SUBCAT FIRST CAT> = VP

<Verb HEAD SUBCAT FIRST FORM> = INFINITITIVE

INFINITIVEVFORMHEAD

VPCATNPCATSUBCATHEAD

VERBCAT

WANTORTH

,


Implementing Unification

The representation we have used cannot facilitate the destructive merger aspect of unification algorithm.

For this reason, we add additional features (additional edges to DAGs) into our feature structures.

Each feature structure will consists of two fields:

Content Field -- This field can be NULL or may contain ordinary feature structure.

Pointer Field -- This field can be NULL or may contain a pointer into another feature structure.

If the pointer field of a DAG is NULL, the content field of DAG contains the actual feature structure to be processed.

If the pointer field of a DAG is not NULL, the destination of that pointer represents the actual feature structure to be processed.


Extended Feature Structures

3PERSON

SGNUMBER

NULLPOINTERNULLPOINTER

CONTENTPERSON

NULLPOINTER

SGCONTENTNUMBER

CONTENT3


Extended DAG

C

P

C

C

P

P

Num

Per

Null

Null

3

SG

Null


Unification of Extended DAGs

33

PERSON

SGNUMBERPERSONSGNUMBER

C

P

Per

C

PNull

3

Null

C

P

Num

C

PNull

SG

Null


Unification of Extended DAGs (cont.)

C

P

Num

C

PNull

SG

Null

C

P

Per

C

PNull

3

Null

C

PNull

Null

P

Per


Unification Algorithm

function UNIFY(f1,f2) returns fstructure or failure

f1real real contents of f1 /* dereference f1 */

f2real real contents of f2 /* dereference f2 */

if f1real is Null then { f1.pointer f2; return f2; }

else if f2real is Null then { f2.pointer f1; return f1; }

else if f1real and f2real are identical then { f1.pointer f2; return f2; }

else if f1real and f2real are complex feature structures then

{ f2.pointer f1;

for each feature in f2real do

{ otherfeature Find or create a feature corresponding to feature in f1real;

if UNIFY(feature.value,otherfeature.value) returns failure then

return failure; }

return f1; }

else return failure;


Example - Unification of Complex Structures

3

)1(

)1(

PERSONAGREEMENTSUBJECT

AGREEMENTSUBJECT

SGNUMBERAGREEMENT

)1(3

)1(

AGREEMENTSUBJECTPERSON

SGNUMBERAGREEMENT


Example - Unification of Complex Structures (cont.)

•

•Null

•C •Agr •Num •C•C• SG

•Null

•C•Agr•C• Null

•Null •Null

•Null

Sub

•C •Sub •Agr •C•C•

•Null •Null •Null

•C•Per

•Null

3

•C NullPer


Parsing with Unification Constraints

Let us assume that we have augmented our grammar with sets of unification constraints.

What changes do we need to make a parser to make use of them?

Building feature structures and associate them with sub-trees.

Unifying feature structures when sub-trees are created. Blocking ill-formed constituents


Earley Parsing with Unification Constraints

What do we have to do to integrate unification constraints with Early Parser?

Building feature structures (represented as DAGs) and associate them with states in the chart.

Unifying feature structures as states are advanced in the chart.

Blocking ill-formed states from entering the chart. The main change will be in COMPLETER function of Earley Parser.

This routine will invoke the unifier to unify two feature structures.


Building Feature Structures

NP Det NOMINAL

<Det HEAD AGREEMENT> = <NOMINAL HEAD AGREEMENT>

<NP HEAD> = <NOMINAL HEAD>

corresponds to

)2()1(

)2(

)1(

AGREEMENTHEADNOMINAL

AGREEMENTHEADDet

HEADNP


Augmenting States with DAGs

Each state will have an additional field to contain the DAG representing the feature structure corresponding to the state.

When a rule is first used by PREDICTOR to create a state, the DAG associated with the state will simply consist of the DAG retrieved from the rule.

For example,

S NP VP, [0,0],[],Dag1

where Dag1 is the feature structure corresponding to S NP

VP.

NP Det NOMINAL, [0,0],[],Dag2

where Dag2 is the feature structure corresponding to S Det

NOMINAL.


What does COMPLETER do?

When COMPLETER advances the dot in a state, it should unify the feature structure of the newly completed state with the appropriate part of the feature structure being advanced.

If this unification process is succesful, the new state gets the result of the unification as its DAG, and this new state is entered into the chart. If it fails, nothing is entered into the chart.


A Completion Example

)2()1(

)2(

)1(

AGREEMENTHEADNOMINAL

SGNUMBERAGREEMENTHEADDet

HEADNP

NP Det NOMINAL, [0,1],[SDet],Dag1

Dag1

Parsing the phrase that flight after that is processed.

A newly completed state NOMINAL Noun , [1,2],[SNoun],Dag2

SGNUMBERAGREEMENTHEADNoun

HEADNOMINAL

)1(

)1(Dag2

To advance in NP, the parser unifies the feature structure found under the NOMINAL feature of Dag2, with the feature structure found under the NOMINAL feature of Dag1.


Earley Parse

function EARLEY-PARSE(words,grammar) returns chart

ENQUEUE(( S, [0,0], chart[0],dag)

for i from 0 to LENGTH(words) do

for each state in chart[i] do

if INCOMPLETE?(state) and NEXT-CAT(state) is not a PS then

PREDICTOR(state)

elseif INCOMPLETE?(state) and NEXT-CAT(state) is a PS then

SCANNER(state)

else

COMPLETER(state)

end

end

return(chart)


Predictor and Scanner

procedure PREDICTOR((A B , [i,j],dagA))

for each (B ) in GRAMMAR-RULES-FOR(B,grammar) do

ENQUEUE((B , [i,j],dagB), chart[j])

end

procedure SCANNER((A B , [i,j],dagA))

if (B PARTS-OF-SPEECH(word[j]) then

ENQUEUE((B word[j] , [j,j+1],dagB), chart[j+1])

end


Completer and UnifyStates

procedure COMPLETER((B , [j,k],dagB))

for each (A B , [i,j],dagA) in chart[j] do

if newdag UNIFY-STATES(dagB,dagA,B) fails then ENQUEUE((A B , [i,k],newdag),

chart[k])

end

procedure UNIFY-STATES(dag1,dag2,cat)

dag1cp CopyDag(dag1);

dag2cp CopyDag(dag2);

UNIFY(FollowPath(cat,dag1cp),FollowPath(cat,dag2cp));

end


Enqueue

procedure ENQUEUE(state,chart-entry)

if state is not subsumed by a state in chart-entry then

Add state at the end of chart-entry

end


Probabilistic Parsing

Slides by Markus Dickinson, Georgetown University


Motivation and Outline

Previously, we used CFGs to parse with, but: Some ambiguous sentences could not be

disambiguated, and we would like to know the most likely parse

How do we get such grammars? Do we write them ourselves? Maybe we could use a corpus …

Where we’re going: Probabilistic Context-Free Grammars (PCFGs) Lexicalized PCFGs Dependency Grammars


Statistical Parsing

Basic idea Start with a treebank

a collection of sentences with syntactic annotation, i.e., already-parsed sentences

Examine which parse trees occur frequently Extract grammar rules corresponding to those parse

trees, estimating the probability of the grammar rule based on its frequency

That is, we’ll have a CFG augmented with probabilities


Using Probabilities to Parse

P(T): probability of a particular parse tree

P(T) = ΠnєT p(r(n))

i.e., the product of the probabilities of all the rules r used to expand each node n in the parse tree

Example: given the probabilities on p. 449, compute the probability of the parse tree on the right


Computing probabilities

We have the following rules and probabilities (adapted from Figure 12.1):

S VP .05 VP V NP .40 NP Det N .20 V book .30 Det that .05 N flight .25

P(T) = P(SVP)*P(VPV NP)*…*P(Nflight)= .05*.40*.20*.30*.05*.25 = .000015, or 1.5 x 10-5


Using probabilities

So, the probability for that parse is 0.000015. What’s the big deal?

Probabilities are useful for comparing with other probabilities

Whereas we couldn’t decide between two parses using a regular CFG, we now can.

For example, TWA flights is ambiguous between being two separate NPs (cf. I gave [NP John] [NP money]) or one NP:

A: [book [TWA] [flights]] B: [book [TWA flights]]

Probabilities allows us to choose choice B (see figure 12.2)


Obtaining the best parse

Call the best parse T(S), where S is your sentence Get the tree which has the highest probability, i.e. T(S) = argmaxTєparse-trees(S) P(T)

Can use the Cocke-Younger-Kasami (CYK) algorithm to calculate best parse

CYK is a form of dynamic programming CYK is a chart parser, like the Earley parser


The CYK algorithm

Base case Add words to the chart Store P(A w_i) for every category A in the chart

Recursive case makes this dynamic programming because we only calculate B and C once

Rules must be of the form A BC, i.e., exactly two items on the RHS (we call this Chomsky Normal Form (CNF))

Get the probability for A at this node by multiplying the probabilities for B and for C by P(A BC)

P(B)*P(C)*P(A BC)

For a given A, only keep the maximum probability (again, this is dynamic programming)


Problems with PCFGs

It’s still only a CFG, so dependencies on non-CFG info not captured

e.g., Pronouns are more likely to be subjects than objects: P[(NPPronoun) | NP=subj] >> P[(NPPronoun) | NP =obj]

Ignores lexical information (statistics), which is usually crucial for disambiguation

(T1) America sent [[250,000 soldiers] [into Iraq]] (T2) America sent [250,000 soldiers] [into Iraq]

send with into-PP always attached high (T2) in PTB!

To handle lexical information, we’ll turn to lexicalized PCFGs


Lexicalized Grammars

The head information is passed up in a syntactic analysis? e.g., VP[head *1] V[head *1] NP

Well, if you follow this down all the way to the bottom of a tree, you wind up with a head word

In some sense, we can say that Book that flight is not just an S, but an S rooted in book

Thus, book is the headword of the whole sentence

By adding headword information to nonterminals, we wind up with a lexicalized grammar


Lexicalized PCFGs

Lexicalized Parse Trees Each PCFG rule in a tree

is augmented to identify one RHS constituent to be the head daughter

The headword for a node is set to the head word of its head daughter

[book]

[book]

[flight]

[flight]


Incorporating Head Proabilities: Wrong Way

Simply adding headword w to node won’t work: So, the node A becomes A[w] e.g., P(A[w]β|A) =Count(A[w]β)/Count(A)

The probabilities are too small, i.e., we don’t have a big enough corpus to calculate these probabilities

VP(dumped) VBD(dumped) NP(sacks) PP(into) 3x10-10

VP(dumped) VBD(dumped) NP(cats) PP(into) 8x10-11

These probabilities are tiny, and others will never occur


Incorporating head probabilities: Right way

Previously, we conditioned on the mother node (A): P(Aβ|A)

Now, we can condition on the mother node and the headword of A (h(A)):

P(Aβ|A, h(A))We’re no longer conditioning on simply the mother

category A, but on the mother category when h(A) is the head

e.g., P(VPVBD NP PP | VP, dumped)


Calculating rule probabilities

We’ll write the probability more generally as: P(r(n) | n, h(n)) where n = node, r = rule, and h = headword

We calculate this by comparing how many times the rule occurs with h(n) as the headword versus how many times the mother/headword combination appear in total:

P(VP VBD NP PP | VP, dumped)

= C(VP(dumped) VBD NP PP)/ Σβ C(VP(dumped) β)


Adding info about word-word dependencies

We want to take into account one other factor: the probability of being a head word (in a given context)

P(h(n)=word | …)We condition this probability on two things: 1. the category of

the node (n), and 2. the headword of the mother (h(m(n))) P(h(n)=word | n, h(m(n))), shortened as: P(h(n) | n,

h(m(n))) P(sacks | NP, dumped)

What we’re really doing is factoring in how words relate to each other

We will call this a dependency relation later: sacks is dependent on dumped, in this case


Putting it all together

See p. 459 for an example lexicalized parse tree forworkers dumped sacks into a bin

For rules r, category n, head h, mother m

P(T) = ΠnєT p(r(n)| n, h(n))

e.g., P(VP VBD NP PP |VP, dumped)

subcategorization info*p(h(n) | n, h(m(n)))

e.g. P(sacks | NP, dumped)dependency info

between words


Dependency Grammar

Capturing relations between words (e.g. dumped and sacks) is moving in the direction of dependency grammar (DG)

In DG, there is no such thing as constituencyThe structure of a sentence is purely the binary relations

between wordsJohn loves Mary is represented as:

LOVE JOHN LOVE MARY

where A B means that B depends on A


Dependency parsing


Dependency Grammar/Parsing

A sentence is parsed by relating each word to other words in the sentence which depend on it.

The idea of dependency structure goes back a long way To Pāṇini’s grammar (c. 5th century BCE)

Constituency is a new-fangled invention 20th century invention

Modern work often linked to work of L. Tesniere (1959) Dominant approach in “East” (Eastern bloc/East Asia)

Among the earliest kinds of parsers in NLP, even in US: David Hays, one of the founders of computational linguistics, built early

(first?) dependency parser (Hays 1962)


Dependency structure

Words are linked from head (regent) to dependent Warning! Some people do the arrows one way; some the other way

(Tesniere has them point from head to dependent…). Usually add a fake ROOT so every word is a dependent

Shaw Publishing acquired 30 % of American City in March $$


Relation between CFG to dependency parse

A dependency grammar has a notion of a head Officially, CFGs don’t But modern linguistic theory and all modern statistical

parsers (Charniak, Collins, Stanford, …) do, via hand-written phrasal “head rules”: The head of a Noun Phrase is a noun/number/adj/… The head of a Verb Phrase is a verb/modal/….

The head rules can be used to extract a dependency parse from a CFG parse (follow the heads).

A phrase structure tree can be got from a dependency tree, but dependents are flat (no VP!)


Propagating head words

Small set of rules propagate heads

S(announced)

NP(Smith)

NP(Smith)

NNP

John

NNP

Smith

NP(president)

NP

DT

the

NN

president

PP(of)

IN

of

NP

NNP

IBM

VP(announced)

VBD

announced

NP(resignation)

PRP$

his

NN

resignation

NP

NN

yesterday


Extracted structure

NB. Not all dependencies shown here

Dependencies are inherently untyped, though some work like Collins (1996) types them using the phrasal categories

NP

[John Smith]

NPNP

[the president] of [IBM]

SNP VP

announced [his Resignation] [yesterday]

VPVBD NP

NPVP

VBD


Sources of information: bilexical dependencies distance of dependencies valency of heads (number of dependents)

A word’s dependents (adjuncts, arguments)

tend to fall near it

in the string.

Dependency Conditioning Preferences

These next 6 slides are based on slides by Jason Eisner and Noah Smith


Probabilistic dependency grammar: generative model

1. Start with left wall $

2. Generate root w0

3. Generate left children w-1, w-2, ..., w-ℓ from the FSA λw0

4. Generate right children w1, w2, ..., wr from the FSA ρw0

5. Recurse on each wi for i in {-ℓ, ..., -1, 1, ..., r}, sampling αi (steps 2-4)

6. Return αℓ...α-1w0α1...αr

w0

w-1

w-2

w-ℓ wr

w2

w1

......

w-ℓ.-1

$

λw-ℓ

λw0 ρw0


Naïve Recognition/Parsing

It takes two to tango

It

takes two to tango

totakes

takes

takes

O(n5) combinations

It

p

p c

i j k

O(n5N3) if N nonterminal

s r0 n

goal

goal


Dependency Grammar Cubic Recognition/Parsing (Eisner & Satta, 1999)

Triangles: span over words, where tall side of triangle is the head, other side is dependent, and no non-head words expecting more dependents

Trapezoids: span over words, where larger side is head, smaller side is dependent, and smaller side is still looking for dependents on its side of the trapezoid

}

}


Dependency Grammar Cubic Recognition/Parsing (Eisner & Satta, 1999)

It takes two to tango

goal

One trapezoid

per dependency

.

A triangle is a head with some left (or right) subtrees.


Cubic Recognition/Parsing (Eisner & Satta, 1999)

i j k i j k

i j ki j k

O(n3) combinations

O(n3) combinations

0 i n

goal

Gives O(n3) dependency grammar parsing

O(n) combinations


Evaluation of Dependency Parsing: Simply use (labeled) dependency accuracy

1 2 3 4 5

1 2 We SUBJ

2 0 eat ROOT

3 5 the DET4 5 cheese

MOD5 2 sandwich

SUBJ

1 2 We SUBJ

2 0 eat ROOT

3 4 the DET4 2 cheese OBJ5 2 sandwich

PRED

Accuracy = number of correct dependenciestotal number of dependencies

= 2 / 5 = 0.40

40%

GOLD PARSED


McDonald et al. (2005 ACL):Online Large-Margin Training of Dependency Parsers

Builds a discriminative dependency parser Can condition on rich features in that context

Best-known recent dependency parser Lots of recent dependency parsing activity connected

with CoNLL 2006/2007 shared task Doesn’t/can’t report constituent LP/LR, but evaluating

dependencies correct: Accuracy is similar to but a fraction below dependencies

extracted from Collins: 90.9% vs. 91.4% … combining them gives 92.2% [all lengths] Stanford parser on length up to 40:

Pure generative dependency model: 85.0% Lexicalized factored parser: 91.0%


McDonald et al. (2005 ACL):Online Large-Margin Training of Dependency Parsers

Score of a parse is the sum of the scores of its dependencies

Each dependency is a linear function of features times weights

Feature weights are learned by MIRA, an online large-margin algorithm But you could think of it as using a perceptron or maxent classifier

Features cover: Head and dependent word and POS separately Head and dependent word and POS bigram features Words between head and dependent Length and direction of dependency


Extracting grammatical relations from statistical constituency parsers

[de Marneffe et al. LREC 2006] Exploit the high-quality syntactic analysis done by statistical

constituency parsers to get the grammatical relations [typed dependencies]

Dependencies are generated by pattern-matching rules

Bills on ports and immigration were submitted by Senator Brownback

NPS

NP

NNP NNP

PP

IN

VP

VP

VBN

VBD

NNCCNNS

NPIN

NP PP

NNS

submitted

Bills were Brownback

Senator

nsubjpass auxpass agent

nnprep_on

ports

immigration

cc_and


Evaluating Parser Output

Dependency relations are also useful for comparing parser output to a treebank

Traditional measures of parser accuracy: Labeled bracketing precision:

# correct constituents in parse/# constituents in parse

Labeled bracketing recall:# correct constituents in parse/# (correct) constituents in treebank

parse

There are known problems with these measures, so people are trying to use dependency-based measures instead

How many dependency relations did the parse get correct?

cs60057 speech &natural language processing

Documents

num det

num value

num vp

num man

sgnp num

xnp num

x n num

man n num