semi-supervised approaches for learning to parse natural languages

50
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Slides are from Rebecca Hwa, Ray Mooney

Upload: kalona

Post on 12-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Semi-Supervised Approaches for Learning to Parse Natural Languages. Slides are from Rebecca Hwa, Ray Mooney. The Role of Parsing in Language Applications…. As a stand-alone application Grammar checking As a pre-processing step Question Answering Information extraction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

1

Semi-Supervised Approaches for Learning to Parse

Natural Languages

Slides are from Rebecca Hwa, Ray Mooney

Page 2: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

2

The Role of Parsing in Language Applications…

• As a stand-alone application– Grammar checking

• As a pre-processing step– Question Answering– Information extraction

• As an integral part of a model– Speech Recognition– Machine Translation

Page 3: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

3

Parsing

• Parsers provide syntactic analyses of sentences

VP

saw PN

her

VB

NP

S

PN

I

NP

Input: I saw her

Page 4: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

4

Challenges in Building Parsers

• Disambiguation– Lexical disambiguation– Structural disambiguation

• Rule Exceptions– Many lexical dependencies

• Manual Grammar Construction– Limited coverage– Difficult to maintain

Page 5: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

5

Meeting these Challenges:Statistical Parsing

• Disambiguation?– Resolve local ambiguities with global likelihood

• Rule Exceptions?– Lexicalized representation

• Manual Grammar Construction?– Automatic induction from large corpora– A new challenge: how to obtain training corpora?– Make better use of unlabeled data with machine

learning techniques and linguistic knowledge

Page 6: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

6

Roadmap

• Parsing as a learning problem

• Semi-supervised approaches– Sample selection– Co-training– Corrected Co-training

• Conclusion and further directions

Page 7: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

7

Parsing Ambiguities

T1: T2:

VP

saw

her

N

duck

PNS

VB

NP

S

PN

I

PP

with

P NP

a

NDET

NP

telescope

VP

saw

her

N

duck

PNS

VB

NP

S

PN

I PP

with

P NP

a

NDET

NP

telescope

Input: “I saw her duck with a telescope”

Page 8: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

8

Disambiguation with Statistical Parsing

)|Pr()|Pr( 21 WTWT

T1: T2:

VP

saw

her

N

duck

PNS

VB

NP

S

PN

I

PP

with

P NP

a

NDET

NP

telescope

VP

saw

her

N

duck

PNS

VB

NP

S

PN

I PP

with

P NP

a

NDET

NP

telescope

W = “I saw her duck with a telescope”

Page 9: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

9

A Statistical Parsing Model

• Probabilistic Context-Free Grammar (PCFG)

• Associate probabilities with production rules

• Likelihood of the parse is computed from the rules used

• Learn rule probabilities from training data

0.3 NP PN

0.5 DET a

0.1 DET anthe0.4 DET

Example of PCFG rules:

...

0.7 NP DET N

r rri

i

WTreesTi

WTreesT

LHSRHSWT

W

WTWT

ii

)|Pr(),Pr(

)Pr(

),Pr(maxarg)|Pr(maxarg

)()(

Page 10: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

10

Sentence Probability• Assume productions for each node are chosen

independently.• Probability of derivation is the product of the

probabilities of its productions.

P(D1) = 0.1 x 0.5 x 0.5 x 0.6 x 0.6 x 0.5 x 0.3 x 1.0 x 0.2 x 0.2 x 0.5 x 0.8 = 0.0000216

D1SVP

Verb NP Det Nominal

Nominal PP

book

Prep NPthrough

HoustonProper-Noun

the

flightNoun

0.5

0.5 0.6

0.6 0.51.0

0.20.3

0.5 0.20.8

0.1

Page 11: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

Syntactic Disambiguation• Resolve ambiguity by picking most probable parse

tree.

1111

D2

VPVerb NP

Det NominalbookPrep NP

throughHouston

Proper-Nounthe

flightNoun

0.5

0.5 0.6

0.6 1.0

0.20.3

0.5 0.20.8

S

VP0.1

PP

0.3

P(D2) = 0.1 x 0.3 x 0.5 x 0.6 x 0.5 x 0.6 x 0.3 x 1.0 x 0.5 x 0.2 x 0.2 x 0.8 = 0.00001296

Page 12: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

Sentence Probability

• Probability of a sentence is the sum of the probabilities of all of its derivations.

12

P(“book the flight through Houston”) = P(D1) + P(D2) = 0.0000216 + 0.00001296 = 0.00003456

Page 13: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

13

PCFG: Supervised Training• If parse trees are provided for training sentences, a

grammar and its parameters can be can all be estimated directly from counts accumulated from the tree-bank (with appropriate smoothing).

.

.

.

Tree Bank

SupervisedPCFGTraining

S → NP VPS → VPNP → Det A NNP → NP PPNP → PropNA → εA → Adj APP → Prep NPVP → V NPVP → VP PP

0.90.10.50.30.20.60.41.00.70.3

English

S

NP VP

John V NP PP

put the dog in the pen

S

NP VP

John V NP PP

put the dog in the pen

Page 14: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

Estimating Production Probabilities

• Set of production rules can be taken directly from the set of rewrites in the treebank.

• Parameters can be directly estimated from frequency counts in the treebank.

14

)count(

)count(

)count(

)count()|(

P

Page 15: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

15

Vanilla PCFG Limitations

• Since probabilities of productions do not rely on specific words or concepts, only general structural disambiguation is possible (e.g. prefer to attach PPs to Nominals).

• Consequently, vanilla PCFGs cannot resolve syntactic ambiguities that require semantics to resolve, e.g. ate with fork vs. meatballs.

• In order to work well, PCFGs must be lexicalized, i.e. productions must be specialized to specific words by including their head-word in their LHS non-terminals (e.g. VP-ate).

Page 16: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

Example of Importance of Lexicalization

• A general preference for attaching PPs to NPs rather than VPs can be learned by a vanilla PCFG.

• But the desired preference can depend on specific words.

16

S → NP VPS → VPNP → Det A NNP → NP PPNP → PropNA → εA → Adj APP → Prep NPVP → V NPVP → VP PP

0.90.10.50.30.20.60.41.00.70.3

English

PCFG Parser

S

NP VP

John V NP PP

put the dog in the pen

John put the dog in the pen.

Page 17: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

17

Example of Importance of Lexicalization

• A general preference for attaching PPs to NPs rather than VPs can be learned by a vanilla PCFG.

• But the desired preference can depend on specific words.

S → NP VPS → VPNP → Det A NNP → NP PPNP → PropNA → εA → Adj APP → Prep NPVP → V NPVP → VP PP

0.90.10.50.30.20.60.41.00.70.3

English

PCFG Parser

S

NP VP

John V NP

put the dog in the penXJohn put the dog in the pen.

Page 18: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

Head Words

• Syntactic phrases usually have a word in them that is most “central” to the phrase.

• Linguists have defined the concept of a lexical head of a phrase.

• Simple rules can identify the head of any phrase by percolating head words up the parse tree.– Head of a VP is the main verb– Head of an NP is the main noun– Head of a PP is the preposition– Head of a sentence is the head of its VP

Page 19: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

Lexicalized Productions• Specialized productions can be generated by

including the head word and its POS of each non-terminal as part of that non-terminal’s symbol.

S

VPVBD NP

DT NominalNominal PP

liked

IN NPin

the

dogNN

DT Nominal

NNthepen

NNP

NP

John

pen-NN

pen-NN

in-INdog-NN

dog-NN

dog-NN

liked-VBD

liked-VBD

John-NNP

Nominaldog-NN → Nominaldog-NN PPin-IN

Page 20: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

Lexicalized Productions

S

VPVP PP

DT NominalputIN NP

in

thedogNN

DT Nominal

NNthepen

NNP

NP

Johnpen-NN

pen-NN

in-IN

dog-NN

dog-NN

put-VBD

put-VBD

John-NNP

NPVBD

put-VBD

VPput-VBD → VPput-VBD PPin-IN

Page 21: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

Parameterizing Lexicalized Productions

• Accurately estimating parameters on such a large number of very specialized productions could require enormous amounts of treebank data.

• Need some way of estimating parameters for lexicalized productions that makes reasonable independence assumptions so that accurate probabilities for very specific rules can be learned.

Page 22: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

Collins’ Parser

• Collins’ (1999) parser assumes a simple generative model of lexicalized productions.

• Models productions based on context to the left and the right of the head daughter.– LHS → LnLn1…L1H R1…Rm1Rm

• First generate the head (H) and then repeatedly generate left (Li) and right (Ri) context symbols until the symbol STOP is generated.

Page 23: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

Sample Production Generation

VPput-VBD → VBDput-VBD NPdog-NN

PPin-IN

VPput-VBD → VBDput-VBDNPdog-NNHL1

STOP PPin-IN STOPR1 R2 R3

PL(STOP | VPput-VBD) * PH(VBD | VPput-VBD)* PR(NPdog-NN | VPput-VBD)* PR(PPin-IN | VPput-VBD) * PR(STOP | VPput-VBD)

Page 24: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

Count(PPin-IN right of head in a VPput-VBD production)

Estimating Production Generation Parameters• Estimate PH, PL, and PR parameters from treebank data.

PR(PPin-IN | VPput-VBD) =Count(symbol right of head in a VPput-VBD)

Count(NPdog-NN right of head in a VPput-VBD production)PR(NPdog-NN | VPput-VBD) =

• Smooth estimates by linearly interpolating with simpler models conditioned on just POS tag or no lexical info.

smPR(PPin-IN | VPput-VBD) = 1 PR(PPin-IN | VPput-VBD) + (1 1) (2 PR(PPin-IN | VPVBD) + (1 2) PR(PPin-IN | VP))

Count(symbol right of head in a VPput-VBD)

Page 25: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

25

Parsing Evaluation Metrics• PARSEVAL metrics measure the fraction of the

constituents that match between the computed and human parse trees. If P is the system’s parse tree and T is the human parse tree (the “gold standard”):– Recall = (# correct constituents in P) / (# constituents in T)

– Precision = (# correct constituents in P) / (# constituents in P)

• Labeled Precision and labeled recall require getting the non-terminal label on the constituent node correct to count as correct.

• F1 is the harmonic mean of precision and recall.

Page 26: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

26

Treebank Results• Results of current state-of-the-art systems on the

English Penn WSJ treebank are slightly greater than 90% labeled precision and recall.

Page 27: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

27

Supervised Learning Avoids Manual Construction

• Training examples are pairs of problems and answers• Training examples for parsing: a collection of

sentence, parse tree pairs (Treebank)– From the treebank, get maximum likelihood estimates for

the parsing model

• New challenge: treebanks are difficult to obtain– Needs human experts

– Takes years to complete

Page 28: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

28

Learning to Classify Learning to Parse

Train a model to decide: should a prepositional phrase modify the verb before it or the noun?

Train a model to decide: what is the most likely parse for a sentence W?

Training examples:(v, saw, duck, with, telescope)

(n, saw, duck, with, feathers)

(v, saw, stars, with, telescope)

(n, saw, stars, with, Oscars)

[S [NP-SBJ [NNP Ford] [NNP Motor] [NNP Co.]] [VP [VBD acquired] [NP [NP [CD 5] [NN %]] [PP [IN of] [NP [NP [DT the] [NNS shares]] [PP [IN in] [NP [NNP Jaguar] [NNP PLC]]]]]]] . ]

[S [NP-SBJ [NNP Pierre] [NNP Vinken]] [VP [MD will] [VP [VB join] [NP [DT the] [NN board]] [PP [IN as] [NP [DT a] [NN director]]] .]

Page 29: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

29

Hwa’s Approach

• Sample selection– Reduce the amount of training data by picking

more useful examples

• Co-training– Improve parsing performance from unlabeled

data

• Corrected Co-training– Combine ideas from both sample selection and

co-training

Page 30: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

30

Roadmap

• Parsing as a learning problem

• Semi-supervised approaches– Sample selection

• Overview

• Scoring functions

• Evaluation

– Co-training– Corrected Co-training

• Conclusion and further directions

Page 31: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

31

Sample Selection

• Assumption– Have lots of unlabeled data (cheap resource)

– Have a human annotator (expensive resource)

• Iterative training session– Learner selects sentences to learn from

– Annotator labels these sentences

• Goal: Predict the benefit of annotation– Learner selects sentences with the highest Training

Utility Values (TUVs)

– Key issue: scoring function to estimate TUV

Page 32: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

32

Algorithm

InitializeTrain the parser on a small treebank (seed data) to get the

initial parameter values.

RepeatCreate candidate set by randomly sample the unlabeled pool.Estimate the TUV of each sentence in the candidate set with

a scoring function, f.Pick the n sentences with the highest score (according to f).Human labels these n sentences and add them to training set.Re-train the parser with the updated training set.

Until (no more data).

Page 33: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

33

Scoring Function

• Approximate the TUV of each sentence– True TUVs are not known

• Need relative ranking

• Ranking criteria– Knowledge about the domain

• e.g., sentence clusters, sentence length, …

– Output of the hypothesis• e.g., error-rate of the parse, uncertainty of the parse, …

….

Page 34: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

34

Proposed Scoring Functions

• Using domain knowledge– long sentences tend to be complex

• Uncertainty about the output of the parser– tree entropy

• Minimize mistakes made by the parser– use an oracle scoring function find

sentences with the most parsing inaccuraciesferror

fte

flen

Page 35: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

35

Entropy

• Measure of uncertainty in a distribution– Uniform distribution very uncertain– Spike distribution very certain

• Expected number of bits for encoding a probability distribution, X

x

xpxpXH )(log)()(

Page 36: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

36

Tree Entropy Scoring Function

))(log(

)(

WTrees

WTEfte

)(

1)|Pr(WTreesT

i

i

WT

• Distribution over parse trees for sentence W:

• Tree entropy: uncertainty of the parse distribution

• Scoring function: ratio of actual parse tree entropy to that of a uniform distribution

)|Pr(log)|Pr()()(

WTWTWTE iWTreesT

i

i

Page 37: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

38

Experimental Setup

• Parsing model: – Collins Model 2

• Candidate pool– WSJ sec 02-21, with the annotation stripped

• Initial labeled examples: 500 sentences• Per iteration: add 100 sentences• Testing metric: f-score (precision/recall)• Test data:

– ~2000 unseen sentences (from WSJ sec 00)

• Baseline– Annotate data in sequential order

Page 38: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

39

Training Examples Vs. Parsing Performance

80

82

84

86

88

90

0 10000 20000 30000 40000

Number of Training Sentences

Pa

rsin

g p

erfo

rma

nce

on

Tes

t S

ente

nce

s (f

-sco

re)

sequential

length

tree entropy

oracle

Page 39: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

40

Parsing Performance Vs. Constituents Labeled

0

100000

200000

300000

400000

500000

600000

700000

800000

87.5 88 88.7

Parsing Performance on Test Sentences (f-score)

Num

ber

of C

onst

itue

nts

in T

rain

ing

Sent

ence

s

baseline

length

tree entropy

oracle

Page 40: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

41

Co-Training [Blum and Mitchell, 1998]

• Assumptions– Have a small treebank– No further human assistance– Have two different kinds of

parsers• A subset of each parser’s

output becomes new training data for the other

• Goal: – select sentences that are

labeled with confidence by one parser but labeled with uncertainty by the other parser.

Page 41: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

42

AlgorithmInitialize

Train two parsers on a small treebank (seed data) to get the initial models.

RepeatCreate candidate set by randomly sample the unlabeled pool.Each parser labels the candidate set and estimates the accuracy

of its output with scoring function, f.Choose examples according to some selection method, S (using

the scores from f).Add them to the parsers’ training sets.Re-train parsers with the updated training sets.

Until (no more data).

Page 42: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

43

Scoring Functions

• Evaluates the quality of each parser’s output

• Ideally, function measures accuracy– Oracle fF-score

• combined prec./rec. of the parse

• Practical scoring functions– Conditional probability fcprob

• Prob(parse | sentence)

– Others (joint probability, entropy, etc.)

Page 43: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

44

Selection Methods

• Above-n: Sabove-n

– The score of the teacher’s parse is greater than n

• Difference: Sdiff-n

– The score of the teacher’s parse is greater than that of the student’s parse by n

• Intersection: Sint-n

– The score of the teacher’s parse is one of its n% highest while the score of the student’s parse for the same sentence is one of the student’s n% lowest

Page 44: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

45

Experimental Setup

• Co-training parsers:– Lexicalized Tree Adjoining Grammar parser [Sarkar, 2002]

– Lexicalized Context Free Grammar parser [Collins, 1997]

• Seed data: 1000 parsed sentences from WSJ sec02• Unlabeled pool: rest of the WSJ sec02-21, stripped• Consider 500 unlabeled sentences per iteration• Development set: WSJ sec00• Test set: WSJ sec23• Results: graphs for the Collins parser

Page 45: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

49

79.5

79.8

80.1

80.4

80.7

81

81.3

0 1000 2000 3000 4000 5000

Number of training sentences

Pa

rsin

g P

erf

orm

an

ce

o

f th

e t

es

t s

et

above-70%

diff-30%

int-30%

Co-Training using fcprob

Page 46: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

50

Roadmap

• Parsing as a learning problem

• Semi-supervised approaches– Sample selection– Co-training– Corrected Co-training

• Conclusion and further directions

Page 47: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

51

Corrected Co-Training

• Human reviews and corrects the machine outputs before they are added to the training set

• Can be seen as a variant of sample selection [cf. Muslea et al., 2000]

• Applied to Base NP detection [Pierce & Cardie, 2001]

Page 48: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

52

AlgorithmInitialize:

Train two parsers on a small treebank (seed data) to get the initial models.

RepeatCreate candidate set by randomly sample the unlabeled pool.Each parser labels the candidate set and estimates the accuracy

of its output with scoring function, f.Choose examples according to some selection method, S (using

the scores from f).Human reviews and corrects the chosen examples.Add them to the parsers’ training sets.Re-train parsers with the updated training sets.

Until (no more data).

Page 49: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

53

Selection Methods and Corrected Co-Training

• Two scoring functions: fF-score , fcprob

• Three selection methods: Sabove-n , Sdiff-n , Sint-n

Page 50: Semi-Supervised Approaches  for Learning to Parse  Natural Languages

56

Corrected Co-Training using fcprob

(Reviews)

79

81

83

85

87

0 5000 10000 15000

Number of training sentences

Pa

rsin

g P

erf

orm

an

ce

o

f th

e t

es

t s

et

above-70%

diff-30%

int-30%

No selection