acl 2005 n. a. smith and j. eisner contrastive estimation contrastive estimation : (efficiently)...

35
ACL 2005 N. A. Smith and J. Eisner • Contrastive Estimation Contrastive Estimation: (Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled Data Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University {nasmith,jason}@cs.jhu.edu

Upload: anissa-rowden

Post on 14-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Contrastive Estimation:(Efficiently) Training Log-

Linear Models (of Sequences) on Unlabeled Data

Noah A. Smith and Jason EisnerDepartment of Computer Science /

Center for Language and Speech ProcessingJohns Hopkins University

{nasmith,jason}@cs.jhu.edu

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Nutshell Version

tractabletraining

unannotated text

“max ent” features

Experiments on unlabeled data:

POS tagging: 46% error rate reduction (relative to EM)

“Max ent” features make it possible

to survive damage to tag dictionary

Dependency parsing: 21% attachment error reduction

(relative to EM)

contrastive estimationwith lattice neighborhoods

sequence models

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

“Red leaves don’t hide blue jays.”

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Maximum Likelihood Estimation(Supervised)

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNS

?

?

p

p *

x

y

Σ* × Λ*

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

red leaves don’t hide blue jays

? ? ? ? ? ?

?

?

p

p *

x

Σ* × Λ*

Maximum Likelihood Estimation

(Unsupervised)

This is what

EM does.

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Focusing Probability Mass

numerator

denominator

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Focusing Probability Mass

numerator

denominator

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Conditional Estimation(Supervised)

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNSp

p

x

y

red leaves don’t hide blue jays

? ? ? ? ? ?

(x) × Λ*

A different

denominator!

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Objective Functions

ObjectiveOptimizati

on Algorithm

NumeratorDenominat

or

MLECount &

Normalize*tags & words

Σ* × Λ*

MLE with hidden

variablesEM* words Σ* × Λ*

Conditional Likelihood

Iterative Scaling

tags & words

(words) × Λ*

Perceptron Backproptags & words

hypothesized tags & words*For generative models.

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Objective Functions

ObjectiveOptimizati

on Algorithm

NumeratorDenominat

or

MLECount &

Normalize*tags & words

Σ* × Λ*

MLE with hidden

variablesEM* words Σ* × Λ*

Conditional Likelihood

Iterative Scaling

tags & words

(words) × Λ*

Perceptron Backproptags & words

hypothesized tags & words*For generative models.

Contrastive Estimation

observed data

(in this talk, raw

word sequence, sum over

all possible taggings)

?

generic numerical

solvers

(in this talk, LMVM L-BFGS)

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

This talk is about denominators ...

in the unsupervised case.

A good denominator can improve

accuracy

and

tractability.

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Language Learning (Syntax)

red leaves don’t hide blue jays

Why didn’t he say,

“birds fly” or “dancing granola” or “the wash

dishes” or

any other sequence of words?

EM

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Language Learning (Syntax)

red leaves don’t hide blue jays

Why did he pick that sequence for those words?

Why not say “leaves red ...” or “... hide don’t ...” or ...

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

What is a syntax model supposed to explain?

Each learning hypothesis

corresponds to

a denominator / neighborhood.

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The Job of Syntax“Explain why each word is necessary.”

→ DEL1WORD neighborhood

red leavesdon’t hide blue jays

leaves don’t hide blue jays

red don’t hide blue jays

red leaves hide blue jays

red leaves don’t blue jays

red leaves don’t hide jays

red leaves don’t hide blue

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The Job of Syntax“Explain the (local) order of the words.”

→ TRANS1 neighborhood

red leavesdon’t hide blue jays

leaves red don’t hide blue jays

red leaves hide don’t blue jays

red don’t leaves hide blue jays

red leaves don’t hide jays blue

red leaves don’t blue hide jays

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

red leaves don’t hide blue jays

? ? ? ? ? ?p

p

leaves red don’t hide blue jays

? ? ? ? ? ?

red don’t leaves hide blue jays

? ? ? ? ? ?

red leaves hide don’t blue jays

? ? ? ? ? ?

red leaves don’t blue hide jays

? ? ? ? ? ?

red leaves don’t hide jays blue

? ? ? ? ? ?

red leaves don’t hide blue jays

? ? ? ? ? ?

sentences inTRANS1

neighborhood

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

p

p

red leaves don’t hide blue jays

? ? ? ? ? ?

red leaves don’t hide blue jays

leaves

don’t

hideblue

jays

don’t hide blue jays

leaves

don’t

hide

bluered

(with any tagging) sentences inTRANS1

neighborhood

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The New Modeling Imperative

numerator

denominator(“neighborhood”)

“Make the good sentence

likely, at the expense

of those bad neighbors.”

A good sentence

hints that a set of bad

ones is nearby.

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

This talk is about denominators ...in the unsupervised case.

A good denominator can improve

accuracy and

tractability.

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Log-Linear Models score of x, y

partition function

Computing is

undesirable!

ConditionalEstimation

(Supervised)

ContrastiveEstimation

(Unsupervised)

Sums over all possible taggings of all possible sentences!1

sentencea few

sentences

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

A Big Picture: Sequence Model Estimation

tractable sums

overlapping features

generative,MLE: p(x, y)

log-linear,conditionalestimation:

p(y | x)

unannotated data

generative,EM: p(x)

log-linear,MLE: p(x, y)

log-linear,EM: p(x)

log-linear,CE with

lattice neighborhoods

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Contrastive Neighborhoods

• Guide the learner toward models that do what syntax is supposed to do.

• Lattice representation → efficient algorithms.

There is an art to

choosing neighborhood functions.

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Neighborhoods

neighborhood sizelatticearcs

perturbations

n+1 O(n) delete up to 1 word

n O(n) transpose any bigram

O(n) O(n)

O(n2

)O(n2) delete any contiguous

subsequence

(EM) ∞ - replace each word with anything

DEL1SUBSEQUENCE

TRANS1

DEL1WORD

DELORTRANS1

Σ*

DEL1WORD TRANS1

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The Merialdo (1994) Task

Given unlabeled text

and a POS dictionary(that tells all possible tags for each word type),

learn to tag.A form of

supervision.

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Trigram Tagging Model

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNS

feature set:

tag trigramstag trigrams

tag/word pairs from a POS dictionary

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

35.1

58.7

60.4

62.1

66.6

70.0

78.8

79.0

79.3

97.2

99.5

30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0

tagging accuracy (ambiguous words)

10 × data

supervised

• 96K words• full POS dictionary• uninformative initializer• best of 8 smoothing conditions

CRF

HMM

random

EM

LENGTH

TRANS1

DELORTRANS1

DA

EM

DEL1WORD

DEL1SUBSEQUENCE

Smith & Eisner (2004)

Merialdo (1994)

≈ log-linear EM

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Dictionary includes ...

■ all words■ words from 1st half of corpus■ words with count 2■ words with count 3

Dictionary excludesOOV words,which can get any tag.

What if wedamagethe POS

dictionary?

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

90.1

90.4

84.4

69.5

84.8

78.3 80

.5

60.5

81.3

75.2

70.9

56.6

77.2

72.3

66.5

51.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0ta

ggin

g ac

cura

cy (al

l wor

ds)

Dictionary includes ...

■ all words■ words from 1st half of corpus■ words with count 2■ words with count 3

Dictionary excludesOOV words,which can get any tag.

• 96K words• 17 coarse POS tags• uninformative initializer

DELO

RTRA

NS1

LENGTH EM

rand

om

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Trigram Tagging Model + Spelling

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNS

feature set:

tag trigramstag trigrams

tag/word pairs from a POS dictionary

1- to 3-character suffixes, contains hyphen, digit

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation91

.1

91.9

90.8

83.2

90.3

73.8

89.8

73.6

90.1

90.4

84.4

69.5

84.8

78.3 80

.5

60.5

81.3

75.2

70.9

56.6

77.2

72.3

66.5

51.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0ta

ggin

g ac

cura

cy (al

l wor

ds)

DELO

RTRA

NS1

LENGTH EM

rand

om

DELO

RTRA

NS1

+ spe

lling

LENGTH

+ spe

lling

Spelling features

aided recovery,

but only with a smart

neighborhood.

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The model need not be finite-state.

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

clever

uninformative

23.633.8 48.7

35.2

42.1

37.4

0.0

10.0

20.0

30.0

40.0

50.0

Unsupervised Dependency Parsing

EM

LENGTH

initializer

TRANS1

att

ach

men

t acc

ura

cy

See our paperat the IJCAI 2005

Grammatical Inferenceworkshop.

Kle

in &

Man

nin

g (

20

04

)

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

To Sum Up ...Contrastive Estimation means

picking your own denominator

for tractabilityor for accuracy(or, as in our case, for both).

Now we can use the task to guide the unsupervised learner

It’s a particularly good fit for log-linear models:

with max ent featuresunsupervised sequence models

all in time for ACL 2006.

(like discriminative techniques do for supervised learners).

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation