acl 2005 n. a. smith and j. eisner contrastive estimation contrastive estimation : (efficiently)...

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Contrastive Estimation:(Efficiently) Training Log-

Linear Models (of Sequences) on Unlabeled Data

Noah A. Smith and Jason EisnerDepartment of Computer Science /

Center for Language and Speech ProcessingJohns Hopkins University

{nasmith,jason}@cs.jhu.edu


Nutshell Version

tractabletraining

unannotated text

“max ent” features

Experiments on unlabeled data:

POS tagging: 46% error rate reduction (relative to EM)

“Max ent” features make it possible

to survive damage to tag dictionary

Dependency parsing: 21% attachment error reduction

(relative to EM)

contrastive estimationwith lattice neighborhoods

sequence models


“Red leaves don’t hide blue jays.”


Maximum Likelihood Estimation(Supervised)

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNS

?

?

p

p *

x

y

Σ* × Λ*



? ? ? ? ? ?

?

?

p

p *

x

Σ* × Λ*

Maximum Likelihood Estimation

(Unsupervised)

This is what

EM does.


Focusing Probability Mass

numerator

denominator


Conditional Estimation(Supervised)


JJ NNS MD VB JJ NNSp

p

x

y


? ? ? ? ? ?

(x) × Λ*

A different

denominator!


Objective Functions

ObjectiveOptimizati

on Algorithm

NumeratorDenominat

or

MLECount &

Normalize*tags & words

Σ* × Λ*

MLE with hidden

variablesEM* words Σ* × Λ*

Conditional Likelihood

Iterative Scaling

tags & words

(words) × Λ*

Perceptron Backproptags & words

hypothesized tags & words*For generative models.


Objective Functions

ObjectiveOptimizati

on Algorithm

NumeratorDenominat

or

MLECount &

Normalize*tags & words

Σ* × Λ*

MLE with hidden

variablesEM* words Σ* × Λ*

Conditional Likelihood

Iterative Scaling

tags & words

(words) × Λ*

Perceptron Backproptags & words

hypothesized tags & words*For generative models.

Contrastive Estimation

observed data

(in this talk, raw

word sequence, sum over

all possible taggings)

?

generic numerical

solvers

(in this talk, LMVM L-BFGS)


This talk is about denominators ...

in the unsupervised case.

A good denominator can improve

accuracy

and

tractability.


Language Learning (Syntax)


Why didn’t he say,

“birds fly” or “dancing granola” or “the wash

dishes” or

any other sequence of words?

EM


Language Learning (Syntax)


Why did he pick that sequence for those words?

Why not say “leaves red ...” or “... hide don’t ...” or ...


What is a syntax model supposed to explain?

Each learning hypothesis

corresponds to

a denominator / neighborhood.


The Job of Syntax“Explain why each word is necessary.”

→ DEL1WORD neighborhood

red leavesdon’t hide blue jays

leaves don’t hide blue jays

red don’t hide blue jays

red leaves hide blue jays

red leaves don’t blue jays

red leaves don’t hide jays

red leaves don’t hide blue


The Job of Syntax“Explain the (local) order of the words.”

→ TRANS1 neighborhood

red leavesdon’t hide blue jays

leaves red don’t hide blue jays

red leaves hide don’t blue jays

red don’t leaves hide blue jays

red leaves don’t hide jays blue

red leaves don’t blue hide jays



? ? ? ? ? ?p

p

leaves red don’t hide blue jays

? ? ? ? ? ?

red don’t leaves hide blue jays

? ? ? ? ? ?

red leaves hide don’t blue jays

? ? ? ? ? ?

red leaves don’t blue hide jays

? ? ? ? ? ?

red leaves don’t hide jays blue

? ? ? ? ? ?


? ? ? ? ? ?

sentences inTRANS1

neighborhood


p

p


? ? ? ? ? ?


leaves

don’t

hideblue

jays

don’t hide blue jays

leaves

don’t

hide

bluered

(with any tagging) sentences inTRANS1

neighborhood


The New Modeling Imperative

numerator

denominator(“neighborhood”)

“Make the good sentence

likely, at the expense

of those bad neighbors.”

A good sentence

hints that a set of bad

ones is nearby.


This talk is about denominators ...in the unsupervised case.

A good denominator can improve

accuracy and

tractability.


Log-Linear Models score of x, y

partition function

Computing is

undesirable!

ConditionalEstimation

(Supervised)

ContrastiveEstimation

(Unsupervised)

Sums over all possible taggings of all possible sentences!1

sentencea few

sentences


A Big Picture: Sequence Model Estimation

tractable sums

overlapping features

generative,MLE: p(x, y)

log-linear,conditionalestimation:

p(y | x)

unannotated data

generative,EM: p(x)

log-linear,MLE: p(x, y)

log-linear,EM: p(x)

log-linear,CE with

lattice neighborhoods


Contrastive Neighborhoods

• Guide the learner toward models that do what syntax is supposed to do.

• Lattice representation → efficient algorithms.

There is an art to

choosing neighborhood functions.


Neighborhoods

neighborhood sizelatticearcs

perturbations

n+1 O(n) delete up to 1 word

n O(n) transpose any bigram

O(n) O(n)

O(n2

)O(n2) delete any contiguous

subsequence

(EM) ∞ - replace each word with anything

DEL1SUBSEQUENCE

TRANS1

DEL1WORD

DELORTRANS1

Σ*

DEL1WORD TRANS1


The Merialdo (1994) Task

Given unlabeled text

and a POS dictionary(that tells all possible tags for each word type),

learn to tag.A form of

supervision.


Trigram Tagging Model


JJ NNS MD VB JJ NNS

feature set:

tag trigramstag trigrams

tag/word pairs from a POS dictionary


35.1

58.7

60.4

62.1

66.6

70.0

78.8

79.0

79.3

97.2

99.5

30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0

tagging accuracy (ambiguous words)

10 × data

supervised

• 96K words• full POS dictionary• uninformative initializer• best of 8 smoothing conditions

CRF

HMM

random

EM

LENGTH

TRANS1

DELORTRANS1

DA

EM

DEL1WORD

DEL1SUBSEQUENCE

Smith & Eisner (2004)

Merialdo (1994)

≈ log-linear EM


Dictionary includes ...

■ all words■ words from 1st half of corpus■ words with count 2■ words with count 3

Dictionary excludesOOV words,which can get any tag.

What if wedamagethe POS

dictionary?


90.1

90.4

84.4

69.5

84.8

78.3 80

.5

60.5

81.3

75.2

70.9

56.6

77.2

72.3

66.5

51.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0ta

ggin

g ac

cura

cy (al

l wor

ds)

Dictionary includes ...

■ all words■ words from 1st half of corpus■ words with count 2■ words with count 3

Dictionary excludesOOV words,which can get any tag.

• 96K words• 17 coarse POS tags• uninformative initializer

DELO

RTRA

NS1

LENGTH EM

rand

om


Trigram Tagging Model + Spelling


JJ NNS MD VB JJ NNS

feature set:

tag trigramstag trigrams

tag/word pairs from a POS dictionary

1- to 3-character suffixes, contains hyphen, digit

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation91

.1

91.9

90.8

83.2

90.3

73.8

89.8

73.6

90.1

90.4

84.4

69.5

84.8

78.3 80

.5

60.5

81.3

75.2

70.9

56.6

77.2

72.3

66.5

51.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0ta

ggin

g ac

cura

cy (al

l wor

ds)

DELO

RTRA

NS1

LENGTH EM

rand

om

DELO

RTRA

NS1

+ spe

lling

LENGTH

+ spe

lling

Spelling features

aided recovery,

but only with a smart

neighborhood.


The model need not be finite-state.


clever

uninformative

23.633.8 48.7

35.2

42.1

37.4

0.0

10.0

20.0

30.0

40.0

50.0

Unsupervised Dependency Parsing

EM

LENGTH

initializer

TRANS1

att

ach

men

t acc

ura

cy

See our paperat the IJCAI 2005

Grammatical Inferenceworkshop.

Kle

in &

Man

nin

g (

20

04

)


To Sum Up ...Contrastive Estimation means

picking your own denominator

for tractabilityor for accuracy(or, as in our case, for both).

Now we can use the task to guide the unsupervised learner

It’s a particularly good fit for log-linear models:

with max ent featuresunsupervised sequence models

all in time for ACL 2006.

(like discriminative techniques do for supervised learners).

acl 2005 n. a. smith and j. eisner contrastive estimation contrastive estimation : (efficiently)...

Documents

em contrastive estimation

em slide

bfgs slide

sequence of words

perceptronbackproptags

hidden variables em

em max ent features

generative models