acl 2005 n. a. smith and j. eisner contrastive estimation contrastive estimation : (efficiently)...
TRANSCRIPT
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Contrastive Estimation:(Efficiently) Training Log-
Linear Models (of Sequences) on Unlabeled Data
Noah A. Smith and Jason EisnerDepartment of Computer Science /
Center for Language and Speech ProcessingJohns Hopkins University
{nasmith,jason}@cs.jhu.edu
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Nutshell Version
tractabletraining
unannotated text
“max ent” features
Experiments on unlabeled data:
POS tagging: 46% error rate reduction (relative to EM)
“Max ent” features make it possible
to survive damage to tag dictionary
Dependency parsing: 21% attachment error reduction
(relative to EM)
contrastive estimationwith lattice neighborhoods
sequence models
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Maximum Likelihood Estimation(Supervised)
red leaves don’t hide blue jays
JJ NNS MD VB JJ NNS
?
?
p
p *
x
y
Σ* × Λ*
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
red leaves don’t hide blue jays
? ? ? ? ? ?
?
?
p
p *
x
Σ* × Λ*
Maximum Likelihood Estimation
(Unsupervised)
This is what
EM does.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Focusing Probability Mass
numerator
denominator
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Focusing Probability Mass
numerator
denominator
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Conditional Estimation(Supervised)
red leaves don’t hide blue jays
JJ NNS MD VB JJ NNSp
p
x
y
red leaves don’t hide blue jays
? ? ? ? ? ?
(x) × Λ*
A different
denominator!
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Objective Functions
ObjectiveOptimizati
on Algorithm
NumeratorDenominat
or
MLECount &
Normalize*tags & words
Σ* × Λ*
MLE with hidden
variablesEM* words Σ* × Λ*
Conditional Likelihood
Iterative Scaling
tags & words
(words) × Λ*
Perceptron Backproptags & words
hypothesized tags & words*For generative models.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Objective Functions
ObjectiveOptimizati
on Algorithm
NumeratorDenominat
or
MLECount &
Normalize*tags & words
Σ* × Λ*
MLE with hidden
variablesEM* words Σ* × Λ*
Conditional Likelihood
Iterative Scaling
tags & words
(words) × Λ*
Perceptron Backproptags & words
hypothesized tags & words*For generative models.
Contrastive Estimation
observed data
(in this talk, raw
word sequence, sum over
all possible taggings)
?
generic numerical
solvers
(in this talk, LMVM L-BFGS)
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
This talk is about denominators ...
in the unsupervised case.
A good denominator can improve
accuracy
and
tractability.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Language Learning (Syntax)
red leaves don’t hide blue jays
Why didn’t he say,
“birds fly” or “dancing granola” or “the wash
dishes” or
any other sequence of words?
EM
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Language Learning (Syntax)
red leaves don’t hide blue jays
Why did he pick that sequence for those words?
Why not say “leaves red ...” or “... hide don’t ...” or ...
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
What is a syntax model supposed to explain?
Each learning hypothesis
corresponds to
a denominator / neighborhood.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
The Job of Syntax“Explain why each word is necessary.”
→ DEL1WORD neighborhood
red leavesdon’t hide blue jays
leaves don’t hide blue jays
red don’t hide blue jays
red leaves hide blue jays
red leaves don’t blue jays
red leaves don’t hide jays
red leaves don’t hide blue
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
The Job of Syntax“Explain the (local) order of the words.”
→ TRANS1 neighborhood
red leavesdon’t hide blue jays
leaves red don’t hide blue jays
red leaves hide don’t blue jays
red don’t leaves hide blue jays
red leaves don’t hide jays blue
red leaves don’t blue hide jays
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
red leaves don’t hide blue jays
? ? ? ? ? ?p
p
leaves red don’t hide blue jays
? ? ? ? ? ?
red don’t leaves hide blue jays
? ? ? ? ? ?
red leaves hide don’t blue jays
? ? ? ? ? ?
red leaves don’t blue hide jays
? ? ? ? ? ?
red leaves don’t hide jays blue
? ? ? ? ? ?
red leaves don’t hide blue jays
? ? ? ? ? ?
sentences inTRANS1
neighborhood
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
p
p
red leaves don’t hide blue jays
? ? ? ? ? ?
red leaves don’t hide blue jays
leaves
don’t
hideblue
jays
don’t hide blue jays
leaves
don’t
hide
bluered
(with any tagging) sentences inTRANS1
neighborhood
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
The New Modeling Imperative
numerator
denominator(“neighborhood”)
“Make the good sentence
likely, at the expense
of those bad neighbors.”
A good sentence
hints that a set of bad
ones is nearby.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
This talk is about denominators ...in the unsupervised case.
A good denominator can improve
accuracy and
tractability.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Log-Linear Models score of x, y
partition function
Computing is
undesirable!
ConditionalEstimation
(Supervised)
ContrastiveEstimation
(Unsupervised)
Sums over all possible taggings of all possible sentences!1
sentencea few
sentences
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
A Big Picture: Sequence Model Estimation
tractable sums
overlapping features
generative,MLE: p(x, y)
log-linear,conditionalestimation:
p(y | x)
unannotated data
generative,EM: p(x)
log-linear,MLE: p(x, y)
log-linear,EM: p(x)
log-linear,CE with
lattice neighborhoods
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Contrastive Neighborhoods
• Guide the learner toward models that do what syntax is supposed to do.
• Lattice representation → efficient algorithms.
There is an art to
choosing neighborhood functions.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Neighborhoods
neighborhood sizelatticearcs
perturbations
n+1 O(n) delete up to 1 word
n O(n) transpose any bigram
O(n) O(n)
O(n2
)O(n2) delete any contiguous
subsequence
(EM) ∞ - replace each word with anything
DEL1SUBSEQUENCE
TRANS1
DEL1WORD
DELORTRANS1
Σ*
DEL1WORD TRANS1
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
The Merialdo (1994) Task
Given unlabeled text
and a POS dictionary(that tells all possible tags for each word type),
learn to tag.A form of
supervision.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Trigram Tagging Model
red leaves don’t hide blue jays
JJ NNS MD VB JJ NNS
feature set:
tag trigramstag trigrams
tag/word pairs from a POS dictionary
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
35.1
58.7
60.4
62.1
66.6
70.0
78.8
79.0
79.3
97.2
99.5
30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0
tagging accuracy (ambiguous words)
10 × data
supervised
• 96K words• full POS dictionary• uninformative initializer• best of 8 smoothing conditions
CRF
HMM
random
EM
LENGTH
TRANS1
DELORTRANS1
DA
EM
DEL1WORD
DEL1SUBSEQUENCE
Smith & Eisner (2004)
Merialdo (1994)
≈ log-linear EM
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Dictionary includes ...
■ all words■ words from 1st half of corpus■ words with count 2■ words with count 3
Dictionary excludesOOV words,which can get any tag.
What if wedamagethe POS
dictionary?
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
90.1
90.4
84.4
69.5
84.8
78.3 80
.5
60.5
81.3
75.2
70.9
56.6
77.2
72.3
66.5
51.0
50.0
55.0
60.0
65.0
70.0
75.0
80.0
85.0
90.0
95.0ta
ggin
g ac
cura
cy (al
l wor
ds)
Dictionary includes ...
■ all words■ words from 1st half of corpus■ words with count 2■ words with count 3
Dictionary excludesOOV words,which can get any tag.
• 96K words• 17 coarse POS tags• uninformative initializer
DELO
RTRA
NS1
LENGTH EM
rand
om
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Trigram Tagging Model + Spelling
red leaves don’t hide blue jays
JJ NNS MD VB JJ NNS
feature set:
tag trigramstag trigrams
tag/word pairs from a POS dictionary
1- to 3-character suffixes, contains hyphen, digit
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation91
.1
91.9
90.8
83.2
90.3
73.8
89.8
73.6
90.1
90.4
84.4
69.5
84.8
78.3 80
.5
60.5
81.3
75.2
70.9
56.6
77.2
72.3
66.5
51.0
50.0
55.0
60.0
65.0
70.0
75.0
80.0
85.0
90.0
95.0ta
ggin
g ac
cura
cy (al
l wor
ds)
DELO
RTRA
NS1
LENGTH EM
rand
om
DELO
RTRA
NS1
+ spe
lling
LENGTH
+ spe
lling
Spelling features
aided recovery,
but only with a smart
neighborhood.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
clever
uninformative
23.633.8 48.7
35.2
42.1
37.4
0.0
10.0
20.0
30.0
40.0
50.0
Unsupervised Dependency Parsing
EM
LENGTH
initializer
TRANS1
att
ach
men
t acc
ura
cy
See our paperat the IJCAI 2005
Grammatical Inferenceworkshop.
Kle
in &
Man
nin
g (
20
04
)
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
To Sum Up ...Contrastive Estimation means
picking your own denominator
for tractabilityor for accuracy(or, as in our case, for both).
Now we can use the task to guide the unsupervised learner
It’s a particularly good fit for log-linear models:
with max ent featuresunsupervised sequence models
all in time for ACL 2006.
(like discriminative techniques do for supervised learners).