Expectation-Propagation for the Generative Aspect Model
Tom Minka and John Lafferty
Carnegie Mellon University
Main points
• Aspect model is important and diff icult– PCA for discrete data
• How variational methods can fail
• Extensions of Expectation-Propagation
• Using EP for maximum likelihood
Generative aspect model
Each document mixes aspects in different proportions
Aspect 1 Aspect 2
1λ11 λ− 2λ
3λ21 λ− 31 λ−
(Hofmann1999; Blei, Ng, & Jordan 2001)
First interpretation
Document
Aspect 1 Aspect 2
) word( wp
λ λ−1
),...,(~)( 1 JDirichletp ααλ
Multinomial sampling
Second interpretation
“Aspect 2” “Aspect 2”
Word 2 Word 3Word 1Doc i
iλ
3iz2iz1iz“Aspect 1”
Sample fromaspect
Two tasks
Inference:
• Given aspects and document i, what is (posterior for) ?
Learning:
• Given some documents, what are (maximum likelihood) aspects?
iλ
Variational method
• Blei, Ng, & Jordan, NIPS’01
• Uses interpretation
• Inference: factored variational distribution
• Learning: – E-step
– M-step (aspects, aspect weights)
)()( zqq λ
),( zλ
),( zλ),( αp
Expectation Propagation
• Uses interpretation
• Inference: EP of Dirichlet posterior
• Learning:– E-step
– M-step (aspects, aspect weights)
• Fewer latent variables better
)only (λ
),( αp
)(λq
)(λ
Geometric interpretation
• Likelihood is composed of terms of form
• Want Dirichlet approximation:
∑==a
na
nnw
www awpwpt ))|(()()( λλ
∏=a
awwat βλλ)(~
Variational method
• Bound each term via Jensen’s inequality:
• Coupled equations:
∏∑ ×≥a
aa
awaawp const)()|( βλλ
)|()logexp( awpawa ><∝ λβ
Moment matching
• Context function: all but one occurrence
• Moment match:
• Fixed point equations for
∏≠
−=ww
nw
nw
w ww ttq'
'1\ ')()()( λλλ
)()(~)()( \\ λλλλ ww
ww qtqt ↔
β
One term
3.0)1(4.0)()( λλλ −+=wt
Ten word document
Moments for ten word doc
VarianceMean
Variational
EP
Exact
0.01780.365
0.06120.393
0.06130.393
1000 word document
Moments for 1000 word doc
VarianceMean
Variational
EP
Exact
0.00020.252
0.00150.252
0.00150.250
General behavior
• For long documents, VB recovers correct mean, but not correct variance
• Optimizes global criterion, only accurate locally
• What about learning?
Learning
• Want MLE for aspects• Requires marginalizing • is probabil ity of generating doc
from a random• Consider extreme approximation:
– Probability of generating from most likely– Ignores uncertainty in (“Occam factor” )– Used by Hofmann (1999)
λ
λ
λ
)(docp
λ
Toy problem
• Two aspects over two words– describes each
• One aspect fixed to
• is uniform, i.e.
•
)|1( awp =
1)2|1( == awp
)2|1()1()1|1()()1( awpawpwp =−+=== λλλ 1=α
)1( =wp1=λ 0=λ
Aspect1 Aspect2
0 1
Exact likelihood vs. Max
ii nn /1Dotted are observed frequencies:
)1|1( awp =
10 docs
EP likelihood vs. VB
)1|1( awp =
VB not much different from Max
100 documents
)1|1( awp =
EP better, VB worse
Longer documents
)1|1( awp =
VB not helped
Why VB fails
• Doesn’ t capture posterior variance of– No Occam factor
• Gets worse with more documents
• Doesn’ t matter if posterior is sharp!– Longer documents
– More distinct aspects
Both sharpen posterior, but do not help VB
λ
Larger vocabulary
10 docs,Length 10
Larger vocabulary
10 docs,Length 10
Larger vocabulary
10 docs,Length 10
More documents
100 docs,Length 10
TREC documents
Perplexity
• EP and VB models have nearly same perplexity (!)
• Perplexity is document probabilit y, normalized by length– Dilutes the penalty of extreme aspects
• Better measure: classification error
Summary & Future
• Approximate inference can lead to biased learning– Must preserve “Occam factor” terms
• Moment matching does well
• Simpler methods may also work– Area of aspect simplex
• Make visualizations!