CS 679: Text Mining
Lecture #13: Gibbs Sampling for LDA
Credit: Many slides are from presentations by Tom Griffiths of Berkeley.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.
Announcements
Required reading for Today Griffiths & Steyvers: “Finding Scientific Topics”
Final Project Proposal Clear, detailed: ideally, the first half of your project
report! Talk to me about ideas Teams are an option Due date to be specified
Objectives
Gain further understanding of LDA Understand the intractability of inference with
the model Gain further insight into Gibbs sampling Understand how to estimate the parameters
of interest in LDA using a collapsed Gibbs sampler
Latent Dirichlet Allocation(slightly different symbols this time)
NdD
zi
wi
(d)
(j)
(d) Dirichlet()
zi Categorical( (d) ) (j) Dirichlet()
wi Categorical( (zi) )
T
distribution over topicsfor each document
topic assignment for each word
distribution over words for each topic
word generated from assigned topic
Dirichlet priors
(Blei, Ng, & Jordan, 2001; 2003)
The Statistical Problem of Meaning
Generating data from parameters is easy
Learning parameters from data is hard
What does it mean to identify the “meaning” of a document?
Estimation ofthe LDA Generative Model
Maximum likelihood estimation (EM) Similar to method presented by Hofmann for pLSI
(1999) Deterministic approximate algorithms
Variational EM (Blei, Ng & Jordan, 2001, 2003) Expectation propagation (Minka & Lafferty, 2002)
Markov chain Monte Carlo – our focus Full Gibbs sampler (Pritchard et al., 2000) Collapsed Gibbs sampler (Griffiths & Steyvers, 2004)
The papers you read for today
Review: Markov Chain Monte Carlo (MCMC)
Sample from a Markov chain converges to a target distribution
Allows sampling from an unnormalized posterior distribution
Can compute approximate statistics from intractable distributions
(MacKay, 2002)
Review: Gibbs Sampling
Most straightforward kind of MCMC For variables Require the full (or “complete”) conditional
distribution for each variable:Draw from
x-i = x1(t), x2
(t),…, xi-1(t)
, xi+1(t-1)
, …, xn(t-1)
Bayesian Inference in LDA We would like to reason with the full joint distribution:
Given , the distribution over the latent variables is desirable, but the denominator (the marginal likelihood) is intractable to compute:
We marginalize the model parameters out of the joint distribution so that we can focus on the words in the corpus () and their assigned topics ():
This leads to our use of the term “collapsed sampler”
Posterior Inference in LDA
From this marginalized joint dist., we can compute the posterior distribution over topics for a given corpus ():
But possible topic assignments, where is the number of tokens in the corpus! i.e., inference is still intractable!
Working with this topic posterior is only tractable up to a constant multiple:
Collapsed Gibbs Sampler for LDA
Since we’re now focusing on the topic posterior, namely:
Let’s find these factors by marginalizing separately:
( | , ) ( | , ) ( | ),P P p d w z w z
( | ) ( |, ,) ( | )P P p d z z
( )
( )1
( ) ( )
( ) ( )
wTjwW w
j jw
n W
n
D
dj
dj
T
j
dj
n
Tn
1)(
)(
)(
)(
)(
)(
Where:• is the number of times word assigned to topic
Collapsed Gibbs Sampler for LDA We only sample each ! Complete (or full) conditionals can now be derived for each in
.
Where:• is the document in which word wi occurs• is the number of times (ignoring position i) word w assigned to topic j• is the number of times (ignoring position i) topic j used in document d
( ) ( ), ,
( )( ), ,
| , , ), ( ( | )( | )i i
i
i i i i
w di j
i i
i j
dwi j i k
w k
i iz j PP z j P w
n
n
z
n
n T
j
W
z w z w z
Steps for deriving the complete conditionals
1. Begin with the full joint distribution over the data, latent variables, and model parameters, given the fixed parameters and of the prior distributions.
2. Write out the desired collapsed joint distribution and set it equal to the appropriate integral over the full joint in order to marginalize over and .
3. Perform algebra and group like terms.4. Expand the generic notation by applying the closed-form definitions of the
Multinomial, Categorical, and Dirichlet distributions.5. Transform the representation: change the product indices from products over
documents and word sequences, to products over cluster labels and token counts.
6. Simplify by combining products, adding exponents and pulling constant multipliers outside of integrals.
7. When you have integrals over terms that are in the form of the kernel of the Dirichlet distribution, consider how to convert the result into a familiar distribution.
8. Once you have the expression for the joint, derive the expression for the conditional distribution
Collapsed Gibbs Sampler for LDA
For = 1 to :For variables (i.e., for to ):
Draw from
Collapsed Gibbs Sampler for LDA
This is nicer than your average Gibbs sampler: Memory: counts (the “” counts) can be cached
in two sparse matrices No special functions, simple arithmetic The distributions on and are analytic in topic
assignments and , and can later be recomputed from the samples in a given iteration of the sampler: from | from
Gibbs sampling in LDA
i wi di zi123456789101112...50
MATHEMATICSKNOWLEDGERESEARCHWORK
MATHEMATICSRESEARCHWORK
SCIENTIFICMATHEMATICS
WORKSCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
iteration1
T=2 Nd=10 M=5
Gibbs sampling in LDA
i wi di zi zi123456789101112...50
MATHEMATICSKNOWLEDGERESEARCHWORK
MATHEMATICSRESEARCHWORK
SCIENTIFICMATHEMATICS
WORKSCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
?
iteration1 2
Gibbs sampling in LDA
i wi di zi zi123456789101112...50
MATHEMATICSKNOWLEDGERESEARCHWORK
MATHEMATICSRESEARCHWORK
SCIENTIFICMATHEMATICS
WORKSCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
?
iteration1 2
( ) ( ), ,
( )( ), ,
( | , )i i
i
w di j i j
i i dwi j i k
w k
n nP z j
n W n T
z w
Gibbs sampling in LDA
i wi di zi zi123456789101112...50
MATHEMATICSKNOWLEDGERESEARCHWORK
MATHEMATICSRESEARCHWORK
SCIENTIFICMATHEMATICS
WORKSCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
?
iteration1 2
( ) ( ), ,
( )( ), ,
( | , )i i
i
w di j i j
i i dwi j i k
w k
n nP z j
n W n T
z w
Gibbs sampling in LDA
i wi di zi zi123456789101112...50
MATHEMATICSKNOWLEDGERESEARCHWORK
MATHEMATICSRESEARCHWORK
SCIENTIFICMATHEMATICS
WORKSCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
2?
iteration1 2
( ) ( ), ,
( )( ), ,
( | , )i i
i
w di j i j
i i dwi j i k
w k
n nP z j
n W n T
z w
Gibbs sampling in LDA
i wi di zi zi123456789101112...50
MATHEMATICSKNOWLEDGERESEARCHWORK
MATHEMATICSRESEARCHWORK
SCIENTIFICMATHEMATICS
WORKSCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
21?
iteration1 2
( ) ( ), ,
( )( ), ,
( | , )i i
i
w di j i j
i i dwi j i k
w k
n nP z j
n W n T
z w
Gibbs sampling in LDA
i wi di zi zi123456789101112...50
MATHEMATICSKNOWLEDGERESEARCHWORK
MATHEMATICSRESEARCHWORK
SCIENTIFICMATHEMATICS
WORKSCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
211?
iteration1 2
( ) ( ), ,
( )( ), ,
( | , )i i
i
w di j i j
i i dwi j i k
w k
n nP z j
n W n T
z w
Gibbs sampling in LDA
i wi di zi zi123456789101112...50
MATHEMATICSKNOWLEDGERESEARCHWORK
MATHEMATICSRESEARCHWORK
SCIENTIFICMATHEMATICS
WORKSCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
2112?
iteration1 2
( ) ( ), ,
( )( ), ,
( | , )i i
i
w di j i j
i i dwi j i k
w k
n nP z j
n W n T
z w
Gibbs sampling in LDA
i wi di zi zi zi123456789101112...50
MATHEMATICSKNOWLEDGERESEARCHWORK
MATHEMATICSRESEARCHWORK
SCIENTIFICMATHEMATICS
WORKSCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
211222212212...1
…
222122212222...1
iteration1 2 … 1000
( ) ( ), ,
( )( ), ,
( | , )i i
i
w di j i j
i i dwi j i k
w k
n nP z j
n W n T
z w
pixel = word image = document
sample each pixel froma mixture of topics
A Visual Example: Bars
A toy problem.Just a metaphor for inference on text.
Documents generated from the topics.
Evolution of the topics ( matrix)
Interpretable decomposition
• SVD gives a basis for the data, but not an interpretable one
• The true basis is not orthogonal, so rotation does no good
Effects of Hyper-parameters
and control the relative sparsity of and smaller : fewer topics per document smaller : fewer words per topic
Good assignments z are a compromise in sparsity
Bayesian model selection
How many topics do we need?
A Bayesian would consider the posterior:
Involves summing over assignments z
P(T|w) P(w|T) P(T)
Sweeping T
Analysis of PNAS abstracts Used all D = 28,154 abstracts from 1991-2001 Used any word occurring in at least five abstracts,
not on “stop” list (W = 20,551) Segmentation by any delimiting character, total of n =
3,026,970 word tokens in corpus Also, PNAS class designations for 2001
(Acknowledgment: Kevin Boyack)
Running the algorithm Memory requirements linear in T(W+D), runtime
proportional to nT T = 50, 100, 200, 300, 400, 500, 600, (1000) Ran 8 chains for each T, burn-in of 1000 iterations, 10
samples/chain at a lag of 100
All runs completed in under 30 hours on BlueHorizon supercomputer at San Diego
How many topics?
Topics by Document Length
A Selection of Topics
FORCESURFACE
MOLECULESSOLUTIONSURFACES
MICROSCOPYWATERFORCES
PARTICLESSTRENGTHPOLYMER
IONICATOMIC
AQUEOUSMOLECULARPROPERTIES
LIQUIDSOLUTIONS
BEADSMECHANICAL
HIVVIRUS
INFECTEDIMMUNODEFICIENCY
CD4INFECTION
HUMANVIRAL
TATGP120
REPLICATIONTYPE
ENVELOPEAIDSREV
BLOODCCR5
INDIVIDUALSENV
PERIPHERAL
MUSCLECARDIAC
HEARTSKELETAL
MYOCYTESVENTRICULAR
MUSCLESSMOOTH
HYPERTROPHYDYSTROPHIN
HEARTSCONTRACTION
FIBERSFUNCTION
TISSUERAT
MYOCARDIALISOLATED
MYODFAILURE
STRUCTUREANGSTROM
CRYSTALRESIDUES
STRUCTURESSTRUCTURALRESOLUTION
HELIXTHREE
HELICESDETERMINED
RAYCONFORMATION
HELICALHYDROPHOBIC
SIDEDIMENSIONALINTERACTIONS
MOLECULESURFACE
NEURONSBRAIN
CORTEXCORTICAL
OLFACTORYNUCLEUS
NEURONALLAYER
RATNUCLEI
CEREBELLUMCEREBELLAR
LATERALCEREBRAL
LAYERSGRANULELABELED
HIPPOCAMPUSAREAS
THALAMIC
TUMORCANCERTUMORSHUMAN
CELLSBREAST
MELANOMAGROWTH
CARCINOMAPROSTATENORMAL
CELLMETASTATICMALIGNANT
LUNGCANCERS
MICENUDE
PRIMARYOVARIAN
P(w
| z
)
Cold topics Hot topics
Cold topics Hot topics
2SPECIESGLOBALCLIMATE
CO2WATER
ENVIRONMENTALYEARS
MARINECARBON
DIVERSITYOCEAN
EXTINCTIONTERRESTRIALCOMMUNITYABUNDANCE
134MICE
DEFICIENTNORMAL
GENENULL
MOUSETYPE
HOMOZYGOUSROLE
KNOCKOUTDEVELOPMENT
GENERATEDLACKINGANIMALSREDUCED
179APOPTOSIS
DEATHCELL
INDUCEDBCL
CELLSAPOPTOTIC
CASPASEFAS
SURVIVALPROGRAMMED
MEDIATEDINDUCTIONCERAMIDE
EXPRESSION
Cold topics Hot topics
2SPECIESGLOBALCLIMATE
CO2WATER
ENVIRONMENTALYEARS
MARINECARBON
DIVERSITYOCEAN
EXTINCTIONTERRESTRIALCOMMUNITYABUNDANCE
134MICE
DEFICIENTNORMAL
GENENULL
MOUSETYPE
HOMOZYGOUSROLE
KNOCKOUTDEVELOPMENT
GENERATEDLACKINGANIMALSREDUCED
179APOPTOSIS
DEATHCELL
INDUCEDBCL
CELLSAPOPTOTIC
CASPASEFAS
SURVIVALPROGRAMMED
MEDIATEDINDUCTIONCERAMIDE
EXPRESSION
37CDNA
AMINOSEQUENCE
ACIDPROTEINISOLATED
ENCODINGCLONED
ACIDSIDENTITY
CLONEEXPRESSEDENCODES
RATHOMOLOGY
289KDA
PROTEINPURIFIED
MOLECULARMASS
CHROMATOGRAPHYPOLYPEPTIDE
GELSDS
BANDAPPARENTLABELED
IDENTIFIEDFRACTIONDETECTED
75ANTIBODY
ANTIBODIESMONOCLONAL
ANTIGENIGG
MABSPECIFICEPITOPEHUMAN
MABSRECOGNIZED
SERAEPITOPESDIRECTED
NEUTRALIZING
Conclusions
Estimation/inference in LDA is more or less straightforward using Gibbs Sampling i.e., easy!
Not so easy in all graphical models
Coming Soon
Topical n-grams