finding scientific topics august 26 - 31, 2011. topic modeling 1.a document as a probabilistic...
TRANSCRIPT
![Page 1: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/1.jpg)
Finding Scientific topics
August 26 - 31, 2011
![Page 2: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/2.jpg)
Topic Modeling
1. A document as a probabilistic mixture of topics.2. A topic as a probability distribution over words.3. Words are assumed known and the number of words is fixed.
If there are T topics,
Here, { w } denote words, { z } denote topics.
The conditional probability P ( w | z ) indicates which words are important to a topic.
For a particular document, P (z), distribution over topics, determines how these topics are mixed together in forming the document.
![Page 3: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/3.jpg)
An example “Soft Classification”: a document is not assigned with only one topic (a single class).
For document, P(z) gives an indication of what topic should be assigned to it.
How to think/visualize P(z), P(w | z )?
words topicsWhat do you want know? What do we want to compute from the input data?
Inputs: a. A document or a collection of documents, with the collection of words appeared in the document {w1, …. , wn } (repetition allowed, perhaps with deletions of unimportant words such as articles ‘the’, ‘an’, prepositions ‘on’, ‘of’).b. Number of topics, T.
![Page 4: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/4.jpg)
We want to know (compute) P ( z ), P (w | z ) for each topic z.
There is one (or D, the number of documents) P( z ), and T P (w | z = j). What form should P(z) and P( w | z) take?
Multinomial distributions ( http://en.wikipedia.org/wiki/Multinomial_distribution )
What is this? Each P ( z) is a non-negative vector with T components (with sum = 1). Each P ( w | z ) is a non-negative vector with W components (with sum = 1).
One possible solution
Question: how many variables are here?Problem with this approach: local maxima and slow convergence
![Page 5: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/5.jpg)
Bayesian Approach
Estimate phi and theta indirectly via the following generative model
Dirichlet Distribution (http://en.wikipedia.org/wiki/Dirichlet_distribution )
What does this generative model say?
It gives us the way that the observed data are thought to be generated.
Where is the prior?
Idea: using the generative model to explain the input data.
Alpha, beta are the hyperparameters
![Page 6: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/6.jpg)
The goal is evaluate the posterior distribution
Difficult because the denominator cannot be computed. (Know very well what the notations Z, W stand for).
However, we do have (for the numerator P( w, z ) )
![Page 7: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/7.jpg)
P ( z ) = P ( z1, … , zw | theta ) = P (z1 | theta) … P (zw | theta) ( Assuming conditional Independence of zi given theta)
![Page 8: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/8.jpg)
This gives Equation 3 (with D= 1, one document)
Equation 2 can be obtained similarly.
![Page 9: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/9.jpg)
The goal is evaluate the posterior distribution
Difficult because the denominator cannot be computed.But what can we do with P(z | w)? Recall that our goal is to estimate theta (topic proportion) and phi (topics)
Suppose we know the true topic assignments (z1, … , zW), theta can be estimated as
theta_i = ( number of words assigned topic i ) / ( number of total words, W) andHow about phi?
![Page 10: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/10.jpg)
More About Dirichlet Distribution
Dirichlet Distribution is defined on the T-dimensional standard simplexThe density function for DD with parameters
When alpha_i is close to zero, probability will concentrate near theta_i =0.On the other hand, when alpha_i is away from zero (large), probability will move away from theta_i=0. An example with T = 3,
![Page 11: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/11.jpg)
More About Dirichlet Distribution
The expected value and variance of each component theta_i are given by the formulas
When alpha_i is close to zero, probability will concentrate near theta_i =0.On the other hand, when alpha_i is away from zero (large), probability will move away from theta_i=0. An example with T = 3,
![Page 12: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/12.jpg)
Suppose we know the true topic assignments (z1, … , zW), theta can be estimated as
theta_i = ( number of words assigned topic i ) / ( number of total words, W) andHow about phi?
Of course, we don’t know the ground truth, but only the distribution P ( Z | W). We need to know P ( theta | Z, W)
By Baye’ rule, we have
Therefore, P(theta | Z, W) is also another Dirichlet Distribution.For a given Z, what should the estimated theta be?This gives Equation 6 (and 7 similarly).
![Page 13: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/13.jpg)
The point, of course, is that we don’t know the exact topic assignment (z1, … zW), but only its distribution P( z ).
For a probability distribution P(x), the expectation of a function can be estimated as (where yi are the samples of P(x) )
For example, we can use this formula to estimate the mean, variance of a distribution from its samples.
More samples give more accurate estimate on the right.
![Page 14: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/14.jpg)
The goal is evaluate the posterior distribution
Difficult because the denominator cannot be computed.
Using Markov Chain Monte Carlo (MCMC) to simulate the probability).
What this mean is that we want to draw samples with respect to the distribution for each sample we generated from P(z | w), we have one estimate of theta and phi according to
![Page 15: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/15.jpg)
Simulating P (z|w) using MCMC (Markov Chain Monte Carlo)
Much more on this later… The steps are
1. Initialize the topic assignments (z1, … zW)s=0, 2. Do (say three thousand iterations) for each i = 1, …. , W change the current assignment for zi according
to the probability
One cycle through all i gives a new topic assignment (z1, … zW)s=s+1,
3. Generate Samples
What does the formula say?
![Page 16: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/16.jpg)
Example, Suppose there are 3 (T=3) topics and there are 3 words (A, B, C) in the dictionary and the word list has 10 words (W=10). Take alpha=beta=1. Word List is { A, A, C, B, C, A, C, A, B, B}
With the initial topic assignment
{ 1, 1, 2, 2, 3, 1, 2, 3, 1 , 1}
How do we apply the formula? First word is A, in the word list, A has been assigned to topics 1, 1, 1, and 3.
P ( z1 | Z-1, W) = (0.5159, 0.1848, 0.3003)
{ 3, 1, 2, 2, 3, 1, 2, 3, 1 , 1}
Next, compute P ( z1 | Z-1, W) and sample a new z2 value
What are the effects of alpha and beta? (prior and large W)
![Page 17: Finding Scientific topics August 26 - 31, 2011. Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution](https://reader035.vdocuments.us/reader035/viewer/2022080917/56649ecb5503460f94bd9c7b/html5/thumbnails/17.jpg)
Summary
Goal is to infer theta (Dirichlet) from the data, with theta itself a distribution.
Dirichlet distribution is a prior distribution on theta (A Bayesian approach). Therefore, it is a distribution on the space of distributions. No particular forms of theta are assumed (nonparametric). The base probability space is finite and discrete X = {1, …, T}. Things become much more complicated when X is no longer discrete, for example, X is the set of real numbers.
Need more sophisticated mathematical language, and that will be the goal for the next two – three weeks.