cs246 latent dirichlet analysis. lsi lsi uses svd to find the best rank-k approximation the result...

43
CS246 Latent Dirichlet Analysis

Upload: morgan-harvey

Post on 19-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

CS246

Latent Dirichlet Analysis

Page 2: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

LSI LSI uses SVD to find the best rank-K

approximation The result is difficult to interpret especially with

negative numbers Q: Can we develop a more interpretable

method?

Page 3: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Theory of LDA (Model-based Approach) Develop a simplified model on how users write

a document based on topics. Fit the model to the existing corpus and

“reverse engineer” the topics used in a document

Q: How do we write a document? A: (1) Pick the topic(s)

(2) Start writing on the topic(s) with related terms

Page 4: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Two Probability Vectors For every document d, we assume that the

user will first pick the topics to write about P(z|d) : probability to pick topic z when the user

write each word in document d. Document-topic vector of d

We also assume that every topic is associated with each term with certain probability P(w|z) : the probability of picking the term w when

the user write on the topic z. Topic-term vector of z

1)|(1

T

ii dzP

1)|(1

W

jj zwP

Page 5: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Probabilistic Topic Model There exists T number of topics The topics-term vector for each topic is set

before any document is written P(wj|zi) is set for every zi and wj

Then for every document d, The user decides the topics to write on, i.e., P(zi|d) For each word in d

The user selects a topic zi with probability P(zi|d)

The user selects a word wj with probability P(wj|zi)

Page 6: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Probabilistic Document Model

Topic 1

Topic 2

DOC 1

DOC 2

DOC 3

1.0

1.0

0.5

0.5

P(w|z) P(z|d)

money1 bank1 loan1

bank1 money1 ...

river2 stream2 river2

bank2 stream2 ...

money1 river2 bank1

stream2 bank2 ...

Page 7: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Example: Calculating Probability z1 = {w1:0.8, w2:0.1, w3:0.1}

z2 = {w1:0.1, w2:0.2, w3:0.7}

d’s topics are {z1: 0.9, z2:0.1}d has three terms {w3

2, w11, w2

1}.

Q: What is the probability that a user will write such a document?

Page 8: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Corpus Generation Probability T: # topics D: # documents M: # words per document Probability of generating the corpus C

D

i

M

jijijiji dzPzwPCP

1 1,,, )|()|()(

Page 9: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Generative Model vs Inference (1)

Topic 1

Topic 2

DOC 1

DOC 2

DOC 3

1.0

1.0

0.5

0.5

P(w|z) P(z|d)

money1 bank1 loan1

bank1 money1 ...

river2 stream2 river2

bank2 stream2 ...

money1 river2 bank1

stream2 bank2 ...

Page 10: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Generative Model vs Inference (2)

Topic 1

Topic 2

DOC 1

DOC 2

DOC 3

?

?

?

?

money? bank? loan?

bank? money? ...

river? stream? river?

bank? stream? ...

money? river? bank?

stream? bank? ...

Page 11: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Probabilistic Latent Semantic Index (pLSI) Basic Idea: We pick P(zj|di), P(wk|zj), and zij values

to maximize the corpus generation probability Maximum-likelihood estimation (MLE)

More discussion later on how to compute the P(zj|di), P(wk|zj), and zij values that maximize the probability

Page 12: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Problem of pLSI Q: 1M documents, 1000 topics, 1M words.

1000 words/doc. How much input data? How many variables do we have to estimate?

Q: Too much freedom. How can we avoid overfitting problem?

A: Adding constraints to reduce degree of freedom

Page 13: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Latent Dirichlet Analysis (LDA) When term probabilities are selected for each

topic Topic-term probability vector, (P(w1|zj), …, P(wW|zj)), is

sampled randomly from Dirichlet distribution

When users select topics for a document Document-topic probability vector, (P(z1|d), …, P(zT|

d)), is sampled randomly from Dirichlet distribution

T

jj

j j

j j

TTjppp

1

111 )(

)(),...,;,....,Dir(

W

jj

j j

j j

WWjppp

1

111 )(

)(),...,;,....,Dir(

Page 14: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

What is Dirichlet Distribution? Multinomial distribution

Given the probability pi of each event ei, what is the probability that each event ei occurs i times after n trial?

We assume pi’s. The distribution assigns i’s probability.

Dirichlet distribution “Inverse” of multinomial distribution: We assume

i’s. The distribution assigns pi’s probability.

kkk

ii

i ikk ppppnf

...

)1(

)1(),...,,;,...,( 1

1

1

11

kkk

ii

i ikk ppppf

...

)1(

))1((),...,;,...,( 1

1

1

11

Page 15: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Dirichlet Distribution

Q: Given 1, 2,…, k, what are the most likely p1, p2, pk values?

kkk

ii

i ikk ppppf

...

)1(

))1((),...,;,...,( 1

1

1

11

Page 16: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Normalized Probability Vector and Simplex Remember that and When (p1, …, pn) satisfies p1 + … + pn = 1,

they are on a “simplex plane” (p1, p2, p3) and their 2-simplex plane

1)|(1

T

ii dzP 1)|(

1

W

jj zwP

Page 17: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Effect of values

p1

p2p3

)1,1,1(),,( 321

p1

p2p3

)1,2,1(),,( 321

Page 18: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Effect of values

p1

p2p3

)1,2,1(),,( 321

p1

p2p3

)3,2,1(),,( 321

Page 19: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Effect of valuesp1

p2p3

p1

p2p3

)1,1,1(),,( 321 )5,5,5(),,( 321

Page 20: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Effect of values

p1

p2

p3

p1

p2

p3

)1,1,1(),,( 321 )5,5,5(),,( 321

Page 21: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Minor Correction

kkk

ii

i ikk ppppf

...

)1(

))1((),...,;,...,( 1

1

1

11

is not “standard” Dirichlet distribution.

The “standard” Dirichlet Distribution formula:

Used non-standard to make the connection to multinomial distribution clear

From now on, we use the standard formula

111

1

11 ...)(

)(),...,;,...,( 1

k

kk

ii

i ikk ppppf

Page 22: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Back to LDA Document Generation Model For each topic z

Pick the word probability vector P(wj|z)’s by taking a random sample from Dir(1,…, W)

For every document d The user decides its topic vector P(zi|d)’s by taking a

random sample from Dir(1,…, T) For each word in d

The user selects a topic z with probability P(z|d) The user selects a word w with probability P(w|z)

Once all is said and done, we have P(wj|z): topic-term vector for each topic P(zi|d): document-topic vector for each document Topic assignment to every word in each document

Page 23: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Symmetric Dirichlet Distribution In principle, we need to assume two vectors,

(1,…, T) and (1 ,…, W) as input parameters.

In practice, we often assume all i’s are equal to and all i’s = Use two scalar values and , not two vectors. Symmetric Dirichlet distribution

Q: What is the implication of this assumption?

Page 24: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Q: What does it mean? How will the sampled document topic vectors change as grows?

Common choice: = 50/T, 200/W

Effect of value on Symmetric Dirichlet

p1

p2

p3

6

p1

p2

p3

2

Page 25: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Plate Notation

TM

N

w

z

P(z|d)

P(w|z)

Page 26: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

LDA as Topic Inference Given a corpus

d1: w11, w12, …, w1m

…dN: wN1, wN2, …, wNm

Find P(z|d), P(w|z), zij that are most “consistent” with the given corpus

Q: How can we compute such P(z|d), P(w|z), zij? The best method so far is to use Monte Carlo

method together with Gibbs sampling

Page 27: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Monte Carlo Method (1) Class of methods that compute a number

through repeated random sampling of certain event(s).

Q: How can we compute Pi?

Page 28: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Monte Carlo Method (2)1. Define the domain of possible events2. Generate the events randomly from the

domain using a certain probability distribution

3. Perform a deterministic computation using the events

4. Aggregate the results of the individual computation into the final result

Q: How can we take random samples from a particular distribution?

Page 29: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Gibbs Sampling Q: How can we take a random sample x from the

distribution f(x)?

Q: How can we take a random sample (x, y) from the distribution f(x, y)?

Gibbs sampling Given current sample (x1, …, xn), pick an axis xi, and

take a random sample of xi value assuming all other (x1, …, xn) values

In practice, we iterative over xi’s sequentially

Page 30: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Markov-Chain Monte-Carlo Method (MCMC) Gibbs sampling is in the class of Markov Chain

sampling Next sample depends only on the current sample

Markov-Chain Monte-Carlo Method Generate random events using Markov-Chain

sampling and apply Monte-Carlo method to compute the result

Page 31: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Applying MCMC to LDA Let us apply Monte Carlo method to estimate

LDA parameters. Q: How can we map the LDA inference problem

to random events? We first focus on identifying topics {zij} for each

word {wij}. Event: Assignment of the topics {zij} to wij’s. The

assignment should be done according to P({zij}|C) Q: How to sample according to P({zij}|C)? Q: Can we use Gibbs sampling? How will it work?

Q: What is P(zij|{z-ij},C)?

Page 32: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

nwt: how many times the word w has been assigned to the topic t

ndt: how many words in the document d have been assigned to the topic t

Q: What is the meaning of each term?

T

ktd

td

W

wwt

tw

ijij

ki

iij

n

n

n

nCztzP

11

)()(),|(

),|( CztzP ijij

Page 33: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

LDA with Gibbs Sampling For each word wij

Assign to topic t with probability

For the prior topic t of wij, decrease nwt and ndt by 1 For the new topic t of wij, increase nwt and ndt by 1

Repeat the process many times At least hundreds of times

Once the process is over, we have zij for every wij

nwt and ndt

T

kktd

td

W

wwt

tw

i

iij

n

n

n

n

11

)()(

W

itw

wt

in

ntwP

1

)()|(

T

kdt

dt

kn

ndtP

1

)()|(

Page 34: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Result of LDA (Latent Dirichlet Analysis) TASA corpus

37,000 text passages from educational materials collected by Touchstone Applied Science Associates

Set T=300 (300 topics)

Page 35: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Inferred Topics

Page 36: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Word Topic Assignments

Page 37: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

LDA Algorithm: Simulation Two topics: River, Money

Five words: “river”, “stream”, “bank”, “money”, “loan”

Generate 16 documents by randomly mixing the two topics and using the LDA model

river stream

bank money

loan

River 1/3 1/3 1/3

Money 1/3 1/3 1/3

Page 38: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Generated Documents and Initial Topic Assignment before Inference

First 6 and the last 3 documents are purely from one topic. Others are mixtureWhite dot: “River”. Black dot: “Money”

Page 39: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Topic Assignment After LDA Inference

First 6 and the last 3 documents are purely from one topic. Others are mixtureAfter 64 iterations

Page 40: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Inferred Topic-Term Matrix Model parameter

Estimated parameter

Not perfect, but very close especially given the small data size

river stream

bank money

loan

River 0.33 0.33 0.33

Money 0.33 0.33 0.33

river stream

bank money

loan

River 0.25 0.4 0.35

Money 0.32 0.29 0.39

Page 41: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

SVD vs LDA Both perform the following decomposition

SVD views this as matrix approximation LDA views this as probabilistic inference based on

a generative model Each entry corresponds to “probability”: better

interpretability

term term

= X

topic

Page 42: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

LDA as Soft Classfication Soft vs hard clustering/classification After LDA, every document is assigned to a

small number of topics with some weights Documents are not assigned exclusively to a topic Soft clustering

Page 43: CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative

Summary Probabilistic Topic Model

Generative model of documents pLSI and overfitting LDA, MCMC, and probabilistic interpretation

Statistical parameter estimation Multinomial distribution and Dirichlet distribution Monte Carlo method Gibbs sampling

Markov-Chain class of sampling