topic models - umd

47
Topic Models Computational Linguistics: Jordan Boyd-Graber University of Maryland INTRODUCTION Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 1/1

Upload: others

Post on 26-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topic Models - UMD

Topic Models

Computational Linguistics: Jordan Boyd-GraberUniversity of MarylandINTRODUCTION

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 1 / 1

Page 2: Topic Models - UMD

Why topic models?

� Suppose you have a huge number ofdocuments

� Want to know what’s going on

� Can’t read them all (e.g. every NewYork Times article from the 90’s)

� Topic models offer a way to get acorpus-level view of major themes

� Unsupervised

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 2 / 1

Page 3: Topic Models - UMD

Why topic models?

� Suppose you have a huge number ofdocuments

� Want to know what’s going on

� Can’t read them all (e.g. every NewYork Times article from the 90’s)

� Topic models offer a way to get acorpus-level view of major themes

� Unsupervised

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 2 / 1

Page 4: Topic Models - UMD

Roadmap

� What are topic models

� How to know if you have good topic model

� How to go from raw data to topics

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 3 / 1

Page 5: Topic Models - UMD

What are Topic Models?

Conceptual Approach

From an input corpus and number of topics K → words to topics

Forget the Bootleg, Just Download the Movie LegallyMultiplex Heralded As

Linchpin To GrowthThe Shape of Cinema, Transformed At the Click of

a MouseA Peaceful Crew Puts

Muppets Where Its Mouth IsStock Trades: A Better Deal For Investors Isn't SimpleThe three big Internet portals begin to distinguish

among themselves as shopping malls

Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens

Corpus

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 4 / 1

Page 6: Topic Models - UMD

What are Topic Models?

Conceptual Approach

From an input corpus and number of topics K → words to topics

computer, technology,

system, service, site,

phone, internet, machine

play, film, movie, theater,

production, star, director,

stage

sell, sale, store, product,

business, advertising,

market, consumer

TOPIC 1 TOPIC 2 TOPIC 3

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 4 / 1

Page 7: Topic Models - UMD

What are Topic Models?

Conceptual Approach

� For each document, what topics are expressed by that document?

Forget the Bootleg, Just Download the Movie Legally

Multiplex Heralded As Linchpin To Growth

The Shape of Cinema, Transformed At the Click of

a Mouse

A Peaceful Crew Puts Muppets Where Its Mouth Is

Stock Trades: A Better Deal For Investors Isn't Simple

The three big Internet portals begin to distinguish

among themselves as shopping mallsRed Light, Green Light: A

2-Tone L.E.D. to Simplify Screens

TOPIC 2

TOPIC 3

TOPIC 1

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 5 / 1

Page 8: Topic Models - UMD

What are Topic Models?

Topics from Science

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 6 / 1

Page 9: Topic Models - UMD

What are Topic Models?

Why should you care?

� Neat way to explore / understand corpus collections� E-discovery� Social media� Scientific data

� NLP Applications� Word Sense Disambiguation� Discourse Segmentation� Machine Translation

� Psychology: word meaning, polysemy

� Inference is (relatively) simple

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 7 / 1

Page 10: Topic Models - UMD

What are Topic Models?

Matrix Factorization Approach

M × VM × K K × V ≈×

Topic Assignment

Topics

Dataset

K Number of topics

M Number of documents

V Size of vocabulary

� If you use singular valuedecomposition (SVD), thistechnique is called latent semanticanalysis.

� Popular in information retrieval.

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 8 / 1

Page 11: Topic Models - UMD

What are Topic Models?

Matrix Factorization Approach

M × VM × K K × V ≈×

Topic Assignment

Topics

Dataset

K Number of topics

M Number of documents

V Size of vocabulary

� If you use singular valuedecomposition (SVD), thistechnique is called latent semanticanalysis.

� Popular in information retrieval.

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 8 / 1

Page 12: Topic Models - UMD

What are Topic Models?

Alternative: Generative Model

� How your data came to be

� Sequence of Probabilistic Steps

� Posterior Inference

� Blei, Ng, Jordan. Latent Dirichlet Allocation. JMLR, 2003.

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 9 / 1

Page 13: Topic Models - UMD

What are Topic Models?

Alternative: Generative Model

� How your data came to be

� Sequence of Probabilistic Steps

� Posterior Inference

� Blei, Ng, Jordan. Latent Dirichlet Allocation. JMLR, 2003.

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 9 / 1

Page 14: Topic Models - UMD

What are Topic Models?

Multinomial Distribution

� Distribution over discrete outcomes

� Represented by non-negative vector that sums to one

� Picture representation(1,0,0) (0,0,1)

(1/2,1/2,0)(1/3,1/3,1/3) (1/4,1/4,1/2)

(0,1,0)

� Come from a Dirichlet distribution

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 10 / 1

Page 15: Topic Models - UMD

What are Topic Models?

Multinomial Distribution

� Distribution over discrete outcomes

� Represented by non-negative vector that sums to one

� Picture representation(1,0,0) (0,0,1)

(1/2,1/2,0)(1/3,1/3,1/3) (1/4,1/4,1/2)

(0,1,0)

� Come from a Dirichlet distribution

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 10 / 1

Page 16: Topic Models - UMD

What are Topic Models?

Dirichlet Distribution

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 11 / 1

Page 17: Topic Models - UMD

What are Topic Models?

Dirichlet Distribution

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 11 / 1

Page 18: Topic Models - UMD

What are Topic Models?

Dirichlet Distribution

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 11 / 1

Page 19: Topic Models - UMD

What are Topic Models?

Dirichlet Distribution

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 12 / 1

Page 20: Topic Models - UMD

What are Topic Models?

Dirichlet Distribution

� If ~φ ∼Dir(()α), ~w ∼Mult(()φ), and nk = |{wi :wi = k}| then

p(φ|α, ~w)∝ p(~w |φ)p(φ|α) (1)

∝∏

k

φnk∏

k

φαk−1 (2)

∝∏

k

φαk+nk−1 (3)

� Conjugacy: this posterior has the same form as the prior

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 13 / 1

Page 21: Topic Models - UMD

What are Topic Models?

Dirichlet Distribution

� If ~φ ∼Dir(()α), ~w ∼Mult(()φ), and nk = |{wi :wi = k}| then

p(φ|α, ~w)∝ p(~w |φ)p(φ|α) (1)

∝∏

k

φnk∏

k

φαk−1 (2)

∝∏

k

φαk+nk−1 (3)

� Conjugacy: this posterior has the same form as the prior

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 13 / 1

Page 22: Topic Models - UMD

What are Topic Models?

Generative Model

computer, technology,

system, service, site,

phone, internet, machine

play, film, movie, theater,

production, star, director,

stage

sell, sale, store, product,

business, advertising,

market, consumer

TOPIC 1

TOPIC 2

TOPIC 3

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 14 / 1

Page 23: Topic Models - UMD

What are Topic Models?

Generative Model

Forget the Bootleg, Just Download the Movie Legally

Multiplex Heralded As Linchpin To Growth

The Shape of Cinema, Transformed At the Click of

a Mouse

A Peaceful Crew Puts Muppets Where Its Mouth Is

Stock Trades: A Better Deal For Investors Isn't Simple

The three big Internet portals begin to distinguish

among themselves as shopping mallsRed Light, Green Light: A

2-Tone L.E.D. to Simplify Screens

TOPIC 2

TOPIC 3

TOPIC 1

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 14 / 1

Page 24: Topic Models - UMD

What are Topic Models?

Generative Model

Hollywood studios are preparing to let people

download and buy electronic copies of movies over

the Internet, much as record labels now sell songs for

99 cents through Apple Computer's iTunes music store

and other online services ...

computer, technology,

system, service, site,

phone, internet, machine

play, film, movie, theater,

production, star, director,

stage

sell, sale, store, product,

business, advertising,

market, consumer

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 14 / 1

Page 25: Topic Models - UMD

What are Topic Models?

Generative Model

Hollywood studios are preparing to let people

download and buy electronic copies of movies over

the Internet, much as record labels now sell songs for

99 cents through Apple Computer's iTunes music store

and other online services ...

computer, technology,

system, service, site,

phone, internet, machine

play, film, movie, theater,

production, star, director,

stage

sell, sale, store, product,

business, advertising,

market, consumer

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 14 / 1

Page 26: Topic Models - UMD

What are Topic Models?

Generative Model

Hollywood studios are preparing to let people

download and buy electronic copies of movies over

the Internet, much as record labels now sell songs for

99 cents through Apple Computer's iTunes music store

and other online services ...

computer, technology,

system, service, site,

phone, internet, machine

play, film, movie, theater,

production, star, director,

stage

sell, sale, store, product,

business, advertising,

market, consumer

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 14 / 1

Page 27: Topic Models - UMD

What are Topic Models?

Generative Model

Hollywood studios are preparing to let people

download and buy electronic copies of movies over

the Internet, much as record labels now sell songs for

99 cents through Apple Computer's iTunes music store

and other online services ...

computer, technology,

system, service, site,

phone, internet, machine

play, film, movie, theater,

production, star, director,

stage

sell, sale, store, product,

business, advertising,

market, consumer

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 14 / 1

Page 28: Topic Models - UMD

What are Topic Models?

Generative Model Approach

MNθd zn wn

Kβk

α

λ

� For each topic k ∈ {1, . . . ,K }, draw a multinomial distribution βk from aDirichlet distribution with parameter λ

� For each document d ∈ {1, . . . ,M}, draw a multinomial distribution θd

from a Dirichlet distribution with parameter α� For each word position n ∈ {1, . . . ,N}, select a hidden topic zn from the

multinomial distribution parameterized by θ .� Choose the observed word wn from the distribution βzn

.

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 15 / 1

Page 29: Topic Models - UMD

What are Topic Models?

Generative Model Approach

MNθd zn wn

Kβk

α

λ

� For each topic k ∈ {1, . . . ,K }, draw a multinomial distribution βk from aDirichlet distribution with parameter λ

� For each document d ∈ {1, . . . ,M}, draw a multinomial distribution θd

from a Dirichlet distribution with parameter α

� For each word position n ∈ {1, . . . ,N}, select a hidden topic zn from themultinomial distribution parameterized by θ .

� Choose the observed word wn from the distribution βzn.

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 15 / 1

Page 30: Topic Models - UMD

What are Topic Models?

Generative Model Approach

MNθd zn wn

Kβk

α

λ

� For each topic k ∈ {1, . . . ,K }, draw a multinomial distribution βk from aDirichlet distribution with parameter λ

� For each document d ∈ {1, . . . ,M}, draw a multinomial distribution θd

from a Dirichlet distribution with parameter α� For each word position n ∈ {1, . . . ,N}, select a hidden topic zn from the

multinomial distribution parameterized by θ .

� Choose the observed word wn from the distribution βzn.

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 15 / 1

Page 31: Topic Models - UMD

What are Topic Models?

Generative Model Approach

MNθd zn wn

Kβk

α

λ

� For each topic k ∈ {1, . . . ,K }, draw a multinomial distribution βk from aDirichlet distribution with parameter λ

� For each document d ∈ {1, . . . ,M}, draw a multinomial distribution θd

from a Dirichlet distribution with parameter α� For each word position n ∈ {1, . . . ,N}, select a hidden topic zn from the

multinomial distribution parameterized by θ .� Choose the observed word wn from the distribution βzn

.

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 15 / 1

Page 32: Topic Models - UMD

What are Topic Models?

Generative Model Approach

MNθd zn wn

Kβk

α

λ

� For each topic k ∈ {1, . . . ,K }, draw a multinomial distribution βk from aDirichlet distribution with parameter λ

� For each document d ∈ {1, . . . ,M}, draw a multinomial distribution θd

from a Dirichlet distribution with parameter α� For each word position n ∈ {1, . . . ,N}, select a hidden topic zn from the

multinomial distribution parameterized by θ .� Choose the observed word wn from the distribution βzn

.

We use statistical inference to uncover the most likely unobservedvariables given observed data.

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 15 / 1

Page 33: Topic Models - UMD

What are Topic Models?

Topic Models: What’s Important

� Topic models� Topics to word types� Documents to topics� Topics to word types—multinomial distribution� Documents to topics—multinomial distribution

� Focus in this talk: statistical methods� Model: story of how your data came to be� Latent variables: missing pieces of your story� Statistical inference: filling in those missing pieces

� We use latent Dirichlet allocation (LDA), a fully Bayesian version ofpLSI, probabilistic version of LSA

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 16 / 1

Page 34: Topic Models - UMD

What are Topic Models?

Topic Models: What’s Important

� Topic models (latent variables)� Topics to word types� Documents to topics� Topics to word types—multinomial distribution� Documents to topics—multinomial distribution

� Focus in this talk: statistical methods� Model: story of how your data came to be� Latent variables: missing pieces of your story� Statistical inference: filling in those missing pieces

� We use latent Dirichlet allocation (LDA), a fully Bayesian version ofpLSI, probabilistic version of LSA

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 16 / 1

Page 35: Topic Models - UMD

Evaluation

Evaluation

Held-out DataForget the Bootleg, Just Download the Movie LegallyMultiplex Heralded As

Linchpin To GrowthThe Shape of Cinema, Transformed At the Click of

a MouseA Peaceful Crew Puts

Muppets Where Its Mouth IsStock Trades: A Better Deal For Investors Isn't SimpleThe three big Internet portals begin to distinguish

among themselves as shopping malls

Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens

Model C

CorpusModel A

Model B Sony Ericsson's Infinite Hope for a TurnaroundFor Search, Murdoch Looks to a Deal With MicrosoftPrice War Brews Between

Amazon and Wal-Mart

-4.8

-15.16

-23.42

How you compute it is important too (Wallach et al. 2009)

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 17 / 1

Page 36: Topic Models - UMD

Evaluation

Evaluation

Held-out DataForget the Bootleg, Just Download the Movie LegallyMultiplex Heralded As

Linchpin To GrowthThe Shape of Cinema, Transformed At the Click of

a MouseA Peaceful Crew Puts

Muppets Where Its Mouth IsStock Trades: A Better Deal For Investors Isn't SimpleThe three big Internet portals begin to distinguish

among themselves as shopping malls

Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens

Model C

CorpusModel A

Model B Sony Ericsson's Infinite Hope for a TurnaroundFor Search, Murdoch Looks to a Deal With MicrosoftPrice War Brews Between

Amazon and Wal-Mart

-4.8

-15.16

-23.42

Held-out LogLikelihood

Measures predictive power, not what the topics are

How you compute it is important too (Wallach et al. 2009)

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 17 / 1

Page 37: Topic Models - UMD

Evaluation

Word Intrusion

computer, technology,

system, service, site,

phone, internet, machine

play, film, movie, theater,

production, star, director,

stage

sell, sale, store, product,

business, advertising,

market, consumer

TOPIC 1 TOPIC 2 TOPIC 3

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 18 / 1

Page 38: Topic Models - UMD

Evaluation

Word Intrusion

1. Take the highest probability words from a topic

Original Topic

dog, cat, horse, pig, cow

2. Take a high-probability word from another topic and add it

Topic with Intruder

dog, cat, apple, horse, pig, cow

3. We ask users to find the word that doesn’t belong

Hypothesis

If the topics are interpretable, users will consistently choose true intruder

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 19 / 1

Page 39: Topic Models - UMD

Evaluation

Word Intrusion

1. Take the highest probability words from a topic

Original Topic

dog, cat, horse, pig, cow

2. Take a high-probability word from another topic and add it

Topic with Intruder

dog, cat, apple, horse, pig, cow

3. We ask users to find the word that doesn’t belong

Hypothesis

If the topics are interpretable, users will consistently choose true intruder

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 19 / 1

Page 40: Topic Models - UMD

Evaluation

Word Intrusion

1. Take the highest probability words from a topic

Original Topic

dog, cat, horse, pig, cow

2. Take a high-probability word from another topic and add it

Topic with Intruder

dog, cat, apple, horse, pig, cow

3. We ask users to find the word that doesn’t belong

Hypothesis

If the topics are interpretable, users will consistently choose true intruder

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 19 / 1

Page 41: Topic Models - UMD

Evaluation

Word Intrusion

� Order of words was shuffled� Which intruder was selected varied� Model precision: percentage of users who clicked on intruder

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 20 / 1

Page 42: Topic Models - UMD

Evaluation

Word Intrusion

� Order of words was shuffled� Which intruder was selected varied� Model precision: percentage of users who clicked on intruder

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 20 / 1

Page 43: Topic Models - UMD

Evaluation

Word Intrusion: Which Topics are Interpretable?

New York Times, 50 LDA Topics

Model Precision

Nu

mb

er

of

To

pic

s

05

10

15

0.000 0.125 0.250 0.375 0.500 0.625 0.750 0.875 1.000

committeelegislationproposalrepublican

taxis

fireplacegaragehousekitchenlist

americansjapanesejewishstatesterrorist

artistexhibitiongallerymuseumpainting

Model Precision: percentage of correct intruders foundComputational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 21 / 1

Page 44: Topic Models - UMD

Evaluation

Interpretability and Likelihood

Model Precision on New York TimesM

odel

Pre

cisio

nBe

tter

Predictive Log Likelihood

0.65

0.70

0.75

0.80

!2.5

!2.0

!1.5

!1.0

New York Times

! !

!

!

!

!

!

!

!

! !

!

! !!

! !!

!7.32 !7.30 !7.28 !7.26 !7.24

Wikipedia

!

!

!

!

!!!

!

!

!!

!

!!!

!!!

!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

Model P

recis

ion

Topic

Log O

dds

Model

! CTM

! LDA

! pLSI

Number of topics

! 50

! 100

! 150

Predictive Log Likelihood

0.65

0.70

0.75

0.80

!2.5

!2.0

!1.5

!1.0

New York Times

!

!

!

!

!

!

!7.32 !7.30 !7.28 !7.26 !7.24

Wikipedia

!

!!

!

!!

!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

Model P

recis

ion

Topic

Log O

dds

Model

! CTM

! LDA

! pLSI

Number of topics

! 50

100

150

Predictive Log Likelihood

0.65

0.70

0.75

0.80

!2.5

!2.0

!1.5

!1.0

New York Times

! !

!

!

!

!

!

!

!

! !

!

! !!

! !!

!7.32 !7.30 !7.28 !7.26 !7.24

Wikipedia

!

!

!

!

!!!

!

!

!!

!

!!!

!!!

!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

Model P

recis

ion

Topic

Log O

dds

Model

! CTM

! LDA

! pLSI

Number of topics

! 50

! 100

! 150

Predictive Log Likelihood

0.65

0.70

0.75

0.80

!2.5

!2.0

!1.5

!1.0

New York Times

!

!

!

!

!

!

!7.32 !7.30 !7.28 !7.26 !7.24

Wikipedia

!

!!

!

!!

!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

Model P

recis

ion

Topic

Log O

dds

Model

! CTM

! LDA

! pLSI

Number of topics

! 50

100

150

Held-out LikelihoodBetter

within a model, higher likelihood 6= higher interpretability

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 22 / 1

Page 45: Topic Models - UMD

Evaluation

Interpretability and Likelihood

Topic Log Odds on WikipediaTo

pic

Log

Odd

sBe

tter

Predictive Log Likelihood

0.65

0.70

0.75

0.80

!2.5

!2.0

!1.5

!1.0

New York Times

! !

!

!

!

!

!

!

!

! !

!

! !!

! !!

!7.32 !7.30 !7.28 !7.26 !7.24

Wikipedia

!

!

!

!

!!!

!

!

!!

!

!!!

!!!

!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

Model P

recis

ion

Topic

Log O

dds

Model

! CTM

! LDA

! pLSI

Number of topics

! 50

! 100

! 150

Predictive Log Likelihood

0.65

0.70

0.75

0.80

!2.5

!2.0

!1.5

!1.0

New York Times

!

!

!

!

!

!

!7.32 !7.30 !7.28 !7.26 !7.24

Wikipedia

!

!!

!

!!

!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

Model P

recis

ion

Topic

Log O

dds

Model

! CTM

! LDA

! pLSI

Number of topics

! 50

100

150

Predictive Log Likelihood

0.65

0.70

0.75

0.80

!2.5

!2.0

!1.5

!1.0

New York Times

! !

!

!

!

!

!

!

!

! !

!

! !!

! !!

!7.32 !7.30 !7.28 !7.26 !7.24

Wikipedia

!

!

!

!

!!!

!

!

!!

!

!!!

!!!

!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

Model P

recis

ion

Topic

Log O

dds

Model

! CTM

! LDA

! pLSI

Number of topics

! 50

! 100

! 150

Predictive Log Likelihood

0.65

0.70

0.75

0.80

!2.5

!2.0

!1.5

!1.0

New York Times

!

!

!

!

!

!

!7.32 !7.30 !7.28 !7.26 !7.24

Wikipedia

!

!!

!

!!

!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

Model P

recis

ion

Topic

Log O

dds

Model

! CTM

! LDA

! pLSI

Number of topics

! 50

100

150

Held-out LikelihoodBetter

across models, higher likelihood 6= higher interpretability

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 22 / 1

Page 46: Topic Models - UMD

Evaluation

Evaluation Takeaway

� Measure what you care about

� If you care about prediction, likelihood is good

� If you care about a particular task, measure that

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 23 / 1

Page 47: Topic Models - UMD

Evaluation

Computational Linguistics: Jordan Boyd-Graber | UMD Topic Models | 24 / 1