topic models - indian institute of technology...

93
Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without footnote mark Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 1 / 71

Upload: others

Post on 03-Sep-2019

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Topic Models

Pawan Goyal

CSE, IITKGP

October 13-14, 2014

Footnotetext without footnote markPawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 1 / 71

Page 2: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Why Topic Modeling?

Information OverloadAs more information becomes available, it becomes more difficult to find anddiscover what we need.

Topic ModelingProvides methods for automatically organizing, understanding, searching andsummarizing large electronic archives

Discover the hidden themes that pervade the collection

Annotate the documents according to those themes

Use annotations to organize, summarize, and search the texts

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 2 / 71

Page 3: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Why Topic Modeling?

Information OverloadAs more information becomes available, it becomes more difficult to find anddiscover what we need.

Topic ModelingProvides methods for automatically organizing, understanding, searching andsummarizing large electronic archives

Discover the hidden themes that pervade the collection

Annotate the documents according to those themes

Use annotations to organize, summarize, and search the texts

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 2 / 71

Page 4: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Applications: Discover Topics from a corpus

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 3 / 71

Page 5: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Applications: Model the evolution of topics over time

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 4 / 71

Page 6: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Applications: Model connections between topics

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 5 / 71

Page 7: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Applications: Discover influential articles

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 6 / 71

Page 8: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Link Prediction using Relational Topic Models

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 7 / 71

Page 9: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Applications: Organize and browse large corpora

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 8 / 71

Page 10: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Intuition: Documents exhibit multiple topics

This articles is about using data analysis to determine the number of genes anorganism needs to survive

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 9 / 71

Page 11: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Intuition: Documents exhibit multiple topics

Highlighted words: ‘blue’: data analysis, ‘pink’: evolutionary biology, ‘yellow’:genetics

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 9 / 71

Page 12: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Intuition: Documents exhibit multiple topics

The article blends genetics, data analysis and evolutionary biology in differentproportions

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 9 / 71

Page 13: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Intuition: Documents exhibit multiple topics

Knowing that this article blends those topics would help situate it in a collectionof scientific articles

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 9 / 71

Page 14: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Topic Model: Basic Idea

A generative statistical model that captures this intuition.

Generative ModelDocuments are mixture of topics, where a topic is a probability distribution overwords.

genetics topic has words about genetics with high probability and theevolutionary biology topic has words about evolutionary biology with highprobability.Technically, the generative model assumes that the topics are generated first,before the documents.

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 10 / 71

Page 15: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Topic Model: Basic Idea

A generative statistical model that captures this intuition.

Generative ModelDocuments are mixture of topics, where a topic is a probability distribution overwords.genetics topic has words about genetics with high probability and theevolutionary biology topic has words about evolutionary biology with highprobability.

Technically, the generative model assumes that the topics are generated first,before the documents.

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 10 / 71

Page 16: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Topic Model: Basic Idea

A generative statistical model that captures this intuition.

Generative ModelDocuments are mixture of topics, where a topic is a probability distribution overwords.genetics topic has words about genetics with high probability and theevolutionary biology topic has words about evolutionary biology with highprobability.Technically, the generative model assumes that the topics are generated first,before the documents.

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 10 / 71

Page 17: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Generative Model for LDA

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 11 / 71

Page 18: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Generative Model for LDA

Wordsare generated in a two-stage process: Firstly, randomly choose a distributionover topics

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 11 / 71

Page 19: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Generative Model for LDA

To generate each word in the document: randomly choose a topic from thedistribution over topics

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 11 / 71

Page 20: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Generative Model for LDA

Once this topic has been chosen, randomly choose a word from thecorresponding distribution over vocabulary

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 11 / 71

Page 21: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

What does the statistical model reflect?

All the document in the collection share the same set of topics, but eachdocument exhibits those topics in different proportions

Each word in each document is drawn from one of the topics, where theselected topic is chosen from the per-document distribution over topics

In the example article, the distribution over topics would place probability ongenetics, data analytics and evolutionary biology, and each word is drawn fromone of those three topics.

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 12 / 71

Page 22: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

What does the statistical model reflect?

All the document in the collection share the same set of topics, but eachdocument exhibits those topics in different proportions

Each word in each document is drawn from one of the topics, where theselected topic is chosen from the per-document distribution over topics

In the example article, the distribution over topics would place probability ongenetics, data analytics and evolutionary biology, and each word is drawn fromone of those three topics.

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 12 / 71

Page 23: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Real Inference with LDA for the example article

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 13 / 71

Page 24: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Central Problem of LDA

The documents themselves are observed, while the topic structure - thetopics, per-document topic distributions, and the per-document per-wordtopic assignments - is hidden structure.

The central computational problem is to use the observed documents toinfer the hidden topic structure, i.e. reversing the generative process.

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 14 / 71

Page 25: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Goal: The posterior distribution

Infer the hidden variablesCompute their distribution conditioned on the documents

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 15 / 71

Page 26: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Topics from TASA corpus

37,000 text passages from educational materials (300 topics)

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 16 / 71

Page 27: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Topics from TASA corpus

Documents with different content can be generated by choosing differentdistributions over topics.

Equal probability to first two topics:

about a person who has taken toomany drugs and how that affected color perceptions.

Equal probability to the last two topics: about a person who experienceda loss of memory, which required a visit to the doctor.

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 17 / 71

Page 28: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Topics from TASA corpus

Documents with different content can be generated by choosing differentdistributions over topics.

Equal probability to first two topics: about a person who has taken toomany drugs and how that affected color perceptions.

Equal probability to the last two topics:

about a person who experienceda loss of memory, which required a visit to the doctor.

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 17 / 71

Page 29: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Topics from TASA corpus

Documents with different content can be generated by choosing differentdistributions over topics.

Equal probability to first two topics: about a person who has taken toomany drugs and how that affected color perceptions.

Equal probability to the last two topics: about a person who experienceda loss of memory, which required a visit to the doctor.

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 17 / 71

Page 30: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Generative model and statistical inference

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 18 / 71

Page 31: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Important points

bag-of-words assumption: The generative process does not make anyassumptions about the order of words in the documents.

capturing polysemy: The way that the model is defined, there is no notionof mutual exclusivity that restricts words to be part of one topic only. Ex:both ‘money’ and ‘river’ topics can give high probability to the word ‘bank’.

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 19 / 71

Page 32: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Graphical Model (Notation)

Nodes are random variables

Edges denote possible dependence

Observed variables are shaded

Plates denote replicated structure

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 20 / 71

Page 33: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Graphical Model (Notation)

Structure of the graph defines the pattern of conditional dependencebetween the ensemble of random variables

E.g., this graph corresponds to

p(y,x1, . . . ,xN) = p(y)N

∏n=1

p(xn|y)

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 21 / 71

Page 34: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

LDA: Graphical Model

Each piece of the structure is a random variable.

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 22 / 71

Page 35: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Latent Dirichlet Allocation: Generative Model

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 23 / 71

Page 36: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

What is Latent Dirichlet Allocation (LDA)?

‘Latent’ has the same sense in LDA as in Latent semantic indexing, i.e.capturing topics as latent variables

The distribution that is used to draw the per-document topic distributionsis called a Dirichlet distribution. This result is used to allocate the wordsof the documents to different topics.

Dirichlet DistributionThe Dirichlet distribution is an exponential family distribution over the simplex,i.e. positive vectors that sum to one

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 24 / 71

Page 37: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

What is Latent Dirichlet Allocation (LDA)?

‘Latent’ has the same sense in LDA as in Latent semantic indexing, i.e.capturing topics as latent variables

The distribution that is used to draw the per-document topic distributionsis called a Dirichlet distribution. This result is used to allocate the wordsof the documents to different topics.

Dirichlet DistributionThe Dirichlet distribution is an exponential family distribution over the simplex,i.e. positive vectors that sum to one

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 24 / 71

Page 38: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Dirichlet Distribution

αis: hyper-parameters of the model which can be interpreted as: αj canbe interpreted as an observation count for the number of times topic j issampled in a document

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 25 / 71

Page 39: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Dirichlet Distribution

αis: hyper-parameters of the model which can be interpreted as: Thesepriors can be interpreted as forces in the topic distributions with higher α

moving the topics away from the corners of the simplex

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 25 / 71

Page 40: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Dirichlet Distribution

αis: hyper-parameters of the model which can be interpreted as: Whenα < 1, there is a bias to pick topic distributions favoring just a few topics

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 25 / 71

Page 41: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Dirichlet Distribution

αis: hyper-parameters of the model which can be interpreted as: It isconvenient to use a symmetirc Dirichlet distribution with a singlehyper-parameter α1 = α2 . . . = α

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 25 / 71

Page 42: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Effect of α: α = 1

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 26 / 71

Page 43: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Effect of α: α = 10

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 27 / 71

Page 44: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Effect of α: α = 100

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 28 / 71

Page 45: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Effect of α: α = 1

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 29 / 71

Page 46: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Effect of α: α = 0.1

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 30 / 71

Page 47: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Effect of α: α = 0.01

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 31 / 71

Page 48: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Effect of α: α = 0.001

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 32 / 71

Page 49: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Online Implementations

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 33 / 71

Page 50: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Latent Dirichlet Allocation: Statistical Inference

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 34 / 71

Page 51: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

LDA: computing posterior

Joint distribution of the hidden and observed variables

What is posterior?Conditional distribution of the hidden variables given the observed variables

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 35 / 71

Page 52: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

LDA: computing posterior

Joint distribution of the hidden and observed variables

What is posterior?Conditional distribution of the hidden variables given the observed variables

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 35 / 71

Page 53: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Computing Posterior

Numerator is the joint distribution of all the random variables, which canbe easily computed for any settings of the hidden variables

Denominator is the marginal probability of the observations, probability ofseeing the observed corpus under any topic model

This marginal is intractable to compute!The sum is over all possible ways of assigning each observed word of thecollection to one of the topics. Number of words can be of the order of millions!!

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 36 / 71

Page 54: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Computing Posterior

Numerator is the joint distribution of all the random variables, which canbe easily computed for any settings of the hidden variables

Denominator is the marginal probability of the observations, probability ofseeing the observed corpus under any topic model

This marginal is intractable to compute!The sum is over all possible ways of assigning each observed word of thecollection to one of the topics. Number of words can be of the order of millions!!

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 36 / 71

Page 55: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Can we approximate it?

Algorithms to approximate it fall in two categories:

Sampling-based AlgorithmsCollect samples from the posterior to approximate it with an empiricaldistribution

Variational MethodsDeterministic alternative to sampling-based algorithms

The inference problem is transformed to an optimization problem

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 37 / 71

Page 56: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Can we approximate it?

Algorithms to approximate it fall in two categories:

Sampling-based AlgorithmsCollect samples from the posterior to approximate it with an empiricaldistribution

Variational MethodsDeterministic alternative to sampling-based algorithms

The inference problem is transformed to an optimization problem

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 37 / 71

Page 57: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Gibbs Sampling

A form of Markov chain Monte Carlo (MCMC), which simulates ahigh-dimensional distribution by sampling on lower-dimensional subset ofvariables where each subset is conditioned on the value of all others

Sampling is done sequentially and proceeds until the sampled valuesapproximate the target distribution

It directly estimates the posterior distribution over z , and uses this toprovide estimates for β and θ

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 38 / 71

Page 58: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Gibbs Sampling

Suppose we have a word token i for which we want to find the topicassignment probability : p(zi = j)

Represent the collection of documents by a set of word indices wi anddocument indices di for this token i

Gibbs sampling considers each word token in turn and estimates theprobability of assigning the current word token to each topic, conditionedon the topic assignment to all other word tokens

From this conditional distribution, a topic is sampled and stored as thenew topic assignment for this word token

This conditional is written as P(zi = j|z−i,wi,di, .)

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 39 / 71

Page 59: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Gibbs Sampling

Let us define two matrices CWT and CDT of dimensions W×T and D×Trespectively.

CwjWT contains the number of times word w is assigned to topic j, not

including the current instance

CdjWT contains the number of times topic j is assigned to some word

token in document d, not including the current instance

P(zi = j|z−i,wi,di, .) ∝Cwij

WT + η

W

∑w=1

CwjWT + Wη

CdijDT + α

T

∑t=1

CdjDT + Tα

The left part is the probability of word w under topic j (How likely a word isfor a topic) whereas

the right part is the probability of topic j under the current topic distributionfor document d (How dominant a topic is in a document)

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 40 / 71

Page 60: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Gibbs Sampling

Let us define two matrices CWT and CDT of dimensions W×T and D×Trespectively.

CwjWT contains the number of times word w is assigned to topic j, not

including the current instance

CdjWT contains the number of times topic j is assigned to some word

token in document d, not including the current instance

P(zi = j|z−i,wi,di, .) ∝Cwij

WT + η

W

∑w=1

CwjWT + Wη

CdijDT + α

T

∑t=1

CdjDT + Tα

The left part is the probability of word w under topic j (How likely a word isfor a topic) whereas

the right part is the probability of topic j under the current topic distributionfor document d (How dominant a topic is in a document)

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 40 / 71

Page 61: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Algorithm

Start: Each word token is assigned to a random topic in [1 . . .T]

For each word token, a new topic is sampled as per P(zi = j|z−i,wi,di, .),adjusting the matrices CWT and CDT

A single pass through all word tokens in the document is one Gibbssample

After the burnin period, these samples are saved at regularly spacedintervals, to prevent correlations between samples

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 41 / 71

Page 62: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Estimating θ and β

βi(j) =

CijWT + η

W

∑k=1

CkjWT + Wη

θj(d) =

CdjDT + α

T

∑k=1

CdkDT + Tα

These values correspond to predictive distributions ofsampling a new token of word i from topic j, and

samping a new token in document d from topic j

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 42 / 71

Page 63: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

An Example

The algorithm can be illustrated by generating artificial data from a known topicmodel and applying the algorithm to check whether it is able to infer theoriginal genrative structure.

ExampleLet topic 1 give equal probability to MONEY, LOAN, BANK and topic 2give equal probability to words RIVER, STREAM, and BANK

βMONEY(1) = βLOAN

(1) = βBANK(1) = 1/3

βRIVER(2) = βSTREAM

(2) = βBANK(2) = 1/3

We generate 16 documents by arbitrarily mixing two topics.

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 43 / 71

Page 64: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Initial Structure

Colors reflect initial random assignment, black = topic 1, while = topic 2

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 44 / 71

Page 65: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

After 64 iterations of Gibbs Sampling

βMONEY(1) = 0.32,βLOAN

(1) = 0.29,βBANK(1) = 0.39

βRIVER(2) = 0.25,βSTREAM

(2) = 0.4,βBANK(2) = 0.35

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 45 / 71

Page 66: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Computing Similarities

Document SimilaritySimilarity between documents di and d2 can be measured by the similaritybetween their topic distributions θ(d1) and θ(d2)

KL divergence : D(p,q) =T

∑j=1

pjlog2pj

qj

Symmetrized KL divergence: 12 [D(p,q) + D(q,p)] seems to work well

Similarity with respect to query qMaximize the conditional probability of query given the document:p(q|di) = ∏

wk∈qp(wk|di)

= ∏wk∈q

T

∑j=1

P(wk|z = j)P(z = j|di)

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 46 / 71

Page 67: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Computing Similarities

Similarity between two wordsHaving observed a single word in a new context, what are the other words thatmight appear in the same context, based on the topic interpretation for theobserved word?

p(w2|w1) =T

∑j=1

p(w2|z = j)p(z = j|wi)

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 47 / 71

Page 68: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Example

Observed and predicted responses for the word ’PLAY’

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 48 / 71

Page 69: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Modeling Science

DataThe OCR’ed collection of Science from 1990-2000

17K documents

11M words

20K unique terms (stop words and rare words removed)

Model100-topic model using variational inference

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 49 / 71

Page 70: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Modeling Science

DataThe OCR’ed collection of Science from 1990-2000

17K documents

11M words

20K unique terms (stop words and rare words removed)

Model100-topic model using variational inference

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 49 / 71

Page 71: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Example Inference

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 50 / 71

Page 72: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Example Topics

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 51 / 71

Page 73: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Modeling Richer Assumptions in Topic Models

Correlated topic models

Dynamic topic models

Measuring scholarly impact

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 52 / 71

Page 74: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Correlated Topic Models

The Dirichlet is an exponential family distribution on the simplex, positivevectors that sum to one

However, the near independence of components makes it a poor choicefor modeling topic proportions

An article about fossil fuels is more likely to also be about geology thanabout genetics

Using logistic normal distribution

A multivariate normal distribution of a k-dimensional vector x = [X1,X2, . . . ,Xk]can be written as

x∼ Nk(µ,Σ)

with k-dimensional mean vector µ and k× k covariance matrix Σ

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 53 / 71

Page 75: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Correlated Topic Model (CTM)

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 54 / 71

Page 76: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

CTM supports more topics and provides a better fit thanLDA

Held-out log probability on Science

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 55 / 71

Page 77: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Correlated Topics

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 56 / 71

Page 78: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Dynamic Topic Models

LDA assumptionLDA assumes that the order of documents does not matter

Not appropriate for corpora that span hundreds of years

We might want to track how language changes over time

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 57 / 71

Page 79: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Dynamic Topic Models

βk,t|βk,t−1 ∼ N(βk,t−1,σ2I)

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 58 / 71

Page 80: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Dynamic Topic Models

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 59 / 71

Page 81: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Dynamic Topic Models

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 60 / 71

Page 82: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Measuring Scholarly Impact

How to model influence?Idea from Dynamic Topic Models, influential articles reflect futurechanges in language use

The influence of an article is a latent variable

Inflential articles affect the drift of the topics that they discuss

The posterior gives a retrospective estimate of influential article

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 61 / 71

Page 83: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Measuring Scholarly Impact

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 62 / 71

Page 84: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Measuring Scholarly Impact

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 63 / 71

Page 85: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Relational Topic Models

Connected ObservationsCitation networks of documents

Hyperlinked networks of web-pages

Friend-connected social network profiles

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 64 / 71

Page 86: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Relational Topic Models

LDA needs to be adapted to a model of content and connection

RTMs find hidden structure in both types of data

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 65 / 71

Page 87: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Relational Topic Models

Works in a supervised framework, allowing predictions about new and unlinkeddata

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 66 / 71

Page 88: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Link Prediction Task using RTM

Given a new document, which documents is it likely to link to?

RTM allows for such predictionslinks given the new words of a document

words given the links of a new document

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 67 / 71

Page 89: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Supervised settings of LDA

Use data points paired with response variablesUser reviews paired with a number of stars

Web pages paired with a number of likes

Documents paired with links to other documents

Images paired with a category

Supervised topic modesare topic models of documents and responses, fit to find topics predictive ofthe response

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 68 / 71

Page 90: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Supervised settings of LDA

Use data points paired with response variablesUser reviews paired with a number of stars

Web pages paired with a number of likes

Documents paired with links to other documents

Images paired with a category

Supervised topic modesare topic models of documents and responses, fit to find topics predictive ofthe response

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 68 / 71

Page 91: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Supervised LDA

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 69 / 71

Page 92: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Supervised LDA

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 70 / 71

Page 93: Topic Models - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/topic_models.pdf · Topic Models Pawan Goyal CSE, IITKGP October 13-14, 2014 Footnotetext without

Prediction

Pawan Goyal (IIT Kharagpur) Topic Models October 13-14, 2014 71 / 71