topic models and its applications - uva · other applications of topic models ! conclusion &...

Topic Models and Its Applications

Zhaochun Ren University of Amsterdam

2

Outline

Ò Review of topic models Ò Topic models for IR Ò Other applications of topic models Ò Conclusion & outlook

3

Outline


4

Model Document as a sample of mixed topics From%cmo44,%in%Jun%18,%2010%8:40%AM%Yep.%Same%here.%Safari%crashes%many%1mes%per%day%on%all%sorts%of%websites.%And%in%par1cularly%on%websites%that%contain%embedded%videos%such%as%YouTube%videos.%But%then%Safari%has%always%been%unreliable%=%it's%as%bad%on%the%iPhone.%From%bearduk,%in%Jun%22,%2010%3:11%PM%Same%here%for%me.%It%seems%to%happen%on%pages%that%have%a%number%of%images%on%them%for%me.%Safari%just%cuts%out%and%I%get%a%blank%screen%for%a%second%and%then%the%iPad%home%screen.%Sites%like%do:while%that%have%a%number%of%images%seem%to%crash%regularly.%

Topic&A�

Topic&B�

Topic&C�

Topic&D�

Topic&E�

…&… �

5

Overview of topic models

Ò Idea? Ò find low-dimensional descriptions of high-dimensional

text

Ò Topic models enable text document: Ò (1) Summarization: find concise restatement Ò (2) Similarity: evaluate closeness of text

6

Overview of topic models

Ò Application of topic models Ò Document Summarization Ò Facilitate navigation/browsing Ò Information retrieval model Ò Segment documents Ò Social media topic tracking Ò Event detection Ò etc.

7

Introduction to topic models

Ò Expectation Maximization Ò Probabilistic LSA Ò Latent Dirichlet Allocation

Ò Variational EM Ò Gibbs Sampling

8




9

The general EM algorithm

Ò Given a joint distribution , the goal of EM algorithm is to maximal the likelihood of with respect of .

p(W,Z|✓)p(W |✓)

✓

Iterate:

E-step: estimate E(z) for each z, given

M-step: estimate maximizing ✓old

✓newX

Z

p(Z|W, ✓old) ln p(W,Z|✓)

10




11

Probabilistic LSA

Ò Generative process: Ò select a document d with p(d) Ò select class z with p(z|d) Ò select word w with p(w|d)

DN

d z w �p(d, wn) = p(d)X

z

p(wn|z)p(z|d)

12

Probabilistic LSA

13




14

Latent Dirichlet Allocation

• Each topic can be represented as a finite mixture of words

• Each document in the corpus can be represented as a mixture of multiple topics

• “Bag of words” assumption

15

Latent Dirichlet Allocation

Ò Generative process: Ò For each topic z:

Ò Derive distribution φ ~ Dirichlet(β) Ò For each document d:

Ò Derive θ ~ Dirichlet(α) Ò For each word w:

Ò Derive topic z ~ Multinomial(θ) Ò Derive word w ~ Multinomial(φ)

T

D

z

wI

E

dND

K

16

Approximate inference

Ò LDA model is intractable to get the exact posterior distribution

Ò Two main approximate inference methods: Ò variational EM algorithm Ò Gibbs sampling

17

Gibbs sampling

p(�!w ,�!z |↵,�) =

0

B@�

P

v2V�v

!

VQv=1

��v

1

CA

K0

B@�

P

k2K↵k

!

KQk=1

�↵k

1

CA

M

⇥KQ

k=1

QVv=1 �(nv

k+�v)

�(PV

v=1 (nvk+�v))

MQm=1

QKk=1 �(nk

m+↵k)

�(PK

k=1 (nkm+↵k))

p(zi = j|�!z �i,�!w ) /

nv�i,j + �v

PVv=1

�nv�i,j + �v

�nm�i,j + ↵k

PKk=1

⇣nm�i,k + ↵k

⌘

18

Outline


19

LDA-Based Document Models for Ad-hoc Retrieval

Ò Motivation Ò Method Ò Experimental results Ò Conclusion

20


Ò Motivation

1.Use topic models for document representation is an area of considerable interest;

2.Document clustering is helpful for the language modeling framework.

3.Assumption in Unigram model is too simple to effectively model a large-scale collection of documents.

21

Overview of cluster-based retrieval

Ò the overall likelihood of observing the document d from the cluster model is

Ò By incorporating the cluster information into language model as smoothing, we have:

P (w|D) =Nd

Nd + µPML(w|D) + (1� Nd

Nd + µ)P (w|cluster)

P (w1, w2, ..., wNd) =XK

z=1P (z)

NdY

i=1

P (wi|z)

22


Ò Method Ò LDA topic model Ò LDA-based Retrieval Ò Complexity

23


Ò Method Ò LDA topic model Ò LDA-based Retrieval Ò Complexity

24

LDA retrieval model

Ò Query likelihood model

P (Q|D) =Y

q2Q

P (q|D)

26

LDA retrieval model

Ò For each document d:

P (w|d, b✓, b�) =X

z=1

P (w|z, b�)P (z|b✓, d)

PLDA(w|D) =Y

d2D

X

z=1

n(w)�i,j + �

Pv(n(v)

�i,j + �)

n(d)�i,j + ↵

Pk(n(d)

�i,k + ↵)

27


Ò Experimental setup

4.2 Parameters There are several parameters that need to be determined in our experiments. For the retrieval experiments, the proportion of the LDA part in the linear combination must be specified (λ in (6)). For the LDA estimation, the number of topics must be specified; the number of iterations and the number of Markov chains also need to be carefully tuned due to its influence on performance and running time. We use the AP collection as our training collection to estimate the parameters. The WSJ, FT, SJMN, and LA collections are used for testing whether the parameters optimized on AP can be used consistently on other collections. At the current stage of our work, the parameters are selected through exhaustive search or manually hill-climbing search. All parameter values are tuned based on average precision since retrieval is our final task. The parameter selection process, including the training set selection, also follows Liu and Croft (2004) to make the results comparable. Mean average precision is used as the basis of evaluation throughout this study.

We use symmetric Dirichlet priors in the LDA estimation with K/50=α and β =0.01, which are common settings in the

literature. Our experience shows that retrieval results are not very sensitive to the values of these parameters.

4.2.1 LDA Estimation Document models consisting of mixtures of topics, like pLSI and LDA, have previously been tested mostly on small collections due to their relatively long running time. From Section 3.3 it is shown that the iteration number in LDA estimation plays an important role in its complexity. Generally, more iterations means that the Markov chain reaches equilibrium with higher probability, and after a certain number of iterations (burn-in period) the invariant distribution of the Markov chain is equivalent to the true distribution. So it would be ideal if we could take samples right after the Markov chain reach equilibrium. However, in practice, convergence detection of Markov chains is still an open research question. That is, no realistic method can be applied on the large IR collections to determine the convergence of the chain. Researchers in the area of topic modeling tend to use a large number of iterations to guarantee convergence. However, in IR tasks it is almost impossible to run a very large number of iterations due to the size of the data set. Besides, a finely tuned topic model does not naturally mean good retrieval performance. Instead, a less accurate distribution of topics may be good enough

for IR purposes. Furthermore, we have λ and µ in our model to adjust the influence of the LDA model. For example, if the LDA estimation is coarse, we may reduce the smoothing weight and let the LDA estimation share a part of smoothing.

0.22

0.225

0.23

0.235

0.24

0.245

0.25

0.255

0.26

0.265

0.27

10 30 50 70 90 110 130 150 170 190

Iterations

Ave

rage

Pre

cisi

on

Figure 2. Retrieval results (in average precision) on AP with different number of iterations. K=400; λ =0.7; 1 Markov chain.

0.24

0.245

0.25

0.255

0.26

0.265

0.27

1 2 3 4 5 6 7 8 9 10Number of Markov Chains

Ave

rage

Pre

cisi

on

Figure 3. Retrieval results (in average precision) on AP with different number of Markov chains. K=400; λ =0.7; 30 iterations. In order to get a good iteration number that is effective for IR applications, we use the AP collection for training and maximizing the average precision score as the optimization criterion since it is our final evaluation metric. We try different iteration numbers, and also do experiments with different numbers of Markov chains, each of which is initialized with a

Table 1. Statistics of data sets

Collection Contents # of dos Size Queries # of Queries with Relevant Docs

AP Associated Press newswire 1988-90 242,918 0.73Gb TREC topics 51-150

(title only) 99

FT Financial Times 1991-94 210,158 0.56Gb TREC topics 301-400

(title only) 95

SJMN San Jose Mercury News 1991 90,257 0,29Gb TREC topics 51-150

(title only) 94

LA LA Times 131,896 0.48Gb TREC topics 301-400 (title only) 98

WSJ Wall Street Journal 1987-92 173,252 0.51Gb TREC topics 51-100 &

151-200 (title only) 100

28


different random number, to see how many chains are needed for our purposes. The results are presented in Figure 2 and Figure 3, respectively. After 50 iterations and with more than 3 Markov chains, performance is quite stable, so we use these values in the final retrieval experiments. The running time of each iteration with large topic numbers can be expensive; 30 iterations and 2 chains are a good trade off between accuracy and running time, and these values are used in the parameter-selecting experiments, especially when selecting a suitable number of topics.

Selecting the right number of topics is also an important problem in topic modeling. Nonparametric models like the Chinese Restaurant Process (Blei et al, 2004; Teh et al, 2004) are not practical to use for large data sets to automatically decide the number of topics. A range of 50 to 300 topics is typically used in the topic modeling literature. 50 topics are often used for small collections and 300 for relatively large collections, which are still much smaller than the IR collections we use. It is well known that larger data sets may need more topics in general, and it is confirmed here by our experiments with different values of K (100, 200, …) on the AP collection. K=800 gives the best average precision, as shown in Table 2. This number is much less than the corresponding optimal K value (2000) in the cluster model (Liu and Croft, 2004). As we explained in Section 3.3, in the cluster model, one document can be based on one topic, and in the LDA model, the mixture of topics for each document is more powerful and expressive; thus a smaller number of topics is used. Empirically, even with more parsimonious parameter settings like K=400, 30 iterations, 2 Markov chains, statistically significant improvements can also be achieved on most of the collections.

Table 2. Retrieval results (in average precision) on AP with different number of topics (K).

K 100 200 300 400 500 Average precision 0.2431 0.2520 0.2579 0.2590 0.2557

600 700 800 900 1000 1500

0.2578 0.2609 0.2621 0.2613 0.2585 0.2579

4.2.2 Parameters in Retrieval Model In order to select a suitable value ofλ , we use a similar procedure as above on the AP collection and find 0.7 to be the best value in our search. From the experiments on the testing collections, we also find that λ =0.7 is the best value or almost the best value for other collections.

We set the Dirichlet priorµ =1000 since the best results are consistently obtained with this setting. The value of µ needs to be adjusted when the other combination methods discussed in Section 3.2 are applied.

4.3 Experimental Results In all experiments, both the queries and documents are stemmed, and stopwords are removed.

4.3.1 Retrieval Experiments The retrieval results on the AP collection are presented in Table 3, with comparisons to the result of query likelihood retrieval (QL) and cluster-based retrieval (CBDM). Statistically significant

improvements of LDA-based retrieval (LBDM) over both QL and CBDM are observed at many recall levels, with 21.64% and 13.97% improvement in average precision respectively.

Table 3. Comparison of query likelihood retrieval (QL), cluster-based retrieval (CBDM) and retrieval with the LDA-based document models (LBDM). The evaluation measure is average precision. AP data set. Stars indicate statistically significant differences in performance with a 95% confidence according to the Wilcoxon test.

QL CBDM LBDM %chg over QL

%chg over

CBDM Rel. 21819 21819 21819

Rel. Retr. 10130 10751 12064 +10.09* +12.21* 0.00 0.6422 0.6485 0.6795 +5.8* +4.8* 0.10 0.4339 0.4517 0.4844 +11.6* +7.2* 0.20 0.3477 0.3713 0.4131 +18.8* +11.2* 0.30 0.2977 0.317 0.3661 +23.0* +15.5* 0.40 0.2454 0.2668 0.311 +26.8* +16.6* 0.50 0.2081 0.2274 0.2666 +28.1* +17.2* 0.60 0.1696 0.1794 0.2245 +32.4* +25.1* 0.70 0.1298 0.1444 0.1665 +28.3* +15.3* 0.80 0.0865 0.1002 0.118 +36.5* +17.8* 0.90 0.0480 0.0571 0.0694 +44.7 +21.6 1.00 0.0220 0.0201 0.0187 -15.1 -6.8 Avg 0.2179 0.2326 0.2651 +21.64* +13.97*

With the parameter setting λ =0.7, 50 iterations and 3 Markov chains, we run experiments on other collections and present results in Table 4. We compare the results with CBDM, and the results of the query likelihood model are also listed as a reference. On all five collections, LDA-based retrieval achieves improvements over both of query likelihood retrieval and cluster-model based retrieval, and four of the improvements are significant (over CBDM). Considering that CBDM has already obtained significant improvements over the query likelihood model (and Okapi-style weighting, see Liu and Croft) on all of these collections, and is therefore a high baseline, the significant performance improvements from LBDM are very encouraging.

Unlike the basic document representation, the LDA-based document model is not limited to only the literal words in a document, but instead describes a document with many other related highly probable words from the topics of this document. For example, for the query “buyout leverage”, the document “AP900403-0219”, which talks about “Farley Unit Defaults On Pepperell Buyout Loan”, is a relevant document. However, this document focuses on the “buyout” part, and does not contain the exact query term “leverage”, which makes this document rank very low. Using the LDA-based representation, this document is closely related to two topics that have strong connections with the term “leverage”: one is the economic topic that is strongly associated with this document because the document contains many representative terms of this topic, such as “million”, “company”, and “bankruptcy”; the other is the money market topic which is closely connected to “bond”, also a very frequent word in this document. In this way, the document is ranked higher with the LDA-based document model. Having multiple topics represent a document tends to give a clearer association

29

Topic models for query expansion

Ò Method Ò Experiments Ò Conclusion

30



31

Ò Use topic models to calculate multinomial distribution for a given query q={q1, q2 …

qk} Ò Following RM method, investigate whether topics

can be used for query expansion


✓ = p(w|q)

pTM (w|q) =X

ti

pTM (w|ti)pTM (ti|q)

33



34

36 X. Yi and J. Allan

Table 2. Retrieval Performance with TREC topics 301-400 (title-only) on one testingcorpus (FT) by using different topic models for query expansion and for documentsmoothing. There are overall 3233 relevant documents. Bold font highlights the bestresult in each column. Parameters tuned on the training corpus for using typical topicmodels on the top feedback documents are not well generalized to this FT testingcorpus: Q-CBQE, Q-LBQE, MFB perform worse than the QL baseline.

Rel. Interpolated Recall - Precision Precision: MAPRetr. 0.00 0.10 0.20 0.40 0.60 P@5 P@10 P@100

QL 1879 0.6142 0.4615 0.3987 0.2989 0.2136 0.3747 0.3242 0.1117 0.2614CBDM 2092 0.6057 0.4766 0.4106 0.3042 0.2234 0.3768 0.3221 0.1144 0.2738

BT-LBDM 2074 0.6082 0.4821 0.4068 0.3062 0.2153 0.3705 0.3211 0.1142 0.2681BT-PBDM 2034 0.6147 0.4747 0.4127 0.2952 0.2187 0.3789 0.3126 0.1144 0.2675RMDE-1 1946 0.6067 0.4832 0.4329 0.3400 0.2344 0.3726 0.3284 0.1178 0.2836MBDM 2099 0.5983 0.4810 0.4076 0.3058 0.2169 0.3705 0.3200 0.1137 0.2718LBDM 2216 0.6338 0.4899 0.4072 0.3213 0.2329 0.3705 0.3147 0.1227 0.2787PBDM 2226 0.6341 0.4993 0.4201 0.3207 0.2382 0.3958 0.3200 0.1229 0.2823RMDE 2134 0.6320 0.4914 0.4294 0.3212 0.2334 0.3726 0.3221 0.1207 0.2811CBQE 2016 0.6067 0.4681 0.4005 0.3210 0.2063 0.3537 0.3053 0.1166 0.2634LBQE 2007 0.6198 0.4770 0.4034 0.3092 0.2114 0.3663 0.3168 0.1179 0.2663PBQE 1981 0.6203 0.4648 0.3917 0.2983 0.2090 0.3642 0.3074 0.1159 0.2607

Q-CBQE 2151 0.5638 0.4517 0.3719 0.3015 0.2186 0.3516 0.3032 0.1254 0.2544Q-LBQE 2028 0.5856 0.4671 0.3709 0.2928 0.2085 0.3411 0.2863 0.1193 0.2541

MFB 2283 0.5351 0.4303 0.3658 0.2793 0.2112 0.3347 0.2979 0.1254 0.2469LDA-RM 2266 0.6113 0.4874 0.4295 0.3432 0.2576 0.3663 0.3263 0.1276 0.2947

RM 2313 0.6103 0.4844 0.4326 0.3592 0.2626 0.3768 0.3389 0.1295 0.3006

4.1 Results and Analysis

Table 2 shows the best retrieval results on one of the four testing corpora (FT).Our results of CBDM and LBDM are only slightly different from earlier results[7,12] due to small differences in the implementations. Table 3 further showsthe pair-wise significance test results of the MAP differences between some well-performed methods and other methods on the FT corpus. MAP results on theother testing corpora (WSJ, SJMN and LA) and the tuning corpus (AP) areshown in Table 4.

We have the following observations: (1) Using topic models for documentsmoothing can improve IR performance of the typical smoothing technique; com-plicated topic models like LDA and PAM have some benefits: LBDM and PBDMachieve higher MAPs than CBDM on every corpus. (2) The document expansionapproach RMDE, which borrows idea from RM to do document smoothing anddoes not actually identify topics in the collection, usually performs better thanCBDM, and sometimes similar to LBDM. (3) LBDM performs usually betterthan PBDM although PAM is more powerful for topic representation; thus, forretrieval, more complicated topic models may not bring further improvement.(4) Topic models trained with the whole corpus are too coarse-grained to beuseful for query expansion. (5) Topic models trained with the query dependentfeedback documents can perform extremely well on the training corpus; how-ever, they are sensitive to the tuned parameters and not always well generalized

35



36

Ò Topics discovered in the query related feedback documents can help retrieval

Ò Topic-based methods using these query related topics for retrieval are sensitive to parameters

Ò method that combines RM successfully improves some queries’ results


37

Outline


38

Topic models in social text streams

Ò Dynamic topic modeling on social streams Ò User behavior modeling on Twitter Ò Topic models for entity linking

39

Topic modeling in social text streams


40

Dynamic topic modeling

Ò Input data:

Ò Output: topic distribution p(z|t) at each time period t

Social text streams: concept drifting phenomenon

input documents input documentsinput documents … …

t1 t2 ti

X1 X2 Xi

41

Application: Hierarchical multi-label classification on social text streams

Hierarchical multi-label classification • learn a hypothesis function from training data to predict a y when

given input document x • Follow -property

f : X ! {0, 1}C {(x(i),y(i))}|D|i=1

T

41

some social texts streams belong to to hierarchical multiple labels

I really feel like Smullers200,000 people travel with

book as ticketThere are quite cramped trains

I think the train will soon stop again because of

snow...

... ...

Communication

ROOT

Personal experience

Traveler

Retail on station

Product

Personal report Parking

Product Experience SmullersCompliment ComplaintIncident

42

Hierarchical multi-label classification on social text streams

Ò Document expansion Ò Dynamic topic modeling (DTMs, Blei et al 2006) Ò Structural learning based text classification

Hierarchical multi-label classification for short documents in social streams • Learn from previous time periods, and predict an output when a new document arrives • Concept drift phenomenon

43

Experimental setup

Ò Dataset • tweets related to a transportation company • from 18th January 2010 to 5th June 2012 • 145,692 tweets posted by 77,161 Twitter users • annotations 493 nodes in 13 subsets

44

Time-aware topic extraction (1)

1"train2"train"cancel3"snow"fall4"froze

5"clumsy"work

1"sta8on2"winter"chaos3"chocomel4"wheel5"change

1"Train"Schedule2"winter"chaos

3"sta8on4"hot"drinks

5"ede>wageningen

1"netherlands2"train3"bomb

4"NS"company5"police

1"bomb2"NS

3"pains4"police5"train

45



46

Application: personalized time-aware tweets summarization

Ò Data preprocessing Ò Tweets propagation model Ò Document summarization: sentence extraction

Time-aware tweets summarization Select the most representative tweets for each time period as summary

Personalized time-aware tweets summarization Summary needs to be relevant to user�s interests

47

Tweet Propagation Model

Key idea

48

Tweet Propagation Model

time time

θu,t1

θu,t2

Candidate tweets User�s own tweets

RTt1

RTt2

• Probability of user’s interests at each time period • Probability of topics at each time period

49

Tweet Propagation Model (1)

private topics

common topics

burst topics

z∈Ztu

z∈Ztcom

z∈ZtB

⎧

⎨⎪⎪

⎩⎪⎪

θu ,tu θu ,t−1

u

θu ,tcom θui ,t−1

com | ui ∈Cu{ }

A tweet can be represented using probabilistic distributions p(z | p,t)

Zt = Ztu Zt

com ZtB

zw

,u tT, 1u tT �

……

1, 1u tT � 2 , 1u tT � , 1xu tT � , 1Fu tT �

, 1C tC �

…

1, 1tC � 2, 1tC � , 1x tC �

1comtI �

1utI �

utI

comtI B

tI

…w

t-

z

tqte

trtS,u tD

,d tN',d tN

1BtI �

tD

…

, 1u tF �

, 1u tC �

BtD tD

50

λu ,c,t

p(z | u1,t −1)

Social-aware interests

p(w | zt−1)Time-aware topics tracking

p(z | ut )

βt

51

Tweet Propagation Model (2)

Gibbs EM sampling is used in the inference process.

Derive parameter λu,c,t via Markov random walks

Gibbs sampling EM parameter estimation: αu,t and βt

λu ,ci ,t = µ sim(θci ,t,θcj ,t

|θu ,t ) ⋅λu ,ci ,ti≠ j∑ + (1− µ)

Cu ,t

52

Approach

Dynamic topic models based on

author topic model

Detect dynamic interests

Detect dynamic topics

Personalized time-aware tweets summarization

Novelty, coverage, diversity

Relevant to user�s interests

Preprocessing & data enrichment

53

Overall performance

Metrics TPM-A TPM-T TPM-S UBM TLDA AT TF-IDF Centroid Lex-R

R-1 0.428 0.403 0.395 0.387 0.374 0.355 0.341 0.302 0.291

R-2 0.125 0.119 0.116 0.114 0.112 0.102 0.095 0..081 0.077

R-W 0.159 0.153 0.149 0.146 0.142 0.144 0.137 0.118 0.115

Metrics TPM-A TPM-T TPM-S UBM TLDA AT TF-IDF Centroid Lex-R

R-1 0.513 0.497 0.482 0.461 0.374 0.423 0.411 0.362 0.369

R-2 0.149 0.142 0.139 0.134 0.127 0.122 0.116 0.097 0.095

R-W 0.197 0.191 0.189 0.178 0.176 0.166 0.161 0.131 0.119

60 tweets per period

40 tweets per period

54



55

Author topic modeldistribution of authors over topics

distribution of topics over words

uniform distribution of documents over authors

56

Author topic model

Ò For each topic k Ò Draw a multinomial from Dirichlet prior

Ò For each author a Ò Draw a multinomila from Dirichlet prior

Ò For each document m Ò For each word n

Ò Draw an author x from a Ò Draw a topic from Ò Draw a word from multinomial

� �

✓a ↵

zm,n ✓x

wm,n �zm,n

57

Author topic model

WORD PROB. WORD PROB. WORD PROB. WORD PROB.LIKELIHOOD 0.0539 RECOGNITION 0.0400 REINFORCEMENT 0.0411 KERNEL 0.0683

MIXTURE 0.0509 CHARACTER 0.0336 POLICY 0.0371 SUPPORT 0.0377EM 0.0470 CHARACTERS 0.0250 ACTION 0.0332 VECTOR 0.0257

DENSITY 0.0398 TANGENT 0.0241 OPTIMAL 0.0208 KERNELS 0.0217GAUSSIAN 0.0349 HANDWRITTEN 0.0169 ACTIONS 0.0208 SET 0.0205

ESTIMATION 0.0314 DIGITS 0.0159 FUNCTION 0.0178 SVM 0.0204LOG 0.0263 IMAGE 0.0157 REWARD 0.0165 SPACE 0.0188

MAXIMUM 0.0254 DISTANCE 0.0153 SUTTON 0.0164 MACHINES 0.0168PARAMETERS 0.0209 DIGIT 0.0149 AGENT 0.0136 REGRESSION 0.0155

ESTIMATE 0.0204 HAND 0.0126 DECISION 0.0118 MARGIN 0.0151

AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.Tresp_V 0.0333 Simard_P 0.0694 Singh_S 0.1412 Smola_A 0.1033Singer_Y 0.0281 Martin_G 0.0394 Barto_A 0.0471 Scholkopf_B 0.0730Jebara_T 0.0207 LeCun_Y 0.0359 Sutton_R 0.0430 Burges_C 0.0489

Ghahramani_Z 0.0196 Denker_J 0.0278 Dayan_P 0.0324 Vapnik_V 0.0431Ueda_N 0.0170 Henderson_D 0.0256 Parr_R 0.0314 Chapelle_O 0.0210

Jordan_M 0.0150 Revow_M 0.0229 Dietterich_T 0.0231 Cristianini_N 0.0185Roweis_S 0.0123 Platt_J 0.0226 Tsitsiklis_J 0.0194 Ratsch_G 0.0172

Schuster_M 0.0104 Keeler_J 0.0192 Randlov_J 0.0167 Laskov_P 0.0169Xu_L 0.0098 Rashid_M 0.0182 Bradtke_S 0.0161 Tipping_M 0.0153

Saul_L 0.0094 Sackinger_E 0.0132 Schwartz_A 0.0142 Sollich_P 0.0141

WORD PROB. WORD PROB. WORD PROB. WORD PROB.SPEECH 0.0823 BAYESIAN 0.0450 MODEL 0.4963 HINTON 0.0329

RECOGNITION 0.0497 GAUSSIAN 0.0364 MODELS 0.1445 VISIBLE 0.0124HMM 0.0234 POSTERIOR 0.0355 MODELING 0.0218 PROCEDURE 0.0120

SPEAKER 0.0226 PRIOR 0.0345 PARAMETERS 0.0205 DAYAN 0.0114CONTEXT 0.0224 DISTRIBUTION 0.0259 BASED 0.0116 UNIVERSITY 0.0114

WORD 0.0166 PARAMETERS 0.0199 PROPOSED 0.0103 SINGLE 0.0111SYSTEM 0.0151 EVIDENCE 0.0127 OBSERVED 0.0100 GENERATIVE 0.0109

ACOUSTIC 0.0134 SAMPLING 0.0117 SIMILAR 0.0083 COST 0.0106PHONEME 0.0131 COVARIANCE 0.0117 ACCOUNT 0.0069 WEIGHTS 0.0105

CONTINUOUS 0.0129 LOG 0.0112 PARAMETER 0.0068 PARAMETERS 0.0096

AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.Waibel_A 0.0936 Bishop_C 0.0563 Omohundro_S 0.0088 Hinton_G 0.2202Makhoul_J 0.0238 Williams_C 0.0497 Zemel_R 0.0084 Zemel_R 0.0545De-Mori_R 0.0225 Barber_D 0.0368 Ghahramani_Z 0.0076 Dayan_P 0.0340Bourlard_H 0.0216 MacKay_D 0.0323 Jordan_M 0.0075 Becker_S 0.0266

Cole_R 0.0200 Tipping_M 0.0216 Sejnowski_T 0.0071 Jordan_M 0.0190Rigoll_G 0.0191 Rasmussen_C 0.0215 Atkeson_C 0.0070 Mozer_M 0.0150

Hochberg_M 0.0176 Opper_M 0.0204 Bower_J 0.0066 Williams_C 0.0099Franco_H 0.0163 Attias_H 0.0155 Bengio_Y 0.0062 de-Sa_V 0.0087Abrash_V 0.0157 Sollich_P 0.0143 Revow_M 0.0059 Schraudolph_N 0.0078Movellan_J 0.0149 Schottky_B 0.0128 Williams_C 0.0054 Schmidhuber_J 0.0056

TOPIC 31 TOPIC 61 TOPIC 71 TOPIC 100


Figure 2: An illustration of 8 topics from a 100-topicsolution for the NIPS collection. Each topic is shownwith the 10 words and authors that have the highestprobability conditioned on that topic.

For each topic, the top 10 most likely authors are well-known authors in terms of NIPS papers written onthese topics (e.g., Singh, Barto, and Sutton in rein-forcement learning). While most (order of 80 to 90%)of the 100 topics in the model are similarly specificin terms of semantic content, the remaining 2 topicswe display illustrate some of the other types of “top-ics” discovered by the model. Topic 71 is somewhatgeneric, covering a broad set of terms typical to NIPSpapers, with a somewhat flatter distribution over au-thors compared to other topics. Topic 100 is somewhatoriented towards Geoff Hinton’s group at the Univer-sity of Toronto, containing the words that commonlyappeared in NIPS papers authored by members of thatresearch group, with an author list largely consistingof Hinton plus his past students and postdocs.

Figure 3 shows similar types of results for 4 selectedtopics from the CiteSeer data set, where again top-ics on speech recognition and Bayesian learning showup. However, since CiteSeer is much broader in con-tent (covering computer science in general) comparedto NIPS, it also includes a large number of topics not

WORD PROB. WORD PROB. WORD PROB. WORD PROB.SPEECH 0.1134 PROBABILISTIC 0.0778 USER 0.2541 STARS 0.0164

RECOGNITION 0.0349 BAYESIAN 0.0671 INTERFACE 0.1080 OBSERVATIONS 0.0150WORD 0.0295 PROBABILITY 0.0532 USERS 0.0788 SOLAR 0.0150

SPEAKER 0.0227 CARLO 0.0309 INTERFACES 0.0433 MAGNETIC 0.0145ACOUSTIC 0.0205 MONTE 0.0308 GRAPHICAL 0.0392 RAY 0.0144

RATE 0.0134 DISTRIBUTION 0.0257 INTERACTIVE 0.0354 EMISSION 0.0134SPOKEN 0.0132 INFERENCE 0.0253 INTERACTION 0.0261 GALAXIES 0.0124SOUND 0.0127 PROBABILITIES 0.0253 VISUAL 0.0203 OBSERVED 0.0108

TRAINING 0.0104 CONDITIONAL 0.0229 DISPLAY 0.0128 SUBJECT 0.0101MUSIC 0.0102 PRIOR 0.0219 MANIPULATION 0.0099 STAR 0.0087

AUTHOR PROB. AUTHOR PROB. AUTHOR PROB. AUTHOR PROB.Waibel_A 0.0156 Friedman_N 0.0094 Shneiderman_B 0.0060 Linsky_J 0.0143Gauvain_J 0.0133 Heckerman_D 0.0067 Rauterberg_M 0.0031 Falcke_H 0.0131Lamel_L 0.0128 Ghahramani_Z 0.0062 Lavana_H 0.0024 Mursula_K 0.0089

Woodland_P 0.0124 Koller_D 0.0062 Pentland_A 0.0021 Butler_R 0.0083Ney_H 0.0080 Jordan_M 0.0059 Myers_B 0.0021 Bjorkman_K 0.0078

Hansen_J 0.0078 Neal_R 0.0055 Minas_M 0.0021 Knapp_G 0.0067Renals_S 0.0072 Raftery_A 0.0054 Burnett_M 0.0021 Kundu_M 0.0063Noth_E 0.0071 Lukasiewicz_T 0.0053 Winiwarter_W 0.0020 Christensen-J 0.0059Boves_L 0.0070 Halpern_J 0.0052 Chang_S 0.0019 Cranmer_S 0.0055Young_S 0.0069 Muller_P 0.0048 Korvemaker_B 0.0019 Nagar_N 0.0050


Figure 3: An illustration of 4 topics from a 300-topicsolution for the CiteSeer collection. Each topic isshown with the 10 words and authors that have thehighest probability conditioned on that topic.

seen in NIPS, from user interfaces to solar astrophysics(Figure 3). Again the author lists are quite sensible—for example, Ben Shneiderman is a widely-known se-nior figure in the area of user-interfaces.

For the NIPS data set, 2000 iterations of the Gibbssampler took 12 hours of wall-clock time on a stan-dard PC workstation (22 seconds per iteration). Cite-seer took 111 hours for 700 iterations (9.5 minutesper iteration). The full list of tables can be foundat http://www.datalab.uci.edu/author-topic, forboth the 100-topic NIPS model and the 300-topic Cite-Seer model. In addition there is an online JAVAbrowser for interactively exploring authors, topics, anddocuments.

The results above use a single sample from the Gibbssampler. Across different samples each sample cancontain somewhat different topics i.e., somewhat dif-ferent sets of most probable words and authors giventhe topic, since according to the author-topic modelthere is not a single set of conditional probabilities, θand φ, but rather a distribution over these conditionalprobabilities. In the experiments in the sections below,we average over multiple samples (restricted to 10 forcomputational convenience) in a Bayesian fashion forpredictive purposes.

4.2 Evaluating predictive power

In addition to the qualitative evaluation of topic-author and topic-word results shown above, we alsoevaluated the proposed author-topic model in termsof perplexity, i.e., its ability to predict words on newunseen documents. We divided the D = 1, 740 NIPSpapers into a training set of 1, 557 papers with a totalof 2, 057, 729 words, and a test set of 183 papers of

58

Extension: entity-topic model for entity linking

Ò Extended from the Author-topic model Ò Entity linking task

Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural

Language Learning, pages 105–115, Jeju Island, Korea, 12–14 July 2012.

c�2012 Association for Computational Linguistics

An Entity-Topic Model for Entity Linking

Xianpei Han Le Sun Institute of Software, Chinese Academy of Sciences

HaiDian District, Beijing, China. {xianpei, sunle}@nfs.iscas.ac.cn

Abstract

Entity Linking (EL) has received considerable attention in recent years. Given many name mentions in a document, the goal of EL is to predict their referent entities in a knowledge base. Traditionally, there have been two distinct directions of EL research: one focusing on the effects of mention’s context compatibility, assuming that “the referent entity of a mention is reflected by its context”; the other dealing with the effects of document’s topic coherence, assuming that “a mention’s referent entity should be coherent with the document’s main topics”. In this paper, we propose a generative model – called entity-topic model, to effectively join the above two complementary directions together. By jointly modeling and exploiting the context compatibility, the topic coherence and the correlation between them, our model can accurately link all mentions in a document using both the local information (including the words and the mentions in a document) and the global knowledge (including the topic knowledge, the entity context knowledge and the entity name knowledge). Experimental results demonstrate the effectiveness of the proposed model.

1 Introduction

Entity Linking (EL) has received considerable research attention in recent years (McNamee & Dang, 2009; Ji et al., 2010). Given many name mentions in a document, the goal of EL is to predict their referent entities in a given knowledge base (KB), such as the Wikipedia1. For example, as 1 www.wikipedia.org

shown in Figure 1, an EL system should identify the referent entities of the three mentions WWDC, Apple and Lion correspondingly are the entities Apple Worldwide Developers Conference, Apple Inc. and Mac OS X Lion in KB. The EL problem appears in many different guises throughout the areas of natural language processing, information retrieval and text mining. For instance, in many applications we need to collect all appearances of a specific entity in different documents, EL is an effective way to resolve such an information integration problem. Furthermore, EL can bridge the mentions in documents with the semantic information in knowledge bases (e.g., Wikipedia and Freebase 2 ), thus can provide a solid foundation for knowledge-rich methods.

Figure 1. A Demo of Entity Linking

Unfortunately, the accurate EL is often hindered by the name ambiguity problem, i.e., a name may refer to different entities in different contexts. For example, the name Apple may refer to more than 20 entities in Wikipedia, such as Apple Inc., Apple (band) and Apple Bank. Traditionally, there have been two distinct directions in EL to resolve the name ambiguity problem: one focusing on the effects of mention’s context compatibility and the other dealing with the effects of document’s topic coherence. EL methods based on context

2 www.freebase.com

At the WWDC conference,

Apple introduces its new operating

system release - Lion.

Document Knowledge Base

Apple Inc.

MAC OS X Lion

Steve Jobs

iPhone

Apple Worldwide Developers Conference

California

105

59

Entity-topic model

For each entity e, our model samples its context word distribution »e»e from a V-dimensional Dirichlet distribution with hyperparameter ±±.

Finally, the full entity-topic model is shown in Figure 3 using the plate representation.

ÁÁ

±±

Figure 3. The plate representation of the entity-topic model

2.3 The Probability of a Corpus Using the entity-topic model, the probability of generating a corpus D={d1, d2, …, dD} given hyperparameters ®®, ¯̄, °° and ±± can be expressed as:

P (D;®; ¯; °; ±) =Y

d

P (md;wd; ®; ¯; °; ±)

=Y

d

X

ed

P (edj®; ¯)P (mdjed; °)P (wdjed; ±)

=

Z

Á

P (Áj¯)

Z

Ã

P (Ãj°)Y

d

X

ed

P (mdjed; Ã)

£Z

»

P (»j±)X

ad

P (adjed)P (wdjad; »)

£Z

�

P (�j®)P (edj�; Á)d�d»dÃdÁ (2:1)

P (D;®; ¯; °; ±) =Y

d

P (md;wd; ®; ¯; °; ±)

=Y

d

X

ed

P (edj®; ¯)P (mdjed; °)P (wdjed; ±)

=

Z

Á

P (Áj¯)

Z

Ã

P (Ãj°)Y

d

X

ed

P (mdjed; Ã)

£Z

»

P (»j±)X

ad

P (adjed)P (wdjad; »)

£Z

�

P (�j®)P (edj�; Á)d�d»dÃdÁ (2:1)

where mdmd and eded correspondingly the set of mentions and their entity assignments in document d, wdwd and adad correspondingly the set of words and their entity assignments in document d.

3 Inference using Gibbs Sampling

In this section, we describe how to resolve the entity linking problem using the entity-topic model. Overall, there were two inference tasks for EL:

1) The prediction task. Given a document d, predicting its entity assignments (eded for mentions and adad for words) and topic assignments ( zdzd ). Notice that here the EL decisions are just the prediction of per-mention entity assignments (eded).

2) The knowledge discovery task. Given a corpus D={d1, d2, …, dD}, estimating the global knowledge (including the entity distribution of topics ÁÁ, the name distribution ÃÃ and the context word distribution »» of entities) from data.

Unfortunately, due to the heaven correlation between topics, entities, mentions and words (the correlation is also demonstrated in Eq. (2.1), where the integral is intractable due to the coupling between �� , ÁÁ, ÃÃ and »» ), the accurate inference of the above two tasks is intractable. For this reason, we propose an approximate inference algorithm – the Gibbs sampling algorithm for the entity-topic model by extending the well-known Gibbs sampling algorithm for LDA (Griffiths & Steyvers, 2004). In Gibbs sampling, we first construct the posterior distribution P (z; e;ajD)P (z; e;ajD) , then this posterior distribution is used to: 1) estimate ��, ÁÁ, ÃÃ and »»; and 2) predict the entities and the topics of all documents in D. Specifically, we first derive the joint posterior distribution from Eq. (2.1) as:

P (z; e; ajD) / P (z)P (ejz)P (mje)P (aje)P (wja)P (z; e; ajD) / P (z)P (ejz)P (mje)P (aje)P (wja)

where

P (z) = (¡(T®)

¡(®)T)D

DY

d=1

Qt ¡(® + CDT

dt )

¡(T® + CDTd¤ )

P (z) = (¡(T®)

¡(®)T)D

DY

d=1

Qt ¡(® + CDT

dt )

¡(T® + CDTd¤ )

(3.1)

is the probability of the joint topic assignment z to all mentions m in corpus D, and

P (ejz) = (¡(E¯)

¡(¯)E)T

TY

t=1

Qe ¡(¯ + CTE

te )

¡(E¯ + CTEt¤ )

P (ejz) = (¡(E¯)

¡(¯)E)T

TY

t=1

Qe ¡(¯ + CTE

te )

¡(E¯ + CTEt¤ )

(3.2)

is the conditional probability of the joint entity assignments ee to all mentions m in corpus D given all topic assignments zz, and

P (mje) = (¡(K°)

¡(°)K)E

EY

e=1

Qm ¡(° + CEM

em )

¡(K° + CEMe¤ )

P (mje) = (¡(K°)

¡(°)K)E

EY

e=1

Qm ¡(° + CEM

em )

¡(K° + CEMe¤ )

(3.3)

is the conditional probability of all mentions mm given all per-mention entity assignments ee, and

P (aje) =DY

d=1

Y

e½ed

¡CDEde

CDEd¤

¢CDAdeP (aje) =

DY

d=1

Y

e½ed

¡CDEde

CDEd¤

¢CDAde (3.4)

is the conditional probability of the joint entity assignments aa to all words w in corpus D given all per-mention entity assignments ee, and

109

Figure 2. The document generative process, with Dir(:)Dir(:), Mult(:)Mult(:) and Unif(:)Unif(:) correspondingly

Dirichlet, Multinomial and Uniform distribution

Given the entity list E = {e1, e2, …, eE} in the knowledge base, the word list V = {w1, w2, …, wv}, the entity name list K = {n1, n2, …, nK} and the global knowledge described in above, the generation process of a document collection (corpus) D = {d1, d2, …, dD} is shown in Figure 2. To demonstrate the generation process, we also demonstrate how the document in Figure 1 can be generated using our model in following steps:

Step 1: The model generates the topic distribution of the document as �d�d = {Apple Inc.0.45, Operating System(OS)0.55};

Step 2: For the three mentions in the document: i. According to the topic distribution �d�d, the

model generates their topic assignments as z1=Apple Inc., z2 = Apple Inc., z3 = OS;

ii. According to the topic knowledge ÁÁApple Inc. , ÁÁOS and the topic assignments z1, z2, z3, the model generates their entity assignments as e1 = Apple Worldwide Developers Conference, e2 = Apple Inc., e3 = Mac OS X Lion;

iii. According to the name knowledge of the entities Apple Worldwide Developers Conference, Apple Inc. and Mac OS X Lion, our model generates the three mentions as m1=WWDC, m2 = Apple, m3 = Lion;

Step 3: For all words in the document: i. According to the referent entity set in

document ed = {Apple Worldwide Developers Conference, Apple Inc., Mac OS X Lion}, the model generates the target entity they describes as

a3=Apple Worldwide Developers Conference and a4=Apple Inc.;

ii. According to their target entity and the context knowledge of these entities, the model generates the context words in the document. For example, according to the context knowledge of the entities Apple Worldwide Developers Conference, the model generates its context word w3 =conference, and according to the context knowledge of the entity Apple Inc., the model generates its context word w4 = introduces.

Through the above generative process, we can see that all entities in a document are extracted from the document’s underlying topics, ensuring the topic coherence; and all words in a document are extracted from the context word distributions of its referent entities, resulting in the context compatibility. Furthermore, the generation of topics, entities, mentions and words are highly correlated, thus our model can capture the correlation between the topic coherence and the context compatibility.

2.2 Global Knowledge Generative Process

The entity-topic model relies on three types of global knowledge (including the topic knowledge, the entity name knowledge and the entity context knowledge) to generate a document. Unfortunately, all three types of global knowledge are unknown and thus need to be estimated from data. In this paper we estimate the global knowledge through Bayesian inference by also incorporating the knowledge generation process into our model.

Specifically, given the topic number T, the entity number E, the name number K and the word number V, the entity-topic model generates the global knowledge as follows:

1) Áj¯ » Dir(¯)Áj¯ » Dir(¯)

For each topic z, our model samples its entity distribution ÁzÁz from an E-dimensional Dirichlet distribution with hyperparameter ¯̄.

2) Ãj° » Dir(°)Ãj° » Dir(°)

For each entity e, our model samples its name distribution ÃeÃe from a K-dimensional Dirichlet distribution with hyperparameter °°.

3) »j± » Dir(±)»j± » Dir(±)

Given the topic knowledge ÁÁ , the entity name knowledge ÃÃ and the entity context knowledge »»: 1. For each doc d in D, sample its topic distribution�d » Dir(®);

2. For each of the Md mentions mi in doc d: a) Sample a topic assignment zi » Mult(�d); b) Sample an entity assignment ei » Mult(Ázi);c) Sample a mention mi » Mult(Ãei);

3. For each of the Nd words wi in doc d: a) Sample a target entity it describes from d’s referent entities ai » Unif(em1 ; em2 ;¢ ¢ ¢ ; emd

);b) Sample a describing word using aiai’s context word distribution wi » Mult(»ai).

108

60

Take home notes

Ò Overview of topic models Ò LDA for ad-hoc retrieval Ò LDA for query expansion Ò Topic models of social text streams

61

Future work of topic models

Ò Non-parametric topic models Ò Large-scale topic modeling Ò Parallel processing to enhance the efficiency Ò Active learning to enhance performance Ò Continuous time periods for time-aware topic

modeling

62

Ò Topic models and its applications Ò Zhaochun Ren Ò ISLA, University of Amsterdam Ò [email protected]

mailto:[email protected]

topic models and its applications - uva · other applications of topic models ! conclusion &...

Documents