supplementary material for von mises-fisher clustering...

Supplementary Material for Von Mises-Fisher Clustering Models

Siddharth GopalCarnegie Mellon University

Yiming YangCarnegie Mellon University

Contents

1 EM algorithm for VMF mixtures (vMFmix) . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1 E-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 M-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Inference for Bayesian von-Mises Fisher Mixture Model (B-vMFmix) . . . . . . . . . . . . . 32.1 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Collapsed Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Inference for Hierarchical von Mises-Fisher Mixture Model (H-vMFmix) . . . . . . . . . . . . 94 Inference for Temporal von Mises-Fisher Mixture Model (T-vMFmix) . . . . . . . . . . . . . 105 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Clustering performance using true cluster labels . . . . . . . . . . . . . . . . . . . . . . . . . 12

6.1 Results on the TDT4 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136.2 Results on the TDT5 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146.3 Results on the News20 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156.4 Results on the CNAE Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.5 Results on the K9 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

7 Results of Bayesian vs non-Bayesian model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Results of Hierarchical Bayesian methods (H-vMFmix) vs other models (vMFmix, B-vMFmix) 199 Results of Temporal Bayesian methods vs other models . . . . . . . . . . . . . . . . . . . . . 2010 Other implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

10.1 Calculating Iν(a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2210.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2210.3 Computational Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1

1 EM algorithm for VMF mixtures (vMFmix)

Given data , X = {xi}Ni=1 where each xi ∈ RD, the generative process for vMFmix with K classes is definedby,

zi ∼ Mult(.|π) i = 1, 2..N

xi ∼ vMF(.|µzi , κ) i = 1, 2..N

To estimate the hidden variables and cluster parameters, we can develop a simple EM algorithm. In theE-step, the expected value of the hidden variables zi = (zi1, zi2, ...ziK) is calculated. In the M-step, theparameter {π, κ,µ = {µ1, µ2, ..µK}} are optimized to maximize the likelihood.

1.1 E-step

The expected value of zi is given by,

E[zik] =πkvMF(xi|µk, κ)K∑j=1

πjvMF(xi|µj , κ)

1.2 M-step

Maximizing the likelihood w.r.t the cluster parameters leads to the following estimates,

πk =

N∑i=1

E[zik]

Rk =

N∑i=1

E[zik]xi , µk =Rk‖Rk‖

Using the procedure given in [2], the approximate estimate for κ is given by,

r =

K∑k=1

‖Rk‖

N, κ =

rD − r3

1− r2

For a more detailed derivation [2]

2

2 Inference for Bayesian von-Mises Fisher Mixture Model (B-vMFmix)

We outline the procedures for variational inference as well as collapsed gibbs sampling for posterior inferencein the bayesian von-Mises Fisher model. Given data , X = {xi}Ni=1 where each xi ∈ RD, the B-vMFmixassumes the following generative model with K classes

π ∼ Dirichlet(.|α)

µk ∼ vMF(.|µ0, C0) k = 1, 2..K

κk ∼ logNormal(.|m,σ2) k = 1, 2..K

zi ∼ Mult(.|π) i = 1, 2..N

xi ∼ vMF(.|µzi , κzi) i = 1, 2..N

In the model, the user specified prior parameters are {α, µ0, C0,m, σ2}. We learn these prior parameters

using empirical bayes and do not rely on the user to set any prior parameters.

2.1 Variational Inference

We assume the following factored form for the approximate posterior,

q(π) ≡ Dirichlet(.|ρ)

q(µk) ≡ vMF(.|ψk, γk)

q(zi) ≡ Mult(.|λi)q(κk) ≡ Distribution discussed later

The Likelihood of the data is given by

P (D) = P (X,Z,µ,κ, π|m,σ2, µ0, C0, α) = P (π|α)

N∏i=1

Mult(zi|π)vMF(xi|µzi , κzi)

K∏k=1

vMF(µk|µ0, C0)logNormal(κk|m,σ2)

logP (D) = log Dirichlet(π|α) +

N∑i=1

log Mult(zi|π) +

N∑i=1

K∑k=1

zik log vMF(xi|µk, κk)

+

K∑k=1

log vMF(µk|µ0, C0) +

K∑k=1

log(logNormal(κk|m,σ2)

)

logP (D) = − logB(α) +

K∑k=1

(α− 1) log(πk) +

N∑i=1

K∑k=1

zik log πk+

N∑i=1

K∑k=1

zik(logCD(κk) + κkx

>i µk

)+

K∑k=1

(logCD(C0) + C0µ

>k µ0

)K∑k=1

(−log(κk)− 1

2log(2πσ2)− (log(κk)−m).2

2σ2

)

3

where CD(x) is the normalization constant of the vMF distribution with concentration parameter x anddimensionality D. Specifically,

logCD(κk) = d′ log(κk)− (d′ + 1) log(2π)− log Id′(κk)

where d′ =D

2− 1

The variational Lower bound is given by

V LB = Eq[logP (D)]−H(q)

The posterior expectations are given by,

Eq[zi,k] = λi,k

Eq[log(πk)] = Ψ(ρk)−Ψ

∑j

ρj

Eq[µk] = Eq

[ID

2(κk)

ID2 −1

(κk)

]µk

Optimizing the VLB w.r.t ρ,

ρk = α+

N∑i=1

λik

Optimizing VLB w.r.t Z,

λik ∝ exp(Eq[log vMF(xi|µk, κk)] + Eq[log(πk)])

where

Eq[log vMF(xi|µk, κk)] = Eq[logCD(κk)] + Eq[κk]x>i Eq[µk]

Optimizing VLB w.r.t cluster means µ

γk =

∥∥∥∥∥Eq[κk]

(N∑i=1

Eq[zik]xi

)+ C0µ0

∥∥∥∥∥ψk =

Eq[κk]

(N∑i=1

Eq[zik]xi

)+ C0µ0

γk

where

Eq[zik] = λik

Optimizing VLB w.r.t κ.

1. Sampling method In the sampling scheme, we do not assume any specific form of posterior distri-bution for the cluster concentration parameters (κk’s).

q(κk) ≡ No-specific form

During variational inference we need to calculate the Eq[logCD(κk)]. The CD(κk) has a first-ordermodified bessel function term whose expectation is difficult to compute for common distributions. We

4

were not able to find any suitable posterior distribution for κk that enables straight-forward computa-tion of logCD(κk). Therefore, we rely on a partial MCMC sampling scheme to estimate the distributionfor κk given the rest of the posterior distributions.

NOTE : In [3] an unnormalized conjugate prior was used - but the problem is it not clear whether itis even a valid distribution !

The true posterior distribution of κk is given by,

P (κk|X,m, σ2, µ0, C0, α) =P (κk,X|m,σ2, µ0, C0, α)

P (X|m,σ2, µ0, C0, α)

∝ P (κk,X|m,σ2, µ0, C0, α)

∝∫Z

∫π

∫µ

∫κ−k

P (κk,X,Z,µ,κ−k, π|m,σ2, µ0, C0, α)

≈ Eq[P (κk,X,Z,µ,κ

−k, π|m,σ2, µ0, C0, α)]

Using Jensen inequality, we have,

Eq[P (κk,X,Z,µ,κ

−k|m,σ2, µ0, C0, π)]≥ exp

(Eq[logP (κk,X,Z,µ,κ

−k|m,σ2, µ0, C0, π)])

where, the inner expectation (after discarding terms which do not depend on κk) is ,

Eq[logP (κk,X,Z,µ,κ

−k|m,σ2, µ0, C0, π)]

=[N∑i=1

λik(logCD(κk) + κkx

>i Eq[µk]

)]− log(κk)− (log(κk)−m)2

2σ2+ constants ..

= nk logCD(κk) + κk

(N∑i=1

λikx>i Eq[µk]

)− log(κk)− (log(κk)−m)2

2σ2+ constants ..

where

nk =

N∑i=1

λik

This implies that,

P (κk|X,m, σ2, µ0, C0, π) ∝ (1)

exp

(nk logCD(κk) + κk

(N∑i=1

λikx>i Eq[µk]

))logNormal(κk|m,σ2) (2)

Based on the above, it is easy to see that computation becomes easier if we use the logNormal(.|m,σ2)as the proposal distribution for MCMC sampling i.e. the logNormal likelihood terms cancel.

However, we still recommend using a different proposal distribution such as logNormal or Normalaround the current value of the parameter for convergence reasons (note that the κ’s are relativelarge positive values, therefore using a Normal Proposal distribution works fine). This is because theposterior of κk’s are reasonably concentrated around a small region and since the prior distribution isreasonably vague, simply using logNormal(.|m,σ2) as the proposal distribution increases the numberof iterations for convergence.

Note that we can estimate Eq logCD(κk) and Eq

[ID

2(κk)

ID2−1

(κk)

]by drawing samples after a specific burnin

period, and use this for computing Eq[zi,k] and Eq[µk]. Because of the sampling, the variational bound

5

is no longer guaranteed to be a valid lower-bound. However, we observed that the posterior of κk’s ishighly concentrated around the mode and therefore estimating Eq logCD(κk) from samples is accurateenough for good performance.

Another alternative is to use grid-search, where we approximate the calculate the probability densityover a finite range of points in an interval. We then use this probability density over the interval tocalculate the expectation of various quantities. In practice, this worked equally well (although a littleslower).

2. Bounding method In this scheme, we assume that posterior distribution for the cluster concentrationparameters (κk’s) is log-normally distributed,

q(κk) ≡ logNormal(.|ak, bk)

In order to find the parameters of the posterior of the k′th cluster, i.e. ak, bk, we optimize,

(∑i

λik)Eq[logCD(κk)] + Eq[κk]∑i

λikx>i Eq[µk]− Eq[log(κk)]− Eq

[(log(κk)−m)2

2σ2

]−H(q)

where

Eq[κk] = exp(ak + .5 ∗ b2k)

Eq[log(κk)] = ak

Eq

[(log(κk)−m)2

2σ2

]=

1

2σ2

(Eq[log(κk)2]− 2mEq[log(κk)] +m2

)Eq[log(κk)2] = a2k + b2k

H(q(κk)) = .5 + .5 log(2πb2k) + ak

Eq[logCD(κk)] = d′Eq[log(κk)]− (d′ + 1) log(2π)− Eq[log(Id′(κk))]

The difficult term is the computation of Eq[log(Id′(κk))]. We upper bound this term using a turnertype inequality. From [4], we use the fact that,

uIv(u)′

Iv(u)≤√u2 + v2

Iv(u)′

Iv(u)≤√

1 +v

u

2

Integrating over u > 0,

log(Iv(u)) ≤√u2 + v2 − v log

(v√v2 + u2 + v2) + v2

)−√u2 + v2 − v log(u)

⇒

Eq[log(Iv(u))] ≤ Eq[√u2 + v2]− vEq[log

(v√v2 + u2 + v2) + v2

)]− Eq[

√u2 + v2]− vEq[log(u)]

The individual expectations can be computed using delta method approximation, i.e.,

E[f(x)] ≈ f(E[x]) + f ′′(E[x])V ar[x]

2

6

Empirical Bayes estimate for prior parameters

The empirical bayes estimate for m,σ, is given by,

maxm,σ2 − K

2log(σ2)− 1

2σ2

K∑k=1

(Eq[log(κk)2]− 2mEq[log(κk)] +m2)

⇒ m =1

K

∑Eq[log(κk)], σ2 =

1

K

K∑k=1

Eq[log(κk)2]−m2

The empirical bayes estimate for µ0, C0 is given by,

maxµ0,C0

K logCD(C0) + C0µ>0

(K∑k=1

Eq[µk]

)

Following [7], the updates are given by,

µ0 =

K∑k=1

Eq[µk]∥∥∥∥ K∑k=1

Eq[µk]

∥∥∥∥ , C0 =rD − r3

1− r2where r =

∥∥∥∥ K∑k=1

Eq[µk]

∥∥∥∥K

The empirical bayes estimate for α is given by,

maxα>0− logB(α) + (α− 1)s

where s =

K∑k=1

Eq[log πk]

B(α) =Γ(α)K

Γ(Kα)

Since there is no closed-form solution, we need to apply some numerical optimization like gradient descentto find the MLE.

2.2 Collapsed Gibbs sampling

The standard Gibbs sampling of the parameters can be easily derived by calculating their conditional distri-butions. The problem is that sampling of the high dimensional vMF mean parameter µk is hard (computa-tionally expensive). However, because the vMF distributions are conjugate, we can exploit this to completelyintegrate out the mean parameters and the mixing distribution. This implies that there are only two sets of

7

parameters = {Z,κ}. The conditional distributions are given by,

P (zi = t|Z−i, X,κ,m, σ2, µ0, C0) = (3)∫π

∫µ

P (xi|zi, µk, κk)P (zi|π)P (π|α)∏j 6=i

P (zj |π)P (xj |µzj , κzj )

K∏k=1

P (µk|µ0, C0)logNormal(κk|m,σ2) (4)

∝∫π

P (π|α)P (zi|π)∏j 6=i

P (zj |π) .

K∏k=1

∫µk

P (xi|zi, µk, κk)∏

j 6=i,zj=k

P (xj |µk, κk)

P (µk|µ0, C0) (5)

∝Γ(Kα)

K∏k=1

Γ(α+ nk,−i + I(t = k))

Γ(Kα+N)K∏k=1

Γ(α)

K∏k=1

CD(κk)nk,−i+I(k=t)CD(C0)

CD

(∥∥∥∥∥κk(I(t = k)xi +

∑j 6=i,zj=k

xj

)+ C0µ0

∥∥∥∥∥) (6)

∝ (α+ nt,−i)CD(κt)

CD

(∥∥∥∥∥κt ∑j 6=i,zj=t

xj + C0µ0

∥∥∥∥∥)

CD

(∥∥∥∥∥κt( ∑j:zj=t

xj

)+ C0µ0

∥∥∥∥∥) (7)

Similarly, we derive the conditional distribution for κk,

P (κk|..) ∝∫µk

∏zi=k

P (xi|µk, κk)P (µk|µ0, C0)logNormal(κk|m,σ2) (8)

∝ CD(κk)nkCD(C0)

CD

(∥∥∥∥∥κk ∑j:zj=k

xj + C0µ0

∥∥∥∥∥) logNormal(κk|m,σ2) (9)

Since the conditional distribution is not of a standard form, we need to use a step of MCMC sampling (withappropriate proposal distribution) to draw κk.The main advantage of sampling is that no approximation for the posterior is required, it is guaranteed thatthe samples will eventually be drawn from the true posterior. The downside is that it is not clear how manysamples must be drawn.

Empirical Bayes estimate for prior parameters

The empirical bayes estimation in the context of MCMC sampling is done as described in [5]. The priorparameters are estimated by maximizing the sum of the marginal likelihood of all the samples.

8

3 Inference for Hierarchical von Mises-Fisher Mixture Model (H-vMFmix)

Given data , X = {xi}Ni=1 where each xi ∈ RD, and a hierarchy pa. Assume the following generative modelfor the data


µk ∼ vMF(.|µpa(k), κpa(k)) k = 1, 2..K


zi ∼ Mult(.|π), zi ∈ {Leaf nodes} i = 1, 2..N

xi ∼ vMF(.|µzi , κzi) i = 1, 2..N

Assume the following factored form for the posterior,

q(π) ≡ Dirichlet(.|ρ) q(µk) ≡ vMF(.|ψk, γk)

q(zi) ≡ Mult(.|λi) q(κk) ≡ No-specific form

The posterior distribution of cluster assignment variables at the leaf-nodes is given by,

ρk = α+

N∑i=1

λik

λik ∝ exp(Eq[log vMF(xi|µk, κk)] + Eq[log(πk)])

where

Eq[log(πk)] = Ψ(ρk)−Ψ

∑j

ρj

Eq[log vMF(xi|µk, κk)] = Eq[logCD(κk)] + Eq[κk]x>i Eq[µk]

The posterior distribution of the cluster means are updated as follows,

For leaf nodes

γk =

∥∥∥∥∥Eq[κk]

(N∑i=1

Eq[zik]xi

)+ Eq[κpa(k)]Eq[µpa(k)]

∥∥∥∥∥ψk =

Eq[κk]

(N∑i=1

Eq[zik]xi

)+ Eq[κpa(k)]Eq[µpa(k)]

γk

where

Eq[zik] = λik

For non-leaf nodes

γk =

∥∥∥∥∥∥Eq[κk]

N∑c:pa(c)=k

Eq[µc] + Eq[κpa(k)]Eq[µpa(k)]

∥∥∥∥∥∥ψk =

Eq[κk]N∑

c:pa(c)=k

Eq[µc] + Eq[κpa(k)]Eq[µpa(k)]

γk

The posterior computation for κk’s follows in a manner similar to the previous section.

9

4 Inference for Temporal von Mises-Fisher Mixture Model (T-vMFmix)

Given data , X = {{xt,i}Nti=1}Tt=1 where each xt,i ∈ RD. Assume the following generative model for the data


µt,k ∼ vMF(.|µt−1,k, C0) ∀t = 1, 2...T ; k = 1, 2, 3..K


zt,i ∼ Mult(.|π) ∀i = 1, 2, ..Nt

xt,i ∼ vMF(.|µzt,i , κzt,i) ∀t = 1, 2..T ; i = 1, 2, ...Nt

Note that the data X is given in the form of T time-steps of data. Each time-step is associated with someNt instances xt,i, i = 1, 2, ..Nt.The approximate posterior follows the factored form given below,

q(π) ≡ Dirichlet(.|ρ)

q(κk) ≡ No-specific form

µt,k ≡ vMF(.|ψt,k, γt,k)

zt,i ≡ Mult(.|λt,i)

Note that for simplicity we have assumed a fully factorised form for the posterior distribution with nodependencies between the µ’s from adjacent time-points in the posterior.The likelihood of the data is given by,

P (X,Z, π,µ,κ|µ−1, C0, α,m, σ2) =

P (π|α)

K∏k=1

P (κk|m,σ2)

T∏t=1

[K∏k=1

(P (µt,k|µt−1,k, C0)

Nt∏i=1

(P (zt,i|π)

K∏k=1

(P (xt,i|µt,k, κk)zt,i

)]

logP (X,Z, π,µ,κ|µ−1, C0, α,m, σ2) =

K∑k=1

− log(κk)− .5 log(2π)− log(σ)− (log(κk)−m)2

2σ2− logB(α) +

K∑k=1

(α− 1) log πk

+T∑t=1

[K∑k=1

(logCD(C0) + C0µ

>t,kµt−1,k

)+

+

Nt∑i=1

K∑k=1

πkzt,i,k+

Nt∑i=1

K∑k=1

zt,i,k[logCD(κk) + κkx

>i µk

]]

Following in a similar manner, by optimizing the variational lower-bound, we get the following updates forthe posterior parameters,

10

λt,i,k ∝ exp (Eq[log πk] + Eq[logP (xt,i|µt,k, κk)])

ρk = α+

N∑i=1

λik

γt,k =

∥∥∥∥∥Eq[κk]

Nt∑i=1

Eq[zt,i,k]xi + C0Eq[µt−1,k] + I(t < T )C0Eq[µt+1,k]

∥∥∥∥∥ψt,k =

Eq[κk]Nt∑i=1

Eq[zt,i,k]xi + C0Eq[µt−1,k] + I(t < T )C0Eq[µt+1,k]

γt,k

For the partial MCMC sampling for κk,

Eq[logP (κk,X,Z,µ,κ−k,η|..)] =

− log(κk)− (log(κk)−m)2

2σ2+

(T∑t=1

Nt∑i=1

Eq[zt,i,k]

)logCd(κk) +

T∑t=1

Nt∑i=1

Eq[zt,i,k]x>i Eq[µk] + constants

If instead, one were to perform MLE estimation of the κk’s without assuming the log-normal prior,

maxκk

(T∑t=1

Nt∑i=1

Eq[zt,i,k]

)logCd(κk) +

T∑t=1

Nt∑i=1

Eq[zt,i,k]x>i Eq[µk]

⇒s =

T∑t=1

Nt∑i=1

Eq[zt,i,k]x>i Eq[µk]

T∑t=1

Nt∑i=1

Eq[zt,i,k]

r = ||s||

κk =rD − r3

1− r2

For the MLE estimations, for C0

maxC0

KT logCD(C0) + C0

T∑t=1

K∑k=1

Eq[µt,k]>Eq[µt−1,k]

⇒s =

T∑t=1

K∑k=1

Eq[µt,k]>Eq[µt−1,k]

KTr = ||s||

C0 =rD − r3

1− r2

Note that C0 determines the level of influence of the previous time-step to the next time-step. In some senseit acts as a regularization term forcing the parameters of the next time-step to be similar to the previousone.We recommend that this be set manually than being learnt from the data. The reason is that, since theestimation is done through maximum likelihood, the likelihood can generally be improved by moving C0 → 0,i.e. towards no regularization.

11

5 Datasets

Table 1: Dataset Statistics

Dataset #Instances #Training #Testing #True Clusters #Features

TDT4 622 311 311 34 8895TDT5 6366 3183 3183 126 20733CNAE 1079 539 540 9 1079NIPS 2483 1241 1242 - 14036K9 2340 1170 1170 20 21839NEWS20 18744 11269 7505 20 53975

6 Clustering performance using true cluster labels

We present the complete results of all the methods on all the datasets using the following evaluation metrics,

1. Mutual Information (MI) 1

2. Normalized Mutual Information (NM) 1

3. Rand Index (RI) 1

4. Adjusted Rand Index (ARI) 2

5. Purity 1

6. Macro-F1 (ma-F1) : Is a measure similar to purity - each cluster is assigned to the class-label whichis most frequent in the cluster. Then the standard Macro-F1 classification measure is used to measurethe performance. This measure compliments the purity measure in the sense that we want to detectall clusters; not just the ones which have large number of instances.

Methods compared,

1. von Mises Fisher Mixture Model (vMFmix) trained using the EM-algorithm.

2. Multinomial Mixture Model (MN) with dirichlet prior.

3. K-means (K-means): The standard K-means algorithm with hard assignments after each iteration.

4. Soft K-means (SK-means): The simplest Gaussian mixture model with K uknown means and a singlevariance parameter learnt from the data. (Presetting the variance to 1 performs so badly that almostall scores are zero). Note that there is no common prior for the means.

5. Latent Dirichlet Allocation: Although not suitable for single-labeled datasets, we still show the resultsby assigning each document to the class with the largest posterior probability.

The methods are compared on all the datasets on which true-cluster labels are available. Note that thereis no splitting of the data into a separate training or testing set. For vMFmix, KM and SKM the Tf-Idfnormalized representation is used (KM and SKM perform better with Tf-Idf normalization than without it).For MM and LDA, word-counts based representation is used. To make a fair comparison, the same random

1http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html2http://en.wikipedia.org/wiki/Rand index#Adjusted Rand index

12

initial values of the cluster assignment variables for all the algorithms is used. For all the experimentalsettings - refer last section.After the publication of the paper, we realized that using Kmeans++ initialization instead of random ini-tialization boosts performance on some of the datasets. We leave this comparison for further studies.

6.1 Results on the TDT4 dataset

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

10 15 20 25 30 35

MI

# Clusters

vMF-mixMN-mixK-means

SK-meansLDA

0.6

0.65

0.7

0.75

0.8

0.85

0.9

10 15 20 25 30 35

NM

I

# Clusters


SK-meansLDA

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

10 15 20 25 30 35

RI

# Clusters


SK-meansLDA

0.3 0.35

0.4 0.45

0.5 0.55

0.6 0.65

0.7 0.75

0.8

10 15 20 25 30 35

AR

I

# Clusters


SK-meansLDA

0.4 0.45 0.5

0.55 0.6

0.65 0.7

0.75 0.8

0.85 0.9

10 15 20 25 30 35

Puri

ty

# Clusters


SK-meansLDA

0.1

0.15

0.2

0.25

0.3

0.35

0.4

10 15 20 25 30 35

Ma-

F1

# Clusters


SK-meansLDA

13

6.2 Results on the TDT5 Dataset

1.8

2

2.2

2.4

2.6

2.8

3

3.2

10 15 20 25 30 35

MI

# Clusters


SK-meansLDA

0.6

0.65

0.7

0.75

0.8

0.85

0.9

10 15 20 25 30 35

NM

I

# Clusters


SK-meansLDA

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

10 15 20 25 30 35

RI

# Clusters


SK-meansLDA

0.25 0.3

0.35 0.4

0.45 0.5

0.55 0.6

0.65 0.7

10 15 20 25 30 35

AR

I

# Clusters


SK-meansLDA

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

10 15 20 25 30 35

Puri

ty

# Clusters


SK-meansLDA

0.02 0.04 0.06 0.08

0.1 0.12 0.14 0.16 0.18

0.2

10 15 20 25 30 35

Ma-

F1

# Clusters


SK-meansLDA

14

6.3 Results on the News20 Dataset

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

10 15 20 25 30 35

MI

# Clusters


SK-meansLDA

0.1 0.15

0.2 0.25

0.3 0.35

0.4 0.45

0.5 0.55

0.6

10 15 20 25 30 35

NM

I

# Clusters


SK-meansLDA

0.74 0.76 0.78 0.8

0.82 0.84 0.86 0.88 0.9

0.92 0.94 0.96

10 15 20 25 30 35

RI

# Clusters


SK-meansLDA

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

10 15 20 25 30 35

AR

I

# Clusters


SK-meansLDA

0.15 0.2

0.25 0.3

0.35 0.4

0.45 0.5

0.55 0.6

0.65

10 15 20 25 30 35

Puri

ty

# Clusters


SK-meansLDA

0.05 0.1

0.15 0.2

0.25 0.3

0.35 0.4

0.45 0.5

10 15 20 25 30 35

Ma-

F1

# Clusters


SK-meansLDA

15

6.4 Results on the CNAE Dataset

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

10 15 20 25 30 35

MI

# Clusters


SK-meansLDA

0.25 0.3

0.35 0.4

0.45 0.5

0.55 0.6

0.65 0.7

0.75

10 15 20 25 30 35

NM

I

# Clusters


SK-meansLDA

0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9

0.91 0.92 0.93 0.94

10 15 20 25 30 35

RI

# Clusters


SK-meansLDA

0.1

0.2

0.3

0.4

0.5

0.6

0.7

10 15 20 25 30 35

AR

I

# Clusters


SK-meansLDA

0.35 0.4

0.45 0.5

0.55 0.6

0.65 0.7

0.75 0.8

0.85

10 15 20 25 30 35

Puri

ty

# Clusters


SK-meansLDA

0.2

0.3

0.4

0.5

0.6

0.7

0.8

10 15 20 25 30 35

Ma-

F1

# Clusters


SK-meansLDA

16

6.5 Results on the K9 Dataset

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

10 15 20 25 30 35

MI

# Clusters


SK-meansLDA

0.15 0.2

0.25 0.3

0.35 0.4

0.45 0.5

0.55 0.6

10 15 20 25 30 35

NM

I

# Clusters


SK-meansLDA

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0.92

10 15 20 25 30 35

RI

# Clusters


SK-meansLDA

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

10 15 20 25 30 35

AR

I

# Clusters


SK-meansLDA

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

10 15 20 25 30 35

Puri

ty

# Clusters


SK-meansLDA

0.05

0.1

0.15

0.2

0.25

0.3

0.35

10 15 20 25 30 35

Ma-

F1

# Clusters


SK-meansLDA

17

7 Results of Bayesian vs non-Bayesian model

The Bayesian vMF mixture model (B-vMFmix) trained using the variational method with sampling schemeis compared with the simple vMF mixture model (vMFmix) trained through EM algorithm. We comparethe methods using the log-likelihood of the data on a held-out test set. Note that true-cluster labels are notrequired for evaluation.

8.150E+068.200E+068.250E+068.300E+068.350E+068.400E+068.450E+068.500E+068.550E+068.600E+068.650E+06

10 20 30 40 50

Log

-Lik

elih

ood

# Clusters

B-vMF-mixvMF-mix

(a) TDT 4 Dataset

2.305E+082.310E+082.315E+082.320E+082.325E+082.330E+082.335E+082.340E+082.345E+082.350E+082.355E+08

10 20 30 40 50

Log

-Lik

elih

ood

# Clusters

B-vMF-mixvMF-mix

(b) TDT 5 Dataset

1.637E+09

1.637E+09

1.637E+09

1.637E+09

1.638E+09

1.638E+09

1.638E+09

1.638E+09

1.638E+09

10 20 30 40 50

Log

-Lik

elih

ood

# Clusters

B-vMF-mixvMF-mix

(c) News 20 Dataset

9.300E+05

9.320E+05

9.340E+05

9.360E+05

9.380E+05

9.400E+05

9.420E+05

9.440E+05

9.460E+05

10 20 30 40 50

Log

-Lik

elih

ood

# Clusters

B-vMF-mixvMF-mix

(d) CNAE Dataset

9.300E+05

9.320E+05

9.340E+05

9.360E+05

9.380E+05

9.400E+05

9.420E+05

9.440E+05

9.460E+05

10 20 30 40 50

Log

-Lik

elih

ood

# Clusters

B-vMF-mixvMF-mix

(e) K9 Dataset

5.845E+07

5.850E+07

5.855E+07

5.860E+07

5.865E+07

5.870E+07

5.875E+07

5.880E+07

5.885E+07

10 20 30 40 50

Log

-Lik

elih

ood

# Clusters

B-vMF-mixvMF-mix

(f) NIPS Dataset

18

8 Results of Hierarchical Bayesian methods (H-vMFmix) vs othermodels (vMFmix, B-vMFmix)

The H-vMFmix and B-vMFmix are trained through variational inference and vMFmix is trained throughEM algorithm. We compare the methods using the log-likelihood of the data on a held-out test set.The results are presented on 5 different hierarchies with varying depth of hierarchy and branch factor(branch factor is the #children under each node). More specifically we tested with (h=height,b=branch).For example (h=3,b=4) is a hierarchy 3 levels deep; the zeroth level is the root-node under which there are4 children which form the first level, each of these 4 children in turn have 4 children each leading to 16 nodesin the second level, each of these 16 children have 4 children each which forms 64 children at the third level.We present on the results on (h=3,b=3),(h=3,b=4),(h=4,b=3),(h=2,b=10),(h=3,b=5).

7.900E+06

8.000E+06

8.100E+06

8.200E+06

8.300E+06

8.400E+06

8.500E+06

8.600E+06

8.700E+06

27 (3,3) 64 (3,4) 81 (4,3) 100 (2,10) 125 (3,5)

Log

-Lik

elih

ood

# Clusters

H-vMF-mixB-vMF-mix

vMF-mix

(a) TDT 4 Dataset

2.280E+08

2.290E+08

2.300E+08

2.310E+08

2.320E+08

2.330E+08

2.340E+08

2.350E+08

2.360E+08

27 (3,3) 64 (3,4) 81 (4,3) 100 (2,10) 125 (3,5)L

og-L

ikel

ihoo

d

# Clusters

H-vMF-mixB-vMF-mix

vMF-mix

(b) TDT 5 Dataset

1.637E+09

1.638E+09

1.638E+09

1.638E+09

1.639E+09

1.640E+09

1.640E+09

27 (3,3) 64 (3,4) 81 (4,3) 100 (2,10) 125 (3,5)

Log

-Lik

elih

ood

# Clusters

H-vMF-mixB-vMF-mix

vMF-mix

(c) News 20 Dataset

9.250E+059.300E+059.350E+059.400E+059.450E+059.500E+059.550E+059.600E+059.650E+059.700E+05

27 (3,3) 64 (3,4) 81 (4,3) 100 (2,10) 125 (3,5)

Log

-Lik

elih

ood

# Clusters

H-vMF-mixB-vMF-mix

vMF-mix

(d) CNAE Dataset

9.000E+079.020E+079.040E+079.060E+079.080E+079.100E+079.120E+079.140E+079.160E+079.180E+07

27 (3,3) 64 (3,4) 81 (4,3) 100 (2,10) 125 (3,5)

Log

-Lik

elih

ood

# Clusters

H-vMF-mixB-vMF-mix

vMF-mix

(e) K9 Dataset

5.800E+075.810E+075.820E+075.830E+075.840E+075.850E+075.860E+075.870E+075.880E+075.890E+075.900E+07

27 (3,3) 64 (3,4) 81 (4,3) 100 (2,10) 125 (3,5)

Log

-Lik

elih

ood

# Clusters

H-vMF-mixB-vMF-mix

vMF-mix

(f) NIPS Dataset

19

9 Results of Temporal Bayesian methods vs other models

In this section, the results of the temporal Bayesian vMF mixture model (T-vMFmix) is compared withthe simple vMF mixture model (vMFmix) as well as the non hierarchical Bayesian vMF mixture model (B-vMFmix). The T-vMFmix and B-vMFmix are trained through variational inference and vMFmix is trainedthrough EM algorithm.We used the 4 datasets which have associated with time-stamps,

1. NIPS-dataset : This dataset outlined in table 1 has 2483 research papers spread over a span of 17years. The collection was divided into 17 time-points based on year of publication of each paper.

2. TDT-5 dataset: This dataset outlined in table 1 has 6636 news stories over a span of few months (Apr03 to Sept 03). The collection was divided into 22 time-points where each time-point corresponds toone week of news. Note that we selected only those news stories which have been judged by humansto belong to one or more preselected news ’stories’ [1].

3. NYTC-{politics,elections} datasets: This is a collection of New York Times news articles from theperiod 1996-2007. We formed two different datasets - the first one being more general than the secondone. The first subset consist of all news articles which have been tagged as politics and the secondsubset consists of articles which have been tagged elections. For computational reasons, we removedall words which occurred less than 50 times and news stories which have less than 50 words. This leadsto the nytc-politics dataset consisting of 48509 news articles with 16089 words and the nytc-electionsdataset consisting of 23665 news articles and 10473 words.

We test the methods by predicting the likelihood of the data of the next time-point given all the articles fromthe beginning using 30 topics. For ease of visualization, we plot the relative improvement in log-likelihoodof the B-vMFmix and T-vMFmix models over the simple vMFmix model (This is because the log-likelihoodfrom adjacent time-points are not comparable and fluctuate wildly).

-1.000E+05-5.000E+040.000E+005.000E+041.000E+051.500E+052.000E+052.500E+053.000E+053.500E+054.000E+05

1996 1998 2000 2002 2004 2006

Impr

ovem

ent i

n L

og-L

ikel

ihoo

d

Year

T-vMF-mixB-vMF-mix

(a) NYTC-politics Dataset

-1.000E+05

-5.000E+04

0.000E+00

5.000E+04

1.000E+05

1.500E+05

2.000E+05

2.500E+05

3.000E+05

1996 1998 2000 2002 2004 2006

Impr

ovem

ent i

n L

og-L

ikel

ihoo

d

Year

T-vMF-mixB-vMF-mix

(b) NYTC-elections Dataset

20

0.000E+001.000E+042.000E+043.000E+044.000E+045.000E+046.000E+047.000E+048.000E+049.000E+04

1988 1990 1992 1994 1996 1998 2000 2002

Impr

ovem

ent i

n L

og-L

ikel

ihoo

d

Year

T-vMF-mixB-vMF-mix

(a) NIPS Dataset

0.000E+00

1.000E+05

2.000E+05

3.000E+05

4.000E+05

5.000E+05

6.000E+05

Apr 03 May June July Aug Sept

Impr

ovem

ent i

n L

og-L

ikel

ihoo

d

Months (Year 2003)

T-vMF-mixB-vMF-mix

(b) TDT5 Dataset

Lastly, we show some qualitative results using D-vMF-mix. Fig 5 shows the changing vocabulary for one ofthe topics in the NIPS dataset over time. The features with the largest weight are shown for every 3 years.

Figure 5: The change in vocabulary over the years for one of the topics from the NIPS dataset

(1987)

excitatory

inhibitor

activity

synaptic

firing

cells

(1991)

activity

synaptic

firing

inhibitory

excitatory

neuron

(1994)

activity

cortical

cortex

stimulus

neuronal

response

(1997)

cortical

visual

neuronal

orientation

spatial

tuning

(2000)

scene

spikes

spike

quantitative

visual

stimuli

(2003)

scene

accessible

quantitative

dissimilarity

fold

distinguish

For illustrative purposes, the represention of the clusters generated by D-vMF-mix (using 5 clusters) on adimension reduced version of the NIPS dataset (Figure 6a) is shown. The data is first projected onto its first3 principal components followed by a projection to a sphere. The different colors denote cluster membershipand the colored lines from the center denotes how the mean parameter of each cluster moves over time.

(a) A 3D representation of the NIPS data

21

10 Other implementation Details

10.1 Calculating Iν(a)

We used the approximation given in [6],

log Iν(a) =√a2 + (ν + 1)2 + (ν +

1

2) log

a

ν + 12 +

√a2 + (ν + 1)2

− 1

2log(

a

2) + (ν +

1

2) log

2ν + 32

2(ν + 1)− 1

2log(2π)

Iν(a) = exp(log Inu(a))

10.2 Experimental Settings

All the clustering experiments were run until one of the following condition was met

1. The increase in likelihood was less than 1e-1

2. The convergence in parameters was less than 1e-2

3. Atmost 200 iterations.

For sampling the concentration parameters, the proposal distribution was a log-normal distribution withmean as the current iterate and variance as 1 - The first 300 iterations were used for burn in and the last200 iterations was used to estimate the samples (500 iterations seemed to be good enough). Since there areonly K concentration parameters this does not affect the computational complexity at all.Since all the algorithms are E-M based algorithms and the objective is not convex, there is certainly apossibility of local maxima. Therefore it is necessary to initialize the EM algorithm appropriately.For all the vMF mix models we initialized the concentration parameters with relative low values like 5 or 10,which implies that the initial distribution of each cluster is very broad. This performed much better thansetting the initial value of the concentration parameters to its MLE (the MLE value of κ was relatively highand makes the cluster distribution immediately very ‘pointy’ , therefore it is difficult for the instances to getreassigned to different clusters after the first iteration)

10.3 Computational Details

Given a fixed number of clusters K and N instances each with dimensionality D, most of the clusteringalgorithm iterate between two steps,

1. Assigning instances to clusters (either soft or hard).

2. Calculate cluster parameters.

In most cases, each step take O(KNP ) per iteration. If hard-assignments are used, one can ‘potentially ’reduce the computation of step-2 by maintaining sufficient statistics which reduces the computation of step-2to O(NP ). In any case the bottleneck is in step-1 i.e. deciding which cluster should the instance be assignedto (either partially or fully). For all the algorithms, this involves calculating a distance matrix. For instanceconsider the simple K-means algorithm; Given a data matrix X (dimension NxP) and cluster-means M(dimension KxP), step-1 involves the computation of,

euclidean distance matrix d(X,M) = ((X ◦X)eD)e>K − 2XM + eK((M ◦M)eD)

where eD = 1D eK = 1K

Calculating the first term isO(NP ) and calculating the last term isO(KP ); the bottleneck is the computationof the middle term i.e. matrix-product XM which is O(KNP ). This is the general case for most of theclustering problems. Sometimes if we have a special structure in the distance metric we might be able to do

22

better - for e.g. if the distance is euclidean like above and we are doing hard assignments, then we can usetriangle inequality to reduce the number of computations.This matrix product is the bottleneck for ‘most’ clustering algorithm - for K-means and vMF, the bottleneckis the matrix-product XM, for Multinomial mixture it is the matrix-product Xlog(M). For more complicatedtopic-models like LDA, it is even worse as each word in a document can belong to a different cluster.Fortunately, people in the systems community have hammered the matrix multiplication over several decades.There are well known standards and widespread implementations that have squeezed even the last bit ofperformance. Depending on the data (sparse or dense) we have, the bottleneck will be one of the two things,

1. Multiplication of a dense matrix and dense matrix.

2. Multiplication of a sparse matrix and dense matrix.

The former is more widely studied than latter. Both are however popular enough to warrant independentfunctions in the Blas interface (Basic Linear Algebra Subprograms) - gemm and its respective sparse inter-face. In my limited experience I’ve found that Intel MKL library is THE fasted matrix multiplication outthere - far better than my self-compiled atlas, clamdblas, gotoblas, openblas etc etc. It seamlessly harnessedhow much ever cores required (in a multicore machine) without dropping performance - something that othertools struggled to do. For e.g. clamdblas just opened too many threads for their own good.

Using such parallel matrix library blurs the computation time between different algorithms. As long asthe distance and parameter calculations can be written in terms of matrix products, all the clusteringalgorithms can run equally fast. In our experiments, we did not find any noticeable difference in runningtime per iteration between the various clustering algorithms.

23

Bibliography

[1] James Allan, Jaime G Carbonell, George Doddington, Jonathan Yamron, and Yiming Yang. Topicdetection and tracking pilot study final report. 1998.

[2] Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphereusing von mises-fisher distributions. JMLR, 6(2):1345, 2006.

[3] Mark Bangert, Philipp Hennig, and Uwe Oelfke. Using an infinite von mises-fisher mixture model tocluster treatment beam directions in external radiation therapy. In ICMLA, 2010.

[4] Arpad Baricz, Saminathan Ponnusamy, and Matti Vuorinen. Functional inequalities for modified besselfunctions. Expositiones Mathematicae, 29(4):399–414, 2011.

[5] George Casella. Empirical bayes gibbs sampling. Biostatistics, 2(4):485–500, 2001.

[6] Kurt Hornik and Bettina Grun. movmf: An r package for fitting mixtures of von mises-fisher distributions.

[7] Suvrit Sra. A short note on parameter approximation for von mises-fisher distributions: and a fastimplementation of i s (x). Computational Statistics, 27(1):177–190, 2012.

24

supplementary material for von mises-fisher clustering...

Documents