p hdqv yv * 0 0 3 / 6$ - department of mathematics | san ...gchen/math285f15/math285_project...1...

16
K-means vs. GMM & PLSA MATH 285 Project Weiqian Hou Fall 2015

Upload: truongnga

Post on 19-May-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

K-means

vs.

GMM & PLSA

MATH 285 Project

Weiqian Hou

Fall 2015

1 Abstract

We have learned K-means (Lloyd’ s algorithm) in class which can be seen asan Expectation Maximization (EM) algorithm. In this paper, we would liketo compare it with other two EM algorithm on clustering. GMM (GaussianMixture Modeling) and PLSA (probabilistic latent semantic analysis) are twoalgorithms using probability to cluster the data. GMM assumes that all datapoints are generated from a set of Gaussian models with the same set of mix-ture weights. PLSA, a natural extension of GMM, assigns different mixtureweights for each data point. We would compare K-means with GMM on twotoy datasets, and then apply K-means and PLSA to a real document dataset tosee the difference.

2 Introduction

EM-algorithm is a general scheme of repeatedly expecting the likelihoods andthen maximizing the model. The K-means (Lloyd’ s algorithm) can be seen asan EM algorithm. The E-step: each object is assigned to the centroid so that itis assigned to the most likely cluster. The M-step: the centroids are recomputed.Then, iterate these two steps until convergence. Since we are very familiar withthe K-means, we would spend more time on the two new algorithms (GMM andPLSA) and their performances.

In section 3, we would introduce a most popular EM which is known as“Gaussian Mixture Modeling” (GMM), where the model are multivariate Gaus-sian distributions. It is also called a soft version of K-means. It allows overlap-ping of different group of data, while K-means is a hard cut. Then, we would testordinary K-means and GMM on Iris flower data set which we used in homeworkand one more toy data set to know more about their differences.

In section 4, we would like to introduce PLSA, and apply both K-means andPLSA to a real document data set (KOS blog entries). PLSA is designed fordocument clustering and it is more flexible than the GMM method. So we canimagine it would do a better job than K-means on documents clustering.

3 Gaussian Mixture Modeling (GMM)

3.1 GMM

A Gaussian Mixture Modeling (GMM) is a parametric probability density func-tion represented as a weighted sum of K component Gaussian densities,

p(x|θ) =

K∑k=1

P (k)gk(x|θk)

2

Each component is a multivariate Gaussian density which can be presented bythe equation

gk(x|θk) =1

(2π)d/2|Σk|1/2exp{−1

2(x− µk)′Σ−1k (x− µk)}

where x ∈ Rd , parameters θk = {µk,Σk}. P (k) are the mixture weights, it

satisfy that∑K

k=1 P (k) = 1.

3.2 The EM Algorithm for Gaussian Mixture Models

For a sequence of T training vectors X = {x1, . . . , xT }, the GMM likelihood,assuming independence between the vectors, can be written as,

p(X|θ) =

T∏t=1

p(xt|θ) =

T∏t=1

K∑k=1

P (k)gk(xt|θk)

As EM algorithm, the GMM has E-step and M-step.Step 0: Make an initial guess on P (k), and θk, and compute

wtk = Pr(k|xt, θk) =P (k)gk(xt|θk)∑Ki=1 P (i)g(xt|θi)

Step 1 (E-step): Formulate expected log-likelihood function E(log p(X|θ))with probability wtk, based on the initial settings.

Step 2 (M-step): Maximize the current expected log-likelihood function andget a new set of P (k) and θk.

Step 3: Repeat the step 1 and step 2 till it converges (it appears not to bechanging much from one iteration to the next).

3.3 K-means Vs. GMM

The Iris flower data set (150×4) is a multivariate data set introduced by RonaldFisher in his 1936 paper. The data set consists of three species of Iris (Iris setosa,Iris virginica and Iris versicolor) and each of them has 50 samples. Each columnis a feature measured from each sample. They are length and width of the sepalsand petals, in centimetres. Below is the original data plotted with true label.

3

Figure 1: Iris Data with True Labels

We applied K-means to Iris data set in our homework, and it showed the errorrate is 0.1067 which was not good enough. Now, we apply GMM algorithm to itand compare the results of K-means and GMM in the two plots below (Figure1: K-means Vs. GMM). All the MATLAB code and output would be listed inAppendix.

Here, we use Mo Chen’ s “EM algorithm for Gaussian mixture model”obtained from Matlab website (http://www.mathworks.com/matlabcentral/fileexchange/26184-em-algorithm-for-gaussian-mixture-model/content/

emgm/emgm.m).Input: matrix of Iris data, number of clusters. Ourput: classes labels.

(a) (b)

Figure 2: K-means Vs. GMM on Iris Data

The performance of GMM is better than that of K-means. The three clustersin GMM plot are closer to the original ones. Also, we compute the error rate(percentage of misclassified points) which should be the smaller the better. TheError rate of GMM is 0.0333, while that of K-means is 0.1067. This result is

4

corresponding to the plots.Let us quickly look at another example which is a toy dataset without true

label, but it bring us a big picture of K-means and GMM. Later, we will discusstheir pros and cons in the conclusion.

(a) (b) (c)

Figure 3: K-means and GMM on a Toy Dataset

4 probabilistic latent semantic analysis (PLSA)

4.1 PLSA

Probabilistic Latent Semantic Analysis (PLSA), developed by Th. Hofmann in1999, was initially used for text-based applications (indexing, retrieval, cluster-ing). The idea is that a document is a mixture of topics, and each topic has itsown characteristic word distribution. The model can be completely defined byspecifying the joint distribution

P (w, d) =∑z

P (z)P (d|z)P (w|z) = P (d)∑z

P (z|d)P (w|z),

where z is topic, d is document, and w is word. As another EM algorithm, itcan be presented by E step and M step.

E step: Formulate expected log-likelihood function E(logP (w, d)) with ini-tial P(d), P(z|d), P(w|z), which can be written to

E(logP (wi, dj)) =

m∑i=1

n∑j=1

Fij logP (wi, dj),

where Fij is the joint probability of the m × n frequency matrix (frequency ofthe m words in each of n documents) and

∑ij Fij = 1.

M step: Maximize the current expected log-likelihood function, and get afinal set of P(z|d), P(w|z).

4.2 About KOS data set

“KOS blog entries” is one of the data set in “Bag of Words” Data Set down-loaded from http://archive.ics.uci.edu/ml/datasets/Bag+of+Words, andits original source is dailykos.com.

5

We know “Daily Kos is an American political blog that publishes news andopinions from a liberal point of view. It functions as a discussion forum andgroup blog for a variety of netroots activists whose efforts are primarily directedtoward influencing and strengthening progressive policies and candidates” fromwikipedia website (https://en.wikipedia.org/wiki/Daily_Kos)

The data set has two columns (D and W) and N observations, with a list ofvocabulary. D is the number of documents, W is the number of words in thevocabulary, and N is the total number of words in the collection. The data setinstruction tells us that the stop words have been removed, and all the wordsoccurred more than ten times. But we still find some stop words there, like“be”, “back” and so on. Thus, we would do some further data processing.

The data set we downloaded has 3430 documents, 6906 unique words, and to-tal 467,714 words. First, we use programming language R to construct a matrixof W×D (6906×3430). The value in matrix is the frequency of each work occurin each document. Then, we use MATLAB porterStemmer function (https://github.com/faridani/reverse-stemmer/blob/master/Matlab/porterStemmer.

m) to combine words with same stem, for example, “abandoned”, “abandoning”would be counted as “abandon”. After that, the matrix shrunk to 4614× 3430.Finally, we remove the stop words again and get a 4567× 3430 matrix.

4.3 Visualizing the Data

The data set has no class labels, and for copyright reasons no file names or otherdocument-level metadata. So we can only check the clusters by visualization.

First, We apply SVD on the centered data set and it turns out first 2 principlecomponents (PC) can explain 85.86% of the variance (Figure 4(a)). So we candraw a 2-dimension plot with first 2 PC to have a look at the big picture of thewhole data set (Figure 4(b)).

(a) (b)

Figure 4: Cumulative Contribution of First 20 PC & Data in First 2 dimensions

There are no obvious clusters can be seen in the scatter plot. The points aredensity in the right side and become sparser and sparser towards left gradually.

6

However, we find there are two main groups of points when we zoom in themain part of the data (Figure 5(a)). We would say there are probably 2 clustershere by our intuition. Then, we try to test it from the number of classes ofdocuments, that is, the words can be divided into two clusters if there are totaltwo kinds of documents. Thus, we flip the matrix and make the documents asobservations (D ×W ). Using the same SVD method, we draw the first 2 PCplot of the documents, which clearly shows there are two clusters (Figure 5(b)).

(a) (b)

Figure 5: Visualizing the Data to Decide the Number of Clusters

4.4 K-means Vs. PLSA

There are two ways to carry out the PLSA clustering after we apply the PLSAalgorithm. One way is PLSA + Kmeans.

1. Implement PLSA function (Teaching material for Machine Learning –Practical Assignment 3, written by Ata Kaban, 2005.) on our W × Dmatrix, output P(w|z), P(z|d).

2. Apply K-means to P(w|z).

The result plot shows in Figure 6(a). Figure 6(b) is the scatter plot ofP(w|z).

7

(a) (b)

Figure 6: PLSA plus K-means on P(w|z)

In this case, however, this method is not suitable because we cannot seeany pattern of classes in the plot of P(w|z). That is why applying K-means onP(w|z) doesn’t yield good solution. Thus, we have to turn to the other way:PLSA + voting. First of all, we have to flip the matrix again.

1. Input D×W matrix into PLSA function and get so-called “P(z|d)” whichis P(z|w) actually.

2. Use majority voting to decide which topic the word should fall in. Say,P(topic 1|word) > P(topic 2|word), the word would belong to topic 1.

We plot the result of this method, and compare it with K-means clustering.

(a) (b)

Figure 7: PLSA vs. K-means

The Figure 7(a) shows the result of PLSA clustering is much closer to thetwo clusters found in Figure 5(a) than that of K-means clustering. Althoughwe don’t have true classes label to test the error rate, we can still concludethe performance of PLSA is much better than K-means base on looking at

8

these two plots. The other advantage of PLSA is that it output correspondingP(d|z) simultaneously. We can use it to help us test the performance of itswords clustering, because it can be used to cluster all the documents whosenumber of clusters is very obvious in Figure 5(b). So we find out the matrixP(document|topic) which is the output in last step and apply K-means to it.The result shows in 2D and 3D plots below (Figure 8(a) and (b)).

(a) (b)

Figure 8: Corresponding PLSA Clusters of Documents in 2D and 3D Plot

You may think the result is not perfect, because several dots fall in the wronggroup. Yet, the error rate won’t be high given the large number of total points.Actually, the result is not always the same due to different initial values. Werun the code many times, some results are better than this one while others areworse. We will show more different results in the Appendix such that you couldhave an overview of the performance of PLSA.

5 Conclusion

We like using K-means clustering in many situations because it is easy to applyand converge quickly, such that it is computationally faster than several otherEM algorithms even with a large number of variables. That is why we use it alot together with other algorithms like PLSA mentioned in this report and Ncutlearned on class. However, K-Means is not good at finding clusters of differentsizes, shapes, and densities, which shows in the three examples here.

GMM works very well on the data of different size and densities, as it showsin the Iris data and toy data plot in section 3.3. Also, it’s very fast. Yet,the result is not stable, that is, the clustering is a little different each time.The reason is that it is completely exploratory and sensitive to violations ofdistributional assumptions.

PLSA is a natural extension of GMM, but PLSA model is more flexiblethan the GMM model in that a different set of mixture weights is set for eachdata point while GMM uses the same set of mixture weights for all the data

9

points of a particular class. Although PLSA was designed for document clus-tering initially, it still could be used in other fields such as image. The idea ofPLSA is clustering words base on different topics which decided by the differentkinds of documents. After applying PLSA, we would get P(word|topic) andP(topic|document) simultaneously. That is, we may cluster words and docu-ments at the same time, and also, their classification are consistent. Althoughits result changes a little bit every time, it still relatively stable. The only dis-advantage of PLSA is it would cause overfitting problem when the amount oftraining data is limited.

6 Future Work

Next step, we might extract some words from two groups and compare with adictionary to see what kind of group they belong to, such that we can knowthe two main topics here. Generally, political articles are of two kinds of topics:liberal and conservative. For example, if we find the larger group of documentsbelongs to liberal topic based on the words, we could conclude that this blogpublishes news and opinions from a liberal point of view.

10

References

[1] T. Hofmann, “Probabilistic Latent Semantic Analysis,” in Proceedings of theFifteenth Conference on Uncertainty in Artificial Intelligence, San Francisco,CA, USA, 1999, pp. 289–296.

[2] C. Ding, T. Li, and W. Peng, “Nonnegative Matrix Factorization and Prob-abilistic Latent Semantic Indexing: Equivalence, Chi-square Statistic, and aHybrid Method,” in Proceedings of the 21st National Conference on Artifi-cial Intelligence - Volume 1, Boston, Massachusetts, 2006, pp. 342–347.

[3] L. Si and R. Jin, “Adjusting Mixture Weights of Gaussian Mixture Modelvia Regularized Probabilistic Latent Semantic Analysis,” in Advances inKnowledge Discovery and Data Mining, T. B. Ho, D. Cheung, and H. Liu,Eds. Springer Berlin Heidelberg, 2005, pp. 622–631.

[4] H. Zhang, R. Edwards, and L. Parker, “Regularized Probabilistic Latent Se-mantic Analysis with Continuous Observations,” in 2012 11th InternationalConference on Machine Learning and Applications (ICMLA), 2012, vol. 1,pp. 560–563.

[5] “Can You Find Love through Text Analytics? Loren on the Art of MAT-LAB.”

[6] University of California, Irvine. ”The EM Algorithm for Gaussian Mixtures”,CS 274A Probabilistic Learning: Theory and Algorithms.

[7] D. Reynolds, “Gaussian Mixture Models,” in Encyclopedia of Biometrics,Springer US, 2009, pp. 659–663.

[8] The University of Edinburgh School of Informatics, ”Probabilistic LatentSemantic Analysis” by Dan Oneata.

11

Appendices

A K-means vs. GMM on Iris Data

% import iris datafileID = fopen('iris.data');C = textscan(fileID,'%3.1f, %3.1f, %3.1f, %3.1f, %s');fclose(fileID);

%% form data matrix and plot data with true labelsX = [C{1:4}]; % concatenate the first four cells to form a matrix

% true lableslabels = zeros(size(X,1),1);labels(strcmp(C{5}, 'Iris-setosa')) = 1;labels(strcmp(C{5}, 'Iris-versicolor')) = 2;labels(strcmp(C{5}, 'Iris-virginica')) = 3;

% display the three true clustersfigure; gcplot(X, labels); axis equallegend('Iris-setosa','Iris-versicolor','Iris-virginica')title('Iris data with true clusters')

%% 2 clusterslabels kmeans = kmeans(X, 3, 'Replicates', 10);figure; gcplot(X, labels kmeans); axis equaltitle('Iris data with three clusters by K-means')

computing percentage of misclassified points(labels kmeans,labels)

ans =

0.1067

%% 3 EM algorithm for Gaussian mixture model. There are two ways to do it.X1=transpose(X);% 1st method, input initial k-means labels got from last step which is not good enough.[label] = emgm(X1, transpose(labels kmeans));figure;gcplot(X, transpose(label)); axis equaltitle('Iris data with three clusters by GMM Algorithm')

computing percentage of misclassified points(label,labels)

ans =

0.0333

% 2nd way is input k=3. But this is not stable, which means it's not always same,% but the best performance is same.[label] = emgm(X1, 3);figure;gcplot(X, transpose(label)); axis equaltitle('Iris data with three clusters by GMM Algorithm')

computing percentage of misclassified points(label,labels)

12

ans =

0.0333

B K-means vs. PLSA on KOS Data

%% R code: reshape the data set to a matrix W X Dkos<-read.table("docword.kos.txt", header = FALSE)colnames(kos)<-c("doc","word","N")attach(kos)kos2 <- reshape(kos, direction="wide", v.names="N", timevar="doc", idvar="word")

vocab<-read.table("vocab.kos.txt", header = FALSE) #vocab$word <- seq.int(nrow(vocab))kos3 <- merge(vocab, kos2, by="word")

kos4 <- kos3[,-c(1:2)]kos4[is.na(kos4)] <- 0

%% matlab code: do porterStemmerfileID = fopen('vocab.kos.txt');C = textscan(fileID,'%s%s');fclose(fileID);

label = [C{1:2}]; % set 2 cells to form a matrix

for j = 1:size(label,1)label(j,2) = cellstr(porterStemmer(char(label(j,1)))); % save stem word in 2nd column.

end

A = dataset(label);export(A,'file','labelkos.txt');%labelkos.txt has 2 column, 1st is former vocabulary, 2nd is stem vocabulary

%% R code: combine the words with same stemlabel<-read.table("labelkos.txt", header = T)

Nkos<-cbind(label,kos4)Nkos2<-aggregate(.~ Nkos$label 2, data = Nkos, sum)

Nkos3<-Nkos2[,-c(2:3)]fitlabel<-Nkos3[,1]fitkos<-Nkos3[,-1]write.table(fitlabel, "fitlabel.txt", sep="\t", col.names=F, row.names=F)write.table(fitkos, "fitkos.txt", sep="\t", col.names=F, row.names=F)

% final dataset: "fitlabel.txt" is 4614x1; "fitkos.txt" is 4614x3430.fileID = fopen('fitkos.txt');C = textscan(fileID, repmat('%f',1,3430), 'Delimiter',' '); % C is a cell arrayfclose(fileID);

X = [C{1:3430}]; % concatenate the first 3430 cells to form a matrix

fileID = fopen('fitlabel.txt');

13

C = textscan(fileID, '%s', 'Delimiter',' '); % C is a cell arrayfclose(fileID);

fitlabel = [C{1:1}];

% remove more stop words.%Source of stopwords- http://norm.al/2009/04/14/list-of-english-stop-words/stopwords cellstring={'a', 'about', 'above', 'above', 'across', 'after', ...'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', ...'already', 'also','although','always','am','among', 'amongst', 'amoungst', ...'amount', 'an', 'and', 'another', 'any','anyhow','anyone','anything','anyway', ...'anywhere', 'are', 'around', 'as', 'at', 'back','be','became', 'because','become',...'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below',...'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom','but', 'by',...'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de',...'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight',...'either', 'eleven','else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', ...'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fify',...'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found',...'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt',...'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', ...'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'ie', 'if',...'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last',...'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile',...'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must',...'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine',...'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off',...'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise',...'our', 'ours', 'ourselves', 'out', 'over', 'own','part', 'per', 'perhaps', 'please',...'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious',...'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so',...'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', ...'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them',...'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', ...'therein', 'thereupon', 'these', 'they', 'thickv', 'thin', 'third', 'this', 'those',...'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too',...'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up',...'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when',...'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein',...'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever',...'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet',...'you', 'your', 'yours', 'yourself', 'yourselves', 'the'};

fitlabel1 = fitlabel(~ismember(fitlabel,stopwords cellstring))index = ismember(fitlabel,stopwords cellstring);d = find(ismember(fitlabel,stopwords cellstring));X(d,:) = []; % X is finally a 4567x3430 matrix

center = mean(X,1); % average of the rowsX tilde = X - repmat(center, size(X,1), 1);[U,S,V] = svd(X tilde,'econ');

%% the variance contributed by the first 20 pcexplained = cumsum(S.ˆ2/sum(S.ˆ2));figure; plot(1:size(S,1),explained);xlim([1 20]);ylim([0 1]);line([2 2],[0 explained(2)],'Color','r')

14

line([0 2],[explained(2) explained(2)],'Color','r')title('Cumulative sum of Sˆ2 divided by sum of Sˆ2')xlabel('Column')ylabel('% variance explained')explained(2)

ans =

0.8586

%% drawing two dimension scatter plotfigure; plot(U(:,1), U(:,2), 'b.', 'MarkerSize', 10);title('KOS Profiles and Words')xlabel('Dimension 1')ylabel('Dimension 2')xlim([-.45 0.02]); ylim([-.6 .5])

%zoom in the right side of plotfigure; plot(U(:,1), U(:,2), 'b.', 'MarkerSize', 10);title('KOS Profiles and Words')xlabel('Dimension 1')ylabel('Dimension 2')xlim([-.3 0.02]); ylim([-.25 .2])

% 2D plot of documentsX1=transpose(X);X1 tilde = X1-repmat(mean(X1,1), size(X1,1), 1);[U,S,V] = svd(X1 tilde,'econ');figure; plot(U(:,1), U(:,2), 'b.', 'MarkerSize', 10);title('2D Plot of Documents')xlabel('Dimension 1')ylabel('Dimension 2')

%%kmeans 2 clusters of wordslabels kmeans = kmeans(X,2);figure;gcplot(X,labels kmeans);title('Clustering Words by Using K-means')

%%PLSA 2 clusters of words + kmeans[W,S]=PLSA(X,2,150); % W: P(word |topic); S: P(topic |doc)figure;plot(W(:,1), W(:,2), 'b.', 'MarkerSize', 9);title('2D Plot of P(w |z)')xlim([0 0.018]); ylim([0 .008])label W = kmeans(W,2);figure;gcplot(X, label W);title('Clustering Words by Using PLSA + K-means')

% PLSA 2 clusters of words + voting[W,S]=PLSA(X1,2,150); % input X1 (DxW matrix), so S: P(topic |word)[M,I] = max(S,[],1);figure;gcplot(X, I);title('Clustering Words by Using PLSA + Voting')

% PLSA 2 clusters of docs + kmeanslabels docs=kmeans(W,2,'Replicates',20);figure;gcplot(X1, labels docs);title('Clustering Documents by Using PLSA + K-means')

15

(a)

(b)

(c)

(d)

(e)

Figure 9: More Results for PLSA Clusters of Words and Documents

16