experimental analysis of gtm - austrian research institute

Experimental Analysis of GTM

Elias Pampalk

In the past years many different data mining techniques have been developed. The goal of theseminar Kosice-Vienna is to compare some of them to determine which of them is preferablefor which type of datasets. Therefor we analyze three different datasets with different proper-ties regarding dimensionality and amount of data. I will summarize the generative topographicmapping algorithm and discuss the results of the experiments and show the similarity to relatedarchitectures.

1. INTRODUCTION

Not linear methods for statistical data analysis have become more and more popularthanks to the rapid development of computers. The fields in which they are appliedto are as various as the methods them self.

Generative topographic mapping (GTM) has been developed by [Bishop et al.1997] as a principal alternative to the self-organizing map (SOM) algorithm [Ko-honen 1982] in which a set of unlabelled data vectors is summarized in terms of aset to reference vectors having a spatial organization corresponding to a (generally)two-dimensional sheet. While the SOM algorithm has achieved many successes inpractical applications, it also suffers from the significant deficiencies, many of whichhighlighted in [Kohonen 1995]. They include: the absence of a cost function, thelack of theoretical basis for choosing learning rate parameter schedules and neigh-borhood parameters to ensure topographic ordering, the absence of any generalproofs of convergence, and the fact that the model does not define a probabilitydensity. These Problems can all be traced to the heuristic origins of the SOMalgorithm. The GTM algorithm overcomes most of the limitations of the SOMwhile introducing no significant disadvantages. The datasets used to analyze thestrengths and weaknesses of GTM range from low dimensional with few vectors tovery high dimensional with many vectors.

In Section 2 I give an overview of related work. I summarize the method in Section3. In Section 4 I describe the experiments I made and discuss the similarities tothe SOM. Conclusions are presented in Section 5.

2. RELATED WORK

A lot of research has been done in the field of statistical data analysis. GTMbelongs to the family of unsupervised methods. Another representative is for ex-ample k-means clustering. An important aspect of unsupervised training is datavisualization. The high dimensional data space is mapped on to a mostly two di-mensional space. The main two criteria are preserving the topology and clusteringthe data. The most common tool is the SOM but there are also others as for ex-ample Sammons mapping. The GTM is a rather new robabilistic re-formulation ofthe SOM. The developers of the GTM algorithm have compared it with a batchalgorithm of the SOM and obtained the result that the computational cost of the

2 · Elias Pampalk

GTM algorithm is about a third higher.

3. THE METHOD

3.1 Principals

GTM consists of a constrained mixture of Gaussians in which the model parametersare determined by maximum likelihood using the EM algorithm. It is definedby specifying a set of points {xi} in latent space, together with a set of basisfunctions {Φj(x)}. The adaptive parameters W and β define a constrained mixtureof Gaussians with centers WΦ(xi) and a common covariance matrix given by β−1I.After initializing W and β, training involves alternating between the E-step inwhich the posterior probabilities are evaluated, and the M-step in which W and βare re-estimated.

3.2 Details

The continuous function y = f(x,W ) defines a mapping from the latent space intothe data space. Where x is a latent variable (vector) and W are the parametersof the mapping (matrix). And y is a vector in the higher dimensional data space.The transformation f(x,W ) maps the latent-variable space into an non-Euclideanmanifold embedded within the data space.

Defining a probability distribution p(x) on the latent-variable space induces acorresponding distribution p(y|W ) in the data space.

Since in reality the data t will only approximately live on a lower-dimensionalmanifold, it is appropriate to include a noise model. A radially-symmetric Gaussiandistribution centered on f(x,W ) is chosen

p(t|x,W, β) = (β

2π)

D2 exp{−β

2||f(x,W )− t||2}.

Where D is the dimension of the data space and β−1 is the variance of the Gaussiandistribution.

The distribution in the data space, for a given value of W , is then obtained byintegration over the x-distribution

p(t|W,β) =∫

p(t|x, W, β)p(x)dx.

For a given data set D = (t1, .., tN ) of N data points, the parameters W and βcan be determined using maximum likelihood. The log likelihood function is givenby

L(W,β) = ln

N∏n=1

p(tn|W,β).

The (prior) distribution p(x) is chosen so that the integral over x can be solved

p(x) =1K

K∑

i=1

δ(x− xi).

Where K is the number of latent points.

Experimental Analysis of GTM · 3

Each point xi is mapped to a corresponding point f(xi,W ) in the data space,which forms the center of a Gaussian density function

p(t|W,β) =1K

K∑

i=1

p(t|xi,W, β)

and the log likelihood function becomes

L(W,β) =N∑

n=1

ln{ 1K

K∑

i=1

p(tn|xi,W, β)}.

This corresponds to a constrained Gaussian mixture model, since the centers of theGaussians, given by f(xi,W ), cannot move independently but are related throughthe function f(x,W ). If the mapping function f(x,W ) is smooth and continuous,the projected points f(xi,W ) will necessarily have a topographic ordering in thedata space.

If now a particular function parametrized form for f(x,W ) is chosen which isa differntiable function of W (for example a feed-forward network with sigmoidalhidden units) then any standard non-linear optimization method, such as conjugategradients can be used. However, since the model consists of a mixture distributionthe EM algorithm is used.

f(x,W ) is chosen to be given by a generalized linear regression model of the form

f(x,W ) = WΦ(x)

where the elements of Φ(x) consist of M fixed basis functions Φj(x), and W is aDxM matrix.

3.3 Matlab Toolbox

The GTM toolbox for Matlab [Svensen 1999] provides a set of functions to generateGTMs and use them for visualization of data.

There are standard functions for the two main steps: setup and training. Setuprefers to a process of generating an initial GTM model, made up by a set of com-ponents (Matlab matrices). Training refers to adapting the initial model to a dataset, in order to improve the fit to that data.

The standard initialization (gtm stp2) consists of the following steps: First alatent variable sample is generated. Then the centers of the basis functions aregenerated and the activations in the basis functions are computed, given the latentvariable sample. At last an initial weight matrix mapping from the output of thebasis functions to the data space, and an initial value for the inverse variance ofthe Gaussian mixture is computed useing the two first principal components.

The standard training function (gtm trn) basically consists of two steps: In thefirst E-step a matrix is calculated, containing the responsibilities assumed by eachGaussian mixture component for each of the data points. These responsibilities areused in the second M-step for calculating new parameters of the Gaussian mixture.For each training cycle both steps are calculated.

4 · Elias Pampalk

4. EXPERIMENTS

4.1 Introduction

4.1.1 Diagrams. GTM defines a distribution for each data point in the latentspace. There are a few ways to visualize this. The first would be to look at thedistribution of each single data point. Normally this is not desired. The secondway would be to plot the means of the data points. However the distributioncould be multi-modal in which case the mean can give a very misleading summaryof the distribution. So another way would be too plot the means and the corre-sponding modes. The problem with this approach however is that with many datapoints it becomes difficult to recognize anything. Therefor for each experiment Iplot a diagram of the means to represent the data points and to easily recognizeclusters. And I plot the means with their corresponding modes to indicate if thedistributions are multi-modal. Further I plot the sum of the distribution of allpoints in the data set. This is easily done by adding the single distributions andnormalizing the result. The forth plot I always make is the log-likelihood. At[http://student.ifs.tuwien.ac.at/˜ elan/elias/kosice/] a complete description of theexperiments including the used Matlab scripts can be found.

4.1.2 Parameters. Following parameters must be specified using the GTM tool-box. A more detailed description can be found in [Svensen 1999].

Number of Latent Points Using a L-dimensional latent space for visualizationit is recommended to set the number of latent points to O(10L) in the support ofeach basis function. The latent points lie within a regular square grid. Settingsas 10x5 are not possible. With too few points per basis function the smoothnessof the mapping is lost. The number of latent points is limited computationally,but a very high number would be ideal.

Number of Basis Functions A limited number of basis functions will necessarilyrestrict the possible forms that the mapping can take. The amount of basisfunctions must be a square of a whole number. Settings as 2x3 are not possible.

Sigma Denotes a scalar giving the relative width of the basis functions. The ab-solute width is calculated as sigma times the distance between two neighboringbasis function centers. When basis functions overlap their response will be cor-related, which causes the smoothness of the mapping. More or narrower basisfunctions will allow a more flexible mapping, while fewer or broader functionswill prescribe a smoother mapping. Sigma = 1.0 is a good starting point.

Cycles The GTM implemented in Matlab uses a batch algorithm. With the test-runs I made the log likelihood had converged after 5 to 30 cycles.

Lambda The weight regulation factor governs the degree of weight decay appliedduring training. It controls the scaling, by restricting the magnitude of theweights. It is recommended to set it to 10−3. All experiments I made had thissetting.

4.2 Animals

4.2.1 Description. A toy example, very useful to test the various parameters:small, easy to handle, and intuitively interpretable: 16 animals, described by 13attributes.


4.2.2 Raw Data Experiments. The following four diagrams illustrate the effectsof the GTM parameters.

Fig. 1. Legend for the figures 2, 3 and 4.

Fig. 2. Parameters: 3x3 latent points, 2x2 basis functions, 1.0 sigma, 10 cycles.

A low number of latent points generates a low resolution of the distribution.Since this dataset only contains few basic clusters a low number of latent points issufficient. In figure 2 (means) you can see that the clusters are separated. Horse,zebra and cow are mapped to the upper left. Dove, hen duck and goose are mappedto the upper right. Tiger and lion are mapped to the middle left, cat is mapped tothe center. Owl, hawk and eagle are mapped to the middle and lower right. Fox,dog and wolf can be found on the lower left. Notice that convergence is reachedafter about 5 cycles.

Figure 3 shows the effects of a higher number of latent points: the form of thedistribution has a higher resolution. Compared to figure 2 a clearer picture of thedata structure is revealed. The low number of basis functions and the high sigmacause the smoothness of the mapping. Each hill in the distribution represents acluster. The peek at (-1,1) is caused by two identical vectors. Because the standardsetup procedure uses a principal component initialization the clusters are almostat the same location as in figure 2.

6 · Elias Pampalk



Decreasing sigma reduces the smoothness of the mapping as can be seen in figure4. The hills in the distribution are much higher and rougher as in figure 3. Noticethat the distributions of the data points are now multi-modal (modes diagram).

Increasing the number of basis functions dramatically increases the flexibility ofthe mapping. Figure 5 shows the result. The distribution is spiked. Because thereare more basis functions than data points the log-likelihood diagram seems strange.Even with these parameter settings topology and clusters are basically correct.



4.2.3 Evaluation. I encountered no problems with this dataset. The results madesense and the standard functions worked fine. Time was no problem. The clustersgenerated look very nice. Finding good parameters was not a problem.

4.3 MIS

4.3.1 Description. This data describes characteristics of software modules (size,complexity measures, ...), medium-sized data set with low dimensionality: 420vectors, described by 13 attributes.

Fig. 6. Legend for the figures 7, 8 and 9. The numbers represent how often the source code hasbeen modified. A plotted circle represents a source code which has been modified up to 50 timesand so on.

4.3.2 Raw Data Experiments. It is difficult to find parameters with which a resultother than a mapping on to one point is generated. Figure 7 shows one of my besttries. Notice the log-likelihood diagram. I suspect a numerical error. On the rightside of the means diagram there are mainly data points which represent sourcecodes which have been modified allot. On the left side there is a big cluster of therest.

8 · Elias Pampalk



4.3.3 Normalized by Attribute Experiments. In Figure 8 again mainly data pointswhich represent a high modification are on the right side of the means diagram andthere is one big cluster of the rest. Notice that the distribution has a form similarto waves. Possibly this is a side effect of the GTM algorithm. This strange formwas generated with almost any parameter setting.



4.3.4 Vector Length Normalized Experiments. In Figure 9 there seems to be noleft right separation between data points with a low and a high modification count.What can be seen is that there are small clusters of similar data points. Oneexample is the top right, where many squares occupy the same space. This can alsobe seen in the distribution: there is a high peek at (-1, -1).

4.3.5 Evaluation. I used the standard procedures. Time was not a real problem.But I was not able to produce good looking results with the raw data. The conver-gence problems I had might be caused by numerical errors. With the vector lengthnormalized a very strange form was generated. The form reminded me of wavesin water after dropping a stone. Normalizing the attributes caused more or lessrandom results, at least I could not detect any structure.

4.4 TIME Magazine

4.4.1 Description. Newspaper articles of the TIME Magazine from the 1960’s,medium-sized data set with very high dimensionality: 420 vectors (articles) de-scribed by 5923 attributes.

4.4.2 Vector Length Normalized Experiments. Matlab does not support functionsthat would make it possible to easily analyze maps with many (420) vectors. Toread a document plotted at a certain point on the map I have to find out byhand which vector it is. One way to do it is to mark one vector after the otherwith a special symbol (for example ’+’ instead of ’o’) which makes it possible todistinguish the vector from the others. It is impossible to analyze a map that waywithin acceptable time limits. The approach I chose was a comparison. I used theresults of Michael Dittenbach. He has trained a flat SOM and labeled the nodes.I took some nodes that ’looked good’. I marked these clusters with symbols so Icould recognize them in the diagrams.

10 · Elias Pampalk

viet(1) SOM node: (1, 8); Content: South Vietnam, religious crisis.viet(2) SOM node: (1, 7); Content: South Vietnam.viet(3) SOM node: (1, 6); Content: South Vietnam, military.moscow, ... SOM node: (11, 7); Content: Russia, communism, Khrushchev.khrushch(1) SOM node: (11, 8); Content: Khrushchev, Cold War.khrushch(2) SOM node: (10, 7); Content: Russia, economy.khrushch(3) SOM node: (10, 8); Content: Russia, culture.zanzibar, kenya SOM node: (3, 4); Content: Kenya, Zanzibar.nato, nuclear SOM node: (2, 11); Content: NATO, nuclear weapons.

Fig. 10. Legend for the figures 11 and 12.

Figure 11 shows a mapping for the TIME magazine documents. The clusters arebasically identical with those found by the SOM. Especially the South Vietnamcluster is separated very nicely. Overlap of the symbols ’+’ and ’x’ causes a ’*’as seen on the left top in the means diagram. Because of the low number ofbasis functions the clusters in the center are cramped. The connection betweenthe triangles pointing down and the pentagrams came out very nicely. Notice theconvergence after the second cycle. This can be observed by all other parametersettings as well.

Figure 12 shows a mapping with more basis functions. The clusters are the sameagain, except that the Russian documents have moved closer together. It is hard torecognize in this plot but the triangles pointing down [krushch(1)] are not all in thesame region. All but one are on the top left with the other documents on Russia.The remainder is on the top right with the documents on nato and nuclear weapons.The reason therefor is that the document is about Russian nuclear weapons. TheSouth Vietnam cluster is now located at the lower right. While with the MISand animals experiments principal component analysis was used to initialize thevariables with the TIME dataset I hade to use a random initialization.

4.4.3 Evaluation. The quality of the mappings seems to be good. The resultsare very similar to those of the flat SOM. I have observed 9 clusters that have beenfound by the SOM. GTM found the same clusters and topology with almost anyparameter settings.

For practical use with text analysis it would be necessary to develop a betteruser interface. It is difficult to find the corresponding documents to the plotteddiagrams.




I encountered several problems while working with this dataset. First of all itwas not possible to use the standard procedures because of the high dimensionality.The principal component analysis which is used by default to initialize the datatook too long. And in my case also too much memory (over 500MB). The GTMtoolbox also provides the possibility to use a random initialization which I used.The next problem where numerical errors. Some of the functions in the GTMtoolbox by default use algorithms which are fast but not very precise. This caused

12 · Elias Pampalk

divisions by zero when the means were calculated. The next problem was theefficiency of some calculations. I was able to solve the problem by changing somematrix multiplications into loops. Clearly this toolbox was not developed for highdimensional data. The remaining problem still is the time it takes to initialize,train and visualize a map. With average parameters it takes me about 20 minutesfor one run. Almost halve of the time is spent calculating the distribution of theentire dataset (if plotting modes and means is enough a run only takes halve of thetime).

5. CONCLUSION

I have presented the results of experiments done with different datasets. I haveexplained the problems of the GTM toolbox for Matlab and I have shown thesimilarity between GTM and SOM.

For datasets with only few clusters the GTM is a good alternative to the SOM.For datasets with many unclear clusters, as in text data mining, it is necessary todevelop a better interface to work with the results.

REFERENCES

Bishop, C. et al. 1997. GTM: The Generative Topographic Mapping.

Kohonen, T. 1995. Self-Organizing Maps.

Kohonen, T. 1982. Self-organized formation of topologically correct feature maps.

Svensen, M. 1999. The GTM Toolbox - User’s Guide.

experimental analysis of gtm - austrian research institute

Documents