pmsb 2006, tuusula (finland) a. bertoni, g.valentini, dsi - univ. milano 1 alberto bertoni, giorgio...

PMSB 2006, Tuusula (Finland)

A. Bertoni, G.Valentini, DSI - Univ. Milano 1

Alberto Bertoni, Giorgio Valentini{bertoni,valentini}@dsi.unimi.it

http://homes.dsi.unimi.it/~valenti

Model order selection for clustered bio-molecular data

DSI - Dipartimento di Scienze dell’Informazione

Università degli Studi di Milano



Motivations and objectives

Bio-medical motivations:

• Finding “robust” subclasses of pathologies using bio-molecular data.

• Discovering reliable structures in high-dimensional bio-molecular data.

More general motivations:

• Assessing the reliability of clusterings discovered in high dimensioanl data

• Estimating the significance of the discovered clusterings

Objectives:

• Development of stability-based methods designed to discover structures in high-dimensional bio-molecular data

• Development of methods to find multiple and hierarchical structures in the data

• Assessing the significance of the solutions through the application of statistical tests in the context of unsupervied model order selection problems.



Model order selection through stability-based procedures

• In this conceptual framework multiple clusterings are obtained by introducing perturbations (e.g. subsampling, BenHur et al, 2002; noise injection, Mc Shane et al, 2003) into the original data, and a clustering is considered reliable if it is approximately maintained across multiple

perturbations.

A general stability based procedure to estimate the reliability of a given clustering:

1. Randomly perturb the data many times according to a given perturbation procedure.

2. Apply a given clustering algorithm to the perturbed data

3. Apply a given clustering similarity measure (e.g. Jaccard similarity) to multiple pairs of k-clusterings obtained according to steps 1 and 2.

4. Use the similarity measures to assess the stability of a given clustering.

5. Repeat steps 1 to 4 for multiple values of k and select the most stable clustering(s) as the most reliable.



A stability based method based on random projections (1)

• Data perturbation through a randomized mapping,

2/1)1()1(,1,1matrix,',)( ijijij APAPAddApAp

such that for every pair : 2/10,, dqp

• An example of a randomized mapping (Plus-Minus-one randomized map, Achlioptas, 2001):

• In (Bertoni and Valentini, 2006) we proposed to choose d’ according to the Johnson-Lindenstrauss (JL) lemma (1984):

Given a data set D with |D|=n examples there exists a -distortion embedding into Rd’ with d’=c log n/2 , where c is a suitable constant.

• Using randomized maps that obey the JL lemma, we may perturb the data introducing only bounded distortions, approximately preserving the structure of the original data

dddd ',: '



A stability based method based on random projections (2): the MOSRAM algorithm

MOSRAM (Model Order Selection by Randomized Maps): Input: D: a dataset; kmax: max number of clusters; n: number of pairs of random projections;

a randomized map; Clust: a clustering algorithm; sim : a clustering similarity measure. Output: M(i,k): list of similarity measures for each k (1≤i≤n, 2≤k≤kmax )

begin for k:=2 to kmax do

for i:=1 to n do proja := (D)

projb := (D)

Ca := Clust(proja, k)

Cb := Clust(projb, k)

M(i,k) := sim(Ca,Cb)

endfor endforend.



Using the distribution of the similarities to estimate the stability

• Sk (0≤ Sk ≤1) is the random variable given by the similarity between two k-clusterings obtained by applying a clustering algorithm to pairs of random independently perturbed data. The intuitive idea is that if Sk is concentrated close to 1, the corresponding clustering is stable with respect to a given controlled perturbation and hence it is reliable.

• fk(s) is the density function of Sk. We have:

• g(k) is a parameter of concentration (BenHur et al. 2002)

• We may observe the following facts:

E[Sk] can be used as a good index of the reliability of

the k-clusterings

• E[Sk] may be estimated through the empirical means k: where

• Note that we use the overall distribution of the the similarity measures to assess the stability of the k-clusterings

, where is a randomized perturbation procedure.



A 2-based method to estimate the significance of the discovered clusterings (1)

• We may perform a sorting of the : p is the index permutation such that

• For each k-clustering, we consider two groups of pairwise clustering similarities values separated by a threshold to . Thus we may obtain: P(Sk>to) = 1- F(Sk=to)

• xk = P(Sk>to)n is the number of times for which the similarity values are larger than to, where n is the number of repeated similarity measurements. Hence xk may be interpreted as the successes from a binomial population with parameter k.

• Setting Xk as a random variable that counts how many times Sk>to, we have:

• the unknown k is estimated through its pooled estimate

We can compute the following statistic:

kΚk ,...,3,2



A 2-based method to estimate the significance of the discovered clusterings (2)

• Using the previous Y statistic we can test the following alternative hypotheses:- Ho: all the k are equal to (the considered set of k-clusterings are equally reliable)- Ha: the k are not all equal between them (the considered set of k-clusterings are not equally reliable)

• If we may reject the null hypothesis at significance level, that is we may conclude that with probability 1- the considered proportions are different, and hence that at least one k-clustering significantly differs from the others.

21||, KY

• The test is iterated until no significant difference of the similarities between the k-clusterings is detected:Using the above test we start considering all the k-clustering. If a difference at significance level is registered according to the statistical test we exclude the last clustering (according to the sorting of k) and we repeat the test with the remaining k-clusterings. This process is iterated until no significant difference is detected: the set of the remaining (top sorted) k-clusterings represents the set of the estimate stable number of clusters discovered (at significance level).



Experiments with high dimensional synthetic data (I)

• 1000-dimensional synthetic data

• data distributed according to a multivariate gaussian distribution

• 2 or 6 clusters of data (as highlighted by the PCA projection to the two principal components)

Histograms of the similarity measures obtained by applying PAM clustering to 100 pairs of PMO projections from 1000 to 471-dimensional subspaces (=0.2):



2 and 6 clusters are selected at 0.01 significance level

Experiments with high dimensional synthetic data (II)

Similarity

k p-value mean variance

2 ---- 1.0000 0.0000

6 1.0000 1.0000 0.0000

7 0.0000 0.9217 0.0016

8 0.0000 0.8711 0.0033

9 0.0000 0.8132 0.0042

5 0.0000 0.8090 0.0104

3 0.0000 0.8072 0.0157

10 0.0000 0.7715 0.0056

4 0.0000 0.7642 0.0158

Empirical cumulative distribution of the similarity measures for different k-clusterings

Sorti

ng a

ccor

ding

to th

e m

eans



Detection of multiple structures

3,6 and 12 clusters are selected at 0.01 significance level

k p-value mean variance 3 -------- 1.0000 0.0000e+00 6 1.0000e+00 0.9979 1.6185e-05 12 1.0000e+00 0.9907 8.0657e-05 13 6.9792e-03 0.9809 2.8658e-04 14 2.2928e-06 0.9754 3.3594e-04 15 0.0000e+00 0.9580 6.8150e-04 7 0.0000e+00 0.9435 2.3055e-03 8 0.0000e+00 0.8954 4.6829e-03 5 0.0000e+00 0.8947 1.5433e-02 11 0.0000e+00 0.8897 3.2340e-03 9 0.0000e+00 0.8706 6.9421e-03 10 0.0000e+00 0.8691 5.0763e-03 4 0.0000e+00 0.8609 9.3463e-03 2 0.0000e+00 0.8532 2.3234e-02




Discovering significant structures in bio-molecular data(Leukemia data, Golub et al. 1999)

2 and 3 clusters are selected at 0.01 significance level

Similarity

k p-value mean variance

2 --------- 0.8285 0.0077

3 7.3280e-01 0.8060 0.0124

4 2.3279e-06 0.6589 0.0060

5 9.5199e-11 0.6012 0.0073

6 6.3282e-15 0.5424 0.0057

7 0.0000e+00 0.5160 0.0062

8 0.0000e+00 0.4865 0.0050

9 0.0000e+00 0.4819 0.0060

10 0.0000e+00 0.4744 0.0049




Comparison with other methods

Methods

Class. risk(Lange et al., 2004)

Gap statistic(Tibshirani et al. 2001)

Clest(Dudoit and Fridlyand, 2002)

Figure of Merit(Levine& Domany, 2001)

Model Explorer(BenHur et al. 2002)

MOS-

RAM “True” number k

Data set

Leukemia (Golub et al., 1999)

k=3 k=10 k=3 k=2,8,10 k=2 k=2,3 k=2,3

Lymphoma (Alizadeh et al, 2000)

k=2 k=4 k=2 k=2,9 k=2 k=2 k=2,(3)*

* Note that the subdivision of Lymphoma samples in 3 classes (DLBCL, CLL and FL) is performed on histopathological and morphological basis and this classification does not seem to correspond to the bio-molecular classification (Alizadeh et al., 2000)



Conclusions

• The proposed stability method based on random projections is well-suited to discover structures in high-dimensional bio-medical data.

• The reliability of the discovered k-clusterings may be estimated exploiting the distribution of the clustering pairwise similarities, and a 2-based statistical test tailored to unsupervised model order selection.

• The 2-based test assumes that the random variables are normally distributed. We are developing a new distribution-independent approach based on the Bernstein inequality to assess the significance of the discovered k-clusterings.

pmsb 2006, tuusula (finland) a. bertoni, g.valentini, dsi - univ. milano 1 alberto bertoni, giorgio...

Documents

data perturbation

perturbed data

data set d

original data slide

high dimensioanl data

milano slide

given clustering algorithm

tuusula finland