general prediction strength methods for estimating the number of clusters in a dataset
Post on 02-Jan-2016
27 Views
Preview:
DESCRIPTION
TRANSCRIPT
General Prediction Strength Methods for Estimating the
Number of Clusters in a Dataset
Moira Regelson, Ph.D.
September 15, 2005
Motivation
• k-means (medoids) clustering will happily divide any dataset into k clusters, regardless of whether that’s appropriate or not.
Elongated Clusters
8 10 12 14 16 18 20 22
810
1214
1618
2022
8
10
12
14
16
18
20
22
Overview
• Review of previous methods
• Re-formulation and extension of Tibshirani’s prediction strength method
• Contrast results for different cluster configurations
• Application to gene co-expression network
Different Methods for Deciding Number of Clusters
• Methods based on internal indices– Depend on between- and within- sum of
squared error (BSS and WSS)
• Methods based on external indices– Depends on comparison between different
partitionings
• Evaluate indices for different values of k and decide which is “best”
Internal Index Methods
Internal Indices
• Calinski & Harabasz
• Hartigan
• Krzanowski & Lai
• n=number of samples
• p=dimension of samples
Calinski and Harabasz (1974)
• For each number of clusters k ≥ 2, define the index
• The estimated number of clusters is the k which maximizes the above.
trace( ) /( -1)
trace( ) /( - )k
kk
BSS kI
WSS n k
Hartigan
• For each number of clusters k ≥ 1, define the index
• The estimated number of clusters is the smallest k ≥ 1 such that Ik ≤ 10.
1
trace( )1 ( 1)
trace( )k
kk
WSSI n k
WSS
Krzanowski and Lai (1985)
• For each number of clusters k ≥ 2, define the indices
• The estimated number of clusters is the k which maximizes Ik.
2 /1
1
( 1) trace( ) trace( ), andpk k k
k k k
d k WSS WSS
I d d
The silhouette width method (Kaufman and Rousseeuw,
1990) • Silhouettes use average dissimilarity
between observation i and other observations in the same cluster.
• Silhouette width of the observation is
• ai = average dissimilarity of observation i
• bi = minimum dissimilarity within the cluster
( ) / max( , )ik i i i iI b a a b
The silhouette width method (cont.)
• Overall silhouette width is the average over all observations:
• The estimated number of clusters is the k for which Ik is maximized.
iki
k
II
n
Gap (uniform) or Gap(pc) (Tibshirani et al., 2000)
• For each number of clusters k,
• B reference datasets generated under null distribution.
1log(trace( ) log(trace( )b
k k kb
I WSS WSSB
Gap statistic (cont.)
• Estimated number of clusters is smallest k ≥ 1 that maximizes Ik and satisfies
• sk = standard deviation over reference datasets.
• Uniform gap statistic samples from a uniform distribution• “pc” (principal component) statistic samples from a
uniform box aligned with the principal components of the dataset (Sarle, 1983).
1 1k k kgap gap s
External Index Methods
External Indices/Approaches
• Comparing Partitionings
• Rand Index
• Tibshirani
• Clest
• General Prediction Strength
Comparing Partitionings: The Contingency Table
• Partitionings U={u1,..., uR} and V = {v1,...,
vS} of n objects into R and S clustersU/V v1 v2 ... vS
u1 n11 n12 ... n1S
u2 n21 n22 ... ...
... ... ... ... ...
uR nR1 nR1 ... nRS
Comparing Partitionings: The Contingency Table
• nrs = number of objects in both ur and vs.
U/V v1 v2 ... vS
u1 n11 n12 ... n1S
u2 n21 n22 ... ...
... ... ... ... ...
uR nR1 nR1 ... nRS
Comparing Partitionings: The Contingency Table
• = total points in cluster ur
• = total points in cluster vs U/V v1 v2 ... vS
u1 n11 n12 ... n1S
u2 n21 n22 ... ...
... ... ... ... ...
uR nR1 nR1 ... nRS
.1
S
r rss
n n
.1
R
s rsr
n n
Rand Index (Rand, 1971, Hubert and Arabie, 1985)
• Rand index and adjusted Rand index (m=2)
. .
,
2s r rs
s r r s
n n nn
m m m mRand
n
m
. .
,
. . . .
.1
2
rs s r
r s s r
s r s r
s r s r
n n n n
mm m mAdj Rand
n n n n n
mm m m m
Clustering as a supervised classification problem
• Input data split repeatedly into a training and a test set for a given choice of k (number of clusters)
• Clustering method applied to the two sets to arrive at k “observed” training and test set clusters.
• Use the training data to construct a classifier for predicting the training set cluster labels.
• Apply classifier to test set data -> predicted test set clusters.
• Measure of agreement calculated based on the comparison of predicted to observed test set clusters (external index).
Predicting the number of clusters
• Use cluster reproducibility measures for different k to estimate the true number of clusters in the data set.
• Assumes that choosing the correct number of clusters -> less random assignment of samples to clusters and to greater cluster reproducibility.
Tibshirani Prediction Strength (Tibshirani et al., 2001)
• Specify kmax and max number of iterations, B. • For k in {2…kmax}, repeat B times:
1. Split data set into a training set and a test set2. Apply clustering procedure to partition training set into k
clusters, record cluster labels as outcomes.3. Construct classifier using training set and cluster labels.4. Apply resulting classifier to test set ->“predicted” labels.5. Apply the clustering procedure to the test set to arrive at the
“observed” labels.
6. Compute a measure of agreement (external index) ps1(k,b) comparing the sets of labels obtained in steps 4 and 5.
Tibshirani PS (cont.)
7. Switch the role of the test and training sets to arrive at another estimate of the index, ps2(k,b).
8. Use ps1(k,b) and ps2(k,b) to compute the mean value, ps(k,b) = (ps1(k,b)+ ps2(k,b))/2, and standard error, se(k,b) = |ps1(k,b)- ps2(k,b)|/2. Use these to define pse(k,b) = ps(k,b)+se(k,b) =
max(ps1(k,b), ps2(k,b)).
Tibshirani PS (cont.)
• pse(k) = median of pse(k,b) over all random splits
• Values of pse(k) used to estimate the number of clusters in the dataset using a threshold rule
*
*
*
**
*
*
**
*
2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
k
PS
E(k
, m=
2)
Clest (Dudoit and Fridlyand, 2002)
• Step “A” identical to steps 1-6 of Tibshirani PS. Denote external indices computed in step A.6. by (sk,1, sk,2,…sk,B). Then
B. Let tk = median(sk,1, ..., sk,B) denote observed similarity statistic for the k-cluster partition of the data.
C. Generate B0 datasets under null hypothesis of k=1. Briefly, for each reference dataset, repeat the procedure described in steps A and B above, to obtain B0 similarity statistics tk,1, ..., tk,Bo.
• Let denote average of the B0 statistics
• Let dk = tk- denote the difference between the observed similarity statistic and estimated expected value under null hypothesis of k = 1.
0kt
0kt
General Prediction Strength
• Re-formulation of Tibshirani PS
• Extension to m-tuplets
Tibshirani re-formulation
• Originally, measure of agreement formulated as
• Note: partitioning around medoid (PAM) clustering used
'
'1
. .
1( ) min ( [ ( )] 1)
( 1)i i s
te iis k
x x vs s
ps k I D U Xn n
( ) argmin (dist(medoid( ), ))te r j ter
U X u x X
1 , for some cluster [ ]
0 otherwise
i j r r
ij
x x u u UD U
General Prediction Strength
• Re-formulation of Tibshirani PS & extension to m-tuplets:
• Add a standard error in cross-validation: PSE(k,m) used with threshold.
• Intuitive interpretation: fraction of m-tuplets in test set that co-cluster in the training set
1
1.
( , ) min ( ), where ( )
krs
rm ms k
s
n
mPS k m p s p s
n
m
Asymmetry in PS
• Difference between Rand index and PSE(k,m):
• Rand = 0.95, adjusted Rand = 0.86.
• PSE(k,m=2) = 0.17, but when role of U and V reversed, PSE(k,m=2) = 0.83.
U/V v1 v2 v3 v4 v5
u1 5 0 0 0 0
u2 5 50 0 0 0
u3 5 0 50 0 0
u4 5 0 0 50 0
u5 5 0 0 0 50
Tests on Simulated Data
Simulations
1. A single cluster containing 200 points uniformly distributed from (-1,1) in 10-d.
2. Three normally distributed clusters in 2-d with centers at (0,0). (0,5), and (5,-3) and 25, 25, and 50 observations in each respective cluster.
3. Four normally distributed clusters in 3-d with centers randomly chosen from N(0, 5*I) and cluster size randomly chosen from {25, 50}.
4. Four normally distributed clusters in 10-d with centers randomly chosen from N(0, 1.9*I) and cluster size randomly chosen from {25, 50}.
In 3 & 4, simulations with clusters with minimum distance less than one unit were discarded.
Simulations
5. Two elongated clusters in 3-d. Generated by choosing equally spaced points between (-0.5, 0.5) and adding normal noise with sd 0.1 to each feature. Then add 10 to each feature of the points in the second cluster.
(a) 100 points per cluster(b) 200 points per cluster (to
illustrate effects of an increased number of observations)
Elongated Clusters
8 10 12 14 16 18 20 22
810
1214
1618
2022
8
10
12
14
16
18
20
22
Predicted # Clusters Predicted # Clusters
Method No Pred 1 2 3 4 5 6 7 8 9 10
Prediction Strength (cutoff=0.8) 1 2 3 4 5 6 7 8 9 10
sim1 (1 cluster, 10d)
Hartigan 7 27 12 4 m=2 50 Calinski NA 3 7 8 12 12 10 m=3 50 Kraznowski-Lai NA 8 8 5 7 5 3 5 7 2 m=5 50 Silhouette NA 33 4 1 1 1 1 3 6 12 m=10 50 Gap (uniform) 50 Gap (pc) 50 Clest* 48 2
sim2 (3 clusters, 2d)
Hartigan 4 5 4 9 8 8 7 5 m=2 49 1 Calinski NA 50 m=3 50 Kraznowski-Lai NA 29 3 4 2 1 1 5 5 m=5 1 2 47 Silhouette NA 6 44 m=10 12 5 33 Gap (uniform) 11 39 Gap (pc) 12 38 Clest* 1 49
sim3 (4 clusters, 3d)
Hartigan 11 2 10 10 4 4 4 5 m=2 6 44 Calinski NA 1 6 43 m=3 6 44 Kraznowski-Lai NA 2 6 37 1 1 1 2 m=5 1 7 42 Silhouette NA 8 13 29 m=10 4 5 6 35 Gap (uniform) 5 7 16 20 2 Gap (pc) 12 6 15 17 Clest* 1 20 29
sim4 (4 clusters, 10d)
Hartigan 50 m=2 3 3 44 Calinski NA 2 9 39 m=3 5 3 5 38 Kraznowski-Lai NA 42 1 2 1 2 2 m=5 6 4 5 35 Silhouette NA 5 11 34 m=10 19 4 2 25 Gap (uniform) 1 2 20 27 Gap (pc) 5 6 7 32 Clest* 1 49
sim5a (2 clusters, 3d, 100pts/clus)
Hartigan 35 1 14 m=2 46 3 1 Calinski NA 15 29 4 2 m=3 48 2 Kraznowski-Lai NA 49 1 m=5 49 1 Silhouette NA 50 m=10 50 Gap (uniform) 31 9 4 6 Gap (pc) 50 Clest* 44 6
sim5b (2 clusters, 3d, 200pts/clus)
Hartigan 50 m=2 27 23 Calinski NA 20 29 1 m=3 50 Kraznowski-Lai NA 50 m=5 50 Silhouette NA 50 m=10 50 Gap (uniform) 34 5 6 5 Gap (pc) 50
Predicted # Clusters Predicted # Clusters
Method No Pred 1 2 3 4 5 6 7 8 9 10
Prediction Strength (cutoff=0.8) 1 2 3 4 5 6 7 8 9 10
Results of Simulations
• PS performs consistently well• Not all values of m perform equally well in all the
simulations (m=3 and m=5 do best overall)• Performance especially noticeable on elongated cluster
simulation.• Clest performs comparably to PS. • Of internal index methods, Hartigan seems least robust• Calinski and Kraznowski-Lai indices and the
silhouette width method cannot predict a single cluster.
Application to Gene Co-Expression Networks
DNA Microarrays
• Expression level of thousands of genes at once
• Lots of processing and normalization
Use of Microarrays
• Within an experiment use “normal” and “diseased” cell types (e.g.).
• Generally examined for differences in expression levels between cell types.
• Look for genes that characteristically vary with disease.
Gene co-expression network
• Use DNA microarray data to annotate genes by clustering them on the basis of their expression profiles across several microarrays.
• Studying co-expression patterns can provide insight into the underlying cellular processes (Eisen et al., 1998, Tamayo et al., 1999).
Building a network
• Use Pearson correlation coefficient as a co-expression measure.
• Threshold correlation coefficient to arrive at gene co-expression network.
Building a network (cont.)
• Node corresponds to expression profile of a given gene.
• Nodes connected (aij=1) if they have significant pairwise expression profile association across perturbations (cell or tissue samples).
Topological Overlap Matrix
• TOM given by (Ravasz et al, 2002)
• ki is the connectivity of node i:
min( , ) 1
iu uj iju
iji j ij
a a a
k k a
1
n
i ijj
k a
TOM (cont.)
ij = 1 if the node with fewer links satisfies two conditions:– all of its neighbors are also neighbors of the other node
and
– it is linked to the other node. In contrast,
ij = 0 if nodes i and j are unlinked and the two nodes have no common neighbors.
• similarity measure, associated dissimilarity measure dij = 1- ij.
Gene Co-Expression Network for Brain Tumor Data
• Brain tumor (glioblastoma) microarray data (previously described in Freije et al (2004), Mischel et al (2005), and Zhang and Horvath (2005)).
• Network constructed using correlation threshold of 0.7 and the 1800 most highly connected genes.
Gene Co-Expression Network for Brain Tumor Data (cont.)
• Used PS method (with PAM) on TOM with m=(2,5,10)
• m=2, m=5 -> 5 clusters– Same as Mischel et al.
• m=10 -> 4 clusters– Reasonable
interpretation?1 2 3 4 5 6 7 8
0.0
0.2
0.4
0.6
0.8
1.0
k
PS
E(m
,k)
m = 2m = 5m = 10
Classical Multi-Dimensional Scaling
• Used to visualize abstract TOM dissimilarity• “Principal component analysis”
-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
-0.4
-0.2
0.0
0.2
0.4
0.6
cmd1
cmd
2 cmd
3
-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
-0.4
-0.2
0.0
0.2
0.4
0.6
cmd1
cmd2 cm
d3
Inspection of Heatmap
• Red for highly expressed genes
• Green for low expression
• Consistent expression across genes (rows) in clusters
=> Either 4 or 5 clusters justified
OR
I.15
440.
1541
5O
RI.
1460
7.14
586
OR
I.15
479.
1547
1O
RI.
4388
.438
7O
RI.
1542
8.15
408
OR
I.15
430.
1540
9O
RI.
1545
2.15
422
OR
I.47
74.4
763
OR
I.15
434.
1541
1O
RI.
4780
.476
6O
RI.
1461
7.14
591
OR
I.47
70.4
761
OR
I.47
86.4
769
OR
I.47
84.4
768
OR
I.69
24.6
908
OR
I.69
46.6
919
OR
I.43
68.4
367
OR
I.15
450.
1542
1O
RI.
1544
8.15
420
OR
I.15
454.
1542
4O
RI.
1073
5.10
730
OR
I.47
72.4
762
OR
I.47
76.4
764
OR
I.47
78.4
765
OR
I.14
631.
1459
8O
RI.
6938
.691
5O
RI.
6950
.692
1O
RI.
6940
.691
6O
RI.
1460
5.14
585
OR
I.69
36.6
914
OR
I.69
22.6
907
OR
I.69
34.6
913
OR
I.14
609.
1458
7O
RI.
1544
4.15
418
OR
I.14
633.
1459
9O
RI.
6932
.691
2O
RI.
6926
.690
9O
RI.
6948
.692
0O
RI.
1462
9.14
597
OR
I.15
442.
1541
7O
RI.
1566
2.15
649
OR
I.15
438.
1541
4O
RI.
6930
.691
1O
RI.
1545
6.15
425
OR
I.69
44.6
918
OR
I.43
78.4
377
OR
I.47
82.4
767
OR
I.14
639.
1460
2O
RI.
6928
.691
0O
RI.
1544
6.15
419
OR
I.14
623.
1459
4O
RI.
1463
5.14
600
OR
I.15
426.
1540
7O
RI.
1543
2.15
410
*****************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
***********************************************************************************************************
************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
**********************************************************************************************************************************
*********************************************************************************************************************
Conclusion
• There are several indices for evaluating clusterings– External compare different partitionings, internal do not
• Indices can be used to predict number of clusters• Prediction Strength index method works across
different cluster configurations• Fairly simple and intuitive• Effective on elongated clusters• Results of varying m reflect hierarchical structure
in data
Acknowledgements
• Steve Horvath• Meghna Kamath• Fred Fox and Tumor Cell Biology Training Grant
(USHHS Institutional National Research Service Award #T32 CA09056)
• Stan Nelson and the UCLA Microarray Core Facility
• NIH Program Project grant #1U19AI063603-01.
References
• http://www.genetics.ucla.edu/labs/horvath/GeneralPredictionStrength.• CALINSKI, R. & HARABASZ, J. (1974). A dendrite method for cluster analysis. Commun Statistics, 1-27.• DUDOIT, S. & FRIDLYAND, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3,
RESEARCH0036.• EISEN, M. B., SPELLMAN, P. T., BROWN, P. O. & BOTSTEIN, D. (1998). Cluster analysis and display of genome-wide expression patterns.
Proc Natl Acad Sci U S A 95, 14863-8.• FREIJE, W. A., CASTRO-VARGAS, F. E., FANG, Z., HORVATH, S., HARTIGAN, J. A. (1985). Statistical theory in clustering. J. Classification,
63-76.• HUBERT, L. & ARABIE, P. (1985). Comparing Partitions. Journal of Classification 2, 193-218.• KAUFMAN, L. & ROUSSEEUW, P. J. (1990). Finding groups in data: an introduction to cluster analysis. New York: Wiley.• KRZANOWSKI, W. & LAI, Y. (1985). A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics,
23-34.• MISCHEL, P., ZHANG, B., CARLSON, M., FANG, Z., FREIJE, W., CASTRO, E., SCHECK, A., LIAU, L., KORNBLUM, H., GESCHWIND,
D., CLOUGHESY, T., HORVATH, S. & NELSON, S. (2005). Hub Genes Predict Survival for Brain Cancer Patients.• RAND, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 846-850.• RAVASZ, E., SOMERA, A. L., MONGRU, D. A., OLTVAI, Z. N. & BARABASI, A. L. (2002). Hierarchical organization of modularity in
metabolic networks. Science 297, 1551-5.• SARLE, W. (1983). Cubic Clustering Criterion. SAS Institute, Inc.• TAMAYO, P., SLONIM, D., MESIROV, J., ZHU, Q., KITAREEWAN, S., DMITROVSKY, E., LANDER, E. S. & GOLUB, T. R. (1999).
Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A 96, 2907-12.
• TIBSHIRANI, R., WALTHER, G., BOTSTEIN, D. & BROWN, P. (2001). Cluster validation by prediction strength. Stanford.• TIBSHIRANI, R., WALTHER, G. & HASTIE, T. (2000). Estimating the number of clusters in a dataset via the gap statistic. Department of
Biostatistics, Stanford.• YEUNG, K. Y., HAYNOR, D. R. & RUZZO, W. L. (2001). Validating clustering for gene expression data. Bioinformatics 17, 309-18.• ZHANG, B. & HORVATH, S. (2005). A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in
Genetics and Molecular Biology.
Extra Slides
WSS and BSS
( )( )g ggi i j jij gBSS n x x x x
1
( )( )
n
i jij is jss
g gij ij
g
TSS x x x x
WSS BSS
Tibshirani re-formulation
• Originally, measure of agreement formulated as
• Note: partitioning around medoid (PAM) clustering used
'
'1
. .
1( ) min ( [ ( )] 1)
( 1)i i s
te iis k
x x vs s
ps k I D U Xn n
( ) argmin (dist(medoid( ), ))te r j ter
U X u x X
1 , for some cluster [ ]
0 otherwise
i j r r
ij
x x u u UD U
Tibshirani re-formulation detail
• Since the matrix D[U] is already the indicator matrix, this can be re-written as:
• Now we can partition the observations xi’ in cluster vs by the cluster ur to which they belong:
• We can also divide the observations xi in cluster vs by whether or not they belong to cluster ur:
'
'
'1. .
1( ) min [ ( )]
( 1)i s i s
i i
te iis kx v x vs s
x x
ps k D U Xn n
'
'
'
'1
1. .
1( ) min [ ( )]
( 1)i s i s
i i
i r
k
te iis k
x v r x vs sx xx u
ps k D U Xn n
'
'
' '1
1. .
1( ) min [ ( )] [ ( )]
( 1)i s i s i s
i r i r i r
k
te ii te iis k
r x v x v x vs sx u x u x u
ps k D U X D U Xn n
Tibshirani re-formulation detail (cont.)• By the definition of D, if xi ur and xi’ ur, then
D[U(Xte)]ii’=0 and if xi ur and xi’ ur
D[U(Xte)]ii’=1. Therefore,
• Thus, ps(k) corresponds to PS(k,m=2):
'
'
1 11 1 1 ' 1. . . .
'
1
1 11 .. .
1 1( ) min (1) min (1)
( 1) ( 1)
21 min ( 1) min
( 1)
2
rs rs
i s i s
i r i r
n nk k
s k s kr x v x v r i is s s s
i ix u x u
krs
kr
rs rss k s k
r ss s
ps kn n n n
n
n nnn n
12
1.
2( , ) min ( ), where ( )
2
krs
rm m
s ks
n
PS k m p s p sn
top related