different perspectives at clustering: the number-of-clusters case b. mirkin school of computer...

Different Perspectives at Clustering: The “Number-of-Clusters” Case

B. MirkinSchool of Computer ScienceBirkbeck College, University of London

IFCS 2006

Different Perspectives at Number of Clusters: Talk Outline

Clustering and K-Means: A discussionClustering goals and four perspectivesNumber of clusters in:

- Classical statistics perspective- Machine learning perspective- Data Mining perspective (including a simulation study with 8

methods)- Knowledge discovery perspective (including a comparative genomics

project)

WHAT IS CLUSTERING; WHAT IS DATAK-MEANS CLUSTERING: Conventional K-Means; Initialization of K-Means; Intelligent K-Means; Interpretation AidsWARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward ClusteringDATA RECOVERY MODELS: Statistics Modelling as Data Recovery;

Data Recovery Model for K-Means; for Ward; Extensions to Other Data Types; One-by-One ClusteringDIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph-Theoretic Approaches; Conceptual Description of ClustersGENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability

Example: W. Jevons (1835-1882), updated in Mirkin 1996

Pluto doesn’t fit in the two clusters of planets

Example: A Few ClustersClustering interface to WEB search engines (Grouper):Query: Israel (after O. Zamir and O. Etzioni 2001)

Cluster # sites Interpretation1ViewRefine

24 Society, religion• Israel and Iudaism• Judaica collection

2ViewRefine

12 Middle East, War, History• The state of Israel• Arabs and Palestinians

3ViewRefine

31 Economy, Travel• Israel Hotel Association• Electronics in Israel

Clustering: Main Steps

Data collectingData pre-processingFinding clusters (the only step appreciated in conventional clustering)

InterpretationDrawing conclusions

Conventional Clustering: Cluster Algorithms

Single Linkage: Nearest Neighbour Ward Agglomeration Conceptual Clustering

K-means Kohonen SOM ………………….

K-Means: a generic clustering method

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

K= 3 hypothetical centroids (@)

* *

* * * * * * * * @ @

@** * * *


Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids

(seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

* *

* * * * * * * * @ @

@** * * *


Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

* *

* * * * * * * * @ @

@** * * *


Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters

* * @ * * * @ * * * *

** * * *

@

Advantages of K-Means

Conventional: Models typology building Computationally effective Can be incremental, `on-line’Unconventional: Associates feature salience with feature scales and correlation/association

Applicable to mixed scale data

Drawbacks of K-Means •No advice on:•Data pre-processing•Number of clusters•Initial setting

•Instability of results•Criterion can be inadequate•Insufficient interpretation aids

Initial Centroids: Correct

Two cluster case

Initial Centroids: Correct

Initial Final

Different Initial Centroids

Different Initial Centroids: Wrong, even though in different clusters

Initial Final

Two types of goals (with no clear-cut borderline)

Engineering goals

Data analysis goals

Engineering goals (examples)

Devising a market segmentation to minimise the promotion and advertisement expenses

Dividing a large scheme into modules to minimise the cost

Organisation structure design

Data analysis goals (examples)

Recovery of the distribution function

Prediction

Revealing patterns in data

Enhancing knowledge with additional concepts

and regularitiesEach of these is realised

in a different perspective at clustering

Clustering Perspectives

Classical statistics: Recovery of a multimodal distribution

function

Machine learning: Prediction

Data mining: Revealing patterns in data

Knowledge discovery: additional concepts

and regularities

Clustering Perspectives at # Clusters

Classical statistics: As many as meaningful modes (mixture items) Machine learning: As many as needed for acceptable prediction Data mining: As many as meaningful patterns in data (including incomplete clustering) Knowledge discovery:

As many as needed to produce concepts and regularities adequate to the domain

Main Sources for Deriving # Clusters

Classical statistics: Model of the world Machine learning: Cost & Accuracy Trade Off Data mining: Data Knowledge discovery: Domain knowledge

Classical Statistics Perspective

There must be a model of data generation

E.g., Mixture of Gaussians

The task: identify all parameters of the model by using observed data

E.g., The number of Gaussians and their probabilities, means and covariances

Mixture of 3 Gaussian densities

Classical statistics perspective on K-Means

But a maximum likelihood method with spherical Gaussians of the same variance:

- within a cluster, all variables are independent and Gaussian with the same cluster-independent variance (z - scoring is a must then);

- the issue of the number of clusters can be approached with conventional approaches to hypothesis testing

Machine learning perspective

Clusters should be of help in learning data incrementally generated The number should be specified by the trade-off between accuracy and cost A criterion should guarantee partitioning of the feature space with clearly separated high density areas;A method should be proven to be consistent with the criterion on the population

Machine learning on K-Means

The number of clusters: to be specified according to prediction goals Pre-processing: no advice An incremental version of K-means converges to a minimum of the summary within-cluster variance, under conventional assumptions of data generation (McQueen 1967 – major reference, though the method is traced a dozen or two years earlier)

Data mining perspective

Data recovery framework fordata mining methods

Type of Data Similarity Temporal Entity-to-feature Co-occurrence

Type of Model Regression Principal

components Clusters

Model:Data = Model_Derived_Data + Residual

Pythagoras:

Data2 = Model_Derived_Data2 + Residual2

The better fit the better the model

K-Means as a data recovery method

Representing a partitionCluster k:

Centroid

ckv (v - feature)

Binary 1/0 membership

zik (i - entity)

Basic equations (analogous to PCA)

y – data entry, z - membership

c - cluster centroid, N – cardinality

i - entity, v - feature /category, k - cluster

K

k Si

V

vkviv

V

vk

K

kkv

N

i

V

viv

k

cyNcy1 1

2

1 1

2

1 1

2 )(

,1

ivikz

kvc

K

kivy

Meaning of Data scatter

The sum of contributions of features – the basis for feature pre-processing (dividing by range, not std) Proportional to the summary variance

V

v

N

iiv

N

i

V

viv yyD

1 1

2

1 1

22

Contribution of a feature F to a partition

Proportional to correlation ratio 2 if F is quantitative a contingency coefficient between cluster

partition and F, if F is nominal:Pearson chi-square (Poisson normalised)

Goodman-Kruskal tau-b (Range normalised)

Fv

k

K

kkvNc

1

2Contrib(F) =

Contribution of a quantitative feature to a partition

Proportional to correlation ratio 2 if F is quantitative

2

1

222 /)(

K

kkkpNN

Contribution of a nominal feature to a partition

Proportional to a contingency coefficient Pearson chi-square (Poisson normalised)

Goodman-Kruskal tau-b (Range normalised)

Bj=1

22

1

/)()( ji

K

kjiij BppppNFContr

jj pB

Pythagorean Decomposition of data scatter for interpretation

Contribution based description of clusters

C. Dickens: FCon = 0

M. Twain: LenD < 28

L. Tolstoy: NumCh > 3 or

Direct = 1

Principal Cluster Analysis (Anomalous Pattern) Method

yiv =cv zi + eiv,

where zi = 1 if iS, zi = 0 if iSWith Euclidean distance squared

Si

V

vSviv

V

vSSv

N

i

V

viv cyNcy

1

2

1

2

1 1

2 )(

Si

SSS

N

i

cidNcdid ),()0,()0,(1

cS must be anomalous, that is, interesting

Initial setting with Anomalous Single Cluster for iK-Means

Tom Sawyer

iK-Means with Anomalous Single Clusters

0

Tom Sawyer

12

3

Anomalous clusters + K-meansAfter extracting 2 clusters (how one can know that 2 is right?)

Final

Simulation study of 8 methods (joint work with Mark Chiang)Number-of clusters methods:

• Variance based:Hartigan(HK)

Calinski & Harabasz (CH) Jump Statistic (JS)• Structure based:

Silhouette Width (SW)• Consensus based:

Consensus Distribution area (CD)Consensus Distribution mean (DD)

• Sequential extraction of APs:Least Square (LS)Least Moduli (LM)

Data generation for the experiment

• Gaussian Mixture (6,7,9 clusters) with:•Cluster spatial size:

- Constant (spherical) - k-proportional - k2-proportional

•Cluster spread (distance between centroids)

SpreadSpherica

l

PPCA model

k-proport. k2-proport.

Large 2 () 10 () 10 ()

Small 0.2 () 0.5 () 2 ()

Evaluation of results: Estimated clustering versus that generated

•Number of clusters

•Distance between centroids

•Similarity between partitions

Distance between estimated centroids (o) and those generated (o )

Prime Assignment

G1(p1)

G2(p2)

G3(p3)

e1(q1)e2(q2)

e3(q3)

e4(q4)

e5(q5)

g1------e2g2------e4g3------e5

Distance between estimated centroids (o) and those generated (o )

Final Assignment

G1(p1)G2(p2)

G3(p3)

e1(q1)e2(q2)

e3(q3)

e4(q4)

e5(q5)

g1------e2, e1g2------e4, e3g3------e5

Distance between centroids: quadratic and city-block

1.Assignment

2.Distancing

g1(p1)------e1(q1), e2(q2)g2(p2)------e3(q3), e4(q4)g3(p3)------e5(q5)

d1=(q1*d(g1,e1)+q2*d(g1,e2))/(q1+q2)

d2=(q3*d(g2,e3)+q4*d(g2,e4))/(q3+q4)

d3=(q5*d(g3,e5)/q5

Distance between centroids: quadratic and city-block

1.Assignment

2.Distancing

3. Averaging

p1*d1+p2*d2+p3*d3

Similarity between partitionsaccording to their confusion table

• Relative distance (Mirkin-Cherny 1970)

• Tchouprov coefficient (Cramer 1943)

• Adjusted Rand Index (Arabie-Hubert, 1985)

• Average Overlap (Mirkin 2005)

Resultsat 9 clusters, 1000 entities, 20 features generated

Estimated number of clusters

Distance between Centroids

Adjust Rand Index

Large spread

Small spread

Large spread

Small spread

Large spread

Small spread

HK

CH

JS

SW

CD

DD

LS

LM

Knowledge discovery perspective on clustering

Conforming to and enhancing domain knowledge:Informal considerations so farRelevant items: Decision trees External validation

A case to generalise

Entities with a similarity measureClustering interpretation tool developedClustering method using a similarity threshold leading to a number of clustersDomain knowledge leading to constraints to the similarity thresholdBest fitting interpretation provides for the best number of clusters

Entities with a similarity measure

740 Homologous Protein Families (HPFs)

(in 30 herpes-virus genomes)Homology defined by a protein sequence fragment:

Sequence neighbourhood based similarity measure on HPFs

F1F2F3

Interpretation tool: Mapping to an evolutionary tree over genomes

Different aggregations – different histories

y o z m k a p b l w d c r n s f g h e x j u t i q v

F3 F2 F1

Algorithm: ADDI-S (Mirkin JoC 1987), a data approximation technique

To maximize Contribution to Data Scatter, Average within-cluster similarity c multiplied by the cluster’s size #SAlgorithm ADDI-S: Take S={ j } for arbitrary j Given S, find c=c(S) and similarities b(i,S) to

S for all entities i in and out of S; Check the differences b(i,S)-c/2. If they are

consistent, change the state of a most contributing entity. Else, stop and output S.

Resulting S: a tightness property.Holzinger (1941) B-coefficient, Arkadiev&Braverman (1964, 1967) Specter, Mirkin (1976, 1987) ADDI family, Ben-Dor, Shamir, Yakhini (1999) CAST

Algorithm: ADDI-S (Mirkin 1987), a data approximation techniques Number of clusters: Depends on

similarity shift threshold b b(ij) b(ij) – b

b

Domain knowledge: Function is known at some HPFs

287 pairs of HPFs with known function of which 86 are SYNONYMOUS (same function)

0.42 0.67 Similarity

density

synonymous

Non-synonymous

Two values: Min error No non-synonymous

Knowledge enhancing

Analyzing the reconstructed contents in 3 family ancestors and HUCA (the root)Analyzing differences between b=.42 and b=.67 cluster reconstructionsAnalyzing gene arrangement within genomes

Glycoprotein L’s HPFs are sequence-dissimilar, but they are always followed in genomes by glycolase that is mapped to HUCA

Glyc L GlycolaseTherefore glycoprotein L must be in HUCA too

Final HPFs and APFs

HPFs with a sequence-based similarity measureInterpretation: parsimonious historiesClustering ADDI-S using a similarity threshold leading to a number of clustersDomain knowledge: 86 pairs should be in same clusters, and 201 in different clusters 2 suggested similarity thresholdsBest fitting: 102 APF (aggregating 249 HPFs) and 491 singleton HPFs

Whole HPF aggregation method’s structure (joint work with R. Camargo, T. Fenner, P. Kellam, G. Loizou)

Conclusion I: Number of clusters?

Engineering perspective: defined by cost/effect Classical statistics perspective: can and should be determined from data with a model Machine learning perspective: can be specified according to the prediction accuracy to achieveData mining perspective: not to pre-specify; only those are of interest that bear interesting patternsKnowledge discovery perspective: not to pre-specify; those that are best in knowledge enhancing

Conclusion II: Each other data analysis concept

Classical statistics perspective: can be determined from data with a model Machine learning perspective: prediction accuracy to achieveData mining perspective: data approximationKnowledge discovery perspective: knowledge enhancing

Variance based methodsHartigan (HK):-calculate HT=(Wk/Wk+1-1)(N-k-1), where N is the number of entities-find the k which HT is less than a threshold 10Calinski and Harabasz (CH):-calculate CH=((T-Wk)/(k-1))/(Wk/(N-k)), where T is the data scatter-find the k which maximize CHWk is given K, the smallest within-cluster summary distance to centroids among those found at different K-Means initializations

Variance based methods

k Si k

Jump Statistic (JS):•for each entity i, clustering S={S1,S2,…,Sk}, and

centroids C={C1,C2,…,Ck}

•calculate d(i, Sk)=(yi-Ck)TΓ-1(yi-Ck) and dk=( d(i, Sk))/P*N

where P is the number of features, N is the number of rows and Γ is the covariance matrix of y•select a transformation power, typically P/2•calculate the jumps JS=d

•find the k which maximize JS

2/Pk 2/

1Pk

2/0P-d and d ≡0

Structure based methodsSilhouette Width (SW):• for each entity i, a(i)=average dissimilarity between i and all other entities of the cluster to which i belongs and b(i) is the minimum of average dissimilarity of i to all entities in other cluster • for each other cluster Sk, d(i, Sk)=average dissimilarity between i

and all entities of Sk

• (i)=min(d(i, Sk)) over Sk • s(i)=[b(i)-a(i)]/max(a(i),b(i))• calculate the average s=

• find the K maximizing the average s

i

s(i)/N

Consensus based methods

For each different K-means initializations•Find the connectivity matrix•Calculate the consensus matrix

•Calculate the cumulative distribution matrix CDF•Calculate the area under CDF A(k)

•Calculate Δ(K+1)=

•find the k which maximize Δ(k)

2 ,)(

)()1(1 ),(

KKA

KAKAKKA

Consensus based area(CD)

Consensus based methods

avdis(K)= μK*(1- μK)- σK2

μK is the mean of the consensus matrixσK is the variance of the consensus matrix

davdis(K)=(avdis(K)-avdis(K+1))/avdis(K+1)

Find the K which maximize davdis(DD)

Sequential cluster extractionIntelligent K-Means:• Anomalous Pattern (Initial clusters)• Removal of singletons• K-Means

• Euclidean Distance + The within-cluster mean Least Square Criterion (LS)

• Manhattan Distance + The within-cluster median Least Modules Criterion (LM)

different perspectives at clustering: the number-of-clusters case b. mirkin school of computer...

Documents

clustering slide

clusters clustering

outline clustering

reliability slide

different initial centroids

mixed scale data slide

discussion clustering

data types