different perspectives at clustering: the number-of-clusters case b. mirkin school of computer...
TRANSCRIPT
![Page 1: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/1.jpg)
Different Perspectives at Clustering: The “Number-of-Clusters” Case
B. MirkinSchool of Computer ScienceBirkbeck College, University of London
IFCS 2006
![Page 2: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/2.jpg)
Different Perspectives at Number of Clusters: Talk Outline
Clustering and K-Means: A discussionClustering goals and four perspectivesNumber of clusters in:
- Classical statistics perspective- Machine learning perspective- Data Mining perspective (including a simulation study with 8
methods)- Knowledge discovery perspective (including a comparative genomics
project)
![Page 3: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/3.jpg)
WHAT IS CLUSTERING; WHAT IS DATAK-MEANS CLUSTERING: Conventional K-Means; Initialization of K-Means; Intelligent K-Means; Interpretation AidsWARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward ClusteringDATA RECOVERY MODELS: Statistics Modelling as Data Recovery;
Data Recovery Model for K-Means; for Ward; Extensions to Other Data Types; One-by-One ClusteringDIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph-Theoretic Approaches; Conceptual Description of ClustersGENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability
![Page 4: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/4.jpg)
Example: W. Jevons (1835-1882), updated in Mirkin 1996
Pluto doesn’t fit in the two clusters of planets
![Page 5: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/5.jpg)
Example: A Few ClustersClustering interface to WEB search engines (Grouper):Query: Israel (after O. Zamir and O. Etzioni 2001)
Cluster # sites Interpretation1ViewRefine
24 Society, religion• Israel and Iudaism• Judaica collection
2ViewRefine
12 Middle East, War, History• The state of Israel• Arabs and Palestinians
3ViewRefine
31 Economy, Travel• Israel Hotel Association• Electronics in Israel
![Page 6: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/6.jpg)
Clustering: Main Steps
Data collectingData pre-processingFinding clusters (the only step appreciated in conventional clustering)
InterpretationDrawing conclusions
![Page 7: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/7.jpg)
Conventional Clustering: Cluster Algorithms
Single Linkage: Nearest Neighbour Ward Agglomeration Conceptual Clustering
K-means Kohonen SOM ………………….
![Page 8: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/8.jpg)
K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence
K= 3 hypothetical centroids (@)
* *
* * * * * * * * @ @
@** * * *
![Page 9: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/9.jpg)
K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids
(seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence
* *
* * * * * * * * @ @
@** * * *
![Page 10: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/10.jpg)
K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence
* *
* * * * * * * * @ @
@** * * *
![Page 11: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/11.jpg)
K-Means: a generic clustering method
Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters
* * @ * * * @ * * * *
** * * *
@
![Page 12: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/12.jpg)
Advantages of K-Means
Conventional: Models typology building Computationally effective Can be incremental, `on-line’Unconventional: Associates feature salience with feature scales and correlation/association
Applicable to mixed scale data
![Page 13: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/13.jpg)
Drawbacks of K-Means •No advice on:•Data pre-processing•Number of clusters•Initial setting
•Instability of results•Criterion can be inadequate•Insufficient interpretation aids
![Page 14: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/14.jpg)
Initial Centroids: Correct
Two cluster case
![Page 15: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/15.jpg)
Initial Centroids: Correct
Initial Final
![Page 16: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/16.jpg)
Different Initial Centroids
![Page 17: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/17.jpg)
Different Initial Centroids: Wrong, even though in different clusters
Initial Final
![Page 18: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/18.jpg)
Two types of goals (with no clear-cut borderline)
Engineering goals
Data analysis goals
![Page 19: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/19.jpg)
Engineering goals (examples)
Devising a market segmentation to minimise the promotion and advertisement expenses
Dividing a large scheme into modules to minimise the cost
Organisation structure design
![Page 20: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/20.jpg)
Data analysis goals (examples)
Recovery of the distribution function
Prediction
Revealing patterns in data
Enhancing knowledge with additional concepts
and regularitiesEach of these is realised
in a different perspective at clustering
![Page 21: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/21.jpg)
Clustering Perspectives
Classical statistics: Recovery of a multimodal distribution
function
Machine learning: Prediction
Data mining: Revealing patterns in data
Knowledge discovery: additional concepts
and regularities
![Page 22: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/22.jpg)
Clustering Perspectives at # Clusters
Classical statistics: As many as meaningful modes (mixture items) Machine learning: As many as needed for acceptable prediction Data mining: As many as meaningful patterns in data (including incomplete clustering) Knowledge discovery:
As many as needed to produce concepts and regularities adequate to the domain
![Page 23: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/23.jpg)
Main Sources for Deriving # Clusters
Classical statistics: Model of the world Machine learning: Cost & Accuracy Trade Off Data mining: Data Knowledge discovery: Domain knowledge
![Page 24: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/24.jpg)
Classical Statistics Perspective
There must be a model of data generation
E.g., Mixture of Gaussians
The task: identify all parameters of the model by using observed data
E.g., The number of Gaussians and their probabilities, means and covariances
![Page 25: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/25.jpg)
Mixture of 3 Gaussian densities
![Page 26: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/26.jpg)
Classical statistics perspective on K-Means
But a maximum likelihood method with spherical Gaussians of the same variance:
- within a cluster, all variables are independent and Gaussian with the same cluster-independent variance (z - scoring is a must then);
- the issue of the number of clusters can be approached with conventional approaches to hypothesis testing
![Page 27: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/27.jpg)
Machine learning perspective
Clusters should be of help in learning data incrementally generated The number should be specified by the trade-off between accuracy and cost A criterion should guarantee partitioning of the feature space with clearly separated high density areas;A method should be proven to be consistent with the criterion on the population
![Page 28: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/28.jpg)
Machine learning on K-Means
The number of clusters: to be specified according to prediction goals Pre-processing: no advice An incremental version of K-means converges to a minimum of the summary within-cluster variance, under conventional assumptions of data generation (McQueen 1967 – major reference, though the method is traced a dozen or two years earlier)
![Page 29: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/29.jpg)
Data mining perspective
![Page 30: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/30.jpg)
Data recovery framework fordata mining methods
Type of Data Similarity Temporal Entity-to-feature Co-occurrence
Type of Model Regression Principal
components Clusters
Model:Data = Model_Derived_Data + Residual
Pythagoras:
Data2 = Model_Derived_Data2 + Residual2
The better fit the better the model
![Page 31: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/31.jpg)
K-Means as a data recovery method
![Page 32: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/32.jpg)
Representing a partitionCluster k:
Centroid
ckv (v - feature)
Binary 1/0 membership
zik (i - entity)
![Page 33: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/33.jpg)
Basic equations (analogous to PCA)
y – data entry, z - membership
c - cluster centroid, N – cardinality
i - entity, v - feature /category, k - cluster
K
k Si
V
vkviv
V
vk
K
kkv
N
i
V
viv
k
cyNcy1 1
2
1 1
2
1 1
2 )(
,1
ivikz
kvc
K
kivy
![Page 34: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/34.jpg)
Meaning of Data scatter
The sum of contributions of features – the basis for feature pre-processing (dividing by range, not std) Proportional to the summary variance
V
v
N
iiv
N
i
V
viv yyD
1 1
2
1 1
22
![Page 35: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/35.jpg)
Contribution of a feature F to a partition
Proportional to correlation ratio 2 if F is quantitative a contingency coefficient between cluster
partition and F, if F is nominal:Pearson chi-square (Poisson normalised)
Goodman-Kruskal tau-b (Range normalised)
Fv
k
K
kkvNc
1
2Contrib(F) =
![Page 36: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/36.jpg)
Contribution of a quantitative feature to a partition
Proportional to correlation ratio 2 if F is quantitative
2
1
222 /)(
K
kkkpNN
![Page 37: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/37.jpg)
Contribution of a nominal feature to a partition
Proportional to a contingency coefficient Pearson chi-square (Poisson normalised)
Goodman-Kruskal tau-b (Range normalised)
Bj=1
22
1
/)()( ji
K
kjiij BppppNFContr
jj pB
![Page 38: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/38.jpg)
Pythagorean Decomposition of data scatter for interpretation
![Page 39: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/39.jpg)
Contribution based description of clusters
C. Dickens: FCon = 0
M. Twain: LenD < 28
L. Tolstoy: NumCh > 3 or
Direct = 1
![Page 40: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/40.jpg)
Principal Cluster Analysis (Anomalous Pattern) Method
yiv =cv zi + eiv,
where zi = 1 if iS, zi = 0 if iSWith Euclidean distance squared
Si
V
vSviv
V
vSSv
N
i
V
viv cyNcy
1
2
1
2
1 1
2 )(
Si
SSS
N
i
cidNcdid ),()0,()0,(1
cS must be anomalous, that is, interesting
![Page 41: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/41.jpg)
Initial setting with Anomalous Single Cluster for iK-Means
Tom Sawyer
![Page 42: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/42.jpg)
iK-Means with Anomalous Single Clusters
0
Tom Sawyer
12
3
![Page 43: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/43.jpg)
Anomalous clusters + K-meansAfter extracting 2 clusters (how one can know that 2 is right?)
Final
![Page 44: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/44.jpg)
Simulation study of 8 methods (joint work with Mark Chiang)Number-of clusters methods:
• Variance based:Hartigan(HK)
Calinski & Harabasz (CH) Jump Statistic (JS)• Structure based:
Silhouette Width (SW)• Consensus based:
Consensus Distribution area (CD)Consensus Distribution mean (DD)
• Sequential extraction of APs:Least Square (LS)Least Moduli (LM)
![Page 45: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/45.jpg)
Data generation for the experiment
• Gaussian Mixture (6,7,9 clusters) with:•Cluster spatial size:
- Constant (spherical) - k-proportional - k2-proportional
•Cluster spread (distance between centroids)
SpreadSpherica
l
PPCA model
k-proport. k2-proport.
Large 2 () 10 () 10 ()
Small 0.2 () 0.5 () 2 ()
![Page 46: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/46.jpg)
Evaluation of results: Estimated clustering versus that generated
•Number of clusters
•Distance between centroids
•Similarity between partitions
![Page 47: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/47.jpg)
Distance between estimated centroids (o) and those generated (o )
Prime Assignment
G1(p1)
G2(p2)
G3(p3)
e1(q1)e2(q2)
e3(q3)
e4(q4)
e5(q5)
g1------e2g2------e4g3------e5
![Page 48: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/48.jpg)
Distance between estimated centroids (o) and those generated (o )
Final Assignment
G1(p1)G2(p2)
G3(p3)
e1(q1)e2(q2)
e3(q3)
e4(q4)
e5(q5)
g1------e2, e1g2------e4, e3g3------e5
![Page 49: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/49.jpg)
Distance between centroids: quadratic and city-block
1.Assignment
2.Distancing
g1(p1)------e1(q1), e2(q2)g2(p2)------e3(q3), e4(q4)g3(p3)------e5(q5)
d1=(q1*d(g1,e1)+q2*d(g1,e2))/(q1+q2)
d2=(q3*d(g2,e3)+q4*d(g2,e4))/(q3+q4)
d3=(q5*d(g3,e5)/q5
![Page 50: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/50.jpg)
Distance between centroids: quadratic and city-block
1.Assignment
2.Distancing
3. Averaging
p1*d1+p2*d2+p3*d3
![Page 51: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/51.jpg)
Similarity between partitionsaccording to their confusion table
• Relative distance (Mirkin-Cherny 1970)
• Tchouprov coefficient (Cramer 1943)
• Adjusted Rand Index (Arabie-Hubert, 1985)
• Average Overlap (Mirkin 2005)
![Page 52: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/52.jpg)
Resultsat 9 clusters, 1000 entities, 20 features generated
Estimated number of clusters
Distance between Centroids
Adjust Rand Index
Large spread
Small spread
Large spread
Small spread
Large spread
Small spread
HK
CH
JS
SW
CD
DD
LS
LM
![Page 53: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/53.jpg)
Knowledge discovery perspective on clustering
Conforming to and enhancing domain knowledge:Informal considerations so farRelevant items: Decision trees External validation
![Page 54: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/54.jpg)
A case to generalise
Entities with a similarity measureClustering interpretation tool developedClustering method using a similarity threshold leading to a number of clustersDomain knowledge leading to constraints to the similarity thresholdBest fitting interpretation provides for the best number of clusters
![Page 55: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/55.jpg)
Entities with a similarity measure
740 Homologous Protein Families (HPFs)
(in 30 herpes-virus genomes)Homology defined by a protein sequence fragment:
Sequence neighbourhood based similarity measure on HPFs
F1F2F3
![Page 56: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/56.jpg)
Interpretation tool: Mapping to an evolutionary tree over genomes
Different aggregations – different histories
y o z m k a p b l w d c r n s f g h e x j u t i q v
F3 F2 F1
![Page 57: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/57.jpg)
Algorithm: ADDI-S (Mirkin JoC 1987), a data approximation technique
To maximize Contribution to Data Scatter, Average within-cluster similarity c multiplied by the cluster’s size #SAlgorithm ADDI-S: Take S={ j } for arbitrary j Given S, find c=c(S) and similarities b(i,S) to
S for all entities i in and out of S; Check the differences b(i,S)-c/2. If they are
consistent, change the state of a most contributing entity. Else, stop and output S.
Resulting S: a tightness property.Holzinger (1941) B-coefficient, Arkadiev&Braverman (1964, 1967) Specter, Mirkin (1976, 1987) ADDI family, Ben-Dor, Shamir, Yakhini (1999) CAST
![Page 58: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/58.jpg)
Algorithm: ADDI-S (Mirkin 1987), a data approximation techniques Number of clusters: Depends on
similarity shift threshold b b(ij) b(ij) – b
b
![Page 59: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/59.jpg)
Domain knowledge: Function is known at some HPFs
287 pairs of HPFs with known function of which 86 are SYNONYMOUS (same function)
0.42 0.67 Similarity
density
synonymous
Non-synonymous
Two values: Min error No non-synonymous
![Page 60: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/60.jpg)
Knowledge enhancing
Analyzing the reconstructed contents in 3 family ancestors and HUCA (the root)Analyzing differences between b=.42 and b=.67 cluster reconstructionsAnalyzing gene arrangement within genomes
Glycoprotein L’s HPFs are sequence-dissimilar, but they are always followed in genomes by glycolase that is mapped to HUCA
Glyc L GlycolaseTherefore glycoprotein L must be in HUCA too
![Page 61: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/61.jpg)
Final HPFs and APFs
HPFs with a sequence-based similarity measureInterpretation: parsimonious historiesClustering ADDI-S using a similarity threshold leading to a number of clustersDomain knowledge: 86 pairs should be in same clusters, and 201 in different clusters 2 suggested similarity thresholdsBest fitting: 102 APF (aggregating 249 HPFs) and 491 singleton HPFs
![Page 62: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/62.jpg)
Whole HPF aggregation method’s structure (joint work with R. Camargo, T. Fenner, P. Kellam, G. Loizou)
![Page 63: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/63.jpg)
Conclusion I: Number of clusters?
Engineering perspective: defined by cost/effect Classical statistics perspective: can and should be determined from data with a model Machine learning perspective: can be specified according to the prediction accuracy to achieveData mining perspective: not to pre-specify; only those are of interest that bear interesting patternsKnowledge discovery perspective: not to pre-specify; those that are best in knowledge enhancing
![Page 64: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/64.jpg)
Conclusion II: Each other data analysis concept
Classical statistics perspective: can be determined from data with a model Machine learning perspective: prediction accuracy to achieveData mining perspective: data approximationKnowledge discovery perspective: knowledge enhancing
![Page 65: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/65.jpg)
Variance based methodsHartigan (HK):-calculate HT=(Wk/Wk+1-1)(N-k-1), where N is the number of entities-find the k which HT is less than a threshold 10Calinski and Harabasz (CH):-calculate CH=((T-Wk)/(k-1))/(Wk/(N-k)), where T is the data scatter-find the k which maximize CHWk is given K, the smallest within-cluster summary distance to centroids among those found at different K-Means initializations
![Page 66: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/66.jpg)
Variance based methods
k Si k
Jump Statistic (JS):•for each entity i, clustering S={S1,S2,…,Sk}, and
centroids C={C1,C2,…,Ck}
•calculate d(i, Sk)=(yi-Ck)TΓ-1(yi-Ck) and dk=( d(i, Sk))/P*N
where P is the number of features, N is the number of rows and Γ is the covariance matrix of y•select a transformation power, typically P/2•calculate the jumps JS=d
•find the k which maximize JS
2/Pk 2/
1Pk
2/0P-d and d ≡0
![Page 67: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/67.jpg)
Structure based methodsSilhouette Width (SW):• for each entity i, a(i)=average dissimilarity between i and all other entities of the cluster to which i belongs and b(i) is the minimum of average dissimilarity of i to all entities in other cluster • for each other cluster Sk, d(i, Sk)=average dissimilarity between i
and all entities of Sk
• (i)=min(d(i, Sk)) over Sk • s(i)=[b(i)-a(i)]/max(a(i),b(i))• calculate the average s=
• find the K maximizing the average s
i
s(i)/N
![Page 68: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/68.jpg)
Consensus based methods
For each different K-means initializations•Find the connectivity matrix•Calculate the consensus matrix
•Calculate the cumulative distribution matrix CDF•Calculate the area under CDF A(k)
•Calculate Δ(K+1)=
•find the k which maximize Δ(k)
2 ,)(
)()1(1 ),(
KKA
KAKAKKA
Consensus based area(CD)
![Page 69: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/69.jpg)
Consensus based methods
avdis(K)= μK*(1- μK)- σK2
μK is the mean of the consensus matrixσK is the variance of the consensus matrix
davdis(K)=(avdis(K)-avdis(K+1))/avdis(K+1)
Find the K which maximize davdis(DD)
![Page 70: Different Perspectives at Clustering: The Number-of-Clusters Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006](https://reader035.vdocuments.us/reader035/viewer/2022070306/5515d234550346d46f8b46b2/html5/thumbnails/70.jpg)
Sequential cluster extractionIntelligent K-Means:• Anomalous Pattern (Initial clusters)• Removal of singletons• K-Means
• Euclidean Distance + The within-cluster mean Least Square Criterion (LS)
• Manhattan Distance + The within-cluster median Least Modules Criterion (LM)