cluster validation and discovery of multiple clusterings

CLUSTER VAL IDATION AND

DISCOVERY OF MULTIPLE

CLUSTERINGS

Yang Lei

ORCID: 0000-0003-3780-6510

Doctor of Philosophy

August 2016

Department of Computing and Information SystemsThe University of Melbourne

Australia

Submitted in total fulfilment of the requirements

of the degree of Doctor of Philosophy

Produced on archival quality paper

ABSTRACT

Cluster analysis is an important unsupervised learning process in data analysis. It aimsto group data objects into clusters, so that the data objects in the same group aremore similar and the data objects in different groups are more dissimilar. There aremany open challenges in this area. In this thesis, we focus on two: discovery of multipleclusterings and cluster validation.Many clustering methods focus on discovering one single ‘best’ solution from the data.

However, data can be multi-faceted in nature. Particularly when datasets are large andcomplex, there may be several useful clusterings existing in the data. In addition, usersmay be seeking different perspectives on the same dataset, requiring multiple clusteringsolutions. Multiple clustering analysis has attracted considerable attention in recentyears and aims to discover multiple reasonable and distinctive clustering solutions fromthe data. Many methods have been proposed on this topic and one popular technique ismeta-clustering. Meta-clustering explores multiple reasonable and distinctive clusteringsby analyzing a large set of base clusterings. However, there may exist poor quality andredundant base clustering which will affect the generation of high quality and diverseclustering views. In addition, the generated clustering views may not all be relevant.It will be time and energy consuming for users to check all the returned solutions. Totackle these problems, we propose a filtering method and a ranking method to achievehigher quality and more distinctive clustering solutions.Cluster validation refers to the procedure of evaluating the quality of clusterings,

which is critical for clustering applications. Cluster validity indices (CVIs) are oftenused to quantify the quality of clusterings. They can be generally classified into twocategories: external measures and internal measures, which are distinguished in termsof whether or not external information is used during the validation procedure. In thisthesis, we focus on external cluster validity indices. There are many open challenges in

iii

this area. We focus two of them: (a) CVIs for fuzzy clusterings and, (b) Bias issues forCVIs.External CVIs are often used to quantify the quality of a clustering by comparing it

against the ground truth. Most external CVIs are designed for crisp clusterings (one dataobject only belongs to one single cluster). How to evaluate the quality of soft clusterings(one data object can belong to more than one cluster) is a challenging problem. Onecommon way to achieve this is by hardening a soft clustering to a crisp clustering andthen evaluating it using a crisp CVI. However, hardening may cause information loss.To address this problem, we generalize a class of popular information-theoretic basedcrisp external CVIs to directly evaluate the quality of soft clusterings, without the needfor a hardening step.There is an implicit assumption when using external CVIs for evaluating the quality of

a clustering, that is, they work correctly. However, if this assumption does not hold, thenmisleading results might occur. Thus, identifying and understanding the bias behaviorsof external CVIs is crucial. Along these lines, we identify novel bias behaviors of externalCVIs and analyze the type of bias both theoretically and empirically.

iv

DECLARATION

This is to certify that:

(a) The thesis comprises only my original work towards the degree of Doctor of Phi-losophy except where indicated in the Preface;

(b) Due acknowledgement has been made in the text to all other material used;

(c) The thesis is fewer than 100,000 words in length, exclusive of tables, maps, bibli-ographies and appendices.

Yang Lei

v

PREFACE

This thesis has been written at the Department of Computing and Information Systems,The University of Melbourne. Each chapter is based on manuscripts published or underreview for publication. I declare that I am the primary author and have contributed tomore than 50% of each of these papers.

Chapter 3 is based on the papers:

- Y. Lei, N. X. Vinh, J. Chan and J. Bailey, “FILTA: Better View Discovery fromCollections of Clusterings via Filtering”. Published in Proceedings of the Euro-pean Conference on Machine Learning and Principles and Practice of KnowledgeDiscovery in Databases (ECML/PKDD 2014), pp. 145-160, 2014.

- Y. Lei, N. X. Vinh, J. Chan and J. Bailey, “rFILTA: Relevant and Non-RedundantView Discovery from Collections of Clusterings via Filtering and Ranking”. Toappear in Knowledge and Information Systems (KAIS).

Chapter 4 is based on the papers:

- Y. Lei, J. C. Bezdek, J. Chan, N. X. Vinh, S. Romano, and J. Bailey, “GeneralizedInformation Theoretic Cluster Validity Indices for Soft Clusterings”. Published inProceedings of the IEEE Symposium Series on Computational Intelligence, pages24-31, 2014.

- Y. Lei, J. C. Bezdek, J. Chan, N. X. Vinh, S. Romano, and J. Bailey, “ExtendingInformation-Theoretic Validity Indices for Fuzzy Clustering”. To appear in IEEETransactions on Fuzzy Systems.

vii

Chapter 5 is based on the paper:

- Y. Lei, J. C. Bezdek, S. Romano, N. X. Vinh, J. Chan and J. Bailey, “GroundTruth Bias in External Cluster Validity Indices”. Under second round review inPattern Recognition.

viii

ACKNOWLEDGMENTS

As I am about to submit my thesis, I have conflicting feelings. On the one hand, I amhappy about finishing my PhD study and entering a new stage of my life. On the otherhand, I have started missing the people, the office, the campus and everything here.These four years are the most precious time in my life and I have so many wonderfulmemories. During these four years, I felt so lucky and grateful to have met so manypeople from whom I had learnt and shared many happy experiences.Firstly, I want to thank my supervisor - Prof. James Bailey. He is also my mentor from

whom I can always get support and guidance. With his wisdom, strong academic skills,always positive attitude, patience, care and support, I have successfully overcome manyobstacles and have grown and learnt a lot. With his helpful guidance and encouragement,I have improved in many aspects and built up my confidence. I also want to thank Prof.Jim Bezdek, who gave me many helpful suggestions in work and life. I am not onlyimpressed by his strong academic expertise and life experience but also his passion forresearch and his professional research attitude. In addition, I want to thank Dr. NguyenXuan Vinh and Dr. Jeffrey Chan, who have provided me with many helpful suggestionsand support. I also want to thank my lab colleague and collaborator Simone, who isalways available for help and support. It has been a pleasure to work with him.Next, I want to thank Prof. Christopher Leckie and A/Prof. Shanika Karunasekera

who kindly being on my advisory committee. Thanks for their always supportive andhelpful suggestions for my research. I also want to thank Prof. Rao Kotagiri. It wasalways enjoyable to chat with him in the kitchen or corridor. His strong academicexpertise and enthusiasm for research has had a big influence on me. Moreover, I wouldlike to thank Prof. Justin Zobel. His research methods course started my research journeyat Melbourne University.Last but not least, and there are many others that I am indebted to but cannot

thank in this limited space, I would like to thank my many lovely friends from our lab

ix

and our building: Sergey, Goce, Jiazhen, Florin, Shuo, Yun, Yamuna, Xingjun, Yunzhe,Mohadeseh, Liyan, Pallab, Andrey and Qingyu. Because of you, my life has been colorfuland delightful.Finally, I want to thank all of my family. Thanks for your unconditionally support

and love, which are the strongest motivator for me.

Thank you all,Yang

x

NOTATIONS

Symbols Description

N total number of data objects in dataset X

X a dataset containing data objects, X = {x1, . . . ,xn}

Y relevant features of X

xi a data object xi ∈ Rd

U c× n matrix, U = [uik] can represent a partition (clustering) on X

uik the degrees of membership of object k belongs to ith cluster

ui a cluster of a crisp clustering U , which contains a set of clusters {u1, . . . ,uc}

c the number of clusters in a partition

ni the number of objects in cluster ui

wi cluster centre of cluster ui

nij the entry of the contingency table built based on two clusterings

PX the space of possible clusterings on X

C a set of (base) clusterings, C = {C1, . . . ,Cl}

Ci a crisp/hard clustering

V a set of clustering views, V = {V1, . . . ,VR}

Vi a clustering view, Vi ∈ PX

Q(C) a function measuring the quality of a clustering C, PX → R+

Sim(Ci,Cj) a function measure the similarity between two clustering Ci and Cj , PX ×PX → R+

xi

CONTENTS

1 introduction 11.1 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Scope of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Cluster Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.3 Discovery of Multiple Clusterings and External Cluster Validity

Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Focus of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Discovery of Multiple Clusterings . . . . . . . . . . . . . . . . . . 61.3.2 External Cluster Validity Indices . . . . . . . . . . . . . . . . . . 7

1.3.2.1 External CVIs for Soft Clustering . . . . . . . . . . . . . 71.3.2.2 Ground Truth Bias of External CVIs . . . . . . . . . . . 9

1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 background knowledge 152.1 Traditional Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 Basic Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.2 Crisp / Hard Clustering Methods . . . . . . . . . . . . . . . . . . 172.1.3 Soft Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Multiple Clustering Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 222.2.1 Discovery of Multiple Clusterings . . . . . . . . . . . . . . . . . . 232.2.2 Other Related Clustering Methods . . . . . . . . . . . . . . . . . 32

2.2.2.1 Ensemble Clustering . . . . . . . . . . . . . . . . . . . . 322.2.2.2 Multiview Clustering . . . . . . . . . . . . . . . . . . . . 332.2.2.3 Subspace Clustering . . . . . . . . . . . . . . . . . . . . 342.2.2.4 Constrained Clustering . . . . . . . . . . . . . . . . . . . 34

2.3 Cluster Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

xiii

xiv contents

2.3.1 External Cluster Validity Indices . . . . . . . . . . . . . . . . . . 362.3.2 Internal Cluster Validity Indices . . . . . . . . . . . . . . . . . . . 43

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 discovery of multiple clusterings 473.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.3 rFILTA Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3.1 Filtering Base Clusterings . . . . . . . . . . . . . . . . . . . . . . 543.3.2 Clustering Quality and Diversity Measures . . . . . . . . . . . . . 55

3.3.2.1 Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.3.2.2 Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3.3 Discovering the Meta-Clusters . . . . . . . . . . . . . . . . . . . . 583.3.4 Meta-Cluster Ranking . . . . . . . . . . . . . . . . . . . . . . . . 603.3.5 Discovering the Clustering Views via Ensemble Clustering . . . . 63

3.4 Time Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 633.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.5.1 Base Clustering Generation Methods . . . . . . . . . . . . . . . . 663.5.2 Evaluation of the Clustering Views . . . . . . . . . . . . . . . . . 673.5.3 Parameter Setting . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.5.3.1 The Impact of the Number of Selected Base Clusterings 683.5.3.2 Impact of the Regularization Parameter . . . . . . . . . 69

3.5.4 Synthetic Datasets Evaluation . . . . . . . . . . . . . . . . . . . . 703.5.4.1 4 Gaussian 2D dataset . . . . . . . . . . . . . . . . . . . 703.5.4.2 8 Gaussian 3D dataset . . . . . . . . . . . . . . . . . . . 74

3.5.5 Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.5.5.1 CMUFace Dataset . . . . . . . . . . . . . . . . . . . . . 773.5.5.2 Card Dataset . . . . . . . . . . . . . . . . . . . . . . . . 823.5.5.3 Flower Dataset . . . . . . . . . . . . . . . . . . . . . . . 873.5.5.4 Evaluation of Running Time for Each step in rFILTA . . 90

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4 soft clustering validation 95

contents xv

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.3 Soft Generalization of Information-theoretic based CVIs . . . . . . . . . . 99

4.3.1 Technique for Soft Generalization . . . . . . . . . . . . . . . . . . 994.3.2 Soft Generalization of IT based CVIs . . . . . . . . . . . . . . . . 100

4.4 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.4.1 Implementation and Settings . . . . . . . . . . . . . . . . . . . . . 1014.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.4.2.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . 1024.4.2.2 Real-World Data . . . . . . . . . . . . . . . . . . . . . . 104

4.4.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . 1054.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.5.1 FCM Tests with the Synthetic Gaussian Datasets . . . . . . . . . 1064.5.1.1 Results on datasets with overlapping clusters (G1) . . . 1084.5.1.2 Results on datasets with different sized clusters (G2) . . 1084.5.1.3 Results on datasets with different sized and overlapping

clusters (G3) . . . . . . . . . . . . . . . . . . . . . . . . 1094.5.1.4 Results on datasets with different data sizes (G4) . . . . 109

4.5.2 FCM Tests with Real-World Datasets . . . . . . . . . . . . . . . . 1104.5.3 EM Tests with Synthetic and Real-World Datasets . . . . . . . . 114

4.6 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1174.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5 bias of cluster validity indices 1235.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.1.1 Example 1 - NC bias of RI and JI . . . . . . . . . . . . . . . . . . 1255.1.2 Example 2 - GT bias of RI . . . . . . . . . . . . . . . . . . . . . . 126

5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305.4 Pair-counting External Cluster Validity Measures . . . . . . . . . . . . . 1345.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.5.1 Type 1: GT1 bias Evaluation . . . . . . . . . . . . . . . . . . . . 137

xvi contents

5.5.2 Type 2: GT2 bias Evaluation . . . . . . . . . . . . . . . . . . . . 1375.5.3 Summary for All 26 Comparison Measures . . . . . . . . . . . . . 139

5.6 Bias Due to Ground Truth for the Rand Index . . . . . . . . . . . . . . . 1415.6.1 Quadratic Entropy and Rand Index . . . . . . . . . . . . . . . . . 141

5.6.1.1 Havrda-Charvat Generalized Entropy . . . . . . . . . . . 1415.6.1.2 Quadratic Entropy and VI . . . . . . . . . . . . . . . . . 1425.6.1.3 Quadratic Entropy and Rand Index . . . . . . . . . . . . 143

5.6.2 GT bias of RI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1445.6.2.1 General Case of GT bias . . . . . . . . . . . . . . . . . . 1445.6.2.2 GT1 bias and GT2 bias . . . . . . . . . . . . . . . . . . 149

5.7 Example of GT Bias for Adjusted Rand Index (ARI) . . . . . . . . . . . 1525.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

6 conclusions 1576.1 Summary of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1576.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1596.3 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Appendix 165

a rfilta experimental results 167a.1 Isolet Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167a.2 WebKB Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168a.3 Object Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

L I ST OF F IGURES

Figure 1.1 Two reasonable but distinctive clusterings on the same data. . . 4Figure 1.2 Examples of crisp clustering and soft clustering. Each symbol

represents a data object. The colors of objects indicates theircluster membership. . . . . . . . . . . . . . . . . . . . . . . . . . 8

Figure 1.3 Example of biased external CVI. V2 is preferred by this externalCVI, S, as it got higher similarity score. However, both V1 andV2 are random clusterings which show no meaningful clusteringstructure. S should not show preference to either of the clusterings. 9

Figure 3.1 The existing meta-clustering framework. . . . . . . . . . . . . . . 48Figure 3.2 Illustrative example for the meta-clustering process. . . . . . . . 48Figure 3.3 Clustering views generated from the unfiltered and filtered base

clusterings on CMUFace dataset. . . . . . . . . . . . . . . . . . . 49Figure 3.4 The meta-clustering framework with our additional, proposed

filtering and ranking steps highlighted with shaded square. . . . 51Figure 3.5 iVAT diagrams for different number of filtered base clusterings

with β = 0.6 on the flower dataset. . . . . . . . . . . . . . . . . 68Figure 3.6 iVAT diagrams generated from 100 filtered base clusterings and

different β values, for the flower data. . . . . . . . . . . . . . . . 70Figure 3.7 iVAT diagrams of the unfiltered and filtered base clusterings on

the 2D Gaussian dataset. . . . . . . . . . . . . . . . . . . . . . . 71Figure 3.8 Top 8 clustering views returned from the unfiltered base cluster-

ings on the 2D Gaussian dataset. . . . . . . . . . . . . . . . . . . 71Figure 3.9 7 clustering views generated from the filtered base clusterings on

the 2D Gaussian dataset. They correspond to the 7 ground truthclustering views. . . . . . . . . . . . . . . . . . . . . . . . . . . 72

xvii

xviii List of Figures

Figure 3.10 MBM scores of the clustering views generated from the unfil-tered (30 views) and filtered (7 views) base clusterings on the2D Gaussian dataset. The x axis indicates the top K clusteringviews. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Figure 3.11 iVAT diagrams of the unfiltered and filtered base clusterings onthe 3D Gaussian dataset. . . . . . . . . . . . . . . . . . . . . . . 76

Figure 3.12 The top 4 clustering views discovered from the 700 unfilteredbase clusterings on the 3D Gaussian dataset. . . . . . . . . . . . 76

Figure 3.13 The 3 clustering views discovered from the 100 filtered base clus-terings on the 3D Gaussian dataset. They correspond to the 3ground truth clustering views. . . . . . . . . . . . . . . . . . . . 77

Figure 3.14 The MBM scores of the clustering views generated from the un-filtered (52 views) and filtered (3 views) base clusterings on the3D Gaussian dataset. . . . . . . . . . . . . . . . . . . . . . . . . 77

Figure 3.15 Two ground truth clustering views on CMUFace dataset. Thefirst row is person view and the second row is pose view. . . . . . 78

Figure 3.16 Results for the unfiltered base clusterings on CMUFace dataset. . 79Figure 3.17 Results for the filtered base clusterings on CMUFace dataset

when β = 0.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Figure 3.18 Results for the filtered set of base clusterings on CMUFace dataset

when β = 0.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Figure 3.19 The MBM scores for the two sets of clusterings views on CMU-

Face dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82Figure 3.20 Two ground truth clustering views on Card dataset. The first

row is the color view and the second row is the suits view. . . . . 83

List of Figures xix

Figure 3.21 The alternative clusterings generated by minCEntropy. The firstrow is the color view given suits view as reference clustering. Thesecond view is generated given color view as reference clustering.The score above each image is the percentage of the dominantclass in that cluster. The AMI similarity scores between eachclustering view with the two ground truth views respectively arepresented at the right side of each clustering view. . . . . . . . . 84

Figure 3.22 Results on the 600 unfiltered base clusterings on the card dataset. 85Figure 3.23 Results for the filtered base clusterings on the card dataset. . . . 87Figure 3.24 The MBM scores for the two sets of clustering views from the

unfiltered and filtered sets of base clusterings on the card dataset. 87Figure 3.25 Example images of buttercup, sunflower, windflower and daisy

flowers from left to right, from top to bottom. . . . . . . . . . . 88Figure 3.26 Two ground truth clustering views on flower dataset. The first

row is the color view and the second row is the shape view. . . . 88Figure 3.27 Results for the unfiltered base clusterings on the flower dataset. . 89Figure 3.28 Results for the 100 filtered base clusterings on the flower dataset. 90Figure 3.29 Results of the filtered base clusterings on flower dataset. . . . . . 91Figure 3.30 The MBM scores for two sets of clustering views generated from

the unfiltered and filtered base clusterings on the flower dataset. 92Figure 3.31 Running time of different steps in rFILTA on CMUFace dataset

and isolet dataset in seconds. . . . . . . . . . . . . . . . . . . . . 92Figure 4.1 Scatter plots for datasets (Ovp1, Ovp5) in group G1, (Dens1,

Dens6) in group G2, (OvpDens1, OvpDens6) in group G3 and(NSize1, NSize9) in group G4. Points in the same cluster havethe same color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Figure 4.2 Overall success rates of the soft CVIs for FCM partitions onthe synthetic and real-world datasets. Error bars indicate thestandard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . 107

Figure 4.3 Success rates of soft CVIs for FCM partitions on 25 syntheticdatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

xx List of Figures

Figure 4.4 Overall success rates of the soft CVIs for EM partitions on thesynthetic and real-world datasets. Error bars indicate the stan-dard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Figure 4.5 Success rates of soft CVIs for EM partitions on 25 syntheticdatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Figure 4.6 Results on data X30. . . . . . . . . . . . . . . . . . . . . . . . . 120Figure 5.1 The average RI and JI values over 100 partitions at each c with

uniformly generated ground truth. The symbol ↑ means the mea-sure is a similarity one and hence larger values indicate highersimilarity. The vertical lines indicate the correct number of clusters.126

Figure 5.2 The average RI and JI values over 100 partitions at each c withskewed ground truth. The symbol ↑ means the measure is a simi-larity one and hence larger values indicate higher similarity. Thevertical lines indicate the correct number of clusters. . . . . . . . 126

Figure 5.3 The relationship between NC bias and GT bias in Definitions 5.1-5.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Figure 5.4 100 trials average values of the RI, JI and ARI external CVIs withvariable ground truth to investigate GT1 bias, ctrue = 2, 50. Thesymbol ↑ means the measure is a similarity one and hence largervalues indicate higher similarity. The vertical lines indicate thecorrect number of clusters. . . . . . . . . . . . . . . . . . . . . . 138

Figure 5.5 100 trials average values of the RI, JI and ARI with unbalancedground truth to investigate GT2bias, ctrue = 5. n1 = 10% ∗N(left), n1 = 90% ∗N (right). The symbol ↑ means the measure isa similarity one and hence larger values indicate higher similarity.The vertical lines indicate the correct number of clusters. . . . . 140

Figure 5.6 100 trial average RI values with ci ranging from 2 to 9 for ctrue =3 (dotted line). In the legend, n1 : n2 : n3 indicates the sizes ofthe three clusters and the corresponding H2(UGT ) values. Thetype of NC bias is also indicated in the legend. . . . . . . . . . . 147

List of Figures xxi

Figure 5.7 100 trials average RI values with c in {2, . . . , 12} for ctrue = 4.p1 = |u1|/N , 9 steps 0.1 to 0.9, and the other 3 clusters uniformlydistributed. When p1 > p∗ = 0.683, the RI decreases with c

increasing (e.g., p1 = 0.7, 0.8, 0.9). When p1 < p∗ = 0.683, theRI increases with c (e.g., p1 = 0.1, . . . , 0.6). . . . . . . . . . . . . 152

Figure 5.8 The relationship between p∗ and r, for r in {2, . . . , 50}. . . . . . 153Figure 5.9 100 trial average ARI values for two different ground truth and

set of candidates with different numbers of clusters. . . . . . . . 154Figure A.1 Results on the isolet dataset. . . . . . . . . . . . . . . . . . . . . 168Figure A.2 Results on the webkb dataset. . . . . . . . . . . . . . . . . . . . 169Figure A.3 Example images of the nine selected objects. . . . . . . . . . . . 171Figure A.4 Two ground truth clustering views on object dataset. The first

row is the color view and the second row is the shape view. . . . 171Figure A.5 The results on the unfiltered base clusterings on the object dataset.172Figure A.6 The results for the filtered base clusterings on object dataset

with β = 0.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173Figure A.7 MBM scores for clustering views generated from the unfiltered

and filtered base clusterings on the object dataset. . . . . . . . . 173

L I ST OF TABLES

Table 2.1 Summary of Alternative Clustering Methods. CMI is short forconditional mutual information. ‘Simul.’ or ‘Seq.’ indicates thatone approach discovers alternative clusterings simultaneously orsequentially. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Table 2.2 Contingency table based on partitions U and V , nij = |ui ∩ vj | . 36Table 2.3 A summary of external cluster validity indices. U and V are two

clusterings. ai is the row sum of contingency table (Table 2.2).bj is the column sum of contingency table. nij is the entry ofcontingency table. k11, k10, k01 and k00 are counts of four typesof pairs of objects (Equations 2.10 - 2.13). . . . . . . . . . . . . . 42

Table 4.1 Information-theoretic cluster validity indices. . . . . . . . . . . . 100Table 4.2 Real-world datasets: N = number of points, d = number of

dimensions and cGT = number of ground truth classes. . . . . . 105Table 4.3 Success rate (% of successes in 100 trials) of nine indices for FCM

on 10 real-world datasets. The highlighted numbers indicate suc-cess rates above 85%. The highlighted datasets have at least oneresult above 85%. . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Table 4.4 Contingency matrix for U2 and V . . . . . . . . . . . . . . . . . . 120Table 4.5 Contingency matrix for U4 and V . . . . . . . . . . . . . . . . . . 120Table 4.6 Contingency matrix for U3 and V . . . . . . . . . . . . . . . . . . 120Table 4.7 The values of NMIsj, NMIss, NMIsr and NMIsM and correspond-

ing entropies for U2, U3 and U4. . . . . . . . . . . . . . . . . . . 121Table 5.1 Glossaries about different bias discussed in this paper. . . . . . . 133Table 5.2 Pair-Counting based Comparison Measures (external CVIs). k11,

k10, k01 and k00 are counts of four types of pairs of objects (referto Equations 2.10 - 2.13 in Chapter 2). . . . . . . . . . . . . . . 134

xxiii

1INTRODUCTION

1.1 cluster analysis

Cluster analysis is an important unsupervised technique in exploratory data analysis. Itattempts to group data objects into clusters, so that data objects within the same clusterare more similar than those in different clusters. It is often used at the initial stage ofdata analysis, when there is little knowledge available about the data. Cluster analysishas been successfully used in many areas, including in bioinformatics [Jakobsson andRosenberg, 2007], market segmentation [Baudat and Anouar, 2000; Punj and Stewart,1983], information retrieval [Wu et al., 2013] and image processing [Bezdek et al., 2006].

There are several fundamental questions to be answered when performing clusteranalysis.

assessment of clustering tendency Before applying any clustering al-gorithm on the given dataset, it is natural to ask: “does the dataset contain meaningfulclusters?” This is an important question to answer, as a clustering solution can alwaysbe generated after applying a clustering algorithm, whether or not there is a genuinegrouping structure in the underlying data. Thus, before clustering is performed, we firstneed to test and analyze if there are valid clusters and groupings in the data This stepis called assessment of clustering tendency.

clustering techniques If natural pattens exist in the data, how do we groupthem into meaningful clusters? Many clustering techniques have been developed to solvethis problem [Jain and Dubes, 1988]. We call this step clustering techniques. Differentclustering techniques are designed based on different cluster models and induction prin-ciples [Estivill-Castro, 2002]. For example: (a) different ways of clustering, e.g., hierar-

1

2 introduction

chical or partitional [Jain, 2010]; (b) different data types, e.g., text data [Aggarwal andZhai, 2012], biological data [Jakobsson and Rosenberg, 2007] and multimedia data [Hin-neburg and Keim, 1998]; (c) different characteristics of data, e.g., big data [Shirkhorshidiet al., 2014] and high dimensional data [Parsons et al., 2004]; (d) different theoreticalbases, e.g., matrix factorization based methods [Xu et al., 2003] and generative modelsbased methods [Dempster et al., 1977]; (e) how many clusterings are to be discovered,e.g., single or multiple [Bailey, 2013].

cluster validity After we generate a clustering, is this clustering meaningfulor useful for users? Many different clustering solutions can be generated on the samedata by different clustering techniques. In addition, for a given clustering algorithm, forexample, k-means, depending on different initializations or different parameter settings(the number of clusters), many candidate partitions can be discovered. Among thesemany different solutions, how do we know which one is the ‘best’? Cluster validity triesto solve this problem. Cluster validation is the procedure for quantitatively evaluatingthe quality of clusterings. This step is crucial for clustering applications [Jain and Dubes,1988].

In this thesis, we focus on proposing new solutions and insights to these two questions,that is, clustering techniques and cluster validity.

1.2 scope of the thesis

1.2.1 Clustering Techniques

As there is no precise definition of a ‘cluster’, many clustering algorithms have beenproposed based on different cluster models and induction principles [Estivill-Castro,2002]. Many clustering algorithms can be formulated as optimization problems. Theydefine an objective function which characterizes what the algorithm designers believe isa good clustering. Then the algorithm optimizes this objective function to obtain thebest clustering solution. Several typical clustering algorithms are:

1.2 scope of the thesis 3

- K-means is a popular centroid based clustering algorithm [Jain, 2010]. It has beenemployed in many applications due to its simplicity and ease of implementation.It partitions the data into several clusters so that each data object belongs to thecluster with the nearest mean (cluster centroid).

- Another popular type of clustering method is hierarchical clustering [Milligan andCooper, 1986]. The generated clustering contains nested clusters which can bedepicted in a tree-like diagram (called a dendrogram), so that the clusters in alower level are subclusters of those in a higher level. By cutting the tree at a certainlevel according the desired number of clusters, a clustering solution is generated.

- DBSCAN is a popular density-based clustering algorithm [Ester et al., 1996]. Itidentifies data points located in higher density regions as clusters and data objectsin sparse regions as outliers or noise.

- Expectation-maximization (EM) algorithm is a famous clustering algorithm basedon statistical distribution models [Dempster et al., 1977]. It assumes the clustersare drawn from certain distribution models, e.g., Gaussian mixture models. Thenthe models are iteratively optimized to find parameter settings that better fit thedataset.

Despite the success of all these approaches, all these methods focus on discovering asingle ‘best’ clustering solution from the data. However, data can be multi-faceted innature, especially since nowadays data is becoming increasingly complex and growing insize. The underlying structure of data can also be complicated. There may be more thanone way to cluster the data. For example, Figure 1.1 shows a small dataset consistingof 9 images of objects. They can either be clustered according to their different colors(Figure 1.1a) or according to their different shapes (Figure 1.1b). These two distinctiveclustering solutions are both reasonable. Thus, multiple reasonable but distinctive clus-tering solutions may exist in the same data. In addition, for the clustering task, thereis no ‘best’ answer. We can say a ‘good’ clustering solution is user dependent. Differentusers may possess different perspectives and requirements for grouping or clustering thedata. These different groupings, which correspond to different perspectives of users, can

4 introduction

(a) A clustering of 3 clusters. Objects are groupedaccording to their different colors.

(b) A clustering of 3 clusters. Objects aregrouped according to their differentshapes.

Figure 1.1: Two reasonable but distinctive clusterings on the same data.

all be reasonable. Thus, multiple clustering solutions are required. Finally, cluster anal-ysis is an exploratory learning process. Users would like to explore the data and gaindifferent insights, which can help them explore new markets or design new strategies.Thus, it will be helpful if we can provide multiple reasonable and distinctive solutions forusers to assist their work. In summary, discovering multiple meaningful and distinctiveclusterings is a necessary and important task in cluster analysis area. This has stimu-lated considerable recent research on the topic of multiple clustering analysis [Bailey,2013]. The MultiClust workshops, “Discovering Multiple Clustering Solutions: GroupingObjects in Different Views of Data”, have been successfully held in association with topranking conferences in recent years, KDD 2010, ECML/PKDD 2011, SDM 2012 andICML 2013, and have attracted many researchers to participate in this emerging field.In this thesis, we focus on multiple clustering analysis in Chapter 3.

1.2.2 Cluster Validity

Cluster validation is the procedure of evaluating the goodness of a clustering. It is acrucial task in clustering applications. It helps users to select appropriate clustering

1.3 focus of the thesis 5

parameters, like the number of clusters, or even the appropriate clustering model andalgorithm [Jain and Dubes, 1988].Cluster validity indices (CVIs) are often used to measure the quality of a clustering in

a quantitative way. There have been a large number of cluster validity indices proposedand successfully used [Aggarwal and Reddy, 2013]. They can be generally classified intotwo categories, external cluster validity indices and internal cluster validity indices [Wuet al., 2009]. They are distinguished in terms of whether or not external information isused during the validation procedure. If the data are labeled, the ground truth partition(class labels) can be used with an external CVI to explore the match between candidateclustering (partition) and the ground truth partition. When the data are unlabeled (thereal case), an important post-clustering question is how to evaluate different candidatepartitions. This job falls to internal CVIs.In this thesis, we focus on external CVIs in Chapter 4 and 5.

1.2.3 Discovery of Multiple Clusterings and External Cluster Validity Indices

External CVIs are often called comparison measures, that is, they compare the similarityor dissimilarity between two clusterings. These measures can also be used to measure thesimilarities between clusterings when discovering multiple distinctive clusterings [Dangand Bailey, 2010a]. They can be used as well to compare clusterings as part of findinga consensus, which can reduce the bias and errors of the individual clusterings [Strehland Ghosh, 2003].

1.3 focus of the thesis

In this thesis, we tackle some existing challenges in multiple clustering analysis andexternal cluster validity indices.

6 introduction

1.3.1 Discovery of Multiple Clusterings

Multiple clustering analysis aims to discover a set of reasonable and distinctive clusteringsolutions from the same dataset. There are two important criteria to be considered whendiscovering multiple clusterings:

- Clustering quality: the discovered multiple clusterings should be of good quality.

- Clustering dissimilarity: the discovered multiple clusterings are expected to bedistinctive from each other.

Many methods have been proposed to discover multiple clusterings based on these tworequirements and one very popular technique is meta-clustering [Caruana et al., 2006;Zhang and Li, 2011]. Meta-clustering generates a large number of base clusterings us-ing different approaches [Caruana et al., 2006] : running different clustering algorithms,running a specific algorithm several times with different initializations, or using randomfeature weighting in the distance function. These base clusterings may then be groupedinto meta-clusters (groups of base clusterings), based on their similarity with each other.Then, (base) clusterings within the same group can be combined using consensus (en-semble) clustering to generate a consensus view of that group. This results in one ormore distinctive clustering views of the dataset, each offering a different perspective orexplanation.

main challenges However, a major drawback and challenge with meta-clusteringis that its effectiveness is highly dependent on the quality and diversity of the generatedbase clusterings. Specifically, if the base clusterings are of low quality, then the ensemblestep will be influenced by such clusterings and may produce low quality clustering views.In addition, if there are redundant, noisy base clusterings that are similar to one or moreof the clustering views, then it is possible that some of the distinct clustering views aremistakenly merged into one, resulting in the loss of interesting clustering views. An-other challenge about meta-clustering is the large number of generated clustering views.Depending on the datasets and the base clusterings generation mechanism, one may


produce a large number of clustering views. It can be time consuming to examine themall.

contributions To address the problems described above, we propose a newapproach: ranked and filtered meta-clustering (rFILTA), aiming at detecting multiplehigh quality and distinctive clustering views. It achieves this by filtering out low qualityand redundant base clusterings, and ranking the resulting clusterings views (Chapter 3).

1.3.2 External Cluster Validity Indices

External CVIs evaluate the quality of a clustering by comparing its similarity (dissimi-larity) with a ground truth (gold standard). The ground truth is usually generated byan expert in the data domain. It identifies the primary substructure of interest to theexpert. This partition provides a benchmark for comparison with candidate partitions.It is believed that, the more similar a clustering is to the ground truth, the better isits quality. There are many challenges in this area. In this thesis, we target two ofthem: (a) how to evaluate soft clusterings using external CVIs; and (b) bias problemsfor external CVIs.

1.3.2.1 External CVIs for Soft Clustering

Most external cluster validity indices are designed for comparing two crisp partitions [Jainand Dubes, 1988]. However, partitions can also be soft, i.e., fuzzy, probabilistic or pos-sibilistic partitions [Anderson et al., 2010]. In crisp / hard clustering, one data objectbelongs to one cluster exclusively. In soft clustering, one data object can belong to morethan one cluster. For example, in Figure 1.2, the dataset contains two clusters. Thecluster memberships of data objects are indicated by their colors. A crisp clusteringsolution is shown in Figure 1.2a, while a soft clustering solution is shown in Figure 1.2b.As we can see from the left figure, the color of each data object is either red or blue,which means that each data object belongs to one cluster exclusively. However, in theright figure, the colors of data objects located between two clusters are colored between

8 introduction

-2 0 2 4 6 8 10 12-4

-2

0

2

4Cluster 1Cluster 2

(a) Example of crisp clustering. Each object be-long to one cluster (red or blue) exclusively.

-2 0 2 4 6 8 10 12-4

-2

0

2

4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Cluster 1 Fractional Membership

(b) Example of soft clustering. The objects lo-cated between the two clusters (red and blue)and colored e.g., aqua/light blue/orange, be-long to both clusters.

Figure 1.2: Examples of crisp clustering and soft clustering. Each symbol represents a dataobject. The colors of objects indicates their cluster membership.

blue and red. If 0 indicates blue, 1 indicates red. The colors between them are fractionalnumbers between 0 and 1.

main challenges Soft partitions are usually converted to crisp partitions byassigning each object unequivocally to the cluster with highest membership (fuzzy parti-tions), probability (probabilistic partitions), or typicality (possibilistic partitions). Thenthey are evaluated by employing the crisp external validity indices. However, this kindof conversion may cause loss of information [Campello, 2007]. For example, different softpartitions may be converted to the same crisp partition.

contributions To solve this problem, we develop external CVIs to evaluate softclusterings directly without the need for a hardening step. We generalize a popularclass of external CVIs, that is, the information-theoretic based external CVIs whichwere developed for crisp clusterings, to evaluate soft clusterings (Chapter 4).


1.3.2.2 Ground Truth Bias of External CVIs

As we introduced earlier, external CVIs are used to evaluate the quality of a clusteringby comparing its similarity with the ground truth partition. The general idea of thisevaluation methodology is that the more similar a candidate is to the ground truth(a larger value for similarity measure (the external CVI)), the better this partitionapproximates the labeled structure in the data.

0 2 4 6 8-5

0

5

10

15

(a) Ground truth U contains 3 clusters.

0 2 4 6 8-5

0

5

10

15

(b) Random clustering V1. The similarity scorebetween U and V1 is S(U ,V1) = 0.58.

0 2 4 6 8-5

0

5

10

15

(c) Random clustering V2. The similarity scorebetween U and V2 is S(U ,V1) = 0.62.

Figure 1.3: Example of biased external CVI. V2 is preferred by this external CVI, S, as it gothigher similarity score. However, both V1 and V2 are random clusterings whichshow no meaningful clustering structure. S should not show preference to eitherof the clusterings.

10 introduction

main challenges However, this evaluation methodology implicitly assumes thatthe similarity measure works correctly, i.e., that a larger similarity score indicates apartition that is really more similar to the ground truth. But this assumption may notalways hold. When this assumption is false, the evaluation results and conclusions willbe misleading. One of the reasons that could cause the assumption to be false is thatthe measures have bias issues. That is, some measures are biased towards preferringcertain clusterings, i.e., have a higher similarity score, even though they are not moresimilar to the ground truth compared to the other candidate partitions being evaluated.This can cause misleading results for users using these biased measures. We provide anexample shown in Figure 1.3. The ground truth partition contains 3 clusters shown inFigure 1.3a. There are two clusterings V1 and V2 to be evaluated by using one externalCVI S. The similarity scores are S(U ,V1) = 0.58 and S(U ,V2) = 0.62. According tothe similarity scores, we will choose V2 as the better solution. However, as we can seefrom Figures 1.3b and 1.3c, these two clusterings are both random clusterings withoutany meaningful patterns. An reasonable external CVI should not show preference toany of them.

contributions Recognizing and understanding the bias behavior of the CVIsis therefore crucial. In this thesis, we study the bias problem of crisp external CVIs inChapter 5. We identify a new type of bias of the external CVIs which has been rarelyrecognized. In addition, We study the empirical and theoretical implications of this newtype of bias.

1.4 structure of the thesis

The structure of the thesis is organized as follows:Chapter 2: In this chapter, we introduce the background knowledge about clusteringtechniques and cluster validity. In particular:

- Several representative traditional clustering techniques are reviewed in Section 2.1.

- Related work in multiple clustering analysis is reviewed in Section 2.2.

1.4 structure of the thesis 11

- Three types of popular external cluster validity indices are reviewed in Section 2.3.1.

- Several popular internal cluster validity indices are reviewed in Section 2.3.2.

Chapter 3: In this chapter, we propose a simple and effective filtering algorithm thatcan be flexibly used in conjunction with any meta-clustering method. In addition, wepropose an unsupervised method to rank the returned clustering views. In particular:

- The proposed framework rFILTA with filtering and ranking methods is introducedin Section 3.3.

- Extensive experimental results with detailed analysis are described in Section 3.5.

This chapter is based on the papers:

(a) Y. Lei, N. X. Vinh, J. Chan and J. Bailey, “FILTA: Better View Discovery from Collec-

tions of Clusterings via Filtering”. Published in Proceedings of the European Conference

on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

(ECML/PKDD 2014), pp. 145-160, 2014.

(b) Y. Lei, N. X. Vinh, J. Chan and J. Bailey, “rFILTA: Relevant and Non-Redundant

View Discovery from Collections of Clusterings via Filtering and Ranking”. To appear

in Knowledge and Information Systems (KAIS).

Chapter 4: In this chapter, eight popular crisp information-theoretic based clustervalidity indices (refer to Chapter 2) have been generalized to evaluate soft clusterings.Of the eight generalized indices, we advocate a normalized version of the soft mutualinformation cluster validity index (NMIsM) as the best overall choice. Finally, we providea theoretical analysis which helps explain the superior performance of NMIsM comparedto the other three normalizations of soft mutual information. In particular:

- The soft generalization approach is introduced in Section 4.3.

- The extensive experiments with analysis are presented in Section 4.5.

- The theoretical analysis is provided in Section 4.6.

This chapter is based on the papers:

12 introduction

(a) Y. Lei, J. C. Bezdek, J. Chan, N. X. Vinh, S. Romano, and J. Bailey, “Generalized Infor-

mation Theoretic Cluster Validity Indices for Soft Clusterings”. Published in Proceedings

of the IEEE Symposium Series on Computational Intelligence, pages 24-31, 2014.

(b) Y. Lei, J. C. Bezdek, J. Chan, N. X. Vinh, S. Romano, and J. Bailey, “Extending

Information-Theoretic Validity Indices for Fuzzy Clustering”. To appear in IEEE Trans-

actions on Fuzzy Systems.

Chapter 5: In this chapter, we identify a new type of bias arising from the distributionof the ground truth (reference) partition against which candidate partitions are com-pared. We call this new type of bias ground truth (GT) bias. We study the empiricaland theoretical implications of GT bias. In particular:

- The definitions of several bias problems are presented in Section 5.3.

- Experimental testing of 26 pair-counting based cluster validity indices is presentedin Section 5.5.

- Theoretical analysis about why and when the GT bias happens is provided inSection 5.6.

This chapter is based on the paper:

(a) Y. Lei, J. C. Bezdek, S. Romano, N. X. Vinh, J. Chan and J. Bailey, “Ground Truth Bias

in External Cluster Validity Indices”. Under second round review in Pattern Recognition.

Chapter 6: In this chapter, we summarize the contributions of the thesis and outlinepossible future directions.

1.5 summary

In this chapter, we reviewed three basic questions to be answered in cluster analysis.We scoped our discussion to two of them in this thesis, that is clustering techniques andcluster validity (Chapter 2).Due to the increasing complexity and size of datasets, multiple reasonable and dis-

tinctive clusterings might exist in the same data instead of one single clustering. In

1.5 summary 13

addition, users usually posses different perspectives and requirements on the same data.Thus, the task of discovering multiple clusterings is crucial. Multiple clustering analy-sis has been rapidly developing due to the above stimulations. In this thesis, we willintroduce a method about how to generate multiple distinctive clusterings which are ofbetter quality, in Chapter 3 .External cluster validity indices evaluate the quality of clusterings by comparing their

similarities with the ground truth. Some challenges exist in this area. We will discusshow to evaluate the quality of soft clusterings with external CVIs in Chapter 4 and thebias problems of external CVIs in Chapter 5.

2BACKGROUND KNOWLEDGE

AbstractIn this chapter, we review some popular clustering techniques and cluster validity indices.In particular, some representative traditional clustering techniques are introduced inSection 2.1. Then, we present several representative methods in multiple clusteringanalysis in Section 2.2. Cluster validity indices are generally classified into two categories:external and internal. External cluster validity indices are discussed in Section 2.3.1.Then we review some internal cluster validity indices in Section 2.3.2.

2.1 traditional clustering methods

Clustering aims to divide data objects into groups, so that data objects in the samegroup are similar, whereas data objects in different groups are dissimilar. Many clus-tering methods have been developed for this task [Aggarwal and Reddy, 2013]. Typicalclustering algorithms can be categorized using many different classifications, e.g., if clas-sified according to the types of generated clustering solutions, these algorithms can bedivided into crisp / hard clustering (one data object belongs to one cluster exclusively)and soft clustering (one data object can belong to more than one cluster). Next, weintroduce some basic notations. Then, we introduce several representative clusteringmethods for each type.

2.1.1 Basic Notations

Let O = {o1, . . . , oN} denote N objects, each associated with a vector xi ∈ Rd in thecase of numeric data. We seek to group the objects into c clusters. For each object, we

15

16 background knowledge

associate one or more cluster labels to it. There are four types of class labels we canassociate with each object: crisp, fuzzy, probabilistic and possibilistic. Let c denote thenumber of classes, 1 < c < N , we define three sets1 of label vectors in Rc as follows:

Npc = {y ∈ Rc : ∀i yi ∈ [0, 1], ∃j yj > 0} (2.1a)

Nfc = {y ∈ Npc :c∑i=1

yi = 1} (2.1b)

Nhc = {y ∈ Nfc : ∀i yi ∈ {0, 1}} (2.1c)

Here, Nhc is the canonical (unit vector) basis of Rc. The i-th vertex of Nhc, i.e.,ei = (0, 0, . . . , 1︸︷︷︸

i

, . . . , 0)T , is the crisp label for class i, 1 ≤ i ≤ c.

The set Nfc is a piece of a hyperplane, and is the convex hull of Nhc. For example,the vector y = (0.1, 0.2, 0.7)T is a constrained label vector in Nf3; its entries between 0and 1 and sum to 1. There are at least two interpretations for the elements of Nfc. If y

comes from a method such as maximum likelihood estimation in mixture decomposition,y is a (usually posterior) probabilistic label, and yi is interpreted as the probability that,given x, it is generated from the class or component i of the mixture [Titterington et al.,1985]. On the other hand, if y is a label vector for some x ∈ Rd generated by, say,the fuzzy c-means clustering model [Bezdek, 1981], y is a fuzzy label for x, and pi isinterpreted as the membership of x in class i. An important point is that Nfc has thesame structure for probabilistic and fuzzy labels.Finally, Npc = [0, 1]c \ {0} is the unit (hyper)cube in Rc, excluding the origin. As an

example, vectors such as z = (0.7, 0.3, 0.6)T in Np3 are called possibilistic label vectors,and in this case, zi is interpreted as the possibility that x is generated from class i.Labels in Npc are produced, e.g., by possibilistic clustering algorithms [Krishnapuramand Keller, 1993]. Note that Nhc ⊂ Nfc ⊂ Npc.Clustering in unlabeled data is the assignment of one of three types of labels to

each object in O. We define a partition of X on N objects as a c×N matrix U =

[U1 . . .Uk . . .UN ] = [uik], where Uk denotes the k-th column of U and uik indicates

1 For the class label types, fuzzy and probabilistic, they use the same sets of label vectors, hence werepresent them as Nfc.

2.1 traditional clustering methods 17

the degrees of membership of object k belongs to cluster i. The label vectors in Equation(2.1) can be used to define three types of c-partitions:

MpcN = {U ∈ RcN : ∀kUk ∈ Npc,∀iN∑k=1

uik > 0} (2.2a)

MfcN = {U ∈MpcN : ∀kUk ∈ Nfc} (2.2b)

MhcN = {U ∈MfcN : ∀kUk ∈ Nhc} (2.2c)

where MpcN (Equation 2.2a) are possibilistic c-partitions, MfcN (Equation 2.2b)are fuzzy or probabilistic c-partitions, and MhcN (Equation 2.2c) are crisp (hard) c-partitions. For convenience, we call the set MpcN \MhcN as the soft c-partitions of O,which contains the fuzzy, probabilistic and possibilistic c-partitions and excludes thecrisp c-partitions.Next we describe a representative sample of crisp/hard clustering methods.

2.1.2 Crisp / Hard Clustering Methods

k-means clustering The k-means algorithm is a crisp clustering method. Itrepresents each cluster by a centroid and assigns each object to its nearest centroid whenproducing a clustering. It does this by optimizing the following sum of squared error(SSE) objective:

SSE =k∑i=1

∑x∈ui

||x−wi||2 (2.3)

where ui is the ith cluster in clustering U , and wi is the cluster centre of ui. The generalprocedure of k-means algorithm is described in Algorithm 2.1.

Algorithm 2.1 K-means Clustering Algorithm1: k initial centroids are chosen from the N data objects.2: Each data point is then assigned to the cluster with the closest cluster centroid.3: Recompute the cluster centroids for all clusters.4: Repeating the steps 2 and 3 until the cluster centroids do not change.


The time complexity of k-means is O(Nkdl), where l is the number of iterations andis usually small. Thus, the time complexity of k-means can be considered as linear in nif the number of clusters k is much smaller than n.K-means has been proposed over 50 years ago, but it still enjoys wide application

due to its simplicity, ease of implementation, high efficiency and empirical success [Jain,2010]. However, k-means has some limitations which have to be considered when using it:(a) k-means is sensitive to the centroid initialization. It often happens that suboptimalsolutions are discovered with poor initialization. One popular way to solve this problemis to run k-means multiple times with different initializations and choose the optimalsolution. Some meta-clustering methods (refer to Section 2.2.1) make use of this propertyof k-means to generate diverse and reasonable clusterings; (b) k-means is not good atdetecting clusters with non-spherical shapes. Thus, for example, for the datasets withirregular shapes of clusters, k-means may not be a good choice; (c) k-means tends togenerate clusterings with equal-sized clusters. It is also called the uniform effect of k-means [Wu et al., 2009] which arises from its objective function; (d) k-means works wellon well-separated clusters instead of overlapping clusters; (e) k-means is sensitive tooutliers.

hierarchical clustering Hierarchical clustering is another popular type ofclustering methods which can detect nested clusters. There are two types of hierarchicalclustering approaches, agglomerative and divisive [Jain and Dubes, 1988].

- Agglomerative: This is a bottom-up method. It starts with N clusters, that is,each data object is considered as a cluster, and then at each step, merges the twoclosest clusters, until one cluster remains and it contains all the N data objects,or when k desired clusters are generated.

- Divisive: This is a top-down method. It starts with one cluster which contains allthe N objects, and then at each step, one cluster is split, until N singleton clustershave been produced or k desired clusters produced.

The agglomerative hierarchical clustering is more commonly used. We use it as a repre-sentative example and its clustering procedure is described in Algorithm 2.2.


Algorithm 2.2 Agglomerative Hierarchical Clustering Algorithm1: All objects are allocated to singleton clusters.2: Compute the proximity matrix among all singleton clusters.3: Merge the closest two clusters into one cluster.4: Update the proximity matrix involved with the merged clusters.5: Repeating steps 2 and 3 until one cluster remains.

The time complexity of agglomerative hierarchical clustering can be O(N3), whichis quite expensive compared to k-means. In hierarchical clustering, the merge or splitdecisions once made at any given level in the hierarchy cannot be undone, hence errorscan propagate. The advantages of hierarchical clusterings are: (a) it provides users withthe flexibility of choosing the number of clusters by reviewing the hierarchy; (b) it doesnot depend on any specific type of attribute as it only requires the proximity matrixas input. For example, it has been used to group a set of clusterings into meta-clusters(groups of clusterings) in the meta-clustering method [Caruana et al., 2006].

density-based clustering Density-based clustering methods discover clus-ters by considering the density of the data space. A high density region is consideredto be a cluster. Such regions are separated by sparse areas, and data objects in these(sparse) areas are usually considered as outliers. The most famous and popular methodin this category is DBSCAN [Ester et al., 1996]. It labels each object as core, border, ornoise, which are then used to form clusters. The definition of these labels are as follows:

- Core objects: An object is a core object if the number of its neighbour objectswithin its radius Eps exceeds a given threshold, MinPts.

- Border objects: A border object falls in the radius of a core object but is not acore object.

- Noise objects: A noise object is any point besides a core object or border object.

Algorithm 2.3 provides an overview of the DBSCAN algorithm. This type of methoddoes not make assumption about the number of clusters or their distribution. Thus,it can discover clusters with arbitrary shapes. But its performance is sensitive to itsparameter settings.


Algorithm 2.3 DBSCAN algorithm1: Label all objects as core, border, or noise objects.2: Eliminate noise objects.3: Put an edge between all core objects that are within Eps of each other.4: Make each group of connected core objects into a separate cluster.5: Assign each border object to one of the clusters of its associated core objects.

In the next section, we describe soft clustering methods.

2.1.3 Soft Clustering Methods

fuzzy c-means Fuzzy c-means (FCM) clustering [Bezdek, 1981] can be consid-ered as the soft version of k-means. It is similar to k-means, except that each data objectoi can belong to more than one cluster, with their association to cluster uj expressedby the weights uij where uij ∈ Nfc. Let U denote a fuzzy clustering U ∈ MfcN . Thegeneral procedure of FCM is as follows. Firstly, initializing the weights uij for all dataobjects. After initialization, repeatedly compute the centroids of clusters and associa-tion weights uij until the stopping criteria satisfied (e.g., the difference between twosuccessive updates of the centroids are less than certain threshold ε).

min(U ,W )

{Jm(U , W;X)} = min(U ,W )

{N∑k=1

c∑i=1

(uik)m||xk −wi||2A

}; (2.4)

where m is a parameter and m ≥ 1, U ∈ MfcN , W = {w1, w2, . . ., wc} is a vectorof (unknown) cluster centers (prototypes), wi ∈ Rd, 1 ≤ i ≤ c, and ||x||A =

√xᵀAx

is any inner product norm. The parameter m in Equation (2.4) controls the amountof uncertainty assigned to each xk about its membership in each of the c clusters inU . At m = 1, Equation (2.4) defines the HCM (k-means) model. For m > 1, we havefuzzy clusters. Optimal partitions U∗ of X are taken from pairs (U∗, W∗) that are localextrema of Jm. If DikA = ||xk −wi||A > 0 for all i and k, m > 1, and X contains at


least c distinct points, then (U , W) ∈ MfcN ×Rcd may minimize Jm only if [Bezdek,1981]:

uik =

c∑j=1

(DikA

DjkA

) 2m−1

−1

, 1 ≤ i ≤ c; 1 ≤ k ≤ n; (2.5a)

wi =

n∑k=1

umikxkN∑k=1

umik

, 1 ≤ i ≤ c. (2.5b)

The most popular algorithm for approximating solutions of Equation (2.4) for m > 1is Picard iteration through Equations (2.5a) and (2.5b). This type of iteration, oftencalled alternating optimization (AO), simply loops through one cycle of estimates forWt−1 → Ut−1 →Wt, and then checks ||Wt −Wt−1||A ≤ ε.

expectation-maximization (em) algorithm The Expectation-Maximization(EM) algorithm seeks local maxima of the likelihood function of X. Here X is assumedto be a set of n observations drawn i.i.d from a mixed population of c p-variate prob-ability distributions, that have {pi} as their prior probabilities (or mixing proportions)and {g(x|i)} as their class-conditional probability density functions (PDFs). The log-likelihood function is:

L(g;X) =N∑k=1

log f(xk : g) (2.6)

where f(x) =c∑i=1

pig(x|i) is a PDF for a mixture of the c components {pig(x|i)}.

f(x : g) =c∑i=1

pig(x|i; (wi,Si)). g = (q1, q2, . . . , qc), and qi = (pi, wi,Si) is anunknown parameter vector for each component. let P = [pik] denote the MaximumLikelihood (ML) estimate for the matrix of posterior probabilities. Then when the com-ponent densities are Gaussian, (g,P ) may maximize L only if:


pi =N∑k=1

pik

/N ; 1 ≤ i ≤ c; (2.7a)

wi =N∑k=1

pikxk/ N∑

k=1pik; 1 ≤ i ≤ c; (2.7b)

Si =N∑k=1

pik(xk −wi)(xk −wi)ᵀ/ N∑

k=1pik; 1 ≤ i ≤ c; (2.7c)

pik =pig(xk|i; (pi, wi,Si))f(xk|i; (pi, wi,Si))

; 1 ≤ i ≤ c; 1 ≤ k ≤ N ; (2.8)

Knowing either P or g allows us to compute the other set of variable (estimators)through equations (2.7) or (2.8). Alternating optimization (AO) can be used to iterate forgt−1 → Pt−1 → gt. This EM scheme can be interpreted either as a clustering algorithm,i.e., finding a partition P of X, or a statistical method of parametric estimation, i.e.,estimating g.

summary Typical clustering methods, whether crisp or soft clustering approaches,focus on discovering one single ‘best’ clustering solution. However, multiple meaningfuland distinctive clustering solutions may exist in the same data simultaneously. Next, weintroduce multiple clustering analysis.

2.2 multiple clustering analysis

Multiple clustering analysis aims to discover multiple interesting and distinctive clus-tering solutions from the data. Many methods have been proposed in this area. Theymay use different terms to name the multiple clusterings in the literature, such as non-redundant clustering [Niu et al., 2010; Gondek and Hofmann, 2004, 2005; Cui et al.,2010] and (multiple) alternative clusterings [Bae and Bailey, 2006; Davidson and Qi,2008; Qi and Davidson, 2009; Bae et al., 2010; Dang and Bailey, 2010b; Vinh and Epps,

2.2 multiple clustering analysis 23

2010; Dang and Bailey, 2010a; Kontonasios and De Bie, 2015; Dang and Bailey, 2015,2014]. In the following, we give a brief review of the popular methods in this area.

2.2.1 Discovery of Multiple Clusterings

Multiple clusterings methods can be categorized by considering different aspects. Thereare no clear boundaries among them. Here, we propose our classification into two types:

- Meta-clustering methods aim at discovering multiple clusterings from a collectionof base clusterings;

- Alternative clustering methods discover multiple high quality and dissimilar clus-terings via searching in the clustering space, guided by defined criteria whichcharacterize alternative solutions.

Meta-clustering aims to find multiple clustering views by generating and evaluatinga large set of base clusterings. It is an unsupervised method which is suited to beemployed in the initial period of data analysis. It is popular due to its simplicity andease of implement. There are several methods proposed in this category. In [Caruanaet al., 2006], it first generates these base clusterings by either random initialization orrandom feature weighting. Then, it groups these base clusterings into multiple meta-clusters and then presents these meta-clusters to the users for evaluation. Based on thisidea, Zhang and Li [Zhang and Li, 2011] proposed a method that extended [Caruanaet al., 2006] with consensus clustering in order to capture multiple clustering views.Work in [Phillips et al., 2011] proposed a sampling method for discovering a large setof good quality base clusterings. After that, the k-center [Gonzalez, 1985] clusteringmethod is used to select the k most dissimilar solutions as the clustering views.For meta-clustering methods, the quality and diversity of the generated clustering

views depend on the quality and diversity of the base clusterings. For the generation ofbase clusterings, there is considerable freedom. Users can use any clustering methodsto produce diverse (ensured by different clustering models used in different clusteringmethods) and good quality (ensured by each clustering model) base clusterings. However,as the base clusterings are generated in an unguided manner, there is a risk of producing


many noisy base clusterings. It has three primary issues to concern: (a) the noisy baseclusterings may cause the redundancy of generated clustering views; (b) the noisy baseclusterings may cause missing interesting clustering views; (c) it may generate a largenumber of clustering views that requires much time and effort for users to study.Alternative clustering methods propose different multi-objective criteria to charac-

terize quality and dissimilarity of alternative clusterings, which they optimize in order toproduce multiple clusterings. We further divide them into several groups: • unsupervisedalternative clustering methods; • alternative clusterings with constraints; • orthogonaltransformation approaches; • information-theoretic approaches.

unsupervised alternative clustering methods The methods in thiscategory build a multi-objective function which characterizes both the quality and di-versity of (multiple) alternative clusterings. These methods are classified in terms of thedifferent ways for quantifying the quality and diversity of clusterings. We can producemultiple alternative clusterings simultaneously by optimizing this objective function.[Jain et al., 2008] presented two approaches, Dec-k-means and Conv-EM, for discov-

ering two disparate clusterings simultaneously in an unsupervised manner. The firstapproach, Dec-kmeans, extended the k-means objective function by adding terms whichmeasure the decorrelation between pairs of clusterings. Conv-EM approach is a regular-ized EM method for learning convolution of mixtures of distributions and each mixturedistribution models a clustering. Similar to Dec-kmeans, Conv-EM incorporate a decor-relation term in the objective function to ensure the dissimilarity of the discoveredclusterings. Dec-kmeans is restricted to discover clusterings with spherical clusters dueto the k-means objective function. Depending on the assumed distribution, Conv-EM isalso constrained to discover a certain type of clusters. These two methods only discoverone alternative clustering currently. The objective function becomes more complicatedif it is extended for discovering multiple solutions.[Dasgupta and Ng, 2010] proposed a unsupervised method based on spectral cluster-

ing. It takes the 2nd to (m+ 1)th eigenvectors of the Laplacian matrix as the basisof the m views and produces multiple clusterings by applying the 2-means algorithmon each of these eigenvectors to discover an alternative clustering of 2 clusters. The


distinctiveness of the clusterings is achieved by the property of orthogonality amongthese eigenvectors. However, to ensure the quality of these clusterings, the number ofclusterings to be generated, m, has to be chosen carefully. This is because that thequality of the clusterings decreases as m increases. Furthermore, the assumption of twoclusters generated in each clustering is quite restrictive.The CAMI algorithm [Dang and Bailey, 2010a] is designed to find two distinctive

clusterings simultaneously by extending the EM framework. The quality of the cluster-ings is achieved by maximizing their log-likelihoods, where each clustering is modelledas a Gaussian mixture model, while the mutual information between the two clusteringsis minimized to realize the dissimilarity. This work was extended to discover multiplealternative clusterings (more than two) in a sequential way in [Dang and Bailey, 2015].

alternative clusterings with constraints The methods in this cat-egories are semi-supervized methods. A new alternative clustering is generated by in-corporating certain constraints which are usually derived from the existing clustering.The COALA method [Bae and Bailey, 2006] firstly derives a set of “cannot-link”

constraints from the existing clustering to characterize the dissimilarity. Then, basedon the agglomerative hierarchical clustering method, pairs of clusters are merged whilesatisfying certain criteria which consider both the quality and diversity (based on thecannot-link constraints) both factors. This method is a simple and heuristic approach.However, it is tied to a specific type of hierarchical method and only discovers onealternative solution.Work in [Qi and Davidson, 2009] proposed a flexible framework to discover an al-

ternative feature space while considering quality and diversity these two factors. It isformulated as a constraint optimization problem. By minimizing the objective functionof the Kullback-Leibler (KL) divergence, which ensures the diversity of clusterings, sub-ject to certain constraints, which ensures the quality of the alternative clustering, itgenerates a new feature space that not only preserves the inherent characteristics of thedata but also presents a different clustering structure with respect to the given clus-tering. Then any algorithm can be applied on this new feature space to generate analternative clustering. However, this method only generates one alternative clustering.


The work in [Bae et al., 2010] proposed a clustering comparison measure, namedADCO, based on density profiles. This measure considers the data distribution alongeach attribute to compare the structure of different clusterings instead of just consid-ering membership of points to clusters. This allows ADCO to compare the clusteringsgenerated from different datasets. Then the proposed MAXIMUS algorithm uses theADCO to discover alternative clusterings iteratively. It first calculates the maximumdissimilarity between the given clustering and the alternative clustering by forming aninteger programming model. Then the generated distribution information is consideredas constraints to guide the clustering process to find an alternative clustering.

orthogonal transformation approaches Methods in this category dis-cover multiple alternative clusterings by learning alternative feature spaces. The alter-native feature space not only captures the characteristics of the data (quality factor),but also highlight different underlying structures with respect to existing clustering (di-versity factor). Then any clustering algorithm can be applied on the new feature spaceto discover alternative clusterings.[Cui et al., 2007] proposed two approaches to discover feature spaces which are or-

thogonal to the existing clustering iteratively. These two approaches use different rep-resentations to characterize the existing clustering. The first approach represents thegiven clustering by its k cluster centroids (generated by k-means method), while thesecond approach characterizes the existing clustering by a feature subspace by applyingprincipal component analysis (PCA) on the k cluster centres of the existing clustering.Finally, an alternative feature space is learnt by projecting the original data to a fea-ture space which is orthogonal to the representations which characterize the existingclustering. Then, a traditional clustering algorithm can be applied on this alternativefeature space to discover alternative clusterings. This work has been extended to findmultiple clusterings proposed in [Cui et al., 2010] by transforming the feature spaceiteratively. These works might not be appropriate for low dimensional data due to thedimensionality reduction after feature space projection.Davidson and Qi [Davidson and Qi, 2008] proposed a transformation based method

known as ADFT. It finds an alternative clustering from the data by learning an alterna-


tive feature space with the aid of constraints. It first characterizes the existing clusteringby generating a set of constraints from which a distance matrix is learnt. Then anotherdistance function or matrix is generated which is alternative to the distance matrixlearnt from the generated constraints. Finally, an alternative feature space is generatedby transforming the original data based on the learnt alternative distance matrix. Thismethod could be used in situations where the dimensionality of the dataset is smallerthan the number of clusters.Work in [Niu et al., 2010] proposed the mSC algorithm, which treats different sub-

spaces as different views. It first clusters the features into different subsets, which aretaken as different views. Then, in each view (a feature subspace), a clustering is gen-erated by optimizing a multi-objective function built based on the spectral clusteringobjective and a Hilbert-Schmidt Independence Criterion (HSIC) penalty term, whichmeasures the dependence of different subspaces (views). The final output is a trans-formed feature space of the original data, then a clustering algorithm can be applied onthis new feature space to find an alternative clustering. This method discover multipleclusterings simultaneously. A sequential version is proposed in [Niu et al., 2014].Two approaches proposed in [Dang and Bailey, 2014] aim at finding multiple alter-

native clusterings by feature space transformation. The first approach is a PCA basedmethod which incorporates the HSIC technique to measure the dependencies betweensubspaces while transforming the original feature spaces into new spaces with maximalvariances. The second approach is a spectral clustering based method which applies thekernel discriminant analysis (KDA) technique [Baudat and Anouar, 2000] to character-ize the existing clustering in the form of a new feature space with fewer dimensions.Then it generates the eigenvectors of the Laplacian matrix of the original feature spacesubject to new constraints, that form a subspace which is orthogonal to the new featurespace found by KDA. These two methods can be extended to find multiple clusterings.

information-theoretic approaches In this category, the methods pro-posed are based on information theoretic concepts. Generally speaking, these methodsbuild an multi-objective function that characterizes the quality and diversity of clus-terings. By optimizing the objective function, multiple alternative clusterings can be


generated. The information-theoretic based concepts are employed to quantify the qual-ity and/or diversity of clusterings.Firstly, we introduce the information bottleneck (IB) method which is the foundation

of several methods in this category are based on. The IB method is an informationtheoretic method and can be considered as a clustering method. The formulation of thisproblem can be described by the following objective function:

minC

[I(X;C)− βI(Y ;C)] (2.9)

where X represents data objects, Y indicates relevant features and C is the desiredclustering. The general idea of IB method is that it generates a clustering C by squeezingthe information of X into C while preserving as much as information about Y .The conditional information bottleneck method (CIB) [Gondek and Hofmann, 2003]

and non-redundant clustering methods [Gondek and Hofmann, 2004] are proposed togeneralize the information bottleneck method to find an alternative clustering C bypreserving as much as information about Y conditioned on the given knowledge Z, i.e.,max I(C,Y |Z) with certain constraints. However, these methods both require the jointdistributions of these variables, which may not be available.Gondek and Hofman in [Gondek and Hofmann, 2005] try to find non-redundant clus-

tering by using of ensemble methods. Given an existing clustering solution, it furtherproduces several local clusterings in each cluster of the given clustering and then com-bines these local clusterings into an “orthogonal” and high quality clustering in the databy maximizing the mutual information among the ensemble clustering. This method isconceptually simple, can be efficiently implemented, and does not depend on any specificclustering algorithm. However, it can only discover one alternative.The NACI algorithm proposed by [Dang and Bailey, 2010b] discovers alternative

clusterings by optimizing an multi-objective function. Given an existing clustering C−,it ensures the quality of the alternative clustering C by maximizing the mutual infor-mation between clustering C and data X, I(X,C), whilst ensuring the dissimilaritybetween clustering C with existing clustering C− by minimizing their mutual informa-tion I(C,C−). To avoid the requirement of the joint distribution for computing the


mutual information, it employs the quadratic mutual information form coupled withthe Parzen window, which is a non-parametric density estimation technique. The Gaus-sian function is chosen as the kernel function placed at each sample while estimating thedensity distribution. This method can discover an alternative clustering with non-linearboundaries among the clusters.minCEntropy [Vinh and Epps, 2010] is similar to NACI. It employs conditional en-

tropy (CE) for quantifying the quality and dissimilarity of alternative clusterings. Specif-ically, for computing the conditional entropy, it employs the Havrda-Charvat’s α struc-tural entropy in combination with the Parzen window density estimation technique. Itdiscovers alternative clusterings in a sequential way.Ye et al. [Ye et al., 2015] proposed a method, SmIB, based on the multivariate in-

formation bottleneck method [Slonim et al., 2006], which is an extension of IB methodand provides a general principled framework involved multiple variables, to discover al-ternative clustering. The existing clustering is taken as side information while buildingthe multi-objective function which characterizes the quality and diversity of clusteringsby employing mutual information. They employ the MeanNN differential entropy esti-mator [Faivishevsky and Goldberger, 2010] to measure the information contained in thedata.

summary Alternative Clustering discovers high quality and dissimilar views viasearching in the clustering space guided by criteria about what constitutes an alternative.One may discover alternatives either iteratively or simultaneously [Bae and Bailey, 2006;Cui et al., 2007; Davidson and Qi, 2008; Jain et al., 2008; Vinh and Epps, 2010; Hossainet al., 2013; Niu et al., 2014; Dang and Bailey, 2014, 2015].Compared with meta-clustering, alternative clustering is more efficient for discover-

ing alternative views. However, it restricts the definition of an alternative to certainobjective functions, which may cause the search process to miss some interesting clus-tering views, due to mismatches between the objective function and the underlying viewstructure. It can be difficult to define an objective function characterizing what is analternative, especially in the initial period of data exploratory analysis when there is


little information about the data available. The discussed alternative clustering methodsare summarized in Table 2.1.


Algorith

mQua

lity

Measure

Diss

imila

rity

Measure

Simul.o

rSe

q.#

ofAlte

rnativeClusterings

Feature

Space

[Gon

dekan

dHofman

n,20

03]

MI

MI

seq.

1who

lespace

[Gon

dekan

dHofman

n,20

04]

MI

MI

seq.

1who

lespace

[Gon

dekan

dHofman

n,20

05]

MI

MI

seq.

1who

lespace

[Bae

andBailey,

2006

]averag

elin

kage

constraints

seq.

1who

lespace

[Cui

etal.,20

07]

k-means

orthog

onal

constraints

seq.

1who

lespace

[Davidsonan

dQi,20

08]

distan

cefunc

tion

tran

sformation

seq.

1who

lespace

[Jainet

al.,20

08]

k-means;

EMde

correlation

simul.

2who

lespace

[Qia

ndDavidson,

2009

]KL

constraints

seq.

1who

lespace

[Bae

etal.,20

10]

k-means

simila

rity

measure

seq.

1who

lespace

[Vinhan

dEp

ps,2

010]

MI

MI

seq.

≥2

who

lespace

[Dan

gan

dBailey,

2010

b]MI

MI

seq.

1who

lespace

[Cui

etal.,20

10]

k-means;

EMorthog

onal

constraints

seq.

≥2

who

lespace

[Dasgu

ptaan

dNg,

2010

]spectral

eigen.

orthog

onal

simul.

≥2

who

lespace

[Dan

gan

dBailey,

2010

a]EM

MI

simul.

2who

lespace

[Niu

etal.,20

10]

spectral

HSIC

simul.

≥2

subspa

ce[D

angan

dBailey,

2014

]k-means;

spectral

HSIC;

orthog

onal

constraints

seq.

≥2

subspa

ce

[Dan

gan

dBailey,

2015

]Max

imum

Likelih

ood

MI

simul.

≥2

who

lespace

[Yeet

al.,20

15]

MI

MI

seq.

1who

lespace

Table2.1:

Summaryof

Alte

rnativeClusteringMetho

ds.C

MIisshortfor

cond

ition

alm

utua

linf

orm

atio

n.‘Sim

ul.’or

‘Seq.’indicates

that

oneap

proa

chdiscoversalternativeclusterin

gssi

mul

tane

ously

orse

quen

tially

.


2.2.2 Other Related Clustering Methods

There are several topics that are relevant to the discovery of multiple clusterings andare briefly reviewed as follows.

2.2.2.1 Ensemble Clustering

Cluster Ensemble or Consensus Clustering combines a collection of partitions ofdata into a single solution which aims to improve the quality and stability of individualclusterings [Strehl and Ghosh, 2003; Topchy et al., 2005; Gionis et al., 2007; Nguyen andCaruana, 2007; Liu et al., 2015]. Strehl and Ghosh [Strehl and Ghosh, 2003] proposedthree ensemble clustering methods:

- Cluster-based Similarity Partitioning Algorithm (CSPA)In CSPA, a N ×N similarity matrix can be derived based on the base clusteringsto be combined, where N is the number of data points. The similarity betweentwo data points is defined as the fraction of base clusterings in which the twoobjects are in the same cluster. Then, an ensemble solution can be generated byapplying the graph partitioning algorithm, METIS [Karypis and Kumar, 1998] onthe derived overall similarity matrix. CSPA is the most simple heuristic of thethree, but its computational and storage complexity are both quadratic in N . Thefollowing two methods are less computationally expensive.

- Hyper-Graph Partitioning Algorithm (HGPA)The base clusterings are represented as hypergraph and each cluster is mapped toa hyperedge. The HGPA algorithm takes it as a hypergraph partitioning problemand produces an ensemble solution by cutting a minimal number of hyperedges.The hypergraph partitioning package HMETIS [Karypis et al., 1999] is used here.

- Meta-CLustering Algorithm (MCLA)The meta-clustering algorithm 2(MCLA) is based on clustering clusters. MCLA

2 Note that this meta-clustering here is different from the meta-clustering method we introduced inmultiple clustering analysis (Section 2.2.1). In this thesis, except here, all other meta-clustering indicatesmultiple clustering method.


groups (use graph partitioning package METIS) and collapses related hyperedges.Then it assigns each data object to the collapsed hyperedge in which it participatesmost strongly.

However, instead of combining all the available clusterings into one single solution,it has been demonstrated that a better clustering can often be achieved by combiningonly a part of the available solutions [Fern and Lin, 2008; Azimi and Fern, 2009; Naldiet al., 2013], which is the cluster ensemble selection problem. It has been shown thatquality and diversity are two important factors which will influence the performanceof cluster ensemble [Fern and Lin, 2008; Hadjitodorov et al., 2006; Naldi et al., 2013].Cluster ensemble and the cluster ensemble selection methods typically focus on discov-ering a single high quality solution from a collection of clusterings, rather than multiplesolutions.

2.2.2.2 Multiview Clustering

Multiview clustering can be considered as the reverse process of multiple clustering dis-covery. The analyzed dataset typically has multiple sources or is represented by multiplerepresentations. The multiview clustering focuses on establishing a consensus solutionwhich reflects a comprehensive understanding of the multiple views, and is more stableand robust compared with the clustering solution obtained from a single view. Severalworks have been developed on this topic.

Generally speaking, there are two kinds of ways for proceeding with multiview clus-tering. The first direction is centralized algorithms [Chaudhuri et al., 2009; Zhou andBurges, 2007; Wiswedel et al., 2010; Kailing et al., 2004; de Sa, 2005], which deal with themultiple representations simultaneously to seek the solution. Another type of multiviewclustering is distributed [Long et al., 2008; Topchy et al., 2005; Boulis and Ostendorf,2004; Januzaj et al., 2004], that is, one clustering is learnt from each view individuallyand finally a consensus clustering is generated by combing these individual clusteringsfrom multiple views. In [Hua and Pei, 2012], they study a novel problem about miningmutual subspace clusters from multiple sources. This method integrates the subspaceclustering into the multiview clustering.


2.2.2.3 Subspace Clustering

Traditional clustering algorithms consider the full feature space of an input dataset toexplore its underlying structure. However, for high dimensional data, it is difficult todetect meaningful clusters due to the so called “curse of dimensionality" [Kriegel et al.,2009]. For high dimensional data, many of the attributes are often irrelevant whichcan hide existing clusters. In addition, the distances between pairs of data points areincreasingly similar with the growing dimensionality [Müller et al., 2009b]. Subspaceclustering is one type of popular method to solve the high-dimensional problem and hasdeveloped rapidly in recent years [Assent et al., 2007; Moise and Sander, 2008; Mülleret al., 2009a; Günnemann et al., 2009; Sim et al., 2013].Subspace clustering attempts to localize all the subspaces in which the meaningful

clusters may exist. Each cluster is associated with a set of relevant features (dimensions).Clusters may exist in multiple, possibly overlapping subspaces [Parsons et al., 2004]. Fora subspace clustering algorithm, it requires a search method for localizing the subspacesthat meaningful clusters may fit in and the definition of cluster model. Different subspaceclustering methods differ in terms of different searching strategies and definitions ofcluster models. Some well known examples include CLIQUE [Agrawal et al., 1998],MAFIA [Nagesh et al., 2001] and ENCLUS [Cheng et al., 1999].Some methods have been proposed to connect the subspace clustering with alternative

clustering [Günnemann et al., 2010, 2012; Niu et al., 2012]. These works take a collectionof diverse subspaces as different views and hypothesize that different views may revealdifferent clustering structures.

2.2.2.4 Constrained Clustering

Constrained clustering, or clustering with constraints, is a class of semi-supervised learn-ing algorithms [Basu et al., 2008]. It incorporates prior knowledge about the data inthe form of constraints into the clustering process with the aim of improving clusteringresults. There are different types of constraints [Davidson and Ravi, 2007], for example,must-link and cannot-link are instance level constraints, which indicate the relationshipof data objects, that is, data objects should be assigned to the same cluster or in dif-

2.3 cluster validity 35

ferent clusters; there are also cluster level constraints which influence the inter-clusterdistances and cluster composition. There are also different ways of making use of theconstraints: some methods attempts to satisfy the constraints while performing the clus-tering process; some methods first learn a new distance based on the constraints whichthen can be used in the clustering algorithm.Some methods have incorporated the concept of constraints into the process of dis-

covering multiple alternative clusterings [Bae and Bailey, 2006; Davidson and Qi, 2008;Qi and Davidson, 2009]. The general idea is that the constraints can be derived fromthe existing or given clustering [Bae and Bailey, 2006; Davidson and Qi, 2008], then thealternative clustering can be generated by satisfying these constraints while performingthe clustering process.

Summary of Clustering Techniques

We have reviewed a few representative traditional clustering techniques, the recent de-velopment of multiple clustering analysis and several types of clustering approachesthat relate to multiple clustering analysis. We want to emphasize the necessity and im-portance of the multiple clustering analysis task. Next, we review another importantquestion in cluster analysis, that is cluster validity.

2.3 cluster validity

Cluster validation is the procedure of quantitatively evaluating the quality of a cluster-ing. Cluster validity indices (CVIs) are often used to measure the quality of a clustering.There have been a large number of clustering validity measures proposed and success-fully used for this task, which can be generally classified into two categories, externalcluster validity indices and internal cluster validity indices [Jain and Dubes, 1988].They are distinguished in terms of whether or not external information is used duringthe validation procedure. Next, we present a brief review of these two types of validationmeasures.


Table 2.2: Contingency table based on partitions U and V , nij = |ui ∩ vj |V ∈MhcN

Cluster v1 v2 . . . vc Sums

U ∈MhrN

u1u2...

ur

n11 n12 . . . n1cn21 n22 . . . n2c... ... ...nr1 nr2 . . . nrc

a1a2...ar

Sums b1 b2 . . . bc∑ij nij = N

2.3.1 External Cluster Validity Indices

External CVIs (or comparison measures), are often considered as similarity (or dissimi-larity) measures between the ground truth and candidate partitions. The ground truthpartition, which is usually generated by an expert in the data domain, identifies theprimary substructure of interest to the expert. This partition provides a benchmark forcomparison with candidate partitions. It is believed that the more similar the evaluatedclustering to the ground truth, the better quality of this clustering. Next, we introducean important concept which is the basis for external CVIs.Let U and V be two crisp partitions, U ∈ MhrN and V ∈ MhcN . We denote the

clusters in U and V as {u1, . . . ,ur} and {v1, . . . , vc}. A r × c contingency table canbe constructed to summarize the overlap between pairs of clusters in the comparedclusterings U and V . The contingency table that compares two clusterings is shown inTable 2.2. Note that the numbers of clusters in U and V need not be equal, r 6= c. Theentry nij indicates the number of shared object pairs in clusters ui and vj . The rowsum, ai, is the number of objects in cluster ui and the column sum, bj , is the numberof objects in cluster vj . Many external CVIs are built upon the contingency table.

pair-counting based measures Pair-counting based comparison CVIs are agroup of popular measures based on counting the agreements and disagreements betweentwo crisp partitions in terms of shared pairs of objects. The number of pairs of sharedobjects between U and V is divided into four groups [Rand, 1971]:

- k11, the number of pairs that are in the same cluster in both U and V ;


- k00, the number of pairs that are in different clusters in both U and V ;

- k10, the number of pairs that are in the same cluster in U but in different clustersin V ;

- k01, the number of pairs that are in different clusters in U but in the same clustersin V .

where k11 + k10 + k01 + k00 = (N2 ). These four types of pairs can be computed as [Rand,1971] :

k11 =12

r∑i=1

c∑j=1

nij(nij − 1) (2.10)

k00 =12(N2 +

r∑i=1

c∑j=1

n2ij − (

r∑i=1

a2i +

c∑j=1

b2j ))

(2.11)

k10 =12( r∑i=1

a2i −

r∑i=1

c∑j=1

n2ij

)(2.12)

k01 =12( c∑j=1

b2j −r∑i=1

c∑j=1

n2ij

)(2.13)

where ai is the sum of ith row and bj is the sum of jth column in contingency table(Table 2.2). The sum of k11 + k00 is interpreted as the total number of agreementsbetween U and V , and the sum k10 + k01 is the total number of disagreements. Pair-counting based external CVIs are computed with these four types of pairs. We introducefour popularly used pair-counting based CVIs.Rand Index (RI) [Rand, 1971] is a popular pair-counting based CVI and is often used

for comparing similarity between two clusterings. It ranges in [0, 1].

RI(U ,V ) = k11 + k00k11 + k10 + k01 + k00

(2.14)

However, the value of RI often lies within a narrow range of [0.5, 1]. In addition, thebaseline value of RI does not take on constant value [Vinh et al., 2010]. The baselinevalue is the expected value of the similarity measures when comparing two independent


clusterings and is supposed to be constant. Thus, the adjusted version of RI is proposedas follows. Adjusted Rand Index (ARI, Hubert and Arabie [Hubert and Arabie, 1985])

ARI(U ,V ) =k11 − (k11+k10)(k11+k01)

k11+k10+k01+k00(k11+k10)+(k11+k01)

2 − (k11+k10)(k11+k01)k11+k10+k01+k00

(2.15)

Jaccard Index (JI) [Jaccard, 1908] is another popular pair-counting based CVI. It issimilar as RI but neglects the term k00 which is considered as ‘neutral’, i.e., neitheragreement or disagreement.

JI(U ,V ) = k11k11 + k10 + k01

(2.16)

Mirkin metric [Mirkin and Chernyi, 1970] is defined as:

Mirkin(U ,V ) = 2(k10 + k01) = n(n− 1)(1−RI(U ,V )) (2.17)

It corresponds to the Hamming distance between the binary vectors representation ofeach partition. It is a transformed form of the RI.

information-theoretic based measures Information-theoretic (IT) basedmeasures are built upon fundamental concepts from information theory [Cover andThomas, 2012], and are a commonly used approach for crisp clustering comparison [Meilă,2007; Strehl and Ghosh, 2003]. Given two crisp partitions U ∈MhrN and V ∈MhcN , ITbased CVIs can also be computed according to contingency table built on clusteringsU and V (Table 2.2). Before introducing the measures, we first describe several basicinformation-theoretic concepts.The entropy measures the uncertainty of a random variable. In the context of cluster-

ing, each crisp clustering U ∈MhrN is associated with a discrete random variable takingr values. H(U) indicates the entropy of the random variable associated with clusteringU .


The entropy of partition U is defined as:

H(U) = −r∑i=1

p(ui) log p(ui) (2.18)

where p(ui) is the probability of one object falls into cluster ui. If each object hasequal probability be picked up, then the probability of this object falls into cluster ui isp(ui) =

|ui|N = ai

N .The joint entropy H(U ,V ) and mutual information (MI) I(U ,V ) between partitions

U and V can be written as:

H(U ,V ) = −r∑i=1

c∑j=1

p(ui, vj) log p(ui, vj)

I(U ,V ) =r∑i=1

c∑j=1

p(ui, vj) log p(ui, vj)p(ui)p(vj)

where p(ui, vj) indicates the joint probability of one object falls into cluster ui in clus-tering U and cluster vj in clustering V , and p(ui, vj) =

|ui∩vj |N =

nij

N . Intuitively, themutual information measures the shared information between clusterings U and V . Thelarger the value, the more information they share which results in larger MI value.It is known that MI is upper-bounded by the following quantities [Vinh et al., 2010]:

I(U ,V ) ≤ min{H(U),H(V )} ≤√H(U),H(V ) ≤ 1

2(H(U) +H(V ))

≤ max{H(U),H(V )} ≤ H(U ,V )(2.19)

Different variants of normalized mutual information (NMI) are proposed and aredistinguished in terms of their different normalization factors, which all aim at scalingMI to range [0, 1]. For example,

NMImax(U ,V ) = I(U ,V )max{H(U),H(V )}

(2.20)


As the baseline values of the variants of NMI are not constant, the adjusted-for-chanceversion of these variants of NMI are proposed [Vinh et al., 2010]. Take NMImax as anexample:

AMImax(U ,V ) = I(U ,V )−E{I(U ,V )}max{H(U),H(V )} −E{I(U ,V )} (2.21)

The AMI is 1 when the two clusterings are identical and 0 when any commonalitybetween the clusterings is due to chance.Base on the hypergeometric model [Vinh et al., 2010], the expected mutual informa-

tion between U and V , E{I(U ,V )}, is computed as:

E{I(U ,V )} =r∑

i=1

c∑j=1

min(ai,bj)∑nij=max(ai+bj−n,0)

nij

Nlog(n · nij

aibj)

ai! bj ! (N − ai)! (N − bj)!N !nij ! (ai − nij)! (bj − nij)! (N − ai − bj + nij)!

(2.22)

Although AMI may have a constant baseline, it has the issue of selection bias, thatis, it is biased towards selecting clusterings containing more clusters when against areference clustering. Standardized mutual information (SMI) is proposed to solve thisproblem [Romano et al., 2014]:

SMI(U ,V ) = I(U ,V )−E{I(U ,V )}√V ar(I(U ,V ))

(2.23)

The variance of mutual information between U and V , V ar(I(U ,V )), under the hyper-geometric hypothesis is V ar(I(U ,V )) = E{I2(U ,V )} − (E{I(U ,V )})2.The variation of information (VI) [Meilă, 2007] is a dissimilarity measure. It has been

proved to be a true metric on the space of clusterings.

V I(U ,V ) = H(U ,V )− I(U ,V ) (2.24)


The normalized version of VI (NVI) ranges in [0, 1] and NVI is the normalized distancemeasure equivalent to NMIjoint .

NV I(U ,V ) = 1− I(U ,V )H(U ,V ) (2.25)

set matching based measures Set matching based measures compare thesimilarity (dissimilarity) between clusterings U and V by considering the matchingbetween clusters from these two clusterings.The F measure is a popular set matching based CVI and is often used in the field

of document clustering [Wu et al., 2010]. Assume U is the ground truth and V is thegenerated clustering to be evaluated. The F measure between class ui in U and clustervj in V evaluates how good the cluster vj match the class ui and it is computed as:

F (ui, vj) =2nijai + bj

(2.26)

The F measure between two partitions U and V is computed as taking the weightedsum of the maximum F measures of clusters vj in V for each ui in U .

F (U ,V ) = 1N

r∑i=1

ai maxjF (ui, vj) (2.27)

The Van Dongen measure [VAN DONGEN, 2000] is a symmetric measure. It is basedon maximum intersections of clusters and is defined as follows:

V D(U ,V ) = 2N −r∑i=1

maxjnij −

c∑j=1

maxinij (2.28)

The set matching based measures suffer from two problems: (a) the number of clustersin the two clusterings may be different which is problematic, as there are some clustersput outside consideration; (b) when the numbers of clusters are the same, these measuresdo not consider the unmatched part of the matched clusters.We summarized the discussed external CVIs in Table 2.3.


External

Cluster

Validity

IndexType

Equation

Rand

IndexPair-counting

k11 +k00

k11 +k10 +

k01 +k00

Adjusted

Rand

IndexPair-counting

k11 −(k11

+k10

)(k11+

k01)

k11+

k10+

k01+

k00(k11

+k10

)+(k11

+k01

)2

−(k11

+k10

)(k11+

k01)

k11+

k10+

k01+

k00

JaccardIndex

Pair-countingk11

k11 +k10 +

k01

Mirkin

metric

Pair-counting2(k10

+k01 )

MutualInform

ationInform

ation-theoretic∑ri=

1 ∑cj=

1n

ij

nlog

n·nij

ai ·b

j

Norm

alizedMutualInform

ationInform

ation-theoreticI(U

,V)

max{

H(U

),H(V

)}

Adjusted

MutualInform

ationInform

ation-theoreticI(U

,V)−E{I(U

,V)}

max{

H(U

),H(V

)}−E{I(U

,V)}

StandardizedMutualInform

ationInform

ation-theoreticI(U

,V)−E{I(U

,V)}

√Var(I

(U,V

))

VariationofInform

ationInform

ation-theoreticH(U

,V)−

I(U

,V)

Norm

alizedVariation

ofInformation

Information-theoretic

1−

I(U

,V)

H(U

,V)

FMeasure

SetMatching

1n ∑ri=

1ai m

axj

2n

ij

ai +b

j

VanDongen

Measure

SetMatching

2n− ∑

ri=1 m

axjnij − ∑

cj=1 m

axi n

ij

Table2.3:A

summary

ofexternal

clustervalidity

indices.U

andV

aretw

oclusterings.

aiis

therow

sumof

contingencytable

(Table2.2).

bjis

thecolum

nsum

ofcontingency

table.n

ijis

theentry

ofcontingency

table.k

11 ,k

10 ,k

01and

k00

arecounts

offourtypes

ofpairsofobjects

(Equations2.10

-2.13).


2.3.2 Internal Cluster Validity Indices

Internal cluster validation measures evaluate the quality of clusterings only based onthe data itself. Many internal cluster validity indices that have been proposed [Liu et al.,2010]. Most of them build the index by considering two factors: cluster cohesion andcluster separation. We introduce several popular internal CVIs in the following.Dunn Index (DI) uses the minimum pairwise distance between objects in different

clusters as the inter-cluster separation and the maximum diameter among all clustersas the intra-cluster compactness. It takes the ratio of separation over cohesion.

DI(U) =mini6=j{δ(ui,uj)}max1≤l≤k{∆(ul)}

(2.29)

where δ(ui,uj) = minx∈ui,y∈uj d(x, y) and ∆(ul) = maxx,y∈uld(x, y) indicates the di-

ameter of cluster ul. d(x, y) indicates the distance between data objects x and y. Alarger DI value is preferred.Davies-Bouldin index(DB) is computed based on a ratio of intra-cluster and inter-

cluster distances. First we can define, for cluster ui and uj ,

Di,j =d̄i + d̄jdi,j

(2.30)

where d̄i = 1ni

∑x∈ui

d(x,wi) and wi is the centroid of cluster ui. d̄i measures the intra-cluster cohesion. di,j = d(wi,wj) and it measures the inter-cluster separation. Then wecan get

DB(U) =1r

r∑i=1

maxj 6=i{Di,j} (2.31)

A smaller DB value is preferred.


Silhouette Coefficient (SC) [Rousseeuw, 1987] measures how similar an object is toits own cluster compared to other clusters. It ranges [−1, 1]. Higher SC value is desired.For any data object x, we can define

s(x) =b(x)− a(x)

max(b(x), a(x)) (2.32)

where a(x) = 1ni−1

∑y∈ui,y 6=x d(x, y), and b(x) = minj,j 6=i[ 1

nj

∑y∈uj

d(x, y)]. a(x) is ameasure of how similar x is to its own cluster and a small value means it is well matched.b(x) measures how similar x matches to its neighbouring cluster and a larger value isdesired. For clustering U , we obtain

SC(U) =1r

∑i

{ 1ni

∑x∈ui

s(x)} (2.33)

Summary of Cluster Validity Indices

We have reviewed several popular external CVIs and a few representative internal CVIs.When there is external information about the data, for example, the ground truth, exter-nal CVIs could be used to choose the best clustering algorithm on a specific dataset. Inaddition, external CVIs can be used to compare similarity or dissimilarity between pairsof clusterings. Thus, they are often used when generating multiple clusterings [Dang andBailey, 2010b] or an ensemble clustering [Strehl and Ghosh, 2003]. When the externalinformation is not available, internal CVIs can be used for the cluster evaluation. Theycan be used to choose the best clustering algorithm as well as the optimal number ofclusters.

2.4 summary

In this chapter, we have reviewed several important clustering techniques, includingtraditional clustering and multiple clustering analysis. We reviewed and compared thepopular methods in multiple clustering analysis. As we have discussed, alternative clus-

2.4 summary 45

tering methods focus on discovering alternative clusterings by defining different criteriato characterize what constitutes an alternative. However, in the initial period of ex-ploratory data analysis when we do not have much knowledge about data, sometimeswe do not even know what alternatives might exist in the data. Thus, meta-clusteringcould be a better choice for this situation. In Chapter 3, we will discuss how to generatemultiple clusterings of better quality based on meta-clustering.We also discussed another important question in cluster analysis that is cluster validity.

Cluster validity indices can be generally classified into two types: external and internal.We have reviewed several popular classes of external cluster validity indices as wellas several popular internal cluster validity indices. In this thesis, we focus on externalcluster validity indices. We will discuss some challenges of external measures in Chapter 4and Chapter 5.

3DISCOVERY OF MULTIPLE CLUSTER INGS

AbstractMeta-clustering is a popular approach for finding multiple clusterings in the dataset, tak-ing a large number of base clusterings as input for further user navigation and refinement.However, the effectiveness of meta-clustering is highly dependent on the distribution ofthe base clusterings and open challenges exist with regard to its stability and noise tol-erance. In addition, the returned clustering views may not all be relevant, hence thereis open challenge on how to rank those clustering views. In this chapter, we proposea simple and effective filtering algorithm that can be flexibly used in conjunction withany meta-clustering method. In addition, we propose an unsupervised method to rankthe returned clustering views. We evaluate the framework (rFILTA) on both syntheticand real world datasets, and see how its use can enhance the clustering view discoveryfor complex scenarios.

3.1 introduction

Meta-clustering is a popular unsupervised method for discovering multiple clusterings.It is especially suited in the exploratory stages of data analysis. In particular, meta-clustering generates a large number of base clusterings using different approaches [Caru-

In this chapter we present results from the following manuscripts:Y. Lei, N. X. Vinh, J. Chan and J. Bailey, “FILTA: Better View Discovery from Collections of Clus-terings via Filtering”. Published in Proceedings of the European Conference on Machine Learning andPrinciples and Practice of Knowledge Discovery in Databases (ECML/PKDD 2014), pp. 145-160, 2014.Y. Lei, N. X. Vinh, J. Chan and J. Bailey, “rFILTA: Relevant and Non-Redundant View Discoveryfrom Collections of Clusterings via Filtering and Ranking”. To appear in Knowledge and InformationSystems (KAIS).

47

48 discovery of multiple clusterings

Generate Raw Base

Clusterings

Ensemble/Consensus Clustering on

each Meta-Cluster

Group into Meta-Clusters

Multiple Clustering

Views

Figure 3.1: The existing meta-clustering framework.

Base clusterings

(a) The base clusterings. Eachsymbol represents a baseclustering.

Meta-cluster

(b) The base clusterings aregrouped into meta-clusters.

Clustering view

(c) One clustering view is gener-ated for each meta-cluster.

Figure 3.2: Illustrative example for the meta-clustering process.

ana et al., 2006], including: running different clustering algorithms, running a specificalgorithm several times with different initializations, or using random feature weightingin the distance function. These base clusterings may then be meta-clustered into groups(Figure 3.2b). Afterwards, the (base) clusterings within the same group can be com-bined using consensus (ensemble) clustering to generate a consensus view of that group(Figure 3.2c). This results in one or more distinctive clustering views of the dataset,each offering a different perspective or explanation. The general procedure of the meta-clustering is illustrated in Figure 3.1. However, there are some limitations with existingmeta-clustering methods.

motivation for filtering Amajor drawback and challenge with meta-clusteringis that its effectiveness is highly dependent on the quality and diversity of the gener-ated base clusterings. Specifically, if the base clusterings are of low quality, then theensemble step will be influenced and may in turn produce low quality clustering views.In addition, if there are redundant and noisy base clusterings that are similar to oneor more of the clustering views, then it is possible that some of the distinct views are

3.1 introduction 49(Person: 0.83) (Pose: 0.09)

(Person 0.08) (Pose 0.11)



(a) 4 representative clustering views generatedfrom the unfiltered base clusterings. The sim-ilarity scores between each clustering viewwith the two ground truth views respectivelyare presented at the top of each clusteringview.



(b) 2 clustering views generated from the fil-tered base clusterings. The similarity scoresbetween each clustering view with the twoground truth views respectively are pre-sented at the top of each clustering view.

Figure 3.3: Clustering views generated from the unfiltered and filtered base clusterings onCMUFace dataset.

mistakenly merged into one, resulting in the loss of interesting clustering views. This canoccur if the base clusterings representing two distinct views are connected via a chain ofnoisy but similar base clusterings. The grouping algorithm may then mistakenly groupall of these base clusterings into one meta-cluster (Figure 3.2b) and subsequently oneclustering view will be produced by the ensemble step (Figure 3.2c) when it finds a con-sensus view from the merged meta-cluster. In this way, users may miss some interestingclustering views.We have experienced these problems in our experiments of both synthetic and real

world datasets. To illustrate, we use an example from a real dataset (shown in Fig-ure 3.3). The CMUFace dataset, which contains images of three different persons alongwith different poses (left, front and right), consists of two reasonable clustering views,


Person and Pose1. From the CMUFace dataset, we generate a set of (raw) base clus-terings (with k = 3 clusters) using a number of standard base clustering generationalgorithms (see Section 3.5.1). Some of the generated base clusterings contain the Per-son or Pose clustering views, so it should be possible to recover/discover both views.We then applied meta-clustering on the generated (raw) base clusterings and found23 clustering views. To avoid clutter, we show four representative clustering views inFigure 3.3a. In Figure 3.3a, each row is a clustering view of three clusters, where eachcluster is shown as the mean of all the images in it. Above each clustering view, weshow the similarity score2, ranging within [0, 1], between this clustering view and thetwo ground truth clustering views, i.e., Person and Pose views, respectively. The largerthe value is, the more similar the clustering view is to that ground truth clustering view.As we can see from Figure 3.3a, only one of the ground truth clustering views is discov-

ered, i.e., Person view (the first clustering view), with a similarity score of 0.83. However,we could not discover the Pose view from the other clustering views. In addition, manyof the clustering views are of poor quality and do not correspond to any underlyingview, e.g., the other three shown clustering views in Figure 3.3a. The reason for thisis that some of the generated base clusterings are noisy and of low quality (i.e., not ofeither ground truth clustering views) and/or form bridges between the base clusteringsof the two clustering views. This large amount of noise causes many redundant andpoor quality clustering views found, and also the pose view being lost in the noise (e.g.,the third clustering view/row in Figure 3.3a has a significant number of pose images intheir clusters, but the bridging base clusterings have caused some person view imagesto be merged with it.) These observations stimulate the following question, which is themotivation for the proposed filtering method - Can we filter out the redundant/similarand noisy base clusterings to avoid discovering poor quality views or missing out onsignificant ones?

Figure 3.3b provides an example of the benefits of filtering. It shows the two clusteringviews, i.e., Person and Pose, generated using a filtered set of base clusterings as input (wewill explain our filtering approach in Section 3.3.1). More specifically, after filtering out

1 Please refer to Section 3.5.5.1 for more details about the dataset and experiments.2 The similarity between two clusterings are measured by adjusted mutual information (AMI), which willbe introduced in Section 3.3.2.

3.1 introduction 51

Generate Raw Base

Clusterings

Rank Meta-Clusters

Ensemble/Consensus Clustering on

each Meta-Cluster

Filter Raw Base Clusterings

Group into Meta-Clusters

Multiple Clustering

Views

Top K Multiple

Clustering Views

Figure 3.4: The meta-clustering framework with our additional, proposed filtering and rankingsteps highlighted with shaded square.

the poor quality base clusterings, we avoid clustering views of poor quality. In addition,filtering the redundant base clusterings helps to expose the other reasonable clusteringview, i.e., pose view (Figure 3.3b). More examples will be presented in the experiments(Section 3.5).

motivation for ranking Another challenge about meta-clustering is the largenumber of generated clustering views. Depending on the datasets and the base cluster-ings generation mechanism, we may produce a large number of clustering views. It willbe time consuming to examine them all. Our filtering step can help reduce the numberof clustering views by removing out the ones of poor quality. However, depending on thecomplexity of the datasets, the generated base clusterings and the different requirementsof users, we do not know how many potentially interesting clustering views exist. Theremay be still a large number of potentially interesting clustering views generated afterfiltering. Hence, it will be helpful to rank these clustering views based on importanceand diversity, and provide users with the top K ones, to facilitate their analysis job.A question that may arise, ‘can we obtain high quality and diverse clustering views

with ranking alone, i.e., without filtering?’ The answer is no. The ranking step helpssolve the problem about the large number of clustering views. However, for the problemsintroduced previously (refer to Section 3.1), e.g., missing interesting clustering views,cannot be solved by ranking. Thus, we need both filtering and ranking to get the goodquality and diverse clustering views.


To solve all the problems described above, we propose our new approach, ranked andfiltered meta-clustering (rFILTA), aiming at detecting multiple high quality and distinc-tive clustering views by filtering, ranking and analyzing a given set of base clusterings.Algorithmically, we propose an information theoretic criterion to perform the filtering.

In addition, we show how to employ a visual method to automatically determine themeta-clusters within the filtered base clusterings. Then, we can rank these meta-clustersin terms of their importance measured by the proposed heuristic criteria. Finally, weperform consensus clustering on the returned top K meta-clusters to produce the topK clustering views. Figure 3.4 shows the whole process. The novelty of our approachlies in the addition of a filtering step and a ranking step to the existing meta-clusteringframework [Caruana et al., 2006; Zhang and Li, 2011], which are highlighted as theshaded square boxes in Figure 3.4. Our focuses are two-fold: (a) investigating how tofilter the given raw base clusterings to generate a set of better clustering views, interms of quality and diversity, compared to the unfiltered meta-clustering; (b) rankingthe clustering views in terms of their importance. We assume that we are given a setof base clusterings. The generation of appropriate base clusterings, which has beenconsidered by numerous existing work in the literature [Jain and Dubes, 1988; Caruanaet al., 2006; Phillips et al., 2011; Zhang and Li, 2011; Fern and Lin, 2008; Azimi andFern, 2009; Hadjitodorov et al., 2006], is outside the scope of this work. This is not alarge limitation as any existing meta-clustering generation techniques can be used. Animportant advantage of our method is that the filtering step and the ranking step areindependent of the other steps in this framework and thus may be easily integrated withthem. The contributions of this chapter are:

- We propose a novel filtering based meta-clustering approach for discovering multi-ple high quality and diverse views from a given set of base clusterings. Our filteringstep can enhance any existing meta-clustering method. In particular, We proposea mutual information based filtering criterion which considers the quality and di-versity of base clusterings simultaneously. We provide a parameter that allowsusers to flexibly control the balance between less number of views but of higherquality or more of them but of relatively lower quality.

3.2 related work 53

- We identify the desirability of a ranking mechanism to assist in selecting a smallnumber of informative views.

- We propose several heuristic ranking schemes for ranking the multiple clusteringviews in terms of their importance, to further assist users in their analysis with alarge number of clustering views.

- We evaluate and demonstrate the performance of our new rFILTA framework on8 datasets (2 synthetic and 6 real world datasets).

The rest of the chapter is organized as follows. We review the related work in Sec-tion 3.2. Then, our rFITLA framework is introduced in Section 3.3. The time com-plexities of different steps involved in rFILTA framework are introduced in Section 3.4.Finally, we present the exhaustive experimental results and analysis on 8 datasets inSection 3.5 and conclude in Section 3.6.

3.2 related work

The research in this chapter is related to several topics: meta-clustering, alternative clus-tering and cluster ensemble or consensus clustering. These methods has been reviewedin Chapter 2.Our proposed framework in Figure 3.4 combines all of the above clustering paradigms.

The critical difference between our work, compared to the others, is that we place eachclustering paradigm into its most relevant place. In particular, we employ alternativeclustering as one of the mechanisms for generating diverse base clusterings. Alterna-tive clustering employs objective functions to guide the search process, thus it maydiscover alternative clustering views faster compared to meta-clustering which employsa random clustering generation scheme (such as random initialization or random fea-ture weighting). However, if the objective function defined in the alternative clusteringcannot characterize the underlying structure of the dataset appropriately, it cannot dis-cover the alternative clustering view 3. On the other hand, meta-clustering can cover

3 We demonstrate this point in the experiments part, refer to Figure 3.21


the space of clusterings more comprehensively compared to alternative clustering, byflexibly employing different means of generation. Thus, we take alternative clusteringmethod as one of the generation methods. Finally, we group the clusterings and gener-ate the consensus view for each group via consensus clustering. This is a more flexibleapproach than generating a single consensus view for the whole set of base clusterings,as the base clusterings may reflect very different structures of the data and thus may notbe reasonably combined to produce a single consensus view. In summary, our rFILTAframework incorporates the strengths of the related techniques to improve upon theexisting meta-clustering methods.We next describe the rFILTA framework.

3.3 rfilta framework

A clustering C is a hard partition of dataset X, denoted by C = {c1, . . . , ck}, whereci is a cluster and ci ∩ cj = ∅,⋃ ci = X. We denote the space of possible clusteringson X as PX . We use C to denote a set of (base) clusterings, i.e., C = {C1, . . . ,Cl}. Leta set of clustering views be denoted by V = {V1, . . . ,VR}, where a clustering view Vi

is a clustering on X, Vi ∈ PX . Even though a clustering view is just a clustering, weuse the view nomenclature to distinguish between the initial base clusterings and thefinal, returned clusterings (the set of clustering views) at the end of the meta-clusteringprocess.The rFILTA framework consists of a number of steps, illustrated in Figure 3.4 and

Algorithm 3.1. In the following sections, we will describe each of these steps.

3.3.1 Filtering Base Clusterings

The quality of a clustering C is measured by a function Q(C): PX → R+, and thediversity between two clusterings can be computed according to a similarity measureSim(Ci,Cj): PX ×PX → R+. The filtering problem can be formalized as follows.

3.3 rfilta framework 55

Algorithm 3.1 Framework of rFILTAInput:

Generated base clusterings C = {C1, . . . ,Cl}K, the number of returned clustering viewsL, the number of selected base clusterings during filtering stepβ, β ∈ [0, 1], the tradeoff parameter which balance the quality and diversity duringfiltering

Output:The top K clustering views, V ′K = {V ′1 , . . . ,V ′K}

1: C′ ← Filtering(C,L, β), where C′ = {C ′1, . . . ,C ′L} (Section 3.3.1)2: Cmc ← Grouping(C′), where Cmc = {Cmc1 , . . . , CmcR}, Cmci ∩ Cmcj = ∅,

⋃ Cmci = C′(Section 3.3.3)

3: C′mc ← Ranking(Cmc,K), where C′mc = {C′mc1 , . . . , C′mcK}, C′mc ⊂ Cmc (Section 3.3.4)

4: V ′K ← Consensus(C′mc), where V ′K = {V ′1 , . . . ,V ′K} (Section 3.3.5)5: return V ′K

Definition 3.1. Given a set of raw base clusterings C = {C1, . . . ,Cl}, we seek a set ofclustering views V = {V1, . . . ,VR} generated from C, such that, ∑ Vi∈VQ(Vi) is maxi-mized and ∑ Vi,Vj∈V,i6=jSim(Vi,Vj) is simultaneously minimized.

We solve this problem by selecting a subset of clusterings C′, which are of high qualityand diversity, from the given raw base clusterings C. The quality and diversity of baseclusterings have a big impact on the quality and diversity of the extracted clusteringviews at last. Next we discuss the quality and diversity criteria for clusterings.

3.3.2 Clustering Quality and Diversity Measures

We employ an information theoretic criterion, namely the mutual information (Chap-ter 2) for measuring both clustering quality and diversity. As a clustering quality mea-sure, mutual information is a well-known criterion for clustering discovery, which candiscover both linear and non-linear clusterings [Dang and Bailey, 2010b]. For measuringsimilarity between clusterings, mutual information can detect linear or non-linear rela-tionship between random variables [Vinh et al., 2010]. More specifically, the quality ofa clustering C is measured by the amount of shared information with the data X, i.e.,


I(X;C). The more information that is shared, the better that a clustering models thedata. In contrast, the mutual information between two clusterings I(Ci;Cj) quantifiestheir similarity. The less mutual information shared between the clusterings, the moredissimilar they are. There are many mutual information variations (see [Vinh et al.,2010] for more information). We choose to utilize the Adjusted Mutual Information(AMI, Chapter 2), which is an adjusted-for-chance version of the normalized mutualinformation4 [Strehl and Ghosh, 2003] for measuring the similarity between two cluster-ings. We selected AMI as its value lies between the interpretable range of 0 to 1 and ituses a principled approach to normalize to such a range.

3.3.2.1 Quality

The average quality of the selected set of base clusterings can be optimized as:

maxC′

1|C′|

∑Ci∈C′

I(X;Ci)

≡ minC′

1|C′|

∑Ci∈C′

H(X|Ci)

(3.1)

where the right hand side results from I(X;C) = H(X) −H(X|C) and H(X) is aconstant (where H(·) is the Shannon entropy function). Computation of the mutualinformation I(X;C) requires the joint density function, p(X,C), which is difficult toestimate for high dimensional data. Instead of directly estimating the joint densities,we may use the meanNN differential entropy estimator for computing the conditionalentropy H(X|C) [Faivishevsky and Goldberger, 2010], due to its desirable properties ofefficiently estimating density functions in high dimensional data and being parameterless.It is defined as

H(X|C) ≈nc∑j=1

1nj − 1

∑i 6=l|ci=cl=j

log ||xi − xj || (3.2)

4 The normalized version of mutual information for scaling mutual information to [0 1].


3.3.2.2 Diversity

The diversity can be optimized by minimizing the average similarity between clusterings,as:

minC′

1|C′|2

∑Ci,Cj∈C′

AMI(Ci;Cj)

Filtering Criterion and Incremental Selection Strategy

We wish to select a subset of base clusterings, C′, to achieve high quality and diver-sity simultaneously. Inspired by the mutual information based feature selection liter-ature [Peng et al., 2005] which maximizes feature relevancy while minimizing featureredundancy, we propose a clustering selection criterion which combines the quality anddiversity of clusterings:

minC′⊂C,|C′|=L

ββ0|C′|

∑Ci∈C′

H(X|Ci) +1− β|C′|2

∑Ci,Cj∈C′,i 6=j

AMI(Ci;Cj)

(3.3)

where L is a user defined parameter specifying the number of base clusterings C′ to beselected, and β ∈ [0, 1] is a trade-off parameter that balances the emphasis put on thequality and diversity during selection. When β = 0.5, we put same emphasis on thequality and diversity. The influence of β is discussed in the experiments Section 3.5.3.2.To make sure the first term is on the same scale as the first term, we rescale theH(X|Ci) to [0, 1] by multiplying it with β0 = H(X|Ci)−min{H(X|Ci)}

max{H(X|Ci)}−min{H(X|Ci)}. Thus, our

selection method aims to select L base clusterings C′ from the given raw base clusteringsC, optimizing the dual-objective criterion in Equation (3.3).A simple incremental search strategy can be used to select a good subset C′ for the

criterion (3.3) as follows. Initially, we select the clustering solution with the highest


quality among the given clusterings C. Then, we incrementally select the next solutionfrom the set C \ C′ as:

arg minCi∈C\C′

ββ0H(X|Ci) +1− β|C′|

∑Cj∈C′

AMI(Ci;Cj)

(3.4)

with the aim of selecting the next clustering with high quality and small average sim-ilarity with the selected ones in C′. This process repeats until we reach the L desirednumber of base clusterings.

3.3.3 Discovering the Meta-Clusters

We have obtained a filtered set of base clusterings after performing the filtering process.Next we group them into clusters at the meta level and then perform ensemble clusteringon each meta-cluster for view generation. We first explain the measure used to computethe similarity between the base clusterings, then explain a visualization technique callediVAT [Wang et al., 2010] for determining the potential number of meta-clusters. We thenintroduce a method that combines with iVAT to automatically determine the appropri-ate number of meta-clusters and performs the grouping, and finally describe how toobtain the views from the meta-clusters.

Measuring the Similarity between Clusterings: In order to divide the selectedclusterings into groups, we need a similarity measure to compare clusterings. Severalmeasures of clustering similarity have been proposed in the literature [Jain and Dubes,1988]. Here we use the AMI for measuring the similarity between clusterings. The dis-tance between two clusterings is then 1−AMI(Ci;Cj).

Grouping the Base Clusterings into Meta-Clusters: After filtering the base clus-terings to obtain C′, we compute the pairwise dissimilarity matrix between all membersof C′ as a prelude to grouping them into meta-clusters. There are two challenges for thisgrouping step: a) determining the number of relevant meta-clusters; and b) partitioning


the clusterings into meta-clusters. Next, we will describe a visualization technique forassessing the number of meta-clusters in a set of base clusterings. Then, an automaticmethod for determining the number of meta-clusters and partitioning the clusteringsinto meta-clusters will be presented.The VAT method [Wang et al., 2010] is a visualization tool for cluster tendency

assessment. By reordering a pairwise dissimilarity matrix of a set of data objects, it canreveal the hidden clustering structure of the data by visualizing the intensity image ofthe reordered dissimilarity matrix. The number of clusters in a set of data objects canbe visually identified by the number of “dark blocks” displayed along the diagonal ofthe VAT image. In this work, we use the iVAT [Wang et al., 2010; Havens and Bezdek,2012] method which is an advanced version of VAT method, in terms of presentingclearer blocks in the images of the reordered dissimilarity matrix. Each clustering canbe taken as a data object, and we utilize the iVAT method to visualize the number ofpotential meta-clusters.For grouping the set of clusterings, existing research uses hierarchical clustering [Caru-

ana et al., 2006; Zhang and Li, 2011]. However, the hierarchical clustering approach doesnot automatically determine the number of clusters. Hence, we propose an alternative,CLODD [Havens et al., 2009], the clustering method which automatically extracts thenumber of clusters and produces a hard partition of the data objects. We choose CLODD,as it also works on reordered dissimilarity matrices generated by the iVAT method. Webelieve these two methods are well complementary. Nevertheless, we stress that therFILTA framework is not restricted to any particular grouping method and users canchoose the one they prefer according to their requirements and knowledge. As mentionedabove, there will be dense blocks along the diagonal of this ordered dissimilarity matrixif clusters exist in this set of clusterings. The CLODD algorithm discovers the numberof meta-clusters and produces a hard partition of these clusterings by optimizing anobjective function which assesses the dense diagonal block structures of the reordereddissimilarity matrix. At the end of this step, we have multiple meta-clusters generated.For the CLODD method, there are two involved parameters to be set, namely NCmin

and NCmax, which are the minimum and maximum number of meta-clusters. We setNCmin = 1 and NCmax = the number of base clusterings to be grouped. For all other


parameters involved in the CLODD method, we use the values suggested in the originalwork [Havens et al., 2009] as default.

3.3.4 Meta-Cluster Ranking

When the generated base clusterings are widely distributed in the clustering space andthe tradeoff parameter β is chosen small, many clustering views may be produced, evenafter filtering. When there are many clustering views, it will be helpful if we can rankand show the top K clustering views for users to analyze. Examining many irrelevantclustering views, which is a possibility when there is no ranking, is time consuming andfrustrating. The challenge is how to rank the clustering views. There are different defini-tions characterizing what is a good clustering view, according to different requirementsof different users. It is hard to define a criterion to satisfy all these different requirements.Moreover, there is no standard ‘right’ ranking for us to learn from.In this section, we propose several heuristic ranking schemes for ranking the meta-

clusters based on their characteristics. Then we will apply ensemble clustering on thereturned top K meta-clusters to produce the top K clustering views. These schemes arereasonable options in terms of different considerations, and users can choose from themaccording to their requirements.Recall that a meta-cluster is a set of clusterings which ideally correspond to a clus-

tering view. A clustering view is a clustering that represent a meta-cluster. Differentfrom traditional cluster evaluation, we have to consider the quality of the meta-clusteras they consist of clusterings as members. The quality of the members of a meta-clusterhas a big impact on the goodness of its corresponding clustering view. Next we presentseveral properties that can be considered for measuring the goodness of a meta-cluster.

(a) Cohesion and SeparationSimilar to cluster evaluation, we can take the terms of cohesion and separation,which are used for measuring the goodness of a cluster, for measuring the goodnessof a meta-cluster. The more compact a meta-cluster, the more similar the clus-terings within this meta-cluster. This indicates the clusterings within this meta-


cluster can be repeatedly found by some of the available clustering generationmethods. Thus, this meta-cluster is more likely to correspond to a reasonableclustering view. If the meta-cluster is separated well from other meta-clusters,it indicates this meta-cluster is different from the others and corresponds to adistinctive clustering view. Based on these ideas, we build an Meta-Cluster Cohe-sion and Separation (mcCS) index for measuring the cohesion and separation ofa meta-cluster which is inspired by the popular internal cluster evaluation index -Silhouette Coefficient (Chapter 2), as follows. For a clustering Ci in a meta-clusterCm, we define

mcCS(Ci) =metaInter(Ci)−metaIntra(Ci)

max{metaInter(Ci),metaIntra(Ci)}

where metaIntra(Ci) is the average dissimilarity of clustering Ci with all otherclusterings in the same meta-cluster Cm, and metaInter(Ci) is the smallest av-erage dissimilarity of Ci to any other meta-cluster which clustering Ci is not amember. For metaIntra(Ci), the lower the value, the better which indicates thebetter cohesion. For larger metaInter(Ci), it means the meta-cluster is better sep-arated with others. The dissimilarity between a pair of clusterings, Ci and Cj , iscomputed by 1−AMI(Ci;Cj). For the meta-cluster Cm, we compute its cohesionand separation by taking the average mcCS of all its member clusterings. For thismeasure, the larger its value, the more likely the meta-cluster corresponds to areasonable and distinctive clustering view.

mcCS(Cm) =1|Cm|

∑Ci∈Cm

mcCS(Ci) (3.5)

(b) Size of the Meta-ClusterWe also consider the size of each meta-cluster5. If the size of the meta-cluster islarge, it indicates that the clustering view is popular according to the available

5 Even though, we are not sampling the clustering space uniformly, but the size of the meta-cluster canbe considered as one reasonable standard.


clustering generation methods. Then, we are more confident that this clusteringview corresponds to a popular one, since many base clusterings are represented byit. For the meta-clusters with smaller size, it does not necessarily mean that theyare not good or not important. It is just that the generation techniques cannotfind it easily. The small size can also be taken as one choice but in this paper wechoose to prefer large sized meta-clusters.

S(Cm) =1|Cm|

(3.6)

(c) Quality of the Meta-ClusterWe believe that the meta-cluster with better quality of clustering members mayresult in a clustering view with better quality. We compute the quality of a meta-cluster Cm as the average quality of its clustering members. For each clustering Ci,we quantify its quality by taking the conditional entropy H(Ci|X) which is sameas Equation 3.4. The smaller the value is, the more likely that meta-cluster is ofhigh quality.

Q(Cm) =1|Cm|

∑Ci∈Cm

H(Ci|X) (3.7)

According to these different criteria, we can get different rankings for the generatedmeta-clusters. We can choose any of these rankings for returning the topK meta-clusters.In this chapter, we rank the meta-clusters in terms of the harmonic mean of the abovedifferent rankings, which works reasonable well in our case. But we stress that anycombination of the above different measures for ranking can be used [Sheng et al., 2005;Pihur et al., 2007; Jaskowiak et al., 2016].

averMC(Cm) =3

1RankCS(Cm) +

1RankS(Cm) +

1RankQ(Cm)

(3.8)

3.4 time complexity analysis 63

3.3.5 Discovering the Clustering Views via Ensemble Clustering

In this final step, we use three ensemble clustering algorithms (Chapter 2) - CPSA,HGPA and MCLA to find a consensus view for each meta-cluster. Among these threealgorithms, MCLA produce the best ensemble clustering in terms of AMI between thegenerated clustering view and the ground truth clustering view. Thus, we finally chooseto present results generated by MCLA ensemble clustering algorithm. However, ourframework is not restricted to any specific ensemble clustering method and users canchoose the one they prefer according to their requirements. At the end of this step, wehave a set of high quality and diverse views of the data.

3.4 time complexity analysis

In this part, we analyze the time complexity of the proposed rFILTA framework. AsrFILTA consists of different steps that involves different methods, its complexity dependson the time complexities of these different algorithms. Next, we provide detailed analysisof the time complexity for each step in rFILTA.

time complexity of generation step We employ 7 clustering methodsin the generation step (refer to Section 3.5.1 for details). Time complexities of thesemethods are: K-means O(INkd); random feature weighting method O(INkd); randomsampling method O(INkd); spectral clustering O(N3); EM clustering O(INkd2); in-formation theory based clustering O(N2); minCEntropy O(N2d), where I indicates thenumber of iterations and we set I = 100 as default, k indicates the number of clustersin a clustering, d indicates the number of features of a data object and N indicates thenumber of data objects.Thus, the overall time complexity of the generation step is O(INkd2 +N2d+N3).

time complexity of filtering step Before the filtering step, we can pre-compute the meanNN differential entropy H(X|Ci) for all the l base clusterings and


also the AMI between each pair of clusterings. In reality, there are only l(l− 1)/2 pairsof AMI need to be computed due to the symmetry.In particular, for one clustering Ci, the computation of its meanNN differential en-

tropy H(X|Ci) costs O(N2d). Thus it costs O(l ·N2d) for all the l clusterings. Theentropy of a clustering H(Ci) costs O(N). The mutual information between a pair ofclusterings Ci and Cj , I(Ci,Cj), costs O(N + k2). It costs O(kN) for the computationof the expectation of the mutual information between a pair of clusterings Ci and Cj ,E{I(Ci,Cj)}. Thus, the computation of the AMI of a pairs of clusterings costs aboutO(kn). Then it costs O(l2kN) for the computation of AMI for the l(l− 1)/2 pairs ofclusterings. Thus, the pre-computation step costs O(N2ld + l2Nk). The incrementalselection procedure costs O(L · l).Overall, the time complexity of the filtering step is dominated by the pre-computation

step, that is O(N2ld+ l2Nk).

time complexity of grouping step After the filtering step, we only haveL base clusterings left for the following steps. The pairwise dissimilarity matrix hasbeen computed in the filtering step. Using the dissimilarity matrix as input for iVATalgorithm, we can get the reordered dissimilarity matrix as output. This step costsO(L2). Taking the reordered dissimilarity matrix as input for CLODD algorithm, thenwe can get the groups of meta-clusters as output and it costs O(L3 ·Np · qmax). AsCLODD method used particle swarm optimization method [Havens et al., 2009], Np isthe number of particles for each swarm and qmax is the maximum number of swarmiterations. We set Np = 20 and qmax = 1000 in in our experiments.

time complexity of ranking step In the ranking step, we proposed threecriteria for ranking meta-clusters. The time complexities of these three criteria are dis-cussed as follows.

- Cohesion and Separation: As the dissimilarities between each pair of clusteringshave been precomputed, for all L clusterings, the complexity of computing mcCSscore is O(LR), where R is the number of generated meta-clusters. The worst caseis that L meta-clusters generated, i.e., each clustering is a meta-cluster, then it

3.5 experimental results 65

becomes O(L2). Thus, the overall complexity of this step is O(L2) in the worstcase.

- Quality: As the quality, H(X|Ci), for each clustering Ci has been precomputed,this step costs O(L).

- Size: This step costs O(L).

Thus, the overall complexity of ranking step is O(L2).

complexity of ensemble step The time complexity of ensemble step withMCLA is O(Nk2L2).

overall complexity In summary, the overall time complexity of the rFILTAframework is O((INkd2 +N2d+N3) + (N2ld+ l2Nk) + (L3 ·Np · qmax)). It might bedominated by the generation step or filtering step or the grouping step, depending onthe specific datasets.

3.5 experimental results

In this section, we evaluate the performance of our rFILTA method against the other ex-isting meta-clustering methods which are all considered as the unfiltered meta-clusteringmethod. We compare their ability to recover known clustering views in 8 datasets (2synthetic datasets and 6 real world datasets). We also evaluate and compare with alter-native clustering method for discovering multiple clustering views. In addition to that,we will show the proposed ranking scheme works well.

In the following, we first introduce the clustering methods employed in the genera-tion step. Then we introduce the evaluation scheme for validating the generated multipleclustering views. Afterwards, the parameter setting is discussed. Finally we show andanalyze the experimental results on different datasets which will demonstrate the per-formance of our proposed filtering step and ranking step in the rFILTA framework.


3.5.1 Base Clustering Generation Methods

We employ 7 clustering generation methods in our experiments, many of which havebeen used previously in other meta-clustering algorithms.

- K-means with random initializations [Caruana et al., 2006].

- Random feature weighting method where feature weights are drawn from the zipfdistribution [Caruana et al., 2006].

- Random sampling that selects {50%, 60%, 70%, 80%, 90%, 100%} of objects andfeatures, and then applying k-means on the sampled objects and features. Thenthe objects not initially sampled are assigned to the nearest clusters by the k-nearest neighbour method.

- Spectral clustering method [Caruana et al., 2006] using the similarity measureS = exp(−‖xi − xj‖2/σ2) with the shape parameter σ =

max{‖xi−xj‖}2k/8 , where k

is randomly chosen from k = 0, . . . , 64.

- EM-based mixture model clustering method with different initializations.

- Information theory based clustering algorithm [Faivishevsky and Goldberger, 2010].

- An alternative clustering method, minCEntropy [Vinh and Epps, 2010], with dif-ferent reference clusterings generated by k-means method.

Generally, we generate 700 base clusterings for each dataset. Each clustering algorithmgenerates 100 clusterings. The number of clusters in each generated clustering is as sameas the ground truth views.For the data selection, we would like to keep the number of clusters in both ground

truth views consistent. If the number of clusters in ground truth views were different,then base clusterings with different number of clusters must be generated. This will bechallenging for the subsequent grouping and ensemble steps. Research on techniques forgrouping and ensemble steps are out of the scope of this work. Our focus is on validatingthe proposed filtering and ranking methods. Therefore, we choose to keep the number


of clusters in the ground truth views consistent and generate base clusterings with samenumber of clusters, to be less distracted by the other steps in the framework.

3.5.2 Evaluation of the Clustering Views

In order to evaluate the goodness of the discovered clustering views, we need to answerthe following questions:

- How many ground truth clustering views can be recovered (diversity)?

- How well do the generated clustering views match the multiple sets of groundtruth clustering views (quality)?

- How do the returned top K clustering views match the ground truth clusteringviews (ranking)?

Inspired by Mean Average Precision (MAP) [Manning et al., 2008], a popular measurefor evaluating ranked retrieval of documents in information retrieval, we propose a MeanBest Matching (MBM) score to evaluate our method. Here, we assess the matchingbetween the returned top K clustering views and the ground truth labels using AMI.In more detail, given multiple ground truth labels G = {G1, . . . ,GH} and a set ofranked clustering views Vr = {Vr1 , . . . ,Vrm}, the MBM for the top K clustering viewsVrk

= {Vr1 , . . . ,Vrk}, where k ≤ m, is defined as:

MBM(Vrk) =

H∑i=1

maxVj∈Vrk

AMI(Gi;Vj)/H (3.9)

whereMBM(Vrk) ∈ [0, 1]. TheMBM(Vrk

) takes 0 when there is no ground truth viewsrecovered at all. It takes 1 when all the ground truth views are recovered and matchedperfectly by the generated clustering views. The MBM score will increase when thegenerated clustering views match the ground truth labels better or there is new groundtruth views recovered.


0 20 40 60 80 100

0

20

40

60

80

100

(a) L = 100.0 50 100 150 200

0

20

40

60

80

100

120

140

160

180

200

(b) L = 200.0 100 200 300

0

50

100

150

200

250

300

(c) L = 300.0 100 200 300 400

0

50

100

150

200

250

300

350

400

(d) L = 400.

0 100 200 300 400 500

0

100

200

300

400

500

(e) L = 500.0 200 400 600

0

100

200

300

400

500

600

(f) L = 600.0 200 400 600

0

100

200

300

400

500

600

(g) L = 700.

Figure 3.5: iVAT diagrams for different number of filtered base clusterings with β = 0.6 onthe flower dataset.

3.5.3 Parameter Setting

In this part, we discuss two important parameters involved in our method, i.e., L andβ, the number of selected base clusterings and the regularization parameter used forbalancing the quality and diversity during filtering.

3.5.3.1 The Impact of the Number of Selected Base Clusterings

The number of selected clusterings L does not have high impact on the quality of viewgeneration by our method. We take the flower dataset as the example to show the impactof L. In Figure 3.5, we show the iVAT diagram constructed when L is varied from 100to 700 selected based clusterings (recall that there are 700 raw base clusterings). We seethat the iVAT diagrams are mostly stable from L = 100 to 500, meaning that rFILTAis quite robust to noise and relatively insensitive to the choice of L. We recommendchoose L = 15% ∼ 25%× n, for example, for the flower dataset, L = 100. We alsoobtain similar patterns with the other evaluated datasets.


3.5.3.2 Impact of the Regularization Parameter

The regularization parameter β ∈ [0, 1] balances the quality and diversity during theclustering filtering procedure. For example, when β = 0.5, it means we treat quality anddiversity equally important. When β → 0, the filtering process places more emphasison diversity, which generally increases the number of potential clustering views but atthe risk of including more poor quality solutions. In contrast, when β → 1, the filteringprocedure focuses on the quality, which will result in high quality clustering views butsome relevant clustering views may be filtered out. Thus, users can tune this parameteraccording to their specific needs for view detection. In our experiments, we chose thevalue for β for each dataset by testing and tuning β within the [0, 1] range according tothe intuition about the balance between quality and diversity.Given that we usually do not have the cluster labels, the iVAT diagrams can be used

as one of the ways to help users for investigation. In particular, we propose to ‘slide’β within the [0, 1] range and inspect the iVAT reordered matrix and the consensusviews that emerge. We take the flower dataset as an example and illustrate a number ofiVAT diagrams (Figures 3.6) constructed from different β values and L = 100. We cansee that as β decreases, the iVAT diagram becomes more fuzzy, which means that theselected base clusterings are more diverse but their quality is decreasing. Depending onusers’ different requirements, if they want the dominant and easily identified clusteringviews, they can put more focus on quality by setting β with large value. If they wouldlike to get more diverse but may be not dominant clustering views, they can put morefocus on diversity by tuning β with smaller value. When decreasing β to certain point,the structure and fuzziness of iVAT diagram does not change much. It indicates thatdiversity might have reached its limit.Next we will present detailed experimental results and analysis based on 8 datasets,

including 2 synthetic datasets and 6 real world datasets.


0 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

(a) β = 0.80 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

(b) β = 0.70 20 40 60 80 100

0

20

40

60

80

100

(c) β = 0.6

0 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

(d) β = 0.50 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

(e) β = 0.40 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

(f) β = 0.3

Figure 3.6: iVAT diagrams generated from 100 filtered base clusterings and different β values,for the flower data.

3.5.4 Synthetic Datasets Evaluation

In this section, we use 2 synthetic datasets to demonstrate that our rFILTA method isable to discover high quality and diverse clustering views by filtering out poor qualityand redundant base clusterings.

3.5.4.1 4 Gaussian 2D dataset

The first synthetic dataset is a 2D four Gaussian dataset with 200 data objects (referto Figure 3.8a), consisting of 7 ground truth clustering views (refer to Figure 3.9). Wegenerate 700 base clusterings with 2 clusters.We first perform meta-clustering on the whole set of generated base clusterings. The

iVAT diagram of the unfiltered base clusterings is shown in Figure 3.7a. We got 30


0 200 400 600

0

100

200

300

400

500

600

700

(a) iVAT diagram of the 700 unfiltered base clus-terings. 30 blocks (meta-clusters) are found.

0 20 40 60 80 100

0

20

40

60

80

100

(b) iVAT diagram of the 100 filtered base clus-terings. 7 blocks (meta-clusters) are found.

Figure 3.7: iVAT diagrams of the unfiltered and filtered base clusterings on the 2D Gaussiandataset.

-5 0 5 10 15-5

0

5

10

15

(a) view1-5 0 5 10 15

-5

0

5

10

15

(b) view2-5 0 5 10 15

-5

0

5

10

15

(c) view3-5 0 5 10 15

-5

0

5

10

15

(d) view4

-5 0 5 10 15-5

0

5

10

15

(e) view5-20 -10 0 10 20

-5

0

5

10

15

(f) view6-5 0 5 10 15

-5

0

5

10

15

(g) view7-5 0 5 10 15

-5

0

5

10

15

(h) view8

Figure 3.8: Top 8 clustering views returned from the unfiltered base clusterings on the 2DGaussian dataset.

meta-clusters which are highlighted by the green dashed line surrounding the blocks,where each block corresponds to a meta-cluster. To avoid clutter, we choose the top 8meta-clusters, apply ensemble clustering on them and obtain the top 8 clustering views,


-5 0 5 10 15-5

0

5

10

15

(a) view1-5 0 5 10 15

-5

0

5

10

15

(b) view2-5 0 5 10 15

-5

0

5

10

15

(c) view3-5 0 5 10 15

-5

0

5

10

15

(d) view4

-5 0 5 10 15-5

0

5

10

15

(e) view5-5 0 5 10 15

-5

0

5

10

15

(f) view6-5 0 5 10 15

-5

0

5

10

15

(g) view7

Figure 3.9: 7 clustering views generated from the filtered base clusterings on the 2D Gaussiandataset. They correspond to the 7 ground truth clustering views.

shown in Figure 3.8. We can see that the first 4 clustering views are reasonable clusteringviews, while clustering views 5 to 8 are of poor quality (the rest of the clustering viewsare also of poor quality, refer to Figure 3.10). As we introduced before, this datasetcontains 7 ground truth clustering views. The unfiltered meta-clustering method onlyrecovered 4 of them, i.e., the first 4 clustering views shown in Figure 3.8. Next we applyour filtering approach on the same set of 700 base clusterings. We filter out 600 ofthe low quality and similar base clusterings setting L = 100 and β = 0.6. The iVATdiagram of the filtered set of base clusterings is shown in Figure 3.7b and has 7 blocks(meta-clusters). The clustering views, generated by applying the ensemble clustering onthese meta-clusters, are shown in Figure 3.9. These 7 clustering views correspond to the7 ground truth clustering views.

By comparing the results got from the unfiltered and filtered base clusterings, weobtain the following observations:

- Missing clustering views: some of the interesting clustering views may be missedwhile performing meta-clustering on the unfiltered base clusterings. This is becausesome of the meta-clusters are incorrectly connected by some chains of noisy baseclusterings, leading to these meta-clusters are grouped into one clustering view.


Then we will miss some clustering views. Our filtering step can help clean outthese noisy clusterings and recover those missed interesting clustering views.

- Poor quality clustering views: from the unfiltered base clusterings, we may generatemany poor quality clustering views. This is because the generated base clusteringmay include many poor quality base clusterings which will result in poor qualityclustering views (refer to Figure 3.7a and Figure 3.8). Our filtering step can helpclean out these poor quality base clusterings, and present those good quality ones(refer to Figure 3.7b and Figure 3.9).

To obtain further insights, let us examine the MBM scores for these two sets of clus-tering views shown in Figure 3.10. The x axis indicates the number of top K clusteringviews returned by the framework evaluated, ranked by our average ranking scheme, e.g.,K = 4 indicates the top 4 clustering views. The y axis shows the corresponding MBMscores. The blue crosses indicate the results from the unfiltered base clusterings, andthe red circles indicate the results from the filtered base clusterings. We can make thefollowing observations:

(a) There are 30 clustering views generated from the unfiltered base clusterings. Thereare 7 clustering views generated from the filtered base clusterings with settingL = 100, β = 0.8.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

K

MB

M S

core

s

UnfilteredFiltered

Figure 3.10: MBM scores of the clustering views generated from the unfiltered (30 views) andfiltered (7 views) base clusterings on the 2D Gaussian dataset. The x axis indicatesthe top K clustering views.


(b) The MBM scores for the unfiltered base clusterings are increasing up to K = 4clustering views. The increasing of MBM scores may be due to a new groundtruth clustering view being recovered or the newly returned kth clustering viewmatching one of the recovered ground truth views better. For the unfiltered baseclusterings, it is the first case. Then, the MBM scores plateau for K ≥ 5 clusteringviews. When MBM scores do not change with an increasing K, if it has not gotthe best matching of all the ground truth clustering views, i.e., MBM(Cm) = 1,then it means that either no new ground truth clustering view is discovered, orthe newly returned kth clustering views do not match better with the discoveredground truth clustering views for returned first k − 1 clustering views. Here, itis both. Thus, with the unfiltered base clusterings, we discovered 4 ground truthclustering views.

(c) The MBM scores for returned top K = 7, clustering views from the filtered setof base clusterings are increasing. It is because it recovers a new ground truthclustering view each time. Finally, it recovers all the 7 ground truth clusteringviews and reach an almost perfect matching with MBM score close to 1.

3.5.4.2 8 Gaussian 3D dataset

Next, we show the experimental results on a 3D synthetic dataset with 800 data objectswhich contains 8 Gaussian clusters. We generate 700 base clusterings with 2 clusterson this dataset. There are 3 ground truth clustering views for this dataset (refer toFigure 3.13).The iVAT diagram of the unfiltered base clusterings is shown in Figure 3.11a. From

this set of unfiltered base clusterings, we discovered 52 clustering views, correspondingto the 52 meta-clusters presented as diagonal blocks in the iVAT diagram. We show thetop 4 clustering views in Figure 3.12. The first 3 clustering views are corresponding tothe three ground truth clustering views. However, the fourth clustering view is of poorquality.For the filtered base clusterings, we keep 100 base clusterings (L = 100) after filtering

with β = 0.7. The iVAT diagram of the filtered base clusterings is shown in Figure 3.11b.


We can see that there are three clearly separated blocks which correspond to the threeground truth clustering views (Figure 3.13).The MBM scores for these two sets of clustering views are shown in Figure 3.14. We

can observe the following:

(a) There are 52 clustering views generated from the unfiltered base clusterings. Afterfiltering with L = 100, β = 0.7, we discovered 3 clustering views.

(b) The top 3 clustering views returned from the unfiltered base clusterings recoverthe three ground truth clustering views and match the ground truth clusteringviews perfectly with MBM(C3) = 1. The MBM scores are invariant after the 3rdclustering view. It is because the returned first 3 clustering views have recoveredall the 3 ground truth views and matched them perfectly. For the other clusteringviews, we inspect that they contain many poor quality and redundant ones whichis due to the redundant and poor quality base clusterings.

(c) The 3 clustering views got from the filtered set of base clusterings recover andmatch the three ground truth clustering views perfectly.

In this set of experiments, we find that we can recover the 3 ground truth clusteringviews perfectly from the unfiltered base clusterings. Do we still need filtering? We canobserve that after filtering out the redundant and poor quality clusterings, the irrelevantclustering views are eliminated, and the 3 ground truth clustering views are obtainedin this case. In addition to that, for the case of ‘missing clustering views’ found in theexperiments on the 4 Gaussian dataset, only ranking does not help solve this problem.Thus, the filtering step is necessary. We will further discuss the necessity of filtering inthe following experiments on real datasets.

3.5.5 Real Datasets

In the following section, we will evaluate rFILTA on 6 real datasets, which cover a varietyof possible datasets. As the results for some of the datasets are similar, we present the


0 100 200 300 400 500 600 700

0

100

200

300

400

500

600

700

(a) iVAT diagram of the 700 base clusterings. 52blocks (meta-clusters) are found.

0 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

(b) iVAT diagram of the 100 filtered base clus-terings. 3 blocks (meta-clusters) are found.

Figure 3.11: iVAT diagrams of the unfiltered and filtered base clusterings on the 3D Gaussiandataset.

2010

X Axis0

-10-20Y Axis0

10

5

0

-5

15

20

Z A

xis

+

+

(a) View1

2010

X Axis0-10-20Y Axis

0

0

5

10

15

-520

Z A

xis

++

(b) View2

20

X Axis

100

-10-100

Y Axis

1020-5

0

5

10

15

Z A

xis

++

(c) View3

2010

X Axis0

-10-20Y Axis

0

0

5

10

15

-520

Z A

xis

+

+

(d) View4

Figure 3.12: The top 4 clustering views discovered from the 700 unfiltered base clusterings onthe 3D Gaussian dataset.


2010

X Axis0-10-20Y Axis

0

0

5

10

15

-520

Z A

xis

++

(a) View1

2010

X Axis0

-10-20Y Axis0

10

5

0

-5

15

20

Z A

xis

+

+

(b) View2

20

X Axis

100

-10-100

Y Axis

1020-5

0

5

10

15

Z A

xis

++

(c) View3

Figure 3.13: The 3 clustering views discovered from the 100 filtered base clusterings on the 3DGaussian dataset. They correspond to the 3 ground truth clustering views.

0 5 10 15 20 25 30 35 40 45 50 550

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

K

MB

M S

core

s

UnfilteredFiltered

Figure 3.14: The MBM scores of the clustering views generated from the unfiltered (52 views)and filtered (3 views) base clusterings on the 3D Gaussian dataset.

results of 3 datasets, i.e., CMUFace, card and flower datasets, in this section and theother 3 datasets, i.e., isolet, WebKB and object datasets, in the Appendix A.

3.5.5.1 CMUFace Dataset

The CMUFace dataset from the UCI Machine Learning Repository [Bache and Lichman,2013] is a commonly used dataset for the discovery of alternative clusterings [Cui et al.,2007]. It contains 624, 32× 30 pixels images of 20 persons, along with different featuresof these persons, e.g., pose (straight, left, right, up). Two dominant clustering viewsexist in this dataset - person (identity) and pose. In our experiment, we randomly selectthe images of three people and have 93 images in total. We applied Principle Component


Person

Pose

Figure 3.15: Two ground truth clustering views on CMUFace dataset. The first row is personview and the second row is pose view.

Analysis (PCA) to reduce the number of features to 18, which retains more than 90%variance of the original data. Again we generated 700 base clusterings and rFILTAselected L = 100 base clusterings. The two ground truth clustering views are shown inFigure 3.15. Each image is represented as the mean of images within the cluster.Next, we will show three sets of experimental results on this dataset to show the

benefits of filtering and ranking.

Benefits of filtering

In this experiment, we demonstrate the necessity of filtering. For the 700 unfiltered baseclusterings, their iVAT diagram is shown in Figure 3.16a. We found 23 clustering viewsfrom this set of unfiltered base clusterings. To avoid clutter, we show the top 4 clusteringviews in Figure 3.16b. Each row represents a clustering view and each clustering viewconsists of three clusters, which are shown as the mean of all the images in each cluster.The number above each image is the percentage of the dominant class in this clusterand indicates the purity of this cluster (we use the same measures and approach in laterexperiments). As we can see from these four clustering views, the first row is the personview. However, the pose view is not presented in the other three, nor in the other 19clustering views. It is because that the pose clustering view is hidden among the othermeta-clusters due to the noisiness of the base clusterings. From this set of experiments,we find out that we may miss out some interesting clustering views from the unfilteredbase clusterings due to the noisiness of the generated base clusterings.


0 200 400 600

0

100

200

300

400

500

600

700

(a) The iVAT diagram of the unfilteredbase clusterings. 23 blocks (meta-clusters) are found.

0.82 1 1

(Person 0.83)(Pose 0.09)

1 0.52 1


1 0.76 0.54


0.5 0.53 0.5


(b) The top 4 clustering views from the unfil-tered base clusterings. The score above eachimage is the percentage of the dominantclass in that cluster. The AMI similarityscores between each clustering view with thetwo ground truth views respectively are pre-sented at the right side of each clusteringview.

Figure 3.16: Results for the unfiltered base clusterings on CMUFace dataset.

Next, we show the results on the 100 filtered base clusterings with β = 0.6 in Fig-ure 3.17. As we can see the iVAT diagram of the filtered base clusterings in Figure 3.17a,it contains two clearly separated blocks. Examining the clustering views obtained fromthese two blocks shown in Figure 3.17b, they are exactly the person and pose views thatwe are looking for. Compared with the results shown in Figure 3.16, we can see thatafter filtering, the resulting iVAT diagram is less noisy, and the blocks are more clearand well separated. After filtering out the noisy base clusterings, we have recovered thehidden pose views.

Benefits of ranking

When we got multiple clustering views, particularly when the number of clusteringviews is large, it is time consuming to examine them all. Our filtering step can help


0 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

(a) iVAT diagram of the filtered base clusteringswith β = 0.6. 2 blocks (meta-clusters) arediscovered.

1 1 0.83


1 1 0.85


(b) The 2 clustering views generated from the fil-tered base clusterings. The score above eachimage is the percentage of the dominant classin that cluster. The AMI similarity scores be-tween each clustering view with the two groundtruth views respectively are presented at theright side of each clustering view.

Figure 3.17: Results for the filtered base clusterings on CMUFace dataset when β = 0.6.

reduce the number of clustering views by filtering out the poor quality and similar ones,and discover the good clustering views (Figure 3.17). However, sometimes there arestill many potentially interesting clustering views after filtering, especially when we donot know how many interesting clustering views exist. It depends on different factors,e.g., the complexity of datasets, the distribution of the generated base clusterings, anddifferent requirements of users. Hence, it will be helpful to rank these clustering viewsand make it easier for users to analyze them.For example, users may want to explore more potentially interesting clustering views

exist in the generated base clusterings by adjusting the tradeoff parameter β. When thevalue of β is large, we may get a few good and diverse clustering views (e.g., Figure 3.17).When they decrease the value of β, the filtered set of base clusterings will be more diversewhich may result in more clustering views. For the same set of 700 base clusterings, whenwe decrease β = 0.4, we discovered 16 clustering views for the filtered case. Refer to theFigure 3.18a, the iVAT diagram contains more blocks and the blocks are generally morefuzzy compared with the iVAT diagram with β = 0.6 (see Figure 3.17a). It is not easy


0 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

(a) The iVAT diagram of 100 filtered base clus-terings with β = 0.4. 16 blocks (meta-clusters) are discovered.

0.97 1 1


0.85 1 0.79


0.63 1 0.86


0.76 0.54 1


(b) The top 4 clustering views generated fromthe filtered base clusterings with β = 0.4.

Figure 3.18: Results for the filtered set of base clusterings on CMUFace dataset when β = 0.4.

to examine all of them. Thus, it will be helpful if we can rank these clustering viewsand recommend the top K to users for facilitating their job. As shown in Figure 3.18b,for the top 4 clustering views returned by our ensemble ranking scheme, the first row isthe person view, and the second row is pose view. In this way, users can check less butstill be presented with interesting views.

Benefits of filtering + ranking

Can we directly rank the clustering views generated from the unfiltered base clusteringswithout filtering? There is one problem with this. When the generated base clusteringsare noisy, some of the potentially interesting clustering views could not be discoveredfrom them. In this way, ranking does not help to get those missed clustering views. Thus,we need filtering which can help discover potential interesting clustering views hiddenin the generated base clusterings.The MBM scores for the two sets of discovered clustering views, i.e., filtering with

β = 0.4 and unfiltered base clusterings are shown in Figure 3.19. In summary:


0 2 4 6 8 10 12 14 16 18 20 22 240

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

K

MB

M S

core

s

UnfilteredFiltered

Figure 3.19: The MBM scores for the two sets of clusterings views on CMUFace dataset.

(a) We discovered 23 clustering views from the unfiltered base clusterings and foundout 16 clustering views from the filtered base clusterings with L = 100, β = 0.4.

(b) For the filtered base clusterings, the discovered clustering views achieve betterMBM scores than the unfiltered base clusterings. It is because after filtering, thetwo ground truth clustering views are discovered. And they match well with theground truth clustering views. From the unfiltered base clusterings, we only dis-cover one of the ground truth views.

(c) The returned top 2 clustering views from the filtered base clusterings are corre-sponding to the two ground truth clustering views.

(d) Without filtering, only one clustering view can be discovered, even when ranked.Without ranking, the two good and diverse clustering views which are correspond-ing to the two ground truth views, are hidden among the 16 generated clusteringviews.

3.5.5.2 Card Dataset

The card dataset6 consists of 52 images of cards. It can be explained from differentperspectives. A deck of cards can be clustered in terms of different suits (heart, cube,

6 The images of card are downloaded from https://code.google.com/p/vectorized-playing-cards/.


Color

Suits

Figure 3.20: Two ground truth clustering views on Card dataset. The first row is the colorview and the second row is the suits view.

diamond and spade), different colors (red, black, and mixed color), and different rankings(1∼13). In our experiments, we randomly choose three different suits along with all thecards belong to these suits. Finally, we got 39 cards in total containing two clusteringviews, i.e., suits (spades, diamond and heart) and color (red, black and mixture). Wescaled these images to 100× 140 pixels. The features of images are described using theHOG descriptors [Dalal and Triggs, 2005] with 2× 2 cells 7. We further applied PrincipleComponent Analysis (PCA) to reduce the number of features to 18, which retains morethan 90% variance of the original data. The two ground truth clustering views are shownin Figure 3.20. Each image is represented as the mean of images within the cluster.In this set of experiments, we would like to discuss and demonstrate two problems.

Firstly, we investigate and discuss the influence of alternative clustering methods in ourframework. In addition to that, we show the performance of our filtering and rankingfunctions.

Alternative Clustering on Card Dataset

In our rFILTA framework, we take alternative clustering method as one of the generationmethods for generating diverse base clusterings. As we discussed in the introduction,the alternative clustering algorithms restrict the definition of alternative clustering tocertain type of objective functions. When the definition of the alternative clustering is

7 The code of feature extraction is available at https://github.com/adikhosla/feature-extraction.


1 1 1

(Color 1)(Suits 0.46)

0.78 0.47 0.55

(Color 0.04)(Suits 0.22)

Figure 3.21: The alternative clusterings generated by minCEntropy. The first row is the colorview given suits view as reference clustering. The second view is generated givencolor view as reference clustering. The score above each image is the percentageof the dominant class in that cluster. The AMI similarity scores between eachclustering view with the two ground truth views respectively are presented at theright side of each clustering view.

not suited for the clustering structure underlying the data, this approach may not findthe alternative clusterings. In this set of experiments, we would like to compare theperformance of the alternative clustering method (minCEntropy) and rFILTA (withoutalternative clustering as generation method) on the card dataset.We apply alternative clustering algorithm, minCEntropy, on the card dataset to gen-

erate alternative clusterings. We take one of the two ground truth clustering views (i.e.,color and suits) as reference clustering to find the other one. We use the default param-eter setting of minCEntropy. The results are shown in Figure 3.21. The first row is thecolor view taking the suits view as the reference clustering. This alternative clusteringis discovered successfully. The second row is an alternative clustering view generated bytaking the color view as given clustering. However, it is not suits view or other explain-able view. The possible reason may be that the definition of the alternative clusteringin this algorithm does not capture the structure of the suits view. Hence, it can be seenalternative clustering can fail to find interesting clustering views if the definition of thealternative clustering does not characterize the alternative clustering properly.


Unfiltered Meta-Clustering without Alternative Clustering Generation on Card Dataset

In this set of experiments, we generate 600 base clusterings with the 6 of the availableclustering generation methods without using the alternative clustering method, i.e.,minCEntropy. We first do meta-clustering on the whole set of 600 base clusterings. Theresults are shown in Figure 3.22. As we can see from the iVAT diagram in Figure 3.22a,the diagonal blocks are not clearly and well separated. We discovered 135 clusteringviews from this unfiltered base clusterings. Note that without rankings, this is a verylarge number to evaluate over. The top 4 clustering views are shown in Figure 3.22b.The first row is the color view, and the second row is the suits view. In the color view,the three clusters indicate red, black and mixed color respectively from left to right. Inthe suits view, the three clusters are corresponding to spades, diamond and heart thesethree suits respectively from left to right.

0 100 200 300 400 500 600

0

100

200

300

400

500

600

(a) The iVAT diagram of the 600 unfiltered baseclusterings on the card dataset. 138 blocks(meta-clusters) are discovered.

1 1 1

(Color 1)(Suits 0.46)

1 1 0.71


1 0.55 0.53


0.75 0.67 0.56


(b) The top 4 views generated from theunfiltered base clusterings on the carddataset.

Figure 3.22: Results on the 600 unfiltered base clusterings on the card dataset.


Comparing the results of the above two sets of experiments, i.e., alternative clusteringand meta-clustering, we can observe that the suits view is discovered by the meta-clustering while is not by alternative clustering (Figure 3.21). This means that the suitsview could be captured by some of the clustering methods among the 6 generationmethods while it could not be captured by the definition of the alternative clustering inminCEntropy algorithm.

rFILTA without Alternative Clustering Generation on Card Dataset

In this set of experiments, we try to demonstrate the performance of our filtering andranking function in the rFILTA framework. We apply rFILTA method on the same 600base clusterings as used in the previous experiments. After filtering, we got 100 baseclusterings with β = 0.3. The results are shown in Figure 3.23. The iVAT diagram of the100 filtered base clusterings is shown in Figure 3.23a. Compared with the unfiltered one(Figure 3.22a), we can observe that after filtering, some of the meta-clusters (presentedas dark blocks along the diagonal) are more clearly presented than before filtering. Thetop 4 clustering views are shown in Figure 3.23b. The first row is the color view andthe fourth row is the suits view.The MBM scores for the two sets of clustering views generated from the unfiltered

and filtered sets of base clusterings are shown in Figure 3.24. In summary:

(a) We discovered 135 clustering views from the unfiltered base clusterings and 23clustering views from the filtered base clusterings with L = 100, β = 0.3.

(b) The top 2 clustering views from the unfiltered base clusterings recover the twoground truth views. The top 4 clustering views from the filtered base clusteringrecover the two ground truth views.

(c) The best MBM scores of the clustering views from the filtered set of base cluster-ings are better than the ones from the unfiltered set of base clusterings. It meansthat the quality of the recovered clustering views from the filtered set of baseclusterings are better than the ones got from the unfiltered set of base clusterings.


0 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

(a) The iVAT diagram of the 100 filtered baseclusterings on the card dataset with β = 0.3.23 blocks (meta-clusters) are discovered.

1 0.87 1


1 0.55 0.5


0.67 0.43 0.5


1 1 1

(Color 0.46)(Suits 1)

(b) The top 4 clustering views generatedfrom the 100 filtered base clusteringson the card dataset.

Figure 3.23: Results for the filtered base clusterings on the card dataset.

0 5 10 15 20 25 30 35 40 45 50 55 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

K

MB

M S

core

s

UnfilteredFiltered

Figure 3.24: The MBM scores for the two sets of clustering views from the unfiltered andfiltered sets of base clusterings on the card dataset.

3.5.5.3 Flower Dataset

The flower image dataset [Nilsback and Zisserman, 2006] consists of 17 species of flowerswith 80 images of each. We chose images from 4 species which are Buttercup, Daisy,


Figure 3.25: Example images of buttercup, sunflower, windflower and daisy flowers from leftto right, from top to bottom.

Color

Shape

Figure 3.26: Two ground truth clustering views on flower dataset. The first row is the colorview and the second row is the shape view.

Windflower and Sunflower (Figure 3.25). For each specie, we randomly chose 16 images,that is 64 images in total. There are two natural clustering views in this dataset: color(white and yellow) and shape (sharp and round). For reducing the disturbance andfocusing on the flowers, we processed the images by blacking the background. We scaledthese images to 120× 120 pixels and extracted their features in the same way as we didfor card dataset. Finally each image is represented by 22 features. We generate 700 baseclusterings on this dataset with 2 clusters. The two ground truth clustering views areshown in Figure 3.26.We firstly show the results on the unfiltered base clusterings in Figure 3.27. We got

62 clustering views from this set of unfiltered base clusterings. The top 4 clusteringviews are shown in Figure 3.27b. The first row is the color view, containing two clusters,


0 200 400 600

0

100

200

300

400

500

600

700

(a) The iVAT diagram of the 700 unfiltered baseclusterings on the flower dataset. 62 blocks(meta-clusters) are discovered.

1 1

(Color 1)(Shape 0.009)

1 1

(Color 0.009)(Shape 1)

0.54 1

(Color 0.04)(Shape 0.04)

1 0.55


(b) The top 4 views got from the 700 un-filtered base clusterings. The first rowis the color view and the second row isthe shape view.

Figure 3.27: Results for the unfiltered base clusterings on the flower dataset.

representing two colors, yellow and white. The second row is the shape view, includingtwo shapes, sharp and round. The results after filtering with L = 100, β = 0.6 areshown in Figure 3.28. As we can see from the Figure 3.28a, the iVAT diagram containstwo clearly separated blocks (meta-clusters) after filtering out the irrelevant clusterings(compared with unfiltered iVAT diagram in Figure 3.27a). The generated clusteringviews from these two meta-clusters are shown in Figure 3.28b which are just the colorand shape view. After filtering out the irrelevant base clusterings, we can discover thedifferent interesting clustering views.To further demonstrate the utility of ranking, we show another set of results in Fig-

ure 3.29 with L = 100, β = 0.3. When decreasing the tradeoff parameter to β = 0.3,more diverse clusterings will be included. Thus, the iVAT diagram in Figure 3.29a ismore fuzzy and untidy compared with the higher β = 0.6 in Figure 3.28a. We generated9 clustering views from this set of filtered set of base clusterings. The top 4 clustering


0 20 40 60 80 100

0

20

40

60

80

100

(a) The iVAT diagram of 100 filtered baseclusterings with beta = 0.6 on the flowerdataset. 2 blocks are discovered.

1 1


1 1


(b) The 2 clustering views got from the filteredbase clusterings on the flower dataset. Thefirst row is the color view and the second rowis the shape view.

Figure 3.28: Results for the 100 filtered base clusterings on the flower dataset.

views are shown in Figure 3.29b. As we can see, the first row is the color view and thesecond row is the shape view.The MBM scores for clustering views generated from the unfiltered base clusterings

and the filtered base clusterings with β = 0.3 are shown in Figure 3.30. In summary:

(a) We generate 62 clustering views from the unfiltered base clusterings and generated9 clustering views from the filtered base clusterings with β = 0.3.

(b) The top 2 clustering views from both sets of clusterings recover and match wellwith the ground truth clustering views.

(c) The rank function works well by ranking the color and shape views as the top 2.

3.5.5.4 Evaluation of Running Time for Each step in rFILTA

In this set of experiments, we show in Figure 3.31 the running time of different steps inrFILTA framework on two representative datasets, namely CMUFace (Section 3.5.5.1)and isolet (Section A.1) datasets. As we see from the Figure 3.31a and Figure 3.31b,

3.6 summary 91

0 20 40 60 80 100

0

20

40

60

80

100

(a) The iVAT diagram of the 100 filtered baseclusterings with beta = 0.3 on the flowerdataset. 9 blocks (meta-clusters) are discov-ered.

1 1


1 1


1 0.57


1 0.54


(b) The top 4 views generated from the fil-tered base clusterings. The first row isthe color view and the second row is theshape view.

Figure 3.29: Results of the filtered base clusterings on flower dataset.

the grouping step takes much more time compared with other steps. It is because theCLODD method used in the grouping step is a genetic algorithm, which is slow. Then,the generation step in isolet dataset takes more time than that in the CMUFace dataset.It is because the size of the isolet dataset and the features of the isolet dataset is largerthan the CMUFace dataset. As grouping is not a contribution of this paper, we leave itto future work to explore faster alternative for grouping.

3.6 summary

Meta-clustering is an important tool for discovering multiple views from data by analyz-ing a large set of raw base clusterings. It does not require any prior knowledge nor poseany assumption on the data, which especially suits exploratory data analysis. However,


0 5 10 15 20 25 30 35 40 45 50 55 60 650

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

K

MB

M S

core

s

UnfilteredFiltered

Figure 3.30: The MBM scores for two sets of clustering views generated from the unfilteredand filtered base clusterings on the flower dataset.

. Generation Filtering Grouping Ranking Consensus0

10

20

30

40

50

60

70

80

90

(a) The running time of different steps in rFILTAon CMUFace dataset

Generation Filtering Grouping Ranking Consensus0

10

20

30

40

50

60

70

80

90

(b) The running time of different steps inrFILTA on isolet dataset.

Figure 3.31: Running time of different steps in rFILTA on CMUFace dataset and isolet datasetin seconds.

the generation of a large set of high-quality base clusterings is a challenging problem.There may exist poor quality and similar solutions which will affect the generation ofhigh quality and diverse views.In this chapter, we have introduced a clustering selection method for filtering out the

poor quality and redundant clusterings from a set of raw base clusterings. This has theeffect of lifting the quality of clustering views generated by the meta-clustering methodsapplied to this set of filtered clusterings. In particular, we proposed a mutual informationbased filtering criterion which considers the quality and diversity of clusterings simulta-

3.6 summary 93

neously. By optimizing this objective function via a simple incremental procedure, wecan select a subset of good and diverse base clusterings. Meta-clustering on this filteredset of base clusterings can then yield multiple good and diverse views. In addition, weproposed a scheme to rank multiple clusterings. We demonstrated that ranking is im-portant when the number of potentially interesting clustering views is large. We believerFILTA is a simple and useful tool in the area of multiple clustering exploration andanalysis.

4SOFT CLUSTER ING VAL IDATION

AbstractIn this chapter, we generated and tested eight popular information-theoretic based clus-ter validity indices for fuzzy partitions generated by the fuzzy c-means (FCM) algorithmand probabilistic partitions built by the expectation-maximization (EM) algorithm forthe Gaussian mixture model. We provide explanations and insights about the perfor-mance of these indices. Of the eight generalized indices, we advocate a normalizedversion of the soft mutual information cluster validity index (NMIsM) as the best overallchoice, as it outperforms the other seven indices for both FCM and EM according toour tests on synthetic and real data. The superiority of NMIsM is most pronounced fordatasets with overlapped and/or different sized clusters. Finally, we provide a theoret-ical analysis which helps explain the superior performance of NMIsM compared to theother three normalizations of soft mutual information.

4.1 introduction

Many popular external CVIs were introduced in Chapter 2, designed for comparing twocrisp partitions [Jain and Dubes, 1988]. However, partitions can also be soft, i.e., fuzzy,probabilistic or possibilistic [Anderson et al., 2010] (see Chapter 2). How to evaluate the

In this chapter we present results from the following manuscripts:Y. Lei, J. C. Bezdek, J. Chan, N. X. Vinh, S. Romano, and J. Bailey, “Generalized Information TheoreticCluster Validity Indices for Soft Clusterings”. Published in Proceedings of the IEEE Symposium Serieson Computational Intelligence, pages 24-31, Dec. 9-12, 2014.Y. Lei, J. C. Bezdek, J. Chan, N. X. Vinh, S. Romano, and J. Bailey, “Extending Information-TheoreticValidity Indices for Fuzzy Clustering”. To appear in IEEE Transactions on Fuzzy Systems.

95

96 soft clustering validation

soft clusterings using external evaluation is a challenging problem. One approach is to“harden” them to crisp partitions by assigning each object to the cluster with highestmembership (fuzzy partitions), posterior probability (probabilistic partitions), or typi-cality (possibilistic partitions). Then the hardened partitions are evaluated with crispexternal validity indices. However, hardening may cause loss of information [Campello,2007], as an infinite number of different soft partitions can be converted to the samecrisp partition. Hence, several methods have been proposed for generalizing some crispCVIs to non-crisp cases [Campello, 2007; Hüllermeier and Rifqi, 2009; Brouwer, 2009;Anderson et al., 2010; Hüllermeier et al., 2012]. A method reported in [Anderson et al.,2010] can be used to generalize any CVI that is a function of the standard contingencytable (see Table 2.2 in Chapter 2) to soft indices. Subsequently, the generalized softindices can be utilized for comparing two partitions of any type. However, these meth-ods are designed to generalize RI and other pair-counting based CVIs to compare softclusterings.Information-theoretic based measures, which have been introduced in Chapter 2.3.1,

form a fundamental class of measures for comparing pairs of crisp partitions and havebeen shown to outperform other classes of comparison measures in a number of com-mon scenarios [Strehl and Ghosh, 2003; Meilă, 2007; Vinh et al., 2010]. However, theCVIs discussed in those papers are designed for comparing crisp partitions and cannotcompare soft ones.In this chapter, we use the method developed in [Anderson et al., 2010] to generalize

eight information-theoretic CVIs (IT-CVIs). We demonstrated the effectiveness of thegeneralized soft indices on the fuzzy clusters and probabilistic clusters found by the fuzzyc-means (FCM) algorithm and the expectation-maximization (EM) algorithm (refer toChapter 2) applied to the Gaussian mixture decomposition (GMD) problem, respec-tively. We use FCM and EM algorithms because they are two popular approaches forproducing soft clusterings. In addition, we provide a theoretical analysis that explainsthe performance of the measures. We further test and demonstrate the effectiveness ofthe generalized indices on relatively large datasets.Specifically, we test and demonstrate the effectiveness of the eight IT-CVIs on 25

synthetic and 10 real-world datasets, in terms of their ability to indicate the correct

4.2 related work 97

number of components in synthetic datasets generated from various Gaussian mixtures,and real-world datasets. Here, the “correct” number of components refers either to theknown number of components in the mixture from which Gaussian clusters are drawn,or the number of classes in labeled ground truth subsets of real data. Our contributionscan be summarized as follows:

- Via experimental evaluation, we demonstrate that the generalized information-theoretic indices can be effective on fuzzy partitions generated by FCM and prob-abilistic partitions generated by EM algorithm.

- We test and demonstrate the effectiveness of the generalized measures on relativelylarge datasets.

- We analyze the experimental results and recommend a normalized version of thesoft mutual information cluster validity index (NMIsM) as the IT-CVI which gen-erally performs better than the other seven soft information-theoretic measureson datasets with overlapping and/or different sized clusters.

- We prove a theorem which helps explain why the soft NMIsM performs better thanthree normalized versions of soft mutual information, namely, NMIsj, NMIss andNMIsr in certain scenarios.

In the rest of the chapter, we first review the related work in Section 4.2. In Section 4.3,we introduce how to generalize the IT based CVIs to the soft case. Then we introducethe evaluation methodology in Section 4.4. Extensive experimental results on syntheticand real-word datasets are discussed in Section 4.5. Finally, theoretical analysis aboutthe superior performance of NMIsM is provided in Section 4.6. This chapter is concludedin Chapter 4.7.

4.2 related work

Several methods have been proposed for generalizing indices based on the popular pair-counting based cluster validity measure Rand Index (RI, Chapter 2) and related indices


to non-crisp case [Campello, 2007; Hüllermeier and Rifqi, 2009; Brouwer, 2009; Hüller-meier et al., 2012; Anderson et al., 2010]. Next, we provide a brief review of thesemethods.Rand proved that RI = 1 if and only if the compared crisp partitions are the

same [Rand, 1971]. Campello [Campello, 2007] generalized the crisp RI to the fuzzy caseby taking two steps. First the original RI (Equation (2.14)) was rewritten using conceptsfrom set theory. Then the fuzzy extension was accomplished using generalized fuzzy set-theoretic operators. However, Campello noted that the generalized fuzzy RI can onlybe used for comparing a fuzzy partition to a crisp reference partition. Campello’s indexcan achieve the value of 1 when both the compared partitions are crisp (and equal).The authors in [Hüllermeier and Rifqi, 2009; Hüllermeier et al., 2012] generalized RI

based on the concepts of agreements and disagreements between two compared parti-tions. In the original RI, the agreements and disagreements are defined according to thecount of the four types of pairs of objects, i.e., (k11 + k00) and (k10 + k01) (Section 2.3.1in Chapter 2). Fuzzy generalization of the agreements and disagreements proceeds bycalculating the distance between pairs of objects using their membership vectors. Thisgeneralized RI is a pesudo-metric and can be used to compare two fuzzy partitions. Ona subclass of fuzzy partitions, this generalized RI can even be a metric.In [Brouwer, 2009], Brouwer extended RI to the fuzzy case by generalizing the four

counts k11, k10, k01 and k00. A new definition called ‘bonding’ was proposed and theoriginal k11, k10, k01 and k00 were rewritten in a form based on ‘bonding’. Two dataobjects are bonded if they are located in the same cluster in a partition. Then, thesoft formulation based on bonding was extended to the fuzzy case by employing cosinesimilarity instead of dot product for measuring the correlation between membershipvectors between pairs of objects.

In [Anderson et al., 2010], the authors proposed a method to extend pair-countingbased CVIs (including RI) that are based on the standard contingency table (Table 2.2in Chapter 2), to soft indices which can be used to compare two partitions of any type.An issue with all the previously described methods are that they are designed only

to generalize RI and other pair-counting based CVIs to compare soft clusterings.

4.3 soft generalization of information-theoretic based cvis 99

Next, we propose how to generalize crisp information-theoretic based cluster validityindices to the soft case.

4.3 soft generalization of information-theoretic based cvis

4.3.1 Technique for Soft Generalization

Let U ∈Mhcn and V ∈Mhrn. The c× r contingency table of these two crisp partitionsU and V is shown in Table 2.2 (see Chapter 2). Anderson et al. [Anderson et al., 2010]observed that the contingency table could be constructed as the product D = UV T .For crisp partitions, this formation reduces to the regular contingency table. Based onthe contingency matrix D = UV T , the authors of [Anderson et al., 2010] proposedgeneralizations of 15 pair-counting based crisp comparison indices for use with softpartitions. Any comparison index that depends only on the entries of the contingencymatrix can be generalized using the following equation:

D∗ = φUV T =[N/

c∑i=1

ai

]UV T (4.1)

where φ is a scaling factor that is needed in the possibilistic case, and recall that ai =∑rj=1 nij (see Table 2.2 (Chapter 2), to ensure memberships are normalized to 1). For

crisp, fuzzy or probabilistic partitions, φ = 1, which are the types of partitions we areinterested in. Next, we introduce how to generalize the crisp information-theoretic basedCVIs to soft case using this soft generalization technique.


Table 4.1: Information-theoretic cluster validity indices.

Index Name Expression Range Find

MI MI(U,V) [0, min{H(U),H(V )}] Max

NMIjointMI(U ,V )H(U ,V ) [0,1] Max

NMImaxMI(U ,V )

max{H(U),H(V )} [0,1] Max

NMIsum2MI(U ,V )

H(U)+H(V ) [0,1] Max

NMIsqrtMI(U ,V )√H(U)H(V )

[0,1] Max

NMIminMI(U ,V )

min{H(U),H(V )} [0,1] Max

Variation of Information(VI)

H(U ,V )−MI(U ,V ) [0, logn] Min

Normalized VI (NVI* ) 1− MI(U ,V )H(U ,V ) [0,1] Min

* NVI is the normalized distance measure equivalent to NMIjoint.

4.3.2 Soft Generalization of IT based CVIs

The information-theoretic based CVIs have been introduced in Chapter 2. Recall thebasic concepts of entropy, joint entropy and mutual information:

H(U) = −c∑i=1

aiN

log aiN

(4.2)

H(U ,V ) = −c∑i=1

r∑j=1

nijN

log nijN

(4.3)

MI(U ,V ) =c∑i=1

r∑j=1

nijN

log nij/N(aibj)/N2 (4.4)

Recall that H(U) is different from PE(U) = − 1n(

N∑k=1

c∑i=1

uik loga(uik)), where a ∈

(1,∞), the partition entropy of U [Bezdek, 1981].Eight popular crisp, external information-theoretic cluster validity indices (IT-CVIs)

are defined based on the above information-theoretic concepts (Equations (4.2)∼(4.4)).Their definitions are listed in Table 4.1.

Next, we generalize the eight popular information-theoretic based CVIs listed in Ta-ble 4.1. The entropy of a soft clustering U , is defined asH(U) = −∑c

i=1(ai/N) log(ai/N),

4.4 evaluation methodology 101

where ai is the row sum of the i-th row from the generalized contingency table D∗.Similarly, we define the joint entropy of two soft clusterings, U and V , as H(U ,V ) =−∑c

i=1∑rj=1(nij/N) log(nij/N), where nij is taken fromD∗. Finally, we defineMI(U ,V ) =∑c

i=1∑rj=1

nij

N log (nij/N)(aibj/N2) . Now the soft versions of the eight IT-CVIs listed in Table 4.1

can be computed from the generalized contingency table D∗ and are denoted as MIs,NMIsj, NMIsM, NMIss, NMIsr, NMIsm, VIs and NVIs.Next, we introduce the evaluation methodology used for assessing the effectiveness of

the generalized IT-CVIs.

4.4 evaluation methodology

In this section, we present the experimental framework and the datasets that we use tocompare the performance of the IT-CVIs.

4.4.1 Implementation and Settings

In this section, we describe parameter settings of FCM and EM, specifically initializationand termination.We modified the fcm function from the MATLAB Fuzzy Logic Toolbox and a MAT-

LAB implementation of EM algorithm 1 to accommodate our initialization and termi-nation criteria.

Initialization Rule

For FCM and EM, they both randomly draw c distinct points from the data X asthe initial cluster centers. The fuzzifier for FCM is set at m = 2 and the model normis Euclidean. For the EM algorithm, the initial covariance matrices for c clusters arediagonal, where the i-th element on the diagonal is the variance of the i-th feature vectorof X and the initial prior probabilities are 1/c.

1 http://www.dcorney.com/ClusteringMatlab.html


Termination Rule

FCM and EM are both terminated when the difference between two successive estimatesof the cluster centers, ‖Wt+1 −Wt‖∞ < ε, where Wt = {w1, . . . , wc}, and ε = 10−3.The maximum number of iterations is 100.

4.4.2 Datasets

We evaluate the generalized IT-CVIs on both synthetic and real-world datasets.

4.4.2.1 Synthetic Data

To evaluate the soft IT measures, we use 25 synthetic datasets, which contain five groundtruth clusters, sampled from mixtures of two-dimensional Gaussian distributions. The25 synthetic datasets span a number of attributes, e.g., shapes of clusters, the amount ofoverlap between clusters , cluster sizes (i.e., the number of samples in each cluster) andsample sizes. Three of these properties showed the largest impact on the CVIs, namely,the amount of overlap between clusters, cluster sizes and sample sizes. Hence, in the restof the paper we focus on these three attributes. We generated four groups of datasetscalled groups G1, G2, G3 and G4. Among these four groups, the number of samples ofeach dataset in groups G1, G2 and G3 is n = 1000 and the sample size of datasets ingroup G4 are varied. We describe the generated datasets:

- G1: Varying cluster overlap with equal sized clusters. There are five equal sizedclusters in the first group of datasets. We vary the overlap between two of theclusters by moving the mean of Cluster5 towards Cluster3 while keeping the otherthrees clusters’ means fixed. Gaussian components for the datasets have these pa-rameters: the covariance matrices for all clusters are identity matrices. The meansof Cluster1, Cluster2, Cluster3 and Cluster4 are µ1 = [2, 0],µ2 = [2, 13],µ3 =

[13, 10],µ4 = [8, 17], respectively. The mean of the fifth cluster (Cluster5) isµ5 = [13, 3 + i], where i = 1, . . . , 5. That is, we increase the amount of overlapbetween Cluster3 and Cluster5 by moving Cluster5 up towards Cluster3 along they axis. Thus, we generate five datasets, ‘Ovp#’, where # ∈ {1, . . . , 5}. The scatter


0 5 10 15-5

0

5

10

15

20

1

2

4

3

5

(a) G1: Ovp1

0 5 10 15-5

0

5

10

15

20

2

1

43

5

(b) G1: Ovp5

0 5 10 15-5

0

5

10

15

20

1

2

43

5

n5=100

(c) G2: Dens1

0 5 10 15-5

0

5

10

15

20

25

1

24

3

5 n5=600

(d) G2: Dens6

0 5 10 15-5

0

5

10

15

20

1

2

43

5n5=100

(e) G3: OvpDens1

0 5 10 15-5

0

5

10

15

20

25

1

2

34

5 n5=600

(f) G3: OvpDens6

-5 0 5 10 15 20 25-5

0

5

10

15

20

25

1

23

4

5

n = 100

(g) G4: NSize1

1

2

34

5

n=1000000

(h) G4: NSize9

Figure 4.1: Scatter plots for datasets (Ovp1, Ovp5) in group G1, (Dens1, Dens6) in group G2,(OvpDens1, OvpDens6) in group G3 and (NSize1, NSize9) in group G4. Points inthe same cluster have the same color.

plots for the two extremes of G1, ‘Ovp1’ and ‘Ovp5’, are shown in Figures 4.1aand 4.1b.

- G2: Varying cluster sizes without overlapping clusters. The second group of datasetsare generated by varying the cluster sizes. For each dataset, the five clustersare well separated, i.e., non-overlapping, with fixed means of µ1 = [2, 0],µ2 =

[2, 13],µ3 = [13, 10],µ4 = [8, 17], and µ5 = [13, 3]. The size of Cluster5 isn5 = 100 ∗ i, where i = 1, . . . , 6, and 1/4th of the remaining n − n5 objectsare drawn for each of the other four clusters. Finally, we generated six datasets,‘Dens#’, where # ∈ {1, . . . , 6}. The scatter plots for the two extremes in G2,‘Dens1’ and ‘Dens6’, are shown in Figures 4.1c and 4.1d.

- G3: Varying cluster sizes with overlapping clusters. For the first two groups ofdatasets, we test the influence of a single factor (overlapping or cluster size) on thesuccess of the generalized measures. However, real-world datasets are often morecomplicated and contain both overlapping and different sized clusters. To mimicthis type of structure, we generated a third type of datasets. For each dataset inG3, the means of the five clusters are µ1 = [2, 0],µ2 = [2, 13],µ3 = [13, 10],µ4 =

[8, 17], and µ5 = [13, 8]. The means of Cluster3 and Cluster5 are close to each


other and these clusters tend to be overlapping. We vary the sizes of Cluster5,based on n5 = 100 ∗ i, where i = 1, . . . , 6 and 1/4th of the remaining n− n5objects are drawn for each of the other clusters. Thus, we generate six datasets,‘OvpDens#’, where # ∈ {1, . . . , 6}. The scatter plots for the two extremes in G3,‘OvpDens1’ and ‘OvpDens6’, are shown in Figures 4.1e and 4.1f.

- G4: Varying data sizes with equal sized, non-overlapping clusters. The fourth groupof datasets are generated by varying the number of samples in the data to test theinfluence of data size on the performance of the soft CVIs. To facilitate the compar-ison, we fix the other two factors, i.e., cluster overlap and cluster sizes while gener-ating the datasets. Specifically, for each dataset, the five clusters are well separated,with fixed means of µ1 = [2, 0],µ2 = [2, 13],µ3 = [20, 13],µ4 = [11, 20], and µ5 =

[18, 5]. The five clusters are equal sized, i.e., n1 = n2 = n3 = n4 = n4 = n5 =

n/5. The sizes of the datasets are n = {100, 500, 1000, 5000, 104, 5× 104, 105, 5×105, 106}. Thus, we generate nine datasets, ‘NSize#’, where # ∈ {1, . . . , 9}. Thescatter plots for the two extremes in G4, ‘NSize1’ and ‘NSize9’ are shown in Fig-ures 4.1g and 4.1h.

Please note that dataset ‘Ovp5’ in G1 is actually same as ‘OvpDens2’ in G3. Thus, wehave 25 datasets overall.

4.4.2.2 Real-World Data

Datasets from the UCI machine learning repository [Bache and Lichman, 2013] areoften benchmarks for evaluating external validity measures [Brouwer, 2009; Hüllermeieret al., 2012]. These datasets have ground truth partitions provided by physically labeledsubsets. We use 10 real-world datasets where nine of them are from the UCI repositoryand one is the large dataset MINIST, which is a collection of hrandwritten digits. Moredetails about this dataset appear in [Havens et al., 2012]. The details of the datasets areshown in Table 4.2, where N , d and cGT correspond to the number of objects, featuresand classes, respectively.


Table 4.2: Real-world datasets: N = number of points, d = number of dimensions and cGT =number of ground truth classes.

Dataset N d cGT

Sonar 208 60 2

Pima-diabetes 768 8 2

Heart-statlog 270 13 2

Haberman 306 3 2

Wine 178 13 3

Vehicle 846 18 4

Iris 150 4 3

Zoo 101 17 7

Vertebral Column 310 6 3

MINIST 70000 784 10

4.4.3 Experimental Design

We test the effectiveness of the generalized soft indices by testing their ability to estimatethe number of labeled clusters for synthetic datasets or classes for real-world datasets.In order to provide a baseline, we include two other soft CVIs that are not information-theoretic in nature, namely the soft versions [Anderson et al., 2010] of the Rand Index(RI) and the adjusted Rand Index (ARI, Hubert and Arabie version [Hubert and Arabie,1985]). We denote these as the RIs and ARIs, respectively.

The general idea is to run FCM and EM on each dataset to generate a set of partitionswith different numbers of clusters. Then, each of the nine generalized soft indices iscomputed on every partition, where the comparison matrix V in Equation (4.1) is theground truth partition of the data. The number of clusters is varied for the computedpartition U . The number of clusters that maximizes the evaluated index for a dataset isdenoted as cpre. Let ctrue be the number of known clusters in the synthetic datasets, andlet cGT denote the number of labeled classes in the real-world datasets. If cpre = ctrue forthe synthetic data, or cpre = cGT for the real-world data, then we declare the prediction


of this index on this dataset a success. This is a popular strategy for evaluating theeffectiveness of the CVIs. However, there is an implicit assumption by using this strategy,that is, it assumes that the computed partition, which contains ctrue or cGT clustersfor the synthetic datasets or real-world datasets, is the best partition among all thepartitions with different number of clusters. However, this assumption cannot alwayshold. Sometimes the best partition may not be the one with ctrue or cGT clusters. Inthis case, this evaluation strategy could be failed.We ran FCM and EM on each dataset with the number of clusters c ranging from 2

to 3× ctrue and 3× cGT for the synthetic datasets and real-world datasets respectively.In order to reduce the influence of random initialization for FCM and EM, we generate100 partitions for each c, and evaluate the nine soft indices on each of the 100 partitions,so that we can make histograms that depict the percentage of successes (success rate)for each index over the 100 trials.

4.5 experimental results

We begin the analysis of the results for FCM partitions on synthetic datasets. After that,the results of the generalized IT-CVIs on real-world datasets will be discussed. Then wediscuss the results for EM partitions.

4.5.1 FCM Tests with the Synthetic Gaussian Datasets

In this section, we evaluate and compare the indices with respect to their cluster over-lapping, cluster size and data size factors on the FCM generated clusterings. We firstanalyze the results across all the synthetic datasets (groups G1, G2, G3 and G4) toobtain a high level understanding, then analyze the indices using each of these groups.


NMIsM RIs ARIs NMIsj NMIss NMIsr VIs MIs NMIsm

Suc

cess

Rat

e

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(a) Overall success rates of soft CVIs for FCM partitions onthe 25 synthetic datasets (2500 trials).

NMIsM RIs ARIs NMIsj NMIssNMI

sr VIs MIs NMIsm

Suc

cess

Rat

e

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

(b) Overall (FCM) success rates of indices on the 10 real-worlddatasets (1000 trials).

Figure 4.2: Overall success rates of the soft CVIs for FCM partitions on the synthetic andreal-world datasets. Error bars indicate the standard deviation.

The overall success rate for an index is the total number of successes across the 25datasets divided by the total number of partitions, i.e., 25× 100. The indices are sortedin descending order by their success rates, displayed from left to right in Figure 4.2a. Ingeneral, Figure 4.2a shows that NMIsM (see Table 4.1 for a reminder of their definitions)performs best among the nine soft CVIs on these synthetic datasets, having a successrate of approximately 80%. RIs, ARIs, NMIsj, NNIss, NMIsr and VIs achieve a success


rate of 62− 72%. In contrast, MIs and NMIsm perform poorly, having a success rate ofabout 5%. A possible reason for this is that MI tends to monotonically increase with thenumber of clusters [Vinh et al., 2010]. Hence, MIs is likely to favour partitions with moreclusters. For NMIsm, if the sizes of the discovered clusters are more equally distributed,the entropy of the generated soft partition, H(U), increases with the number of clustersc. This is more likely to occur with FCM as FCM tends to favour clusters that are evenlysized. Note that the entropy of the ground truth labels H(V ) = q is constant. At somec, H(U) > H(V ), and subsequently, NMIsm(U ,V ) = MIs(U ,V )/H(V ) = MIs/q,so NMIsm becomes equivalent to the scaled version of MIs and they have the samedeficiencies as it.Next, we analyze the results on each of the four groups of datasets. For convenience

of comparison, we order the indices as shown in Figure 4.2a in subsequent graphs.

4.5.1.1 Results on datasets with overlapping clusters (G1)

The results on the first group of five datasets are illustrated in Figure 4.3a. The firstseven CVIs perform similarly on evaluating the FCM partitions for the first four datasets,while differently on the dataset Ovp5, which has the most overlapping clusters amongthe five datasets. Note that a missing vertical bar means that there were no successes forthe given index on a particular dataset. For Ovp5, only NMIsM, RIs and ARIs performrelatively well with success rates of about 60% while the rest of the soft measures performpoorly (nearly 0%). This suggests that the efficacy of NMIsj, NMIss, NMIsr and VIs toevaluate FCM partitions is more severely affected by overlap, while NMIsM, RIs andARIs are more robust to this factor. At the other extreme, Figure 4.3a shows that MIsand NMIsm are inadequate for the G1 datasets.

4.5.1.2 Results on datasets with different sized clusters (G2)

The bar chart in Figure 4.3b shows the results on the group of datasets G2, wherethe cluster sizes are varied and their overlap is fixed. The first seven measures provideidentical evaluations for all six datasets in G2. As with the first tests, MIs and NMIsmshow poor performance. This indicates that the first seven CVIs are not influenced muchby the datasets containing different sized, non-overlapping clusters. Compared to the


results on the first group of datasets (G1), it seems that FCM, in common with mostother clustering algorithms, has more difficulty finding partitions that match the groundtruth when there is overlap than it does on the well-separated clusters.

4.5.1.3 Results on datasets with different sized and overlapping clusters (G3)

The success rates of the CVIs for FCM partitions generated from this group of datasets,where the cluster are overlapping and their sizes are varied, are shown in Figure 4.3c.There are some significant differences between the graphs in Figures 4.3a, 4.3b and thechart in Figure 4.3c, which corresponds to this set of tests. Specifically, NMIsM is theonly index in the experiments with G3 that successfully recovers a positive fraction ofthe 100 trials for each of the six datasets in G3. The other eight indices have relativelypoor performance. In particular, note the dropoff in performance by the soft ARIs, whichdid well for G1 and G2, but is quite ineffective here. These results suggest that onlyNMIsM has (relatively) consistent good success rates for more complicated datasets, likethose in this group of experiments.

4.5.1.4 Results on datasets with different data sizes (G4)

Different from the previous bar graphs, we use a line graph to show the trend of successrates of all nine indices with increasing data sizes in Figure 4.3d. The x axis indicatesthe number of samples in the datasets. The y axis represents the success rates. There areonly two graphs in Figure 4.3d: the upper graph has 7 coincident plots which correspondto the indices NMIsM, RIs, ARIs, NMIsj, NMIss, NMIsr and VIs. The lower graph showstwo coincident plots which represent MIs and NMIsm. Please recall that all these datasetscontain five well separated, equal sized clusters, so FCM is expected to find clusteringssimilar to the ground truth on these datasets. We have several observations from thisgraph:

- The first seven soft CVIs work well (achieve success rates above 80%) on all thesedatasets, while the last two, i.e., MIs and NMIsm, work poorly (with success ratesabout 20%). Apparently the first seven measures identify the correct number ofclusters when there are reasonable FCM partitions, while MIs and NMIsm do not.


- The size of the dataset does not impact the performance of these measures verymuch, i.e., the success rates of the first seven measures are consistently high andthe success rates of MIs and NMIsm are always low.

Summary of experimental results on the synthetic datasets: The results show that theNMIsM performs better than the other eight soft CVIs. This suggests that NMIsM mightbe preferred for detecting the right number of clusters when validating FCM partitions.We point out that NMIsM performs better than the other three variants of NMI, i.e.,NMIsj, NMIss and NMIsr2 on some datasets, even though they are all based on softmutual information but have different normalizations. We will discuss this further inSection 4.6, but first we present the results from analyzing the real datasets.

4.5.2 FCM Tests with Real-World Datasets

Next, we analyze the experimental results for FCM on real-world datasets. The overallsuccess rates of the nine indices on the 10 real-world datasets are shown in Figure 4.2b.Note that the indices are shown in the same order as in Figure 4.2a. Most of the indiceskeep the same ranking as they had on the synthetic datasets (Figure 4.2a), but NMIsr’sperformance is not as effective as NMIsj, VI and NMIss in this set of experiments. NMIsMagain works best while MIs and NMIsm are again the worst CVIs. The success rates ofthese indices on each real-world dataset are summarized in Table 4.3. The highlightedentries in the table show that NMIsM performs better than the other measures.The last row of Table 4.3 shows the column sums. The higher the number, the greater

the overall success on these 10 datasets: a perfect score would be 10. NMIsM, with ascore of 5.9, is clearly superior to the other eight indices. The RIs comes in second, witha sum of 5. The last two columns, MIs and NMIsm are tied for last place at 2.06.The last column of Table 4.3 shows the row sums for the nine indices, and a perfect

score would be 9. The first three datasets (Sonar, Pima-diabetes and Heart-Statlog) haverelatively high scores, indicating that these datasets contain fairly distinctive clustersfor FCM. Heart-Statlog in particular is nearly perfect, so it is likely to clusters which are

2 Because of its very poor performance, we do not include NMIsm in the comparison discussion inSection 4.5.1.


easily detected by FCM and that coincide quite well with the ground truth labels. Onthe other hand, the values cpre and ctrue are never matched for 100 trails (zero row sums)by FCM partitions of the datasets “Vehicle”, “Zoo” and “MINIST”. This means one oftwo things: either these datasets do not have clusters in their numerical feature spacesthat can be recovered by FCM; or there are distinguishable clusters in these featurespaces, but they are not recognizable as local minima of the FCM objective function.


Success Rate

0

0.2

0.4

0.6

0.8 1

Ovp1

Ovp2

Ovp3

Ovp4

Ovp5

NM

IsM R

Is AR

Is N

MIsj N

MIss N

MIsr V

Is MIs N

MIsm

(a)Success

ratesofsoft

CVIs

onthe

syntheticdatasets

(G1)

asa

functionofoverlap.

Success Rate

0

0.2

0.4

0.6

0.8 1 Dens1

Dens2

Dens3

Dens4

Dens5

Dens6

NM

IsM R

Is AR

Is N

MIsj N

MIss N

MIsr V

Is MIs N

MIsm

(b)Success

ratesofsoft

CVIs

onthe

syntheticdatasets

(G2)

asa

functionofcluster

size.

Success Rate

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 OvpD

1O

vpD2

OvpD

3O

vpD4

OvpD

5O

vpD6

NM

IsM R

Is AR

Is N

MIsj N

MIss N

MIsr V

Is MIs N

MIsm

(c)Success

ratesofsoft

CVIs

onthe

syntheticdatasets

(G3)

asa

functionofcluster

sizewith

overlappingclusters.

Success Rate0100

500 10

3 5x10

3 10

4 5x10

4 10

5 5x10

5 10

6

n = # of samples

0.2

0.4

0.6

0.8 1

MIs, N

MIsm

NM

IsM , RIs , A

RIs ,

Sam

e graph NM

Isj , NM

Iss , NM

Isr , VIs

for these seven indices.

Sam

e graph for these two indices.

(d)SuccessratesofsoftC

VIson

thesyntheticdatasets(G4)

asafunction

ofsample

size.

Figure4.3:Success

ratesofsoft

CVIs

forFC

Mpartitions

on25

syntheticdatasets.


Table4.3:

Successrate

(%of

successesin

100trials)

ofnine

indicesforFC

Mon

10real-w

orld

datasets.T

hehigh

lighted

num

bers

indicate

successratesab

ove

85%.T

hehigh

lighted

data

sets

have

atleaston

eresultab

ove

85%.

Dataset

NMI s

MRI s

ARI s

NMI s

jNMI s

sNMI s

rVI s

MI s

NMI s

mRow

Sums

Sona

r0.

951

10.

930.

930.

931

0.87

0.87

8.48

Pima-diab

etes

0.96

10.

980.

890.

890.

851

00

6.57

Heart-statlo

g0.

991

10.

990.

990.

991

0.99

0.99

8.94

Hab

erman

01

00

00

10

02

Wine

10

11

11

00.

200.

205.

4

Vehicle

00

00

00

00

00

Iris

11

11

10

00

05

Zoo

00

00

00

00

00

VertebralC

olum

n1

00

00

00

00

1

MIN

IST

00

00

00

00

00

Colum

nSu

ms

5.9

54.

984.

814.

813.

774

2.06

2.06


4.5.3 EM Tests with Synthetic and Real-World Datasets

In this section, we present and analyze the experimental results for EM partitions on the25 synthetic and real-world datasets. We found that these measures for EM partitions,across the datasets, the results generally show similar trends as they did for FCMpartitions. Thus, to avoid repetition, our analysis will include less details than ourprevious analysis for FCM partitions.The overall success rates of the nine indices on the 25 synthetic datasets and 10

real-world datasets are show in Figure 4.4. Figure 4.4a shows that the performance ofthese indices for EM partitions behave quite similarly as they did for FCM partitions onsynthetic datasets (Figure 4.2a). That is, these indices generally keep the same rankingas they did for FCM partitions. In particular, NMIsM shows the highest success rates,while MIs and NMIsm show the lowest success rates. The other six indices performsimilarly well.In Figure 4.4b, we can observe the performance of the generalized indices for EM

partitions on the real-world datasets. Generally speaking, except MIs and NMIsm, theother five IT-CVIs perform better than the other two generalized pair-counting basedCVIs, i.e., RIs and ARIs. In particular, NMIsM perform better than the other threenormalized mutual information indices, i.e., NMIsj, NMIss and NMIsr. VIs perform bestoverall which is new found interesting observation.Next, the results for each of the four groups of synthetic datasets are shown in Fig-

ure 4.5. Generally, these indices perform similarly as they did for FCM partitions (Fig-ure 4.3). For example, for the group of datasets G3, which contain different sized andoverlapping clusters, NMIsM show superior performance than other measures. In Fig-ure 4.5d, the x axis indicates the sizes of datasets. The y axis represents the successrates. Each line represents an index. Similar to the FCM case, there appears to be onlytwo lines shown in Figure 4.5d. Actually, there are 7 lines, but the results for indicesNMIsM, RIs, ARIs, NMIsj, NMIss, NMIsr and VIs, are almost overlapped and appear asone line; for the other indices, MIs and NMIsm, their results are almost overlapping andappear as one line.


NMIsM

RIs

ARIs

NMIsj

NMIss

NMIsr

VIs

MIs

NMIsm

0

0.1

0.2

0.3

0.4

0.5

Suc

cess

Rat

e

(a) Overall success rates of soft CVIs for EM partitions on 25synthetic datasets (2500 trials).

NMIsM

RIs

ARIs

NMIsj

NMIss

NMIsr

VIs

MIs

NMIsm

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Suc

cess

Rat

e

(b) Overall (EM) success rates of indices on 10 real-worlddatasets (1000 trials).

Figure 4.4: Overall success rates of the soft CVIs for EM partitions on the synthetic andreal-world datasets. Error bars indicate the standard deviation.

Summary of EM tests: The generalized IT-CVIs show similar performance for EMpartitions as they did for FCM partitions on synthetic datasets. NMIsM performs best.For the real datasets, NMIsM still performs relative well but VI performs best, which isslightly different from the FCM results.


NM

IsMR

IsA

RIs

NM

Isj NM

Iss NM

IsrV

IsM

IsN

MIsm

0

0.1

0.2

0.3

0.4

0.5

0.6

Success Rate

Ovp1

Ovp2

Ovp3

Ovp4

Ovp5

(a)Success

ratesofsoft

CVIs

onthe

syntheticdatasets

(G1)

asa

functionofoverlap.

NM

IsMR

IsA

RIs

NM

Isj NM

Iss NM

IsrV

IsM

IsN

MIsm

0

0.1

0.2

0.3

0.4

0.5

0.6

Success Rate

Dens1

Dens2

Dens3

Dens4

Dens5

Dens6

(b)Success

ratesofsoft

CVIs

onthe

syntheticdatasets

(G2)

asa

functionofcluster

size.

NM

IsMR

IsA

RIs

NM

Isj NM

Iss NM

IsrV

IsM

IsN

MIsm

0

0.1

0.2

0.3

0.4

0.5

0.6

Success Rate

OvpD

1O

vpD2

OvpD

3O

vpD4

OvpD

5O

vpD6

(c)Success

ratesofsoft

CVIs

onthe

syntheticdatasets

(G3)

asa

functionofcluster

sizewith

overlappingclusters.

100

500

103

5x103

104

5x104

105

5x105

106

n = # of sam

ples

0

0.1

0.2

0.3

0.4

0.5

Success Rate

NM

IsM

RIs

AR

Is

NM

Isj

NM

Iss

NM

Isr

VIs

MIs

NM

Ism

(d)SuccessratesofsoftC

VIson

thesyntheticdatasets(G4)

asafunction

ofsample

size.

Figure4.5:Success

ratesofsoft

CVIs

forEM

partitionson

25synthetic

datasets.

4.6 theoretical analysis 117

4.6 theoretical analysis

Our experiments suggest that NMIsM has superior performance to NMIsj, NMIss andNMIsr (Figures 4.3a and 4.3c). In this section, we provide a theoretical explanationwhich enables us to explain why NMIsM outperforms the other three variants of NMIin certain situations.First, we define two measures of change in the computation of NMI:

Definition 4.1. Let V ∈Mhrn be a crisp reference partition (ground truth), r ≥ 3. LetU ′ ∈ Mf(r−k)n and U∗ ∈ Mfrn be two soft partitions on n objects with r − k and r

clusters respectively. The relative change in MIs with respect to U ′ on moving from U ′

to U∗ (note that the number of clusters increases by k, from (r− k) to r) is

α =MI(U∗,V )−MI(U ′,V )

MI(U ′,V ) (4.5)

Let NMI∗ denote any of the three normalizations {NMIsj ,NMIss,NMIsr} ofMIs, andlet B∗(U ,V ) denote the denominators (normalization factors as shown in Table 4.1) of{NMIsj ,NMIss,NMIsr}. The relative change in the denominator of any of these CVIswith respect to U ′ on moving from U ′ to U∗ is

β =B∗(U∗,V )−B∗(U ′,V )

B∗(U ′,V )(4.6)

Theorem 4.1. Let V ∈ Mhrn be a crisp reference partition (ground truth), r ≥ 3.Let U ′ ∈ Mf(r−k)n and U∗ ∈ Mfrn be two soft partitions on n objects with r − k andr clusters respectively, where (r − k) ≥ 2 and k ≥ 1. Let NMI∗ denote any of thethree normalizations {NMIsj ,NMIss,NMIsr} of MIs. If MI(U∗,V ) > MI(U ′,V )and H(V ) ≥ H(U∗),H(U ′), then

(A) NMIsM (U∗,V ) > NMIsM (U ′,V ), and (4.7)

(B) NMI∗(U∗,V ) =

(1 + α

1 + β

)NMI∗(U

′,V ). (4.8)


Proof. (A) ifH(V ) ≥ H(U∗),H(U ′), then max{H(U ′),H(V )} = max{H(U∗),H(V )} =H(V ). By hypothesis,MI(U∗,V ) > MI(U ′,V ), soNMIsM (U∗,V ) = MI(U∗,V )

max{H(U∗),H(V )} =

MI(U∗,V )H(V ) > MI(U ′,V )

H(V ) = NMIsM (U ′,V ). This completes the proof of (A).

(B) Rearranging Equation (4.5) yields MI(U∗,V ) = (1 + α)MI(U ′,V ). Similarly,rearranging Equation (4.6) yields B∗(U∗,V ) = (1 + β)B∗(U ′,V ). Then for any of thethree normalized forms of MIs (NMIsj, NMIss and NMIsr) we have NMI∗(U∗,V ) =MI(U∗,V )B∗(U∗,V ) = (1+α)MI(U ′,V )

(1+β)B∗(U ′,V ) = (1+α)(1+β)NMI∗(U ′,V ). This completes the proof of (B).

Statement (A) shows that when H(V ) ≥ H(U∗),H(U ′), and the number of clustersin the soft partition increases from r− k in U ′ to r in U∗, that when MI also increases,i.e.,MI(U?,V ) > MI(U ′,V ), then NMIsM (U∗,V ) > NMIsM (U ′,V ). In contrast, theother three forms of normalized MIs depend on relative changes of both their numeratorsand denominators. i.e., if α > β, then NMI∗(U∗,V ) > NMI∗(U ′,V ); if α = β, thenNMI∗(U∗,V ) = NMI∗(U ′,V ); if α < β, then NMI∗(U∗,V ) < NMI∗(U ′,V ). Thus,whenMI(U∗,V ) > MI(U ′,V ), NMIsM will favour U∗ (r clusters, matching the numberof clusters in the reference partition V ) over U ′, which has r − k clusters. But for theother three measures NMI∗, α and β are sensitive to changes from U ′ to U∗, and hencecan fluctuate easily, making these three measures unstable and hence, their performancemore uncertain. Next, we discuss a specific case for Theorem 4.1, when the ground truthis balanced.

Definition 4.2. Let U ∈Mfcn be any crisp, fuzzy or probabilistic partition of n objectswith c clusters. Then U is balanced if and only if ∑n

k=1 uik = n/c, 1 ≤ i ≤ c.

In other words, each of the c clusters in U is allocated the same amount of membership.When U is crisp, this is equivalent to saying that each of the c crisp clusters has the samenumber of objects in it. The importance of this concept is contained in the followingwell know result.

Proposition 4.2. Let U ∈Mfcn be any crisp, fuzzy or probabilistic partition with c > 1.The entropy H(U) = −∑c

i=1 p(ui) log p(ui), where p(ui) = (∑nk=1 uik)/n, is maximum

if and only if U is balanced. The maximum entropy of U is maxU∈Mfcn{H(U)} = log c.


Proof. Regard the row sums of U as c “events”. Here are three well know facts frominformation theory [Cover and Thomas, 2012]: (i) 0 ≤ H(U) ≤ log c; (ii) H(U) = 0when exactly one of the p(ui)’s is 1 and all the rest are zero; (iii) H(U) = log c if andonly all of the events have the same probability p(ui) = 1/c, i = {1, . . . , c}. (Forwarddirection ⇒) Assume that H(U) = log c. From the fact (iii), the only time this canhappen is when U is balanced.(Backward direction ⇐) Assume that U is balanced. When U is balanced, its row

sums are all equal by Definition 4.2, that is, the c “events” are all equally likely. Againby fact (iii), this guarantees that H(U) is maximum, with value H(U) = log c.

Now we are in a position to show why NMIsM is the best normalization of the mutualinformation when the reference partition is balanced (datasets G1 in our experiments).

Corollary 4.3. Let V ∈Mhrn be a crisp, balanced reference partition. If MI(U∗,V ) >MI(U ′,V ), then statements (A) and (B) in Theorem 4.1 hold.

Proof. If V is balanced, according to Proposition 4.2, so H(V ) = log r ≥ H(U∗). Also,

H(V ) = log r > log(r− k) ≥ H(U ′), so H(V ) ≥ H(U∗),H(U ′).

In summary, the better performance of NMIsM over the other three NMI measures isdue to their denominators (normalization factors) having different sensitivity to changesin the number of clusters in candidate partitions. NMIsM is more robust to the changes,while the other three indices suffer from sensitivity to α and β (see Equations 4.5and 4.6).Next, we give a simple example that illustrates the content of Theorem 4.1, which

explain the behavior of the four NMI measures, i.e., NMIsM, NMIsj, NMIss and NMIsr,when the data contains balanced ground truth with overlapping clusters. This will helpus understand the performance of these four NMI measures on datasets G1.Figure 4.6a is a scatter plot of the dataset X30 which has n = 30 data objects having

ctrue = 3 balanced clusters, drawn from a mixture of three two-dimensional Gaussiandistributions, which all have identity covariance matrices. The means of the three clus-ters, named Cluster1, Cluster2 and Cluster3, are µ1 = [2, 0],µ2 = [2, 13],µ3 = [3, 0]respectively. Cluster1 and Cluster3 are overlapping. We conduct the same experiments


-1 0 1 2 3 4 5 6-5

0

5

10

15

(a) Scatter plot for data X30.NMIsM NMIsj NMI

ss NMIsr

Suc

cess

Rat

e

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

(b) Success rates of NMIsM, NMIsj, NMIss andNMIsr on X30. NMIsM succeeds in 70 of100 trials while NMIsj, NMIss and NMIsrfail to succeed at all.

Figure 4.6: Results on data X30.

V

v1 v2 v3

U2u1 9.7 9.8 0.1u2 0.3 0.2 9.9

Table 4.4: Contingency matrix for U2 and V .

V

v1 v2 v3

U4

u1 5.6 2.3 0u2 0.2 0.1 5.1u3 4.1 7.5 0u4 0.1 0.1 4.9

Table 4.5: Contingency matrix for U4 and V .V

v1 v2 v3

U3

u1 0.1 0.1 9.9u2 4.2 7.6 0u3 5.7 2.3 0.1

Table 4.6: Contingency matrix for U3 and V .

as in Section 4.5.1.1 and show the results in Figure 4.6b. The success rate of NMIsMis about 70%, while the other three normalized forms of MIs are near 0%. This is verysimilar to the results obtained on data Ovp5 in group G1 (see Figure 4.3a).Next, we examine the 70 out of 100 successful cases and analyze why NMIsM is

successful while the other measures are not. For these 70 cases, NMIsj, NMIss andNMIsr sometimes achieve their maximum values when cpre = 2, while NMIsM chooses


NMIsj NMIss NMIsr NMIsM MIs H(V) H(U) H(U,V)

(U2,V ) 0.460 0.630 0.652 0.500 0.792 1.585 0.931 1.724

(U3,V ) 0.402 0.574 0.574 0.571 0.905 1.585 1.568 2.248

(U4,V ) 0.336 0.503 0.506 0.460 0.881 1.585 1.917 2.621

Table 4.7: The values of NMIsj, NMIss, NMIsr and NMIsM and corresponding entropies for U2,U3 and U4.

cpre = ctrue = 3 (this does not mean the three measures always prefer ctrue− 1 numberof clusters. It depends on the specific dataset and clustering algorithm.). Tables 4.4, 4.5and 4.6 are the contingency tables from a representative set of three partitions fromthe 70 cases. These partitions have c = 2, c = 3 and c = 4 clusters named U2, U3 andU4, respectively. Ideally, the measures should identify U3 is most similar to the groundtruth partition V .Consider the contingency matrices, U2V T and U4V T , shown in Tables 4.4 and 4.5

respectively. Intuitively, the entries nij in the contingency matrix are shared data mem-berships between cluster ui (∈ U) and vj (∈ V ). The summation of each column is 10.Note for these two tables, the lack of dominant entries for the diagonals. The contin-gency matrix U3V T is shown in Table 4.6. The dominant shared data memberships inTable 4.6 between ui and vj are 5.7, 7.6 and 9.9 out of 10 respectively, which indicatesthat U3 is a reasonable clustering with respect to the ground truth and should be chosenover U2 or U4.The values of the four measures (along with the corresponding entropies H(U), H(V )

and joint entropies H(U ,V )) on U2, U3 and U4, are shown in Table 4.7. Only NMIsMprefers U3 over U2 and U4, while the three normalizations of MIs all (incorrectly) preferU2. This agrees with Theorem 4.2, as we know that NMIsM (U3,V ) > NMIsM (U2,V ),since the mutual information increases from U2 to U3. In contrast, the other threemeasures, NMIsj, NMIss and NMIsr, have NMI∗(U2,V ) > NMI∗(U3,V ) because α <β. For U4, all the measures have lower values than for U3, as α < β from U3 to U4.In summary, the better performance of NMIsM when compared to the other three

NMI measures when we have balanced ground truth with overlapping clusters, is dueto their denominators (normalization factors) having different sensitivity to changes in


the number of clusters in candidate partitions. NMIsM is more robust to the changes,while the other three measures are more sensitive and easily influenced.

4.7 summary

This chapter has presented an organized study of eight IT-CVIs for FCM partitionsand EM partitions on 25 synthetic and 10 real-world datasets. We demonstrated thatsoft generalizations of the eight IT-CVIs are quite capable of identifying the “correct”number of clusters or classes from candidate partitions generated by FCM and EM onthese synthetic and real-world datasets. The results of this study provide a reasonablystrong empirical argument about the effectiveness of generalized IT-CVIs for both fuzzyand probabilistic cluster validity. In particular, NMIsM is superior to the other sevengeneralized IT-CVIs for both FCM and EM partitions on datasets with overlappedand/or various sized clusters. Finally, the proposed theorem provides a theoretical reasonto expect better performance of NMIsM over the other three variants of NMI, i.e., NMIsj,NMIss and NMIsr in certain situations.To the best of our knowledge, this is the first cluster validity study which demonstrates

that the distribution of the ground truth clusters can bias the value of an external CVI.Our theorem covers a very specific case for one external CVI, but suggests a much morericher question for further research: to what extent does this distribution of the groundtruth affect any external cluster validity index? This question is investigated in the nextchapter.

5B IAS OF CLUSTER VAL ID ITY IND ICES

AbstractExternal cluster validity indices (CVIs) are used to quantify the quality of a clustering bycomparing the similarity between the clustering and a ground truth partition. However,some external CVIs show a biased behavior when selecting the most similar clustering.Users may consequently be misguided by such results. Recognizing and understandingthe bias behavior of CVIs is therefore crucial.It has been noticed that some external CVIs exhibit a preferential bias towards a larger

or smaller number of clusters which is monotonic (directly or inversely) in the number ofclusters in candidate partitions. This type of bias is caused by the functional form of theCVI model. For example, the popular Rand Index (RI) exhibits a monotone increasing(NCinc) bias, while the Jaccard Index (JI) index suffers from a monotone decreasing(NCdec) bias. This type of bias has been previously recognized in the literature.

In this chapter, we identify a new type of bias arising from the distribution of theground truth (reference) partition against which candidate partitions are compared. Wecall this new type of bias ground truth (GT) bias. This type of bias occurs if a changein the reference partition causes a change in the bias status (e.g., NCinc, NCdec) of aCVI. For example, NCinc bias in the RI can be changed to NCdec bias by skewing thedistribution of clusters in the ground truth partition. It is important for users to beaware of this new type of biased behavior, since it may affect the interpretations of CVIresults. In addition, we study the empirical and theoretical implications of GT bias.

In this chapter we present results from the following manuscript:Y. Lei, J. C. Bezdek, S. Romano, N. X. Vinh, J. Chan and J. Bailey, “Ground Truth Bias in ExternalCluster Validity Indices”. Under second round review in Pattern Recognition.

123

124 bias of cluster validity indices

5.1 introduction

External CVIs (or comparison measures) introduced in Chapter 2.3.1 are often inter-preted as similarity (or dissimilarity) measures between the ground truth and candidatepartitions. The ground truth partition, which is usually generated by an expert in thedata domain, identifies the primary substructure of interest to the expert. This parti-tion provides a benchmark for comparison with candidate partitions. The general ideaof this evaluation methodology is that the more similar a candidate is to the groundtruth (a larger value for the similarity measure), the better this partition approximatesthe labeled structure in the data.However, this evaluation methodology implicitly assumes that the similarity measure

works correctly, i.e., that a larger similarity score indicates a partition that is reallymore similar to the ground truth. But this assumption may not always hold. When thisassumption is false, the evaluation results will be misleading. One of the reasons thatcan cause the assumption to be false is that a measure may have bias issues. That is,some measures are biased towards certain clusterings, even though they are not moresimilar to the ground truth compared to the other candidate partitions being evaluated.This can cause misleading results for users employing these biased measures. Thus,recognizing and understanding the bias behavior of the CVIs is crucial.The Rand Index (RI, similarity measure), introduced in Chapter 2.3.1, is a very

popular pair-counting based validation measure that is still widely used in many appli-cations [Johnson et al., 2010; Erisoglu et al., 2011; Zakaria et al., 2012; Wang et al.,2013; Xu et al., 2014; Ryali et al., 2015]. It has been noticed that the RI tends to fa-vor candidate partitions with larger numbers of clusters when the number of subsets inthe ground truth is fixed [Vinh et al., 2010], i.e., it tends to increase as the number ofclusters increases (we call it NCinc bias in this work, where NC = number of clusters).NC bias means that the CVI’s preference is influenced by the number of clusters in thecandidate partitions. For example, some measures may prefer the partition with largernumber of clusters, i.e., NCinc bias, while some measures may prefer the partition withsmaller number of clusters, i.e., NCdec bias. The following initial example illustrates NCbias for two popular measures, the Rand Index (RI) and Jaccard Index (JI) measures.

5.1 introduction 125

5.1.1 Example 1 - NC bias of RI and JI

In this example, we illustrate NC bias for RI and JI. We generate a set of candidatepartitions randomly with different numbers of clusters and a random ground truth. Weuse RI and JI to choose the most similar partition from the candidate partitions bycomparing the similarity between each of them and the ground truth. As there is nodifference in the generation methodology of the candidate partitions, we expect them tobe treated equally on average. A measure without NC bias should treat these candidatepartitions equally without preference to any partition in terms of their different numberof clusters. However, if a measure prefers the partition, e.g., with a larger number ofclusters (gives higher value to the partition with a larger number of clusters if it is asimilarity measure), we say it possess NC bias, more specifically, NCinc bias.Let UGT be a ground truth partition with ctrue subsets. Consider a set of N = 100, 000

objects and let the number of clusters in the candidate partitions c vary from 2 to cmax,where cmax = 3 ∗ ctrue. We randomly generate a ground truth partition UGT withctrue = 5. Then for each c, 2 ≤ c ≤ 15, we generate 100 partitions randomly, andcalculate the RI and JI between UGT and each generated partition. Finally, we computethe average values of these two measures at each value of c. The results are shown inFigure 5.1. Please note that the RI and JI are max-optimal (larger value is preferred).Evidently RI monotonically increases and JI monotonically decreases as c increases.Figure 5.1 shows that for this experiment, the RI points to c = 15, its maximum overthe range of c; and the JI points to c = 2, its maximum over the range of c. Hence bothindices exhibit NC bias (RI shows NCinc bias and JI shows NCdec bias).But, does the RI always exhibit NCinc bias towards clusterings with a larger numbers

of clusters? The answer is no. We have discovered that the overall bias of some CVIs,including the RI, may change their NC bias tendencies depending on the distribution ofthe subsets in the ground truth. The change in the NC bias status of an external CVI dueto the different ground truths is called GT bias. This kind of changeable bias behaviorcaused by the ground truth has not been recognized previously in the literature. It isimportant to be aware of this phenomenon, since it affects how a user should interpretclustering validation results. Next, we give an example of GT bias (GT = ground truth).


2 4 6 8 10 12 14

Ave

rage

Val

ues

of In

dex

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7RI"

# Clusters in Candidate Partitions

(a) Average RI values with random UGT con-taining 5 subsets.

2 4 6 8 10 12 14

Ave

rage

Val

ues

of In

dex

0

0.05

0.1

0.15Jaccard"


(b) Averge JI values with random UGT contain-ing 5 subsets.

Figure 5.1: The average RI and JI values over 100 partitions at each c with uniformly generatedground truth. The symbol ↑ means the measure is a similarity one and hence largervalues indicate higher similarity. The vertical lines indicate the correct number ofclusters.

2 4 6 8 10 12 14

Ave

rage

Val

ues

of In

dex

0

0.1

0.2

0.3

0.4

0.5

RI"


(a) Average RI values with skewed groundtruth.

Cluster#2 4 6 8 10 12 14

Ave

rage

Val

ues

of In

dex

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35Jaccard"

(b) Averge JI values with skewed ground truth.

Figure 5.2: The average RI and JI values over 100 partitions at each c with skewed groundtruth. The symbol ↑ means the measure is a similarity one and hence larger valuesindicate higher similarity. The vertical lines indicate the correct number of clusters.

5.1.2 Example 2 - GT bias of RI

We use a similar experimental setup as Example 1, but change the distribution of thesubsets in the ground truth by randomly assigning 80% of the objects to the first clusterand then randomly assigning the remaining 20% of the labels to the other four clusters

5.1 introduction 127

for c = 2, 3, 4, 5. Thus, the distribution of the ground truth is heavily skewed (non-uniform). The average values of RI and JI are shown in Figure 5.2. The shape of JI inFigures 5.1b and 5.2b is the same: JI still decreases monotonically with c, exhibitingNCdec bias, and indicating c = 2 as its preferred choice. Now consider the results for theRI, we see that trend seen in Figure 5.1a is reversed. The RI in Figure 5.2a is maximumat c = 2, and decreases monotonically as c increases. So the NC bias of RI has changedfrom NCinc bias to NCdec bias. Thus, RI shows GT bias. To summarize, Examples 1 and2 show that NC bias is possessed by some external CVIs due to monotonic tendenciesof the underlying mathematical model. But beyond this, some external CVIs can beinfluenced by GT bias, which is due to the way the distribution of the ground truthinteracts with the elements of the CVI.In this chapter, we study the empirical and theoretical implications of GT bias. To

the best of our knowledge, this is the first extensive study of this property for externalcluster validity indices. In this work, our contributions can be summarized as follows:

(a) We identify the GT bias effect for external validation measures, and also explainits importance.

(b) We test and discuss NC bias for 26 popular pair-counting based external validationmeasures.

(c) We prove that RI and four related indices suffer from GT bias. And also providetheoretical explanations for understanding why GT bias happens and when ithappens on RI and the four related indices.

(d) We present experimental results that support our analysis.

(e) We present an empirical example to show that Adjusted Rand index (ARI) alsosuffers from a modified GT bias.

The remainder of the chapter is organized as follows. In Section 5.2 we discuss workrelated to the bias problems of some external validation measures. We introduce relevantnotations and definitions of NC bias and GT bias in Section 5.3. In Section 5.4, we brieflyintroduce some background knowledge about 26 pair-counting based external validation


measures. In section 5.5, we test the influence of NC bias and GT bias for these 26measures. Theoretical analysis of GT bias on the RI is presented in Section 5.6. Anexperimental example, showing that ARI has GT bias in certain scenarios, is presentedin Section 5.7. The chapter is concluded in Section 5.8.

5.2 related work

Several works have discussed the bias behavior of external CVIs. As the conditionsimposed on the discussion of the biased behavior are varied, here we classify theseconditions into three categories for convenience of discussion: i) general bias; ii) NCbias; iii) GT bias.

general bias It has been noticed that the RI exhibits a monotonic trend as boththe number of subsets in the ground truth and the number of clusters in the candidatepartitions increases [Fowlkes and Mallows, 1983; Vendramin et al., 2010; Albatineh,2010]. However, in our case, we consider the monotonic bias behavior of an external CVIas a function of the number of clusters in the candidate partitions when the number ofsubsets in the ground truth is fixed.Wu et al. [Wu et al., 2009] observed that some external CVIs were unduly influenced

by the well known tendency of k-means to equalize cluster sizes. They noted that cer-tain CVIs tended to prefer approximately balanced k-means solutions even though theground truth distribution was heavily skewed. The only case considered in [Wu et al.,2009] was the special case when all of the candidate partitions had the same numberof clusters. We develop the general case, allowing candidate partitions to have differentnumbers of clusters.Wu et al. [Wu et al., 2010] studied the use of the external CVI known as the F-

measure for evaluation of clusters in the context of document retrieval. They found thatthe F-measure tends to assign higher scores to partitions containing a large number ofclusters, which they called the “the incremental effect” of the F-measure. These authorsalso found that the F-measure has a “prior-probability effect”, i.e., the F-measure tendsto assign higher scores to partitions with higher prior probabilities for the relevant

5.2 related work 129

documents. Wu et al. only discussed using the F-measure for accepting or rejectingproposed documents, they did not consider the multiclass case.

nc bias The NC bias problem of some external CVIs has been noticed in theliterature [Milligan and Cooper, 1986; Vinh et al., 2010; Romano et al., 2014]. Nguyenet al. [Vinh et al., 2010] pointed out that some external validation measures such asthe mutual information (MI) (also the work [Romano et al., 2014]) and the normalizedmutual information (NMI) suffered from NCinc bias. Based on this observation, theyproposed adjustments to the information-theoretic based measures. However, they didnot notice that the CVIs may show different NC bias behavior with different groundtruth partitions.

gt bias Milligan and Cooper [Milligan and Cooper, 1986] tested 5 external CVIs,i.e., RI, Adjusted Rand Index (ARI, Hubert & Arabie) [Arabie and Boorman, 1973], ARI(Morey & Agresti) [Morey and Agresti, 1984], Fowlkes & Mallow (FM) [Fowlkes andMallows, 1983] and Jaccard Index (JI), by comparing partitions with variable numbersof clusters generated by the hierarchical clustering algorithms, against the ground truth.The empirical tests showed that the RI suffered from NCinc bias, and FM and JI sufferedfrom NCdec bias. However, it was mentioned in this work that “... the bias with theRand index would be to select a solution with a larger number of clusters. The onlyexception occurred when two clusters were hypothesized to be present in the data. Inthis case, the bias was reversed.” This empirical observation can be related our work.However, there was no analysis or further discussion about this reversed bias behavior ofRI except this isolated observation. In this work, we provide a comprehensive empiricaland theoretical study of this kind of changeable bias behavior due to the distribution ofthe ground truth.In summary, GT bias is a type of bias which is less well known but equally important

bias problem. This type of bias occurs if a change in the reference partition causes achange in the NC bias status of an external CVI. In the following, we provide empiricaland theoretical study of GT bias for pair-counting based CVIs.


5.3 definitions

In this section, we first introduce the notations used in this work. Then we provide thedefinitions about the different bias behaviors, i.e., NC bias and GT bias, with the latterhaving two subtypes, GT1 bias and GT2 bias.This section contains definitions for the types of bias exerted on external CVIs by

their functional forms (NC bias) and the distribution of the ground truth partition (GTbias). We will call the influence of the number of clusters in the ground truth partition,UGT as Type 1 or GT1 bias, and the influence of the size distribution of the subsets inUGT as Type 2, or GT2 bias. A summary of the definitions is available in Table 5.1.

Definition 5.1. Let UGT ∈ MhrN be any crisp ground truth partition with r subsets,where 2 ≤ r ≤ N . Let CP = {V1, . . . ,Vm}, where Vi ∈ MhciN , be a set of candidatepartitions with different numbers of clusters, where 2 ≤ ci ≤ N . We compare UGT witheach Vi ∈ CP using an external Cluster Validity Index (CVI) and choose the one that isthe best match to UGT . There are two types of external CVIs: max-optimal (larger valueis better) similarity measures such as Rand’s index (RI); and min-optimal (smaller valueis better) dissimilarity measures such as the Mirkin metric (refer to Table 5.2).We say an external (CVI) has NC bias if it shows bias behavior with respect to thenumber of clusters in Vi when comparing Vi ∈ CP to the ground truth UGT . There arethree types of NC bias:

(a) if a max-optimal (min-optimal) CVI tends to assign higher (smaller) scores to thepartition Vi ∈ CP with larger ci, then we say this CVI has NCinc (NC increase)bias;

(b) if a max-optimal (min-optimal) CVI tends to assign smaller (higher) scores to thepartition Vi ∈ CP with larger values of ci, then we say this CVI has NCdec (NCdecrease) bias;

(c) if a CVI tends to be indifferent to the values of ci for the partitions Vi ∈ CP , wesay that this CVI has no NC bias, i.e., NCneu (NC neutral) bias.

5.3 definitions 131

Next, we define ground truth bias (GT bias), which occurs if the use of a differentground truth partition alters the NC bias status of an external CVI.

Definition 5.2. Let Q and Q′ denote the NC bias status of an external CVI, CVI,with respect to two ground truth partitions, UGT and U ′GT respectively, so Q,Q′ ∈{NCinc,NCdec,NCneu}. If Q 6= Q′, then CVI has ground truth bias (GT bias).

For example, given UGT 6= U ′GT , if a CVI shows e.g., NCinc bias with UGT , and shows,e.g., NCneu bias with U ′GT , then this CVI has GT bias. Definition 5.2 characterizes GTbias as an transition effect on the NC bias status of CVI. There are quite a few subcasesof GT bias depending on the properties of UGT and U ′GT relative to each other. In thischapter we have studied two specific cases in GT bias, i.e., GT1 bias and GT2 bias.Generally speaking, if an external CVI changes its bias status with two ground truthUGT1 and UGT2, then it has: i) GT1 bias, when the subsets in these two ground truthsare uniformly distributed but with different numbers of subsets; ii) GT2 bias, whenthese two ground truths have same number of subsets but with different distributions.The formal definitions of GT1 bias and GT2 bias are described as follows.

Definition 5.3. Let UGT ∈ MhrN be a balanced crisp ground truth partition with r

subsets {u1, . . . ,ur}, i.e., pi = |ui|N = 1

r , and U ′GT ∈ Mhr′N be a balanced crisp groundtruth partition with r′ subsets {u′1, . . . ,u′r′}, i.e., p′i =

|u′i|N = 1

r′ , where r 6= r′. We sayan external CVI has GT1 bias if the NC bias status for UGT is different from that ofU ′GT .

For example, given UGT with 2 balanced subsets, and U ′GT with 5 balanced subsets,then if an CVI shows e.g., NCneu bias with UGT , and NCinc bias with U ′GT , then thisCVI has GT1 bias.

Definition 5.4. Let UGT ∈ MhrN be a crisp ground truth partition with r subsets{u1, . . . ,ur}, P = {p1, . . . , pr} = { |u1|

N , . . . , |ur|N } and p2 = p3 = . . . = pr =

1−p1r−1 . Let

U ′GT ∈ Mhr′N be another crisp ground truth partition with r′ subsets {u′1, . . . ,u′r′} andP ′ = {p′1, . . . , p′r′} = { |u

′1|N , . . . , |u

′r′ |N }, p

′2 = p′3 = . . . = p′r′ =

1−p′1r−1 , where r = r′ and

p1 6= p′1. We say an external CVI has GT2 bias if it exhibits different types of NCbias for UGT and U ′GT .


For example, given UGT ∈ Mh5N with p1 = 0.2 and U ′GT ∈ Mh5N with p′1 = 0.8, ifan external CVI shows, e.g., NCinc bias for UGT and shows e.g., NCdec bias for U ′GT ,then this CVI has GT2 bias.Figure 5.3 illustrates the relationship between NC bias and GT bias that is contained

in Definitions 5.1 - 5.4. In this Figure, CP denotes a set of crisp candidate partitionswith different numbers of clusters, and CVI denotes an external CVI. UGT 6= U ′GT aredifferent crisp ground truth partitions and UGT ∈ MhrN ,UGT ∈ Mhr′N . Recall, wesummarized the different bias problems discussed in this work in Table 5.1.

},,{:),(},,{:),(

NCdecNCneuNCincQUCPNCdecNCneuNCincQUCP

GT

GT

∈′′

∈

CVI

CVI

Q = ′ Q ⇓

No GT bias ,balanced are ,

QQrrUU GTGT

⇓

CVI is GT1

′≠′≠

′

2

,

11

11

is GT

QQrrNnp

Nnp

CVI ⇓

′≠′=

′=′≠=

Figure 5.3: The relationship between NC bias and GT bias in Definitions 5.1- 5.4.

5.3 definitions 133

Table5.1:

Glossariesab

outdiffe

rent

bias

discussedin

this

pape

r.

Glossary

Exp

lana

tion

NC

bias

Anexternal

CVIs

howsbias

beha

vior

with

respectto

thenu

mbe

rof

clusters

inthe

compa

redclusterin

gs.

NCincbias

(NC

increase)

One

oftheNC

bias

status.A

nexternal

CVIp

refers

clusterin

gswith

larger

numbe

rof

clusters.

NCdecbias

(NC

decrease)

One

oftheNC

bias

status.A

nexternal

CVIp

refers

clusterin

gswith

smallern

umbe

rof

clusters.

NCneubias

(NC

neutral)

One

oftheNC

bias

status.A

nexternal

CVIh

asno

bias

forc

lusterings

with

respect

tothenu

mbe

rof

clusters.

GT

bias

Anexternal

CVIs

howsdiffe

rent

NC

bias

status

whenvaryingthegrou

ndtruth.

GT1bias

Asubtyp

eof

GT

bias.An

external

CVIshow

sdiffe

rent

NC

bias

status

fortw

ogrou

ndtruths

with

unifo

rmdistrib

utionbu

twith

diffe

rent

numbe

rsof

subsets.

GT2bias

Asubtyp

eofG

Tbias.A

nexternal

CVIsho

wsd

ifferentN

Cbias

status

fortwo

grou

ndtruths

that

have

thesamenu

mbe

rofsub

sets

butw

ithdiffe

rent

subset

distrib

utions.


Next, we introduce the pair-counting based CVIs that we evaluate about their GTbias problems.

5.4 pair-counting external cluster validity measures

In this section, we list the 26 pair-counting based measures that we evaluate about theirNC bias and GT bias problems. This is a non-exhaustive list that has been previouslystudied [Anderson et al., 2010; Albatineh et al., 2006].

Table 5.2: Pair-Counting based Comparison Measures (external CVIs). k11, k10, k01 and k00 arecounts of four types of pairs of objects (refer to Equations 2.10 - 2.13 in Chapter 2).

# Name [Reference] Symbol Formula Find

1 Rand Index [Rand,1971]

RI k11+k00k11+k10+k01+k00

Max

2 Adjusted Rand IndexARI

k11−(k11+k10)(k11+k01)k11+k10+k01+k00

(k11+k10)+(k11+k01)2 − (k11+k10)(k11+k01)

k11+k10+k01+k00

MaxHubert and Arabie [Hu-bert and Arabie, 1985]

3 Mirkin [Mirkin, 1996] Mirkin 2(k10 + k01) Min

4 Jaccard Index [Jaccard,1908]

JI k11k11+k10+k01

Max

5 Hubert [Hubert, 1977] H (k11+k00)−(k10+k01)k11+k10+k01+k00

Max

6 Wallace [Wallace, 1983] W1 k11k11+k10

Max

7 Wallace [Wallace, 1983] W2 k11k11+k01

Max

8 Fowlkes & Mal-low [Fowlkes andMallows, 1983]

FM k11√(k11+k10)(k11+k01)

Max

5.4 pair-counting external cluster validity measures 135

Continued from previous page


9 Minkowski [Jiang et al.,2004]

MK√k10+k01k11+k10

Min

10 Hubert’s Gamma [Jainand Dubes, 1988]

Γ k11k00−k10k01√(k11+k10)(k11+k01)(k01+k00)(k10+k00)

Max

11 Yule [Sneath et al., 1973] Y k11k00−k10k01k11k10+k01k00

Max

12 Dice [Dice, 1945] Dice 2k112k11+k10+k01

Max

13 Kulczynski [Kulczyński,1928]

K 12

(k11

k11+k10+ k11

k11+k01

)Max

14 McConnaughey [Mc-Connaughey and Laut,1964]

MC k211−k10k01

(k11+k10)(k11+k01)Max

15 Peirce [Peirce, 1884] PE k11k00−k10k01(k11+k01)(k10+k00)

Max

16 Sokal & Sneath [Sokaland Sneath, 1963]

SS1 14

(k11

k11+k10+ k11

k11+k01+ k00

k10+k00+

k00k01+k00

) Max

17 Baulieu [Baulieu, 1989] B1 (N2 )

2−(N

2 )(k10+k01)+(k10−k01)2

(N2 )

2 Max

18 Russel & Rao [RUS-SELL et al., 1940]

RR k11k11+k10+k01+k00

Max

19 Fager & Mc-Gowan [Fager andMcGowan, 1963]

FMG k11√(k11+k10)(k11+k01)

− 12√k11+k10

Max

20 Pearson P k11k00−k10k01(k11+k10)(k11+k01)(k01+k00)(k10+k00)

Max

21 Baulieu [Baulieu, 1989] B2 k11k00−k10k01

(N2 )

2 Max


Continued from previous page



SS2 k11k11+2(k10+k01)

Max


SS3 k11k00√(k11+k10)(k11+k01)(k10+k00)(k01+k00)

Max

Ochiai [Ochiai, 1957]

24 Gower & Legen-dre [Gower and Leg-endre, 1986]

GL k11+k00k11+

12 (k10+k01)+k00

Max

Sokal & Sneath [Sokaland Sneath, 1963]

25 Rogers & Tani-moto [Rogers andTanimoto, 1960]

RT k11+k00k11+2(k10+k01)+k00

Max

26 Goodman &

Kruskal [Goodman

and Kruskal, 1954]

GK k11k00−k10k01k11k00+k10k01

Max

Yule [Yule, 1900]

5.5 numerical experiments

In this section, we evaluate and discuss the 26 pair-counting based external clustervalidity indices listed in Table 5.2 with respect to NC bias, GT1 bias and GT2 bias.We use the same experimental setup as in Example 1 (Section 5.1.1) and Example 2(Section 5.1.2). We found that RI and 4 related CVIs show GT1 and GT2 bias behavior.

5.5 numerical experiments 137

5.5.1 Type 1: GT1 bias Evaluation

We use the same experimental setting as in Example 1. The ground truth partition UGTis randomly generated with ctrue subsets which are in each case uniformly distributedin size, where ctrue = {2, 10, 20, 30, 50}. Then, we randomly generate 100 candidate par-titions with c clusters, where c ranges from 2 to 3 ∗ ctrue. We performed this experimenton all 26 comparison measures shown in Table 5.2, and we focus our discussion on theresults from three representative measures, the RI, JI and ARI (indices #1, #2, and #4in Table 5.2) with ctrue = 2, 50 (Figure 5.4).When ctrue = 2, the RI trend is flat, that is, it has NCneu bias. But when ctrue = 50,

the RI favors solutions with larger number of clusters, i.e., it shows NCinc bias withctrue = 50. Thus, the number of clusters in the random ground truth partition UGT doesinfluence the NC bias behavior of the RI. According to definition 5.3, this indicates thatRI has GT1 bias. Comparing Figures 5.4c and 5.4d shows that the Jaccard index doesnot seem to suffer from GT bias due to the number of subsets in UGT . These two figuresshow that the JI exhibits NCdec bias, decreasing monotonically as c increases from 2to 6 (Figure 5.4c) or 2 to 150 (Figure 5.4d). Figures 5.4e and 5.4f show that the ARIis not monotonic for either value of c, and is not affected by the number of clusters inUGT . Thus, ARI has NCneu bias. We remark that these observed bias behaviors of theevaluated external CVIs are based on these experimental settings.

5.5.2 Type 2: GT2 bias Evaluation

We use an experimental setup similar to that in Example 2. We generate a ground truthby randomly assigning 10%, 20%, . . . , 90% of the objects to the first cluster, and thenrandomly assigning the remaining cluster labels to the rest of the data objects. Herectrue = 5 is discussed. Figure 5.5 shows the results for the RI, JI and ARI with the sizeof the first cluster either n1 = 0.1 ∗N or n1 = 0.9 ∗N .Figures 5.5a and 5.5b show that the RI suffers from GT2 bias according to defini-

tion 5.4. It is monotone increasing with n1 = 10, 000 (NCinc bias), but monotone de-


2 3 4 5 6

Ave

rage

Val

ues

of In

dex

0

0.1

0.2

0.3

0.4

0.5

RI"


NCneu

(a) RI with random UGT ∈ Mh2N , i.e.,ctrue = 2.

20 40 60 80 100 120 140

Ave

rage

Val

ues

of In

dex

0

0.2

0.4

0.6

0.8RI"


NCinc

(b) RI with random UGT ∈ Mh50N , i.e.,ctrue = 50.

2 3 4 5 6

Ave

rage

Val

ues

of In

dex

0

0.05

0.1

0.15

0.2

0.25

0.3 Jaccard"


NCdec

(c) Jaccard with random UGT ∈ Mh2N , i.e.,ctrue = 2.

20 40 60 80 100 120 140

Ave

rage

Val

ues

of In

dex

0

0.005

0.01

0.015

Jaccard"


NCdec

(d) Jaccard with random UGT ∈Mh50N , i.e.,ctrue = 50.

# Clusters Candidate Partitions2 3 4 5 6

Ave

rage

Val

ues

of In

dex

#10-3

0

2

4

6

8

10ARI"

NCneu

(e) ARI with random UGT ∈ Mh2N , i.e.,ctrue = 2.

# Clusters Candidate Partitions20 40 60 80 100 120 140

Ave

rage

Val

ues

of In

dex

#10-3

0

2

4

6

8

10ARI"

NCneu

(f) ARI with random UGT ∈ Mh50N , i.e.,ctrue = 50.

Figure 5.4: 100 trials average values of the RI, JI and ARI external CVIs with variable groundtruth to investigate GT1 bias, ctrue = 2, 50. The symbol ↑ means the measure is asimilarity one and hence larger values indicate higher similarity. The vertical linesindicate the correct number of clusters.

5.5 numerical experiments 139

creasing with n1 = 90, 000 (NCdec bias). Note that the graphs in Figures 5.5a and 5.5bare reflections of each other about the horizontal axis at 0.5. The Jaccard index in Fig-ures 5.5c and 5.5d exhibits the same NC bias status as it did in Figures 5.4c and 5.4d.Specifically, JI decreases monotonically with c, so it still has NCdec bias, but it doesnot seem to be affected by GT2 bias. The ARI in Figures 5.5e and 5.5f does not showany influence due to GT2 bias. It has NCneu bias under these two sets of experimentalsettings. So, from our empirical results, ARI would appear to be preferable to the RIand the JI in this setting. To summarize, these examples illustrate that the RI cansuffer from GT1 bias and GT2 bias; that JI can suffer from NCdec bias but not GT1bias nor GT2 bias; and that ARI does not suffer from NC bias or GT bias, under theexperimental setup we have used here.

5.5.3 Summary for All 26 Comparison Measures

The overall results of similar experiments for all 26 indices in Table 5.2 led to theconclusion that 5 of the 26 external CVIs suffer from GT1 bias and GT2 bias for theseexperimental settings. These measures are:

RI(U ,V ) = k11 + k00k11 + k00 + k10 + k01

= RI (5.1)

H(U ,V ) = (k11 + k00)− (k01 + k10)

k11 + k00 + k10 + k01= 2RI − 1 (5.2)

GL(U ,V ) = k11 + k00k11 + 1/2(k10 + k01) + k00

=2

1 + 1RI

(5.3)

RT (U ,V ) = k11 + k00k11 + 2(k10 + k01) + k00

=1

2RI − 1

(5.4)

Mirkin(U ,V ) = 2(k10 + k01) = N(N − 1)(1−RI) (5.5)

Please note that the external CVIs in Equations 5.2, 5.3, 5.4 and 5.5 are all functionsof the RI. This observation forms the basis for our analysis in the next section.


2 4 6 8 10 12 14

Ave

rage

Val

ues

of In

dex

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7RI"


NCinc

(a) RI with skewed ground truth ctrue = 5:n1 = 10% ∗N .

2 4 6 8 10 12 14

Ave

rage

Val

ues

of In

dex

0

0.1

0.2

0.3

0.4

RI"


NCdec

(b) RI with skewed ground truth ctrue = 5:n1 = 90% ∗N .

2 4 6 8 10 12 14

Ave

rage

Val

ues

of In

dex

0

0.05

0.1

0.15

Jaccard"


NCdec

(c) Jaccard with skewed ground truthctrue = 5: n1 = 10% ∗N .

2 4 6 8 10 12 14

Ave

rage

Val

ues

of In

dex

0

0.1

0.2

0.3

0.4Jaccard"


NCdec

(d) Jaccard with skewed ground truthctrue = 5: n1 = 90% ∗N .


Ave

rage

Val

ues

of In

dex

#10-3

0

2

4

6

8

10ARI"

NCneu

(e) ARI with skewed ground truth ctrue = 5:n1 = 10% ∗N .


Ave

rage

Val

ues

of In

dex

#10-3

0

2

4

6

8

10ARI"

NCneu

(f) ARI with skewed ground truth ctrue = 5:n1 = 90% ∗N .

Figure 5.5: 100 trials average values of the RI, JI and ARI with unbalanced ground truth toinvestigate GT2bias, ctrue = 5. n1 = 10% ∗N (left), n1 = 90% ∗N (right). Thesymbol ↑ means the measure is a similarity one and hence larger values indicatehigher similarity. The vertical lines indicate the correct number of clusters.

5.6 bias due to ground truth for the rand index 141

5.6 bias due to ground truth for the rand index

In this section, we provide a theoretical analysis for the GT bias, GT1 bias and GT2bias for the Rand Index. More specifically, we will analyze the underlying reason for theGT bias of RI, based on its relationship with the quadratic entropy. Then, based on therelationship between RI and the quadratic entropy, we will discuss theoretically aboutwhen RI shows GT bias, GT1 bias and GT2 bias, according to the distribution of theground truth and the number of subsets in the ground truth.We first provide some background about the Havrda-Charvat generalized entropy and

its relationship with RI.

5.6.1 Quadratic Entropy and Rand Index

The Havrda-Charvat entropy [Havrda and Charvát, 1967] is a generalization of theShannon entropy. The quadratic entropy is the Havrda-Charvat generalized entropywith β = 2.

5.6.1.1 Havrda-Charvat Generalized Entropy

The Havrda-Charvat generalized entropy for a crisp partition U with r clusters U =

{u1, . . . ,ur} is:

Hβ(U) =1

1− 21−β (1−r∑i=1

(|ui|N

)β) (5.6)

where β is any real number > 0 and β 6= 1. Since H is a continuous function of β, whenβ = 1:

H1(U) = −r∑i=1

|ui|N

log |ui|N

(5.7)


which is the Shannon entropy HS(U). When β = 2 we have the quadratic entropy:

H2(U) = 2(1−r∑i=1

(|ui|N

)2) (5.8)

It can be shown that in the case of statistically independent random variables U and V

Hβ(U ,V ) = Hβ(U) +Hβ(V )− (1− 21−β)Hβ(U)Hβ(V ) (5.9)

When β = 2, Equation 5.9 becomes

H2(U ,V ) = H2(U) +H2(V )−12H2(U)H2(V ) (5.10)

5.6.1.2 Quadratic Entropy and VI

Directly analyzing bias of RI is difficult, Hence we use its relationship with VI to do so.In this section, we first introduce VI and how it can be written in terms of the quadraticentropy.In [Meilă, 2007], Meila showed that the Variation of Information (VI) is a metric by

expressing it as a function of Shannon’s entropy. Consider a crisp partition V with c

subsets V = {v1, . . . , vc}, then

V I(U ,V ) = HS(U |V ) +HS(V |U) (5.11)

= 2HS(U ,V )−HS(U)−HS(V )

The VI is not one of the 26 indices in Table 5.2, but this information-theoretic CVI canbe computed based on the contingency table, and it will help us analyze the GT bias ofthe 5 external CVIs discussed in Section 5.5.3.Simovici [Simovici, 2007] showed that replacing Shannon’s entropy in Equation 5.11

by the generalized entropy at Equation 5.6 still yielded a metric,

V Iβ(U ,V ) = Hβ(U |V ) +Hβ(V |U) (5.12)


= 2Hβ(U ,V )−Hβ(U)−Hβ(V ) (5.13)

For β = 2, this becomes

V I2(U ,V ) = 2H2(U ,V )−H2(U)−H2(V ) (5.14)

Based on the above introduced concepts, we next introduce how to derive the relation-ship between RI and the quadratic entropy (i.e., Havrda-Charvat generalized entropywith β = 2). This relationship will help us explain why RI shows GT bias.

5.6.1.3 Quadratic Entropy and Rand Index

Let U and V be two crisp partitions of N samples with r clusters and c clusters re-spectively. Then the relationship between V I2(U ,V ) and RI(U ,V ) can be derived asfollows [Simovici, 2007].First, based on Equations 5.13 and 5.6, we can rewrite V Iβ(U ,V ) as

V Iβ(U ,V ) = 2Hβ(U ,V )−Hβ(U)−Hβ(V ) (5.15)

=2

1− 21−β (1−r∑i=1

c∑j=1

(|ui ∩ vj |N

)β)− 11− 21−β (1−

r∑i=1

|ui|N

)β

− 11− 21−β (1−

c∑j=1

(|vj |N

)β)

=1

1− 21−β

2(1−r∑i=1

c∑j=1

(|ui ∩ vj |N

)β)− (1−r∑i=1

(|ui|N

)β)

− (1−c∑

j=1(|vj |N

)β)

=

1Nβ(1− 21−β)

(r∑i=1

(|ui|)β +c∑

j=1(|vj |)β − 2

r∑i=1

c∑j=1

(|ui ∩ vj |)β)

Now setting β = 2, we get

V I2(U ,V ) = 2N2 (

r∑i=1

(|ui|)2 +c∑

j=1(|vj |)2 − 2

r∑i=1

c∑j=1

(|ui ∩ vj |)2)


=2N2 (2k10 + 2k01)

=2N(N − 1)(1−RI(U ,V )) (5.16)

Equation 5.16 shows that V I2 and RI are inversely related. Thus, by analyzing the biasbehavior of V I2, it will be easy to understand the behavior of RI. Next, we will analyzethe GT bias behavior of V I2 based on the concept of quadratic entropy.

5.6.2 GT bias of RI

In this section, we will first discuss the general case of GT bias for RI by providinga series of theoretical statements for helping understand why RI shows GT bias, andwhen RI shows GT bias. Then, we will discuss two specific cases, i.e., GT1 bias andGT2 bias for RI and provide related theoretical statements which will explain when RIshows GT1 bias and GT2 bias.

5.6.2.1 General Case of GT bias

We introduce Lemma 5.1 to build the foundation for analyzing the GT bias of V I2, thenRI.

Lemma 5.1. Given U ∈ MhrN and V ∈ MhcN , two statistically independent crisppartitions of N data objects, we have

V I2(U ,V ) = H2(U) + (1−H2(U))H2(V ) (5.17)

Proof. As U and V are statistically independent, we can substitute Equation 5.10 intoEquation 5.14, obtaining

V I2(U ,V ) = 2H2(U ,V )−H2(U)−H2(V )

= 2(H2(U) +H2(V )−

12H2(U)H2(V )

)−H2(U)−H2(V )

= H2(U) +H2(V )−H2(U)H2(V )

= H2(U) + (1−H2(U))H2(V ) (5.18)


Next, we introduce an important theorem in this chapter that demonstrates why RIshows GT bias and when it shows GT bias by judging the relationship between thequadratic entropy of ground truth UGT , H2(UGT ) and 1.

Theorem 5.2. Let UGT ∈ MhrN be a ground truth partition with r subsets, and letCP = {V1, . . . ,Vm} be a set of candidate partitions with different numbers of clusters,where Vi ∈ MhciN contains ci clusters which are uniformly distributed (balanced), 2 ≤ci ≤ N . Assuming UGT and Vi ∈ CP are statistically independent, then RI suffers fromGT bias. In addition, according to the relationship between H2(UGT ) and 1, we have:

(a) if H2(UGT ) < 1, RI suffers from NCdec bias (i.e., RI decreases as ci increases);

(b) If H2(UGT ) = 1, RI is unbiased, i.e., NCneu bias (i.e., RI has no preferences asci increases);

(c) if H2(UGT ) > 1, RI suffers from NCinc bias (i.e., RI increases as ci increases).

Proof. According to Lemma 5.1,

V I2(UGT ,V ) = H2(UGT ) + (1−H2(UGT ))H2(V ) = a+ bx (5.19)

where a = H2(UGT ) and b = (1 −H2(UGT )) = (1 − a) and x = H2(V ). As anyVi ∈ CP is uniformly distributed (balanced), then H2(Vi) = 2(1−∑ci

i=1(1ci)2) (refer to

Equation 5.8) and H2(Vi) increases as ci increases. It is clear from Equation (5.19) thatfor fixed UGT , V I2 can be regarded as a straight line with y intercept a = H2(UGT ) andslope b = 1−H2(UGT ) = (1− a), so the rate of growth (or decrease, or neither (flat))of V I2 depends on b. In other words, V I2 could be increasing, decreasing or flat as ciincreases. More specifically,

(a) b > 0⇒ H2(UGT ) < 1, V I2 increases as x (and ci) increases;

(b) b = 1⇒ H2(UGT ) = 1, V I2 is constant as x (and ci) increases;

(c) b < 0⇒ H2(UGT ) > 1, V I2 decreases as x (and ci) increases.


According to Equation 5.16, we know that V I2 and RI are inversely related. Thus,

(a) H2(UGT ) < 1⇒ RI has NCdec bias;

(b) H2(UGT ) = 1⇒ RI has NCneu bias;

(c) H2(UGT ) > 1⇒ RI has NCinc bias.

Given a ground truth partition UGT , Theorem 5.2 provides a test for the NC biasstatus of the RI. Compute the quadratic entropy H2(UGT ) of the reference matrixUGT and compare it to the value 1, and use Theorem 5.2 to determine the type ofbias. Figure 5.6 illustrates the relationship between H2(UGT ) and 1 on the Rand indexgraphically. Figure 5.6 is based on the same experimental setting as in Example 2 butwith N = 1000, and a different distribution P (proportion of subsets) in the groundtruth and ctrue = 3. Next, we show that we can also judge the NC bias and GT bias ofRI by comparing ∑r

i=1(pi)2 and 1

2 , and introduce Corollary 5.3 which is the basis forthe following theorems.

Corollary 5.3. Let UGT ∈MhrN be a ground truth partition with r subsets {u1, . . . ,ur},and let P = {p1, . . . , pr} and pi = |ui|

N . Let CP = {V1, . . . ,Vm} be a set of generatedpartitions with different numbers of clusters, where Vi ∈MhciN contains ci clusters whichare balanced, 2 ≤ ci ≤ N . Assuming UGT and Vi ∈ CP are statistically independent, wehave

(a) if ∑ri=1(pi)

2 > 12 , then RI has NCdec bias;

(b) if ∑ri=1(pi)

2 = 12 , then RI has NCneu bias;

(c) if ∑ri=1(pi)

2 < 12 , then RI has NCinc bias.

Proof. According to Theorem 5.2, we know that depending on the relationship betweenH2(UGT ) and 1, i.e., the slope b in Equation 5.19, that RI shows different NC biasstatus. As H2(UGT ) = 2(1−∑r

i=1 p2i ) (Equation 5.8), then

b = 1−H2(UGT ) = 1− 2(1−r∑i=1

p2i )


2 3 4 5 6 7 8 9


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Ave

rage

RI V

alue

s

600:50:350 (H 2(U

GT)=1.03, NC

inc)

600:100:300 (H2(U

GT)=1.08, NC

inc)

600:150:250 (H2(U

GT)=1.11, NC

inc)

600:200:200 (H2(U

GT)=1.12, NC

inc)

700:50:250 (H 2(U

GT)=0.89, NC

dec)

700:100:200 (H2(U

GT)=0.92, NC

dec)

700:150:150 (H2(U

GT)=0.93, NC

dec)

2000:100:20 (H2(U

GT)=0.22, NC

dec)

(ctrue)

Figure 5.6: 100 trial average RI values with ci ranging from 2 to 9 for ctrue = 3 (dottedline). In the legend, n1 : n2 : n3 indicates the sizes of the three clusters and thecorresponding H2(UGT ) values. The type of NC bias is also indicated in the legend.

= 2r∑i=1

p2i − 1

= 2(r∑i=1

p2i −

12) (5.20)

Thus, we know that the slope b, i.e., the relationship between H2(UGT ) and 1, dependson the relationship between∑r

i=1 p2i and 1

2 . So, the three assertions of the corollary followby noting the relationship between ∑r

i=1 p2i and 1

2 , and definition of H2(UGT )

Next, we introduce another theorem which helps us understand how do the priorprobabilities {pi} and the number of subsets r in the ground truth UGT influence theNC bias status of RI.

Theorem 5.4. Let UGT ∈MhrN be a ground truth partition with r subsets {u1, . . . ,ur},and let P = {p1, . . . , pr} and pi = |ui|

N . Let P ′ = {p′1, p′2, . . . , p′r} be P reordered indescending order, where p′1 ≥ p′2 . . . ≥ p′r. Let CP = {V1, . . . ,Vm} be a set of generatedpartitions with different numbers of clusters, where Vi ∈MhciN contains ci clusters whichare balanced, 2 ≤ ci ≤ N . Assuming UGT and Vi ∈ CP are statistically independent,


then RI has GT bias. In addition, depending on P and r, we have:When r > 2

(a) if p′1 > 12 , and

if p′1(p′1 − 12) >

∑ri=2 p

′i(

12 − p

′i), then RI has NCdec bias;

if p′1(p′1 − 12) =

∑ri=2 p

′i(

12 − p

′i), then RI has NCneu bias;

if p′1(p′1 − 12) <

∑ri=2 p

′i(

12 − p

′i), then RI has NCinc bias.

(b) if p′1 = 12 , then RI has NCinc bias;

(c) if p′1 < 12 , then RI has NCinc bias.

When r = 2

(a) if p′1 > 12 , then RI has NCdec bias;

(b) if p′1 = 12 , then RI has NCneu bias.

Proof. According to Corollary 5.3, we know that the relationship between ∑ri=1 p

2i and

12 influences the NC bias status of RI. It is straightforward to see that pi and r influencethe relationship between ∑r

i=1 p2i and 1

2 , thus pi and r can potentially alter the NC biasstatus of RI. We have

r∑i=1

(p′i)2 − 1

2 =r∑i=1

(p′i)2 − 1

2

r∑i=1

p′i

=r∑i=1

p′i(p′i −

12)

= p′1(p′1 −

12) +

r∑i=2

p′i(p′i −

12) (5.21)

Please note that p′1 is the biggest cluster’s density in the ground truth, based on whichwe discuss and summarize the influence of pi and r on the NC bias status of RI. We candiscuss the relationship between p′1(p′1− 1

2) and∑ri=2 p

′i(p′i− 1

2), which is equivalent tothe relationship between ∑r

i=1 p2i and 1

2 , for the different NC bias status.When r > 2

(a) if p′1 > 12 , because

∑ri=1 p

′i = 1, we have p′2, . . . , p′r < 1

2 . Thus, with the help ofCorollary 5.3, we have:


if p′1(p′1 − 12) >

∑ri=2 p

′i(

12 − p

′i), then

∑ri=1(p

′i)

2 > 12 , thus RI has NCdec bias.

If p′1(p′1 − 12) =

∑ri=2 p

′i(

12 − p

′i), then

∑ri=1(p

′i)

2 = 12 , thus RI has NCneu bias.

If p′1(p′1 − 12) <

∑ri=2 p

′i(

12 − p

′i), then

∑ri=1(p

′i)

2 < 12 , thus RI has NCinc bias.

(b) if p′1 = 12 , we have p′2, . . . , p′r < 1

2 , then∑ri=1(p

′i)


(c) if p′1 < 12 , we can get p′2, . . . , p′r < 1

2 , then∑ri=1(p

′i)


When r = 2

(a) if p′1 > 12 , p

′1(p′1 − 1

2) > p′2(12 − p

′2), then

∑ri=1(p

′i)

2 > 12 , thus RI has NCdec bias.

(b) if p′1 = 12 , we have p′2 = 1

2 , then∑ri=1(p

′i)

2 = 12 , thus RI has NCneu bias.

Thus, the RI suffers from GT bias according to the distribution of ground truth P andthe number of clusters r in the ground truth.

Theorem 5.4 tells us how the ground truth distribution P ′ and the number of clustersr of UGT affect the Rand index and helps us judge the NC bias status based on P ′ andr. For example, if r > 2, p′1 > 1

2 and p′1(p′1 − 12) >

∑ri=2 p

′i(

12 − p

′i), then RI has NCdec

bias (e.g., r = 3, p′1 = 23 , p′2 = 1

4 and p′3 = 112). If r = 2 and p′1 = 1

2 , then RI has NCneubias. Thus RI has GT bias.Up to this point, the discussion and theoretical analysis are about GT1 bias in general.

Next, we discuss GT1 bias and GT2 bias of the RI, which are two specific types of GTbias with certain conditions imposed on the ground truth. This will also help explain andjudge the NC bias behaviors of the indices in the empirical evaluation shown Section 5.5.

5.6.2.2 GT1 bias and GT2 bias

First, we start by introducing a theorem for GT1 bias of RI.

Theorem 5.5. Let UGT ∈MhrN be a crisp ground truth partition with r balanced subsets{u1, . . . ,ur}, i.e., pi = |ui|

N = 1r . Let CP = {V1, . . . ,Vm} be a set of generated partitions

with different numbers of clusters, where Vi ∈ MhciN contains ci clusters which arebalanced, 2 ≤ ci ≤ N . Assuming UGT and Vi ∈ CP are statistically independent, thenRI suffers from GT1 bias. More specifically,


(a) if r = 2, then RI has NCneu bias;

(b) if r > 2, then RI has NCinc bias.

Proof. Corollary 5.3 shows that the NC bias of the RI depends on the relationshipbetween ∑r

i=1(pi)2 and 1

2 . Since pi =1r , then

∑ri=1(pi)

2 − 12 = 1

2 Then, according toCorollary 5.3, we have i) if r = 2, then RI has NCneu bias; ii) if r > 2, then RI hasNCinc bias. By Definition 5.3, different values for r in UGT , i.e., r = 2 or r > 2, resultin different NC bias status for the RI, thus RI has GT1 bias.

Theorem 5.5 provides an explanation of how GT1 bias influences RI. For example, itis easier to understand the behavior of RI shown in the GT1 bias testing in Section 5.5.1(Figures 5.4a and 5.4b). Next, we introduce a theorem for the GT2 bias of RI.

Theorem 5.6. Let UGT ∈MhrN be a ground truth partition with r subsets {u1, . . . ,ur}.Assume the first cluster u1 in the ground truth has variable sizes, and the remainingclusters {u1, . . . ,ur} are uniformly distributed in size across the remaining objects N −|u1|. Let P = {p1, . . . , pr} and pi = |ui|

N , 0 < pi < 1, and ∑ri=1 pi = 1. So p2 =

p3 = . . . = pr = 1−p1r−1 . Let CP = {V1, . . . ,Vm} be a set of generated partitions with

different numbers of clusters, where Vi ∈MhciN contains ci clusters which are balanced,1 ≤ i ≤ m. Assuming UGT and Vi ∈ CP are statistically independent, then RI suffersfrom GT2 bias. More specifically, let p∗ = 2+

√2(r−1)(r−2)

2r , we have:When r > 2,

(a) if p1 > p∗, then RI has NCdec bias;

(b) if p1 = p∗, then RI has NCneu bias;

(c) if p1 < p∗, then RI has NCinc bias.

When r = 2

(a) if p1 = p∗, then RI has NCneu bias;

(b) if p1 6= p∗, then RI has NCdec bias.


Proof. According to Corollary 5.3, we know that the relationship between ∑ri=1(pi)

2

and 12 determines the NC bias status of the RI. As pi = 1−p1

r−1 , i = 2, . . . , r, we have:

r∑i=1

p2i −

12 = p2

1 +r∑i=2

p2i −

12

= p21 + (r− 1)

(1− p1r− 1

)2− 1

2=

r

r− 1p21 −

2r− 1p1 +

3− r2(r− 1) (5.22)

Equation 5.22 is quadratic in p1, and has one real positive root p∗ = 2+√

2(r−1)(r−2)2r in

our case. Then:When r > 2

(a) if p1 > p∗, then ∑ri=1(pi)

2 > 12 ⇒ RI has NCdec bias.

(b) if p1 = p∗, then ∑ri=1(pi)

2 = 12 ⇒ RI has NCneu bias.

(c) if p1 < p∗, then ∑ri=1(pi)

2 < 12 ⇒ RI has NCinc bias.

When r = 2

(a) if p1 = p∗, then ∑ri=1(pi)

2 = 12 ⇒ RI has NCneu bias.

(b) if p1 6= p∗, then ∑ri=1(pi)

2 > 12 ⇒ RI has NCdec bias.

Then if UGT and U ′GT satisfy the conditions, for example, r = r′ = 2, and p1 = p∗ andp′1 6= p∗, then RI will have different NC bias status for UGT and U ′GT , thus RI suffersfrom GT2 bias.

Theorem 5.6 provides an explanation of how GT2 bias affects the RI. For example,in the GT2 bias testing (Section 5.5.2), r = 5, when p1 = 0.8 > 1+

√3

4 (1+√

34 ≈ 0.683),

then the RI tends to decreases as ci increases. Figure 5.7 illustrates GT2 bias on theRI graphically. The basis of this figure is the same experimental setting as Example 2in Section 5.1.2 with N = 1000 and ctrue = 4. We also show the relationship between rand p∗ in Figure 5.8 (r takes integer values from 2 to 50). Note that limr→∞ p

∗ = 1√2 ,

where 1√2 ≈ 0.7071.


Next, we conclude our study by giving an experimental example to show that theARI shows GT bias in certain scenarios.

5.7 example of gt bias for adjusted rand index (ari)

In this section we will illustrate that depending on the set of candidate partitions, ARIcan show GT bias behavior in certain scenarios. Recall that the ARI in Figures 5.1and 5.2 had NCneu bias for the method of partition generation used there. We willconduct experiments with a different set of candidates, and will discover that the ARIcan be made to exhibit GT bias. We conduct two sets of experiments using the followingprotocol setup. We first generate ground truth UGT1 by randomly choosing 20% of theobject labels from N = 100, 000 objects to identify the first cluster. Then, we randomlychoose 20% of the object labels from the remaining 80, 000 objects as the second cluster,and finally, we randomly assign the rest of the cluster labels [3, ctrue] to the remainingobjects, where ctrue ≥ 3. We generate a second ground truth UGT2 partition in the

2 4 10 12


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Ave

rage

RI V

alue

s

p1=0.1(NCinc)

p1=0.2(NCinc)

p1=0.3(NCinc)

p1=0.4(NCinc)

p1=0.5(NCinc)

p1=0.6(NCinc)

p1=0.7(NCdec)

p1=0.8(NCdec)

p1=0.9 (NCdec)

(ctrue) 6 8

Figure 5.7: 100 trials average RI values with c in {2, . . . , 12} for ctrue = 4. p1 = |u1|/N , 9 steps0.1 to 0.9, and the other 3 clusters uniformly distributed. When p1 > p∗ = 0.683,the RI decreases with c increasing (e.g., p1 = 0.7, 0.8, 0.9). When p1 < p∗ = 0.683,the RI increases with c (e.g., p1 = 0.1, . . . , 0.6).

5.8 summary 153

Figure 5.8: The relationship between p∗ and r, for r in {2, . . . , 50}.

following way. We randomly choose 20% of the object labels from N = 100, 000 objectsas the first cluster. Then we randomly choose 50% of the object labels from the remaining80, 000 objects as the second cluster, and finally, we assign the rest of the cluster labels[3, ctrue] to the rest of objects, where ctrue ≥ 3. We set ctrue = 5 for both UGT1 andUGT2.For these two sets of experiments, we generate 100 candidate partitions CP in this

way. For each candidate Vi ∈ CP , we copy the first cluster from UGT1 or UGT2 as thefirst cluster in Vi. Then, we randomly assign the rest of cluster labels [2, ci] to the other80, 000 objects, where ci ranges from c = 2 to c = 15. The results are shown in Figure 5.9.For these two experiments the ARI shows NCinc bias with UGT1 and shows NCdec biaswith UGT . Comparing Figures 5.9a and 5.9b shows that for these experiments, the ARIsuffers from GT bias. For the exploration of this interesting phenomenon is beyond thescope of this chapter, and is an interesting direction for future work.

5.8 summary

This chapter examines several types of bias that may affect external cluster validityindices that are used to evaluate the quality of candidate partitions by comparison witha ground truth partition. They are: i) one of two types of NC bias (NCinc, NCdec),


2 4 6 8 10 12 14

Ave

rage

Val

ues

of In

dex

0

0.05

0.1

0.15

0.2

0.25ARI"


NCinc

(a) The average ARI values for UGT 1 andrandom partitions with different num-bers of clusters.

2 4 6 8 10 12 14

Ave

rage

Val

ues

of In

dex

0

0.05

0.1

0.15

0.2

0.25ARI"


NCdec

(b) The average ARI values for UGT 2 andrandom partitions with different num-bers of clusters.

Figure 5.9: 100 trial average ARI values for two different ground truth and set of candidateswith different numbers of clusters.

which arises when the mathematical model of the external CVI tends to be monotonicin the number of clusters in candidate partitions; ii) GT bias, which arises when theground truth partitions alters the NC bias status of an external CVI; iii) GT1 bias,which arises when the numbers of clusters in the ground truth partitions alters the NCbias status of an external CVI; iv) GT2 bias, which arises when the distribution of theground truth partitions alters the NC bias status of an external CVI.Numerical experiments with 26 pair-counting based external CVIs established that

for the method described in the examples, 5 of the 26 suffer from GT1bias and/orGT2bias, viz., the indices due to Rand (#1), Mirkin (#3), Hubert (#5), Gower andLegendre (#24) and Rogers and Tanimoto (#25), the numbers referring to rows inTable 5.2. Actually, the 4 indices, Mirkin (#3), Hubert (#5), Gower and Legendre(#24) and Rogers and Tanimoto (#25), are all functions of RI. We point out that theobserved bias behavior (NC bias, GT1 bias and GT2 bias) of the tested 26 indiceswas based on a particular way to obtain candidate partitions. In our experiments the“clustering algorithm" used to generate the CPs was random draws from MhciN . It isentirely possible that sets of CPs secured, for example, by running clustering clusteringalgorithms on a dataset will NOT exhibit the same bias tendencies. This is just anotherdifficulty of external cluster validity indices, as was illustrated by the fact that we couldchange the bias status of the ARI by changing the method of securing the candidate.

5.8 summary 155

The major point of this work is to draw attention to the fact that there can be a GTbias problem for external CVIs.We then formulated an explanation for both types of GT bias with Rand Index

based on the the Havrda-Charvat quadratic entropy. Our theory explained how RI’sNC bias behavior is influenced by the distribution of the ground truth partition andalso the number of clusters in the ground truth. Our major results in Theorem 5.2,which provides a computable test that predicts the NC bias behavior of the Rand Index,and hence, all external CVIs related to it. Rand Index has been one of the most popularexternal CVIs due to its simple, natural interpretation and has recently been appliedin many research work [Johnson et al., 2010; Erisoglu et al., 2011; Zakaria et al., 2012;Wang et al., 2013; Xu et al., 2014; Ryali et al., 2015]. Thus, the identified GT biasbehavior for RI with correponding explaination could be helpful for users who applyRI in their work. Finally, we gave an experimental example showing that the ARI cansuffer from GT bias in certain scenarios.We believe this to be the first systematic study of the effects of ground truth on the

NC bias behavior of external cluster validity indices. We have termed this GT bias.

6CONCLUS IONS

In this chapter, we summarize the contributions we have made in the thesis. In addition,we outline the future directions of research, and conclude with a few final words.

6.1 summary of thesis

Cluster analysis is an important unsupervised learning process in data analysis. It aimsto group the data objects into groups (clusters), so that the data objects in the samecluster are more similar, while the data objects in different clusters are more dissimi-lar. Many clustering techniques have been proposed for this task. Traditional clusteringmethods focus on discovering a single ‘best’ clustering from the data. However, theremight be multiple reasonable but distinctive clusterings existing in the data. In this the-sis, we studied the problem about the discovery of multiple clusterings. Cluster validityrefers to the procedure of evaluating the quality of clusterings. It is important and essen-tial for the clustering application. External cluster validity indices (CVIs) are used toevaluate the quality of a candidate partition by comparing its similarity with a groundtruth. Many challenges still exist in this topic. In this thesis, we focused two of them.First, we studied the problem of soft clustering evaluation using external CVIs. Second,we studied the bias problem of external CVIs. This thesis was presented according tothese three problems.In Chapter 2, we reviewed the background knowledge about the clustering techniques

and cluster validity indices. In particular, we introduced traditional clustering methodsand multiple clustering analysis methods. In addition, we discussed the external andinternal cluster validity indices.In Chapter 3, we studied the multiple clusterings discovery problem. Meta-clustering

is an important tool for discovering multiple views from data by analyzing a large set of

157

158 conclusions

raw base clusterings. However, there may exist poor quality and similar base clusteringswhich will affect the generation of high quality and diverse views. To tackle this issue,we have introduced a (base) clustering selection method for filtering out the poor qualityand redundant clusterings from a set of raw base clusterings. In addition, we proposed ascheme to rank multiple clustering views. We believe the proposed framework, rFILTA,is a simple and useful tool in the area of multiple clustering exploration and analysis.In Chapter 4, we reviewed the soft clustering evaluation using external CVIs. We

have generalized eight popular information-theoretic (IT) based crisp external CVIs toevaluate soft clusterings. We demonstrated that soft generalizations of the eight IT-CVIsare quite capable of identifying the “correct” number of clusters or classes from candidatepartitions generated by FCM and EM algorithms. The results of this study provideda reasonably strong empirical argument about the effectiveness of generalized IT-CVIsfor both fuzzy and probabilistic cluster validity. In particular, NMIsM is superior to theother seven generalized IT-CVIs for both FCM and EM partitions on datasets withoverlapped and/or various sized clusters. Finally, Theorem 4.1 provided a theoreticalreason to expect better performance of NMIsM over the other three variants of NMI, i.e.,NMIsj, NMIss and NMIsr in certain situations. In work [Horta and Campello, 2015], threeinformation-theoretic based measures were discussed and evaluated for soft clusterings,named 03MI, 03VI and 05MI, which were initially discussed for hard/crisp clusteringsrefer to 2000s. While evaluating soft clusterings, it used the similar idea as we proposedin Chapter 4 which was based on our independently published work [Lei et al., 2014].In Chapter 5, we studied and analyzed the bias problem of external CVIs. We ex-

amined several types of bias that may affect external cluster validity indices that areused to evaluate the quality of candidate partitions by comparison with a ground truthpartition. They are: i) one of two types of NC bias (NCinc, NCdec), which arises whenthe mathematical model of the external CVI tends to be monotonic in the number ofclusters in candidate partitions; ii) GT bias, which arises when the ground truth par-titions alters the NC bias status of an external CVI; iii) GT1 bias, which arises whenthe numbers of clusters in the ground truth partitions alters the NC bias status of anexternal CVI; iv) GT2 bias, which arises when the distribution of the ground truthpartitions alters the NC bias status of an external CVI.

6.2 future directions 159

We then formulated an explanation for both types of GT bias with Rand Index basedon the the Havrda-Charvat quadratic entropy. Our theory explained how RI’s NC biasbehavior is influenced by the distribution of the ground truth partition and also thenumber of clusters in the ground truth. Our major results were outlined in Theorem 5.2,which provides a computable test that predicts the NC bias behavior of the Rand Index,and hence, all external CVIs related to it. We believe this to be the first systematicstudy of the effects of ground truth on the NC bias behavior of external cluster validityindices. We have termed this GT bias.

6.2 future directions

Our study has addressed several important problems under the topics of multiple clus-tering analysis and external cluster validity indices. However, there are still many inter-esting and challenging problems to be solved.

(a) In the rFILTA framework proposed in Chapter 3, the generated base clusteringsare required to contain the same number of clusters to facilitate the grouping step.However, this condition can sometimes be too strict and unrealistic. In addition,the current grouping method can be expensive. Thus, it will be an interestingdirection to work on developing more effective and efficient methods for groupingthe base clusterings.

Each step involved in the rFILTA framework is important and relevant to theeffectiveness of the whole framework. Thus, it will be also interesting to investigate:different generation methods, e.g., to capture more interesting clustering views inthe data efficiently; different ranking schemes and different ensemble methods toimprove the effectiveness of the whole rFILTA framework.

(b) In Chapter 4, we have generalized a popular class of crisp external validationmeasures, i.e., IT based measures, to evaluate soft clusterings. The set matchingbased measures is another popular class of crisp external validation measures(Chapter 2). It will be an interesting direction to generalize the crisp set matching

160 conclusions

based measures to soft case and compare them with the existing soft externalCVIs.

In [Horta and Campello, 2015], the authors provided a conceptual and experimen-tal overview of about 30 soft external CVIs. In Chapter 4, the generalized IT basedmeasures were only compared against a limited subset of existing soft CVIs andwith respect to success rate only. It will be interesting to compare the measuresproposed in Chapter 4 with a comprehensive list of existing soft external CVIs(can refer to [Horta and Campello, 2015]), and take into account more propertiesof measures to better evaluate their effectiveness and understand their behaviors.

In addition, inspired by work [Horta and Campello, 2015], the generalized IT basedmeasures discussed in Chapter 4 do not satisfy the maximum property, that is,these measures may not attain their maximum 1 when two equivalent partitionsare compared. How to correct the generalized IT based measures to satisfy themaximum property will be another interesting direction to work on.

(c) We have studied the bias problem, namely NC bias and GT bias, of pair-countingbased measures in Chapter 5. The findings from this work alert us to the im-portance of recognizing and understanding the performance of CVIs in differentsettings. The correct performance of the CVIs is crucial for their further applica-tions, for example, evaluating the goodness of clusterings, choosing appropriateclustering algorithms and clustering comparison while generating multiple cluster-ings.

In the future, it will be interesting to study this phenomenon in a more generalsetting afforded by non pair-counting based external CVIs, for example, crispinformation-theoretic based CVIs. In addition, the soft clustering validation mea-sures could also have bias problem. It will be interesting to investigate the biasproblem of the generalized information-theoretic based CVIs proposed in Chap-ter 4.

In addition, it is also meaningful to explore different settings which may causeand influence the bias behavior of cluster validity indices, for example, differentgeneration strategies for candidate partitions. We have provided an example of

6.3 final words 161

ARI at the end of Chapter 5 to show that ARI can also show GT bias with certaingeneration settings for candidate partitions. It will be interesting to study whyand when this happens, and how to correct this type of bias.

(d) Network motifs [Milo et al., 2002] are defined as the recurrent and statisticallypatterns of interconnections in a network. They can provide a deep insight intothe functional abilities of the network. Depending on the different classes of net-work, motifs may have different structures. Based on concept of motifs, we mightdesign new comparison measures to compare similarity between two clusteringsby considering their motifs [Benson et al., 2016] instead of pairs of objects. Inaddition, a motif based comparison measure could be used in multiple clusteringdiscovery to ensure the dissimilarity of discovered clusterings.

(e) The data may contain multiple meaningful and distinctive clustering views si-multaneously. How to evaluate the generated multiple clustering solutions is anchallenging task. Most of the existing work evaluate the multiple generated cluster-ing views by considering the quality of each clustering view and the dissimilaritiesamong them. However, how to evaluate the quality of a clustering view is chal-lenging. In Chapter 3, we measure the quality of the generated clustering viewsby comparing their similarities with the known ground truth. However, there islimited number of ground truth known in reality. Due to the subjectiveness ofdata interpretation, it is unsure how many possible clustering views existing inthe data and what are they. Thus, the generated clustering views which are notsimilar to the known ground truth might be also meaningful. It will be interest-ing to investigate how to evaluate the multiple clustering views effectively in thefuture.

6.3 final words

Cluster analysis is one important tool in exploratory data analysis. It is oftenused to explore the natural patterns and meaningful insights from the data. It

162 conclusions

also faces the challenges stimulated by the huge and complex data. Traditionalclustering techniques, which have focused on discovering one single best clustering,cannot always satisfy the users’ requirements. For the same dataset, users may seekmultiple meaningful, unknown insights and explanations of the data, so that theycan explore new markets or design new effective strategies to assist in their work.Motivated by these new challenges, multiple clustering analysis has appeared andattracted considerable attention in recent years. It is an emerging field, that aimsto discover multiple meaningful and distinctive patterns from the data. It attemptsto provide users more possible ways to interpret the data and more insights tounderstand the data. In addition, it attempts to extract information from thedata as much as possible and as diverse as possible. We believe work on discoveryof multiple clusterings is important and useful to advance the development of dataanalysis. Our proposed rFlLTA framework is a simple and easy handled tool whichaims to help users explore the data and get more insights from the data.

Due to the unsupervised nature of the cluster analysis, cluster validity is espe-cially important for the successful application of clustering. For example, for highdimensional data, it is difficult to directly visualize the data for judging its quality.In addition, many clustering solutions could be generated from the same datasetby applying different clustering techniques. It is difficult and inefficient to checkeach solution, especially for the large and complex data. Quantitative evaluationof the quality of clusterings provides users with a concrete idea about the goodnessof clusterings which can help them make more informed decisions. Thus, study ofcluster validity is also very important for the data analysis.

Cluster validity indices (CVIs) are an important tool for assessing clustering re-sults. One important class of CVIs is external measures, which makes use of ex-ternal knowledge to evaluate the quality of clusterings. External CVIs are alsocalled comparison measures as they are often used to compare the similarity ordissimilarity between two clusterings. They evaluate the quality of clusterings bycomparing them against the ground truth, which is usually defined by the domainexpert. It is believed that the more similar the clustering to the ground truth

6.3 final words 163

(gold standard), the better quality this clustering is of. Thus, we can evaluate thequality of clusterings or choose appropriate clustering algorithms for a dataset byjudging their generated clusterings with external CVIs.

Different types of clusterings with different characteristics bring new challenges tothe cluster validity job. In this thesis, we tackled the challenge about validating softclusterings using external CVIs. In addition, we studied another crucial problemin cluster validity, that is, the bias problem of cluster validity indices. Biases in theCVIs can cause wrong conclusions be drawn about the correct number of clusters,how close two clusterings are and other applications of cluster validation. Hence,it is important to understand and identify such biases in CVIs. In this thesis, wefocus on the pair-counting based CVIs and their biases.

This thesis has studied several specific problems and tackled several specific chal-lenges in cluster analysis this broad area. We hope our work can provide startingpoints for researchers interested in this area and enable them to tackle growingchallenges in the area of cluster analysis.

APPENDIX

165

ARF ILTA EXPERIMENTAL RESULTS

a.1 isolet dataset

The isolet dataset from the UCI machine learning repository [Bache and Lichman, 2013]contains 7797 records with 617 features, which come from 150 subjects speaking thename of each letter of the alphabet twice. There are two clustering views (speaker andletters) in this dataset. In our experiment, we randomly selected 10 persons along with10 letters, resulting in a 200 records dataset. We generate 700 base clusterings thatcontains the speaker and letter views, and select 100 base clusterings using rFILTA(β = 0.6).

The results are shown in Figure A.1. Compared with the iVAT diagram of the unfil-tered base clusterings in Figure A.1a, the iVAT diagram of filtered base clusterings inFigure A.1b contains more clear blocks. It may because the clustering views are moreeasily identified after filtering out the irrelevant base clusterings. The MBM scores ofthese two sets of clustering views are shown in Figure A.1c. In summary:

(a) We discovered 92 clustering views from the unfiltered base clusterings and discov-ered 28 clustering views from the filtered set of base clusterings with L = 100, β =

0.6.

(b) The best MBM scores of filtered clustering views are a little higher than the onesof unfiltered clusterings views. It is because the quality of the clustering viewsgenerated from the filtered base clusterings are better than the ones from theunfiltered base clusterings.

167

168 rfilta experimental results

0 100 200 300 400 500 600 700

0

100

200

300

400

500

600

700

(a) iVAT diagram of the 700 unfiltered base clus-terings. 92 blocks (meta-clusters) are discov-ered.

0 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

(b) iVAT diagram of the 100 filtered base cluster-ings with β = 0.6. 28 blocks are discovered.

0 10 20 30 40 50 60 70 80 90 950

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

K

MB

M S

core

s

UnfilteredFiltered

(c) The MBM scores for the clustering views from the unfiltered andfiltered set of base clusterings.

Figure A.1: Results on the isolet dataset.

a.2 webkb dataset

The WebKB dataset 1 contains webpages collected mainly from four universities: Cor-nell, Texas, Washington and Wisconsin. We selected all documents from those four

1 www.cs.cmu.edu/ webkb

A.2 webkb dataset 169

0 100 200 300 400 500 600 700

0

100

200

300

400

500

600

700

(a) iVAT diagram of the 700 raw base clusterings.112 blocks (meta-clusters) are discovered.

0 10 20 30 40 50 60 70 80 90 100

0

10

20

30

40

50

60

70

80

90

100

(b) iVAT diagram of the 100 filtered base cluster-ings with β = 0.8. 72 blocks (meta-clusters)are discovered.

0 5 10 15 20 25 30 35 40 45 50 55 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

K

MB

M S

core

s

UnfilteredFiltered

(c) The MBM scores for the clustering views from theunfiltered and filtered base clusterings.

Figure A.2: Results on the webkb dataset.

universities that fall under one of four page types namely, course, faculty, project andstudent. We preprocessed the data by removing common words and rare words (ap-peared less than twice), and stemming. Finally, we choose 350 words with the highestvariance. We use the TF-IDF weighting to construct the feature vectors. The final datamatrix contains 1041 documents and 350 words. This dataset can be either clustered bythe four universities or by the four page types.


We generated 700 base clusterings and selected 100 base clusterings with β = 0.8. Theresults are shown in Figure A.2. Comparing the iVAT diagrams from the unfiltered andfiltered base clusterings, the iVAT diagram of the filtered base clusterings in Figure A.2breveal more clear blocks while the iVAT diagram from the unfiltered base clusterings inFigure A.2a are fuzzy and not clearly separated.The MBM scores of these two sets of clustering views are shown in Figure A.2c.

(a) We generated 112 clustering views from the unfiltered base clusterings and gener-ated 22 clustering views from the filtered set of base clusterings with L = 100, β =

0.8.

(b) The clustering views generated from the filtered base clusterings are matching theground truth views better than the ones generated from unfiltered base clusterings.

a.3 object dataset

The Amsterdam Library of Object Images (ALOI) consists of 110250 images of 1000common objects. For each object, a number of photos are taken from different anglesand under various lighting conditions. We choose 9 objects with different colors andshapes, for a total of 108 images (Figure A.3). We processed them in the same way asthe card dataset and extracted 15 features for each image finally. The two ground truthclustering views are shown in Figure A.4.In this set of experiments, we generate 700 base clusterings with 3 clusters. We would

like to demonstrate the performance of the filtering and ranking functions in our rFILTAframework. The experimental results on the unfiltered set of base clusterings are shownin Figure A.5. As we can see from the iVAT diagram of the 700 base clusterings inFigure A.5a, there are one big block and two small blocks along the diagonal. Wefinally generate 3 clustering views shown in Figure A.5b. The first row is the color view,containing three clusters, red, green and yellow. We do not find the shape view fromthe unfiltered base clusterings. Next, we show the results on the filtered set of baseclusterings with L = 100, β = 0.95 in Figure A.6. The iVAT diagram contains multiple

A.3 object dataset 171

Figure A.3: Example images of the nine selected objects.

Color

Shape

Figure A.4: Two ground truth clustering views on object dataset. The first row is the colorview and the second row is the shape view.

clear blocks. The top 4 clustering views are shown in Figure A.6b. The first row is thecolor view and the fourth row is the shape view.Comparing the two sets of results from the unfiltered base clusterings and the filtered

base clusterings, we have some observations. There are less clustering views generatedfrom the unfiltered base clusterings than ones from the filtered base clusterings. It maybe because that there are many generated base clusterings which are connecting differentclustering views in the clustering space. Thus, in the clustering space, they appear tobe a big meta-cluster. After filtering, we clean out these connecting base clusteringsand the different clustering views are separated clearly. Thus, the iVAT diagram of the


0 100 200 300 400 500 600 700

0

100

200

300

400

500

600

700

(a) The iVAT diagram of the unfiltered base clus-terings on the object dataset. 3 blocks (meta-clusters) are discovered.

0.75 1 1


1 0.4 0.4


1 0.34 1


(b) The top 3 clustering views got from theunfiltered base clusterings on the objectdataset. The first row is the color view.

Figure A.5: The results on the unfiltered base clusterings on the object dataset.

unfiltered base clusterings only contains one big dark block and two small blocks whilethe iVAT diagram of the filtered base clusterings contain multiple clear blocks. Fromthe unfiltered base clusterings, the shape view is not discovered. It is because its meta-cluster is concealed in the big block. After filtering, the shape views is discovered andthe quality of the color view is increased.The MBM scores for clustering views generated from the unfiltered and filtered base

clusterings are shown in Figure A.7. In summary:

(a) We found out 3 clustering views from the unfiltered set of base clusterings andfound out 9 clustering views from the filtered base clusterings with L = 100, β =

0.95.

(b) The MBM scores for the 3 clustering views generated from the unfiltered baseclusterings are invariant as only the color view is recovered and its quality doesnot improve with K.

(c) The top 4 clustering views returned from the filtered set of base clusterings recoverand match well with the two ground truth views, with MBM(C4) = 0.9.

A.3 object dataset 173

0 20 40 60 80 100

0

10

20

30

40

50

60

70

80

90

100

(a) The iVAT diagram of the filtered base clus-terings on the card dataset. 9 blocks (meta-clusters) are discovered.

1 1 1

(Color 1)(Shape 0)

0.5 1 1


0.6 0.67 1


1 0.75 1


(b) The top 4 views got from the filteredbase clustering on the object dataset.The first row is the color view and thefourth row is the shape view.

Figure A.6: The results for the filtered base clusterings on object dataset with β = 0.3.

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

K

MB

M S

core

s

UnfilteredFiltered

Figure A.7: MBM scores for clustering views generated from the unfiltered and filtered baseclusterings on the object dataset.

REFERENCES

C. C. Aggarwal and C. K. Reddy. Data Clustering: Algorithms and Applications. CRCPress, 2013. (Cited on pages 5 and 15.)

C. C. Aggarwal and C. Zhai. A survey of text clustering algorithms. In Mining TextData, pages 77–128. Springer, 2012. (Cited on page 2.)

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clusteringof high dimensional data for data mining applications, volume 27. ACM, 1998. (Citedon page 34.)

A. N. Albatineh. Means and variances for a family of similarity indices used in clus-ter analysis. Journal of Statistical Planning and Inference, 140(10):2828–2838, 2010.(Cited on page 128.)

A. N. Albatineh, M. Niewiadomska-Bugaj, and D. Mihalko. On similarity indicesand correction for chance agreement. Journal of Classification, 23(2):301–313, 2006.(Cited on page 134.)

D. T. Anderson, J. C. Bezdek, M. Popescu, and J. M. Keller. Comparing fuzzy, prob-abilistic, and possibilistic partitions. IEEE Transactions on Fuzzy Systems, 18(5):906–918, 2010. (Cited on pages 7, 95, 96, 98, 99, 105, and 134.)

P. Arabie and S. A. Boorman. Multidimensional scaling of measures of distance be-tween partitions. Journal of Mathematical Psychology, 10(2):148–203, 1973. (Citedon page 129.)

I. Assent, R. Krieger, E. Müller, and T. Seidl. Dusc: Dimensionality unbiased subspaceclustering. In Proceedings of the Seventh IEEE International Conference on DataMining (ICDM 2007), pages 409–414. IEEE, 2007. (Cited on page 34.)

175

176

J. Azimi and X. Fern. Adaptive cluster ensemble selection. In Proceedings of the 21stInternational Joint Conference on Artifical Intelligence (IJCAI 2009), volume 9, pages992–997, 2009. (Cited on pages 33 and 52.)

K. Bache and M. Lichman. UCI machine learning repository, 2013. URL http://

archive.ics.uci.edu/ml. (Cited on pages 77, 104, and 167.)

E. Bae and J. Bailey. Coala: A novel approach for the extraction of an alternate clus-tering of high quality and high dissimilarity. In Proceedings of the Sixth InternationalConference on Data Mining (ICDM’06), pages 53–62. IEEE, 2006. (Cited on pages 22,25, 29, 31, and 35.)

E. Bae, J. Bailey, and G. Dong. A clustering comparison measure using density pro-files and its application to the discovery of alternate clusterings. Data Mining andKnowledge Discovery, 21(3):427–471, 2010. (Cited on pages 22, 26, and 31.)

J. Bailey. Alternative clustering analysis: A review. In C. Aggarwal and C. Reddy,editors, Data Clustering: Algorithms and Applications. CRC Press, 2013. (Cited onpages 2 and 4.)

S. Basu, I. Davidson, and K. Wagstaff. Constrained clustering: Advances in algorithms,theory, and applications. CRC Press, 2008. (Cited on page 34.)

G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach.Neural computation, 12(10):2385–2404, 2000. (Cited on pages 1 and 27.)

F. Baulieu. A classification of presence/absence based dissimilarity coefficients. Journalof Classification, 6(1):233–246, 1989. (Cited on page 135.)

A. R. Benson, D. F. Gleich, and J. Leskovec. Higher-order organization of complexnetworks. Science, 353(6295):163–166, 2016. (Cited on page 161.)

J. C. Bezdek. Pattern recognition with fuzzy objective function algorithms. KluwerAcademic Publishers, 1981. (Cited on pages 16, 20, 21, and 100.)

http://archive.ics.uci.edu/ml

http://archive.ics.uci.edu/ml

177

J. C. Bezdek, J. Keller, R. Krisnapuram, and N. Pal. Fuzzy models and algorithmsfor pattern recognition and image processing, volume 4. Springer Science & BusinessMedia, 2006. (Cited on page 1.)

C. Boulis and M. Ostendorf. Combining multiple clustering systems. In European Con-ference on Principles of Data Mining and Knowledge Discovery, pages 63–74. Springer,2004. (Cited on page 33.)

R. K. Brouwer. Extending the rand, adjusted rand and jaccard indices to fuzzy par-titions. Journal of Intelligent Information Systems, 32(3):213–235, 2009. (Cited onpages 96, 98, and 104.)

R. J. Campello. A fuzzy extension of the rand index and other related indexes forclustering and classification assessment. Pattern Recognition Letters, 28(7):833–841,2007. (Cited on pages 8, 96, and 98.)

R. Caruana, M. Elhaway, N. Nguyen, and C. Smith. Meta clustering. In Proceedingsof the Sixth International Conference on Data Mining (ICDM 2006), pages 107–118.IEEE, 2006. (Cited on pages 6, 19, 23, 47, 52, 59, and 66.)

K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan. Multi-view clustering viacanonical correlation analysis. In Proceedings of the 26th International Conference onMachine Learning (ICML 2009), pages 129–136. ACM, 2009. (Cited on page 33.)

C. H. Cheng, A. W. Fu, and Y. Zhang. Entropy-based subspace clustering for miningnumerical data. In Proceedings of the fifth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pages 84–93. ACM, 1999. (Cited on page 34.)

T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons,2012. (Cited on pages 38 and 119.)

Y. Cui, X. Z. Fern, and J. G. Dy. Non-redundant multi-view clustering via orthogo-nalization. In Proceedinsgs of the Seventh IEEE International Conference on DataMining (ICDM 2007), pages 133–142. IEEE, 2007. (Cited on pages 26, 29, 31, and 77.)

178

Y. Cui, X. Z. Fern, and J. G. Dy. Learning multiple nonredundant clusterings. ACMTransactions on Knowledge Discovery from Data (TKDD), 4(3):15, 2010. (Cited onpages 22, 26, and 31.)

N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR 2005), volume 1, pages 886–893. IEEE, 2005. (Cited on page 83.)

X. H. Dang and J. Bailey. Generation of alternative clusterings using the cami approach.In Proceedings of the 10th SIAM International Conference on Data Mining (SDM2010), pages 118–129, 2010a. (Cited on pages 5, 23, 25, and 31.)

X. H. Dang and J. Bailey. A hierarchical information theoretic technique for the discov-ery of non linear alternative clusterings. In Proceedings of the 16th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages 573–582.ACM, 2010b. (Cited on pages 22, 28, 31, 44, and 55.)

X. H. Dang and J. Bailey. Generating multiple alternative clusterings via globally opti-mal subspaces. Data Mining and Knowledge Discovery, 28(3):569–592, 2014. (Citedon pages 23, 27, 29, and 31.)

X. H. Dang and J. Bailey. A framework to uncover multiple alternative clusterings.Machine Learning, 98(1-2):7–30, 2015. (Cited on pages 23, 25, 29, and 31.)

S. Dasgupta and V. Ng. Mining clustering dimensions. In Proceedings of the 27thInternational Conference on Machine Learning (ICML 2010), pages 263–270, 2010.(Cited on pages 24 and 31.)

I. Davidson and Z. Qi. Finding alternative clusterings using constraints. In Proceedingsof the Eighth IEEE International Conference on Data Mining (ICDM 2008), pages773–778. IEEE, 2008. (Cited on pages 22, 26, 29, 31, and 35.)

I. Davidson and S. Ravi. The complexity of non-hierarchical clustering with instanceand cluster level constraints. Data Mining and Knowledge Discovery, 14(1):25–61,2007. (Cited on page 34.)

179

V. R. de Sa. Spectral clustering with two views. In ICML workshop on learning withmultiple views, 2005. (Cited on page 33.)

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incompletedata via the em algorithm. Journal of the Royal Statistical Society. Series B (method-ological), pages 1–38, 1977. (Cited on pages 2 and 3.)

L. R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945. (Cited on page 135.)

M. Erisoglu, N. Calis, and S. Sakallioglu. A new algorithm for initial cluster centers ink-means algorithm. Pattern Recognition Letters, 32(14):1701–1705, 2011. (Cited onpages 124 and 155.)

M. Ester, H. P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for dis-covering clusters in large spatial databases with noise. In Proceedings of the SecondInternational Conference on Knowledge Discovery and Data Mining, volume 96, pages226–231, 1996. (Cited on pages 3 and 19.)

V. Estivill-Castro. Why so many clustering algorithms: a position paper. ACM SIGKDDExplorations Newsletter, 4(1):65–75, 2002. (Cited on pages 1 and 2.)

E. W. Fager and J. A. McGowan. Zooplankton species groups in the north pacific co-occurrences of species can be used to derive groups whose members react similarly towater-mass types. Science, 140(3566):453–460, 1963. (Cited on page 135.)

L. Faivishevsky and J. Goldberger. Nonparametric information theoretic clusteringalgorithm. In Proceedings of the 27th International Conference on Machine Learning(ICML 2010), pages 351–358, 2010. (Cited on pages 29, 56, and 66.)

X. Z. Fern and W. Lin. Cluster ensemble selection. Statistical Analysis and Data Mining,1(3):128–141, 2008. (Cited on pages 33 and 52.)

E. B. Fowlkes and C. L. Mallows. A method for comparing two hierarchical clusterings.Journal of the American statistical association, 78(383):553–569, 1983. (Cited onpages 128, 129, and 134.)

180

A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. ACM Transactions onKnowledge Discovery from Data (TKDD), 1(1):4, 2007. (Cited on page 32.)

D. Gondek and T. Hofmann. Conditional information bottleneck clustering. In Proceed-ings of the 3rd IEE International Conference on Data Mining, Workshop on ClusteringLarge Data Sets, pages 36–42. Citeseer, 2003. (Cited on pages 28 and 31.)

D. Gondek and T. Hofmann. Non-redundant data clustering. In Proceedings of theFourth IEEE International Conference on Data Mining (ICDM 2004), pages 75–82.IEEE, 2004. (Cited on pages 22, 28, and 31.)

D. Gondek and T. Hofmann. Non-redundant clustering with conditional ensembles. InProceedings of the eleventh ACM SIGKDD International Conference on KnowledgeDiscovery in Data Mining, pages 70–77. ACM, 2005. (Cited on pages 22, 28, and 31.)

T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. TheoreticalComputer Science, 38:293–306, 1985. (Cited on page 23.)

L. A. Goodman and W. H. Kruskal. Measures of association for cross classifications.Journal of the American Statistical Association, 49(268):732–764, 1954. (Cited onpage 136.)

J. C. Gower and P. Legendre. Metric and euclidean properties of dissimilarity coefficients.Journal of Classification, 3(1):5–48, 1986. (Cited on page 136.)

S. Günnemann, E. Müller, I. Färber, and T. Seidl. Detection of orthogonal concepts insubspaces of high dimensional data. In Proceedings of the 18th ACM Conference onInformation and Knowledge Management, pages 1317–1326. ACM, 2009. (Cited onpage 34.)

S. Günnemann, I. Färber, E. Müller, and T. Seidl. Asclu: Alternative subspace cluster-ing. In MultiClust Workshop, Proceedings of the 16th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining. Citeseer, 2010. (Cited onpage 34.)

181

S. Günnemann, I. Färber, and T. Seidl. Multi-view clustering using mixture modelsin subspace projections. In Proceedings of the 18th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, pages 132–140. ACM, 2012.(Cited on page 34.)

S. T. Hadjitodorov, L. I. Kuncheva, and L. P. Todorova. Moderate diversity for bet-ter cluster ensembles. Information Fusion, 7(3):264–275, 2006. (Cited on pages 33and 52.)

T. C. Havens and J. C. Bezdek. An efficient formulation of the improved visual as-sessment of cluster tendency (ivat) algorithm. IEEE Transactions on Knowledge andData Engineering, 24(5):813–822, 2012. (Cited on page 59.)

T. C. Havens, J. C. Bezdek, J. M. Keller, and M. Popescu. Clustering in ordereddissimilarity data. International Journal of Intelligent Systems, 24(5):504–528, 2009.(Cited on pages 59, 60, and 64.)

T. C. Havens, J. C. Bezdek, C. Leckie, L. O. Hall, and M. Palaniswami. Fuzzy c-meansalgorithms for very large data. Fuzzy Systems, IEEE Transactions on, 20(6):1130–1146, 2012. (Cited on page 104.)

J. Havrda and F. Charvát. Quantification method of classification processes. conceptof structural a-entropy. Kybernetika, 3(1):30–35, 1967. (Cited on page 141.)

A. Hinneburg and D. A. Keim. An efficient approach to clustering in large multime-dia databases with noise. In Proceedings of the Fourth International Conference onKnowledge Discovery and Data Mining, volume 98, pages 58–65, 1998. (Cited onpage 2.)

D. Horta and R. J. Campello. Comparing hard and overlapping clusterings. Journal ofMachine Learning Research, 16:2949–2997, 2015. (Cited on pages 158 and 160.)

M. S. Hossain, N. Ramakrishnan, I. Davidson, and L. T. Watson. How to “alternatize”a clustering algorithm. Data Mining and Knowledge Discovery, 27(2):193–224, 2013.(Cited on page 29.)

182

M. Hua and J. Pei. Clustering in applications with multiple data sourcesâĂŤa mutualsubspace clustering approach. Neurocomputing, 92:133–144, 2012. (Cited on page 33.)

L. Hubert. Nominal scale response agreement as a generalized correlation. BritishJournal of Mathematical and Statistical Psychology, 30(1):98–103, 1977. (Cited onpage 134.)

L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193–218,1985. (Cited on pages 38, 105, and 134.)

E. Hüllermeier and M. Rifqi. A fuzzy variant of the rand index for comparing clusteringstructures. In IFSA/EUSFLAT Conf., pages 1294–1298, 2009. (Cited on pages 96and 98.)

E. Hüllermeier, M. Rifqi, S. Henzgen, and R. Senge. Comparing fuzzy partitions: A gen-eralization of the rand index and related measures. Fuzzy Systems, IEEE Transactionson, 20(3):546–556, 2012. (Cited on pages 96, 98, and 104.)

P. Jaccard. Nouvelles recherches sur la distribution florale. 1908. (Cited on pages 38and 134.)

A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010. (Cited on pages 2, 3, and 18.)

A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.(Cited on pages 1, 2, 5, 7, 18, 35, 52, 58, 95, and 135.)

P. Jain, R. Meka, and I. S. Dhillon. Simultaneous unsupervised learning of disparateclusterings. Statistical Analysis and Data Mining: The ASA Data Science Journal, 1(3):195–210, 2008. (Cited on pages 24, 29, and 31.)

M. Jakobsson and N. A. Rosenberg. Clumpp: a cluster matching and permutationprogram for dealing with label switching and multimodality in analysis of populationstructure. Bioinformatics, 23(14):1801–1806, 2007. (Cited on pages 1 and 2.)

183

E. Januzaj, H.-P. Kriegel, and M. Pfeifle. Scalable density-based distributed cluster-ing. In European Conference on Principles of Data Mining and Knowledge Discovery,pages 231–244. Springer, 2004. (Cited on page 33.)

P. A. Jaskowiak, D. Moulavi, A. C. Furtado, R. J. Campello, A. Zimek, and J. Sander.On strategies for building effective ensembles of relative clustering validity criteria.Knowledge and Information Systems, 47(2):329–354, 2016. (Cited on page 62.)

D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression data: A sur-vey. IEEE Transactions on Knowledge and Data Engineering, 16(11):1370–1386, 2004.(Cited on page 135.)

R. A. Johnson, K. D. Wright, H. Poppleton, K. M. Mohankumar, D. Finkelstein, S. B.Pounds, V. Rand, S. E. Leary, E. White, C. Eden, et al. Cross-species genomicsmatches driver mutations and cell compartments to model ependymoma. Nature, 466(7306):632–636, 2010. (Cited on pages 124 and 155.)

K. Kailing, H. P. Kriegel, A. Pryakhin, and M. Schubert. Clustering multi-representedobjects with noise. In Advances in Knowledge Discovery and Data Mining, pages394–403. Springer, 2004. (Cited on page 33.)

G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioningirregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392, 1998. (Citedon page 32.)

G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. Multilevel hypergraph partitioning:applications in vlsi domain. IEEE Transactions on Very Large Scale Integration(VLSI) Systems, 7(1):69–79, 1999. (Cited on page 32.)

K. N. Kontonasios and T. De Bie. Subjectively interesting alternative clusterings. Ma-chine Learning, 98(1-2):31–56, 2015. (Cited on page 23.)

H.-P. Kriegel, P. Kröger, and A. Zimek. Clustering high-dimensional data: A surveyon subspace clustering, pattern-based clustering, and correlation clustering. ACMTransactions on Knowledge Discovery from Data (TKDD), 3(1):1, 2009. (Cited onpage 34.)

184

R. Krishnapuram and J. M. Keller. A possibilistic approach to clustering. IEEE Trans-actions on Fuzzy Systems, 1(2):98–110, 1993. (Cited on page 16.)

S. Kulczyński. Die pflanzenassoziationen der pieninen. Imprimerie de l’Université, 1928.(Cited on page 135.)

Y. Lei, J. C. Bezdek, J. Chan, N. Xuan Vinh, S. Romano, and J. Bailey. Generalizedinformation theoretic cluster validity indices for soft clusterings. In ComputationalIntelligence and Data Mining (CIDM), 2014 IEEE Symposium on, pages 24–31. IEEE,2014. (Cited on page 158.)

H. Liu, T. Liu, J. Wu, D. Tao, and Y. Fu. Spectral ensemble clustering. In Proceedingsof the 21th ACM SIGKDD International Conference on Knowledge Discovery andData Mining, pages 715–724. ACM, 2015. (Cited on page 32.)

Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu. Understanding of internal clusteringvalidation measures. In 2010 IEEE International Conference on Data Mining, pages911–916. IEEE, 2010. (Cited on page 43.)

B. Long, S. Y. Philip, and Z. M. Zhang. A general model for multiple view unsupervisedlearning. In SDM, pages 822–833, 2008. (Cited on page 33.)

C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval,volume 1. Cambridge university press Cambridge, 2008. (Cited on page 67.)

B. H. McConnaughey and L. P. Laut. The determination and analysis of planktoncommunities. Lembaga Penelitian Laut, 1964. (Cited on page 135.)

M. Meilă. Comparing clusterings - an information based distance. Journal of Multivari-ate Analysis, 98(5):873–895, 2007. (Cited on pages 38, 40, 96, and 142.)

G. W. Milligan and M. C. Cooper. A study of the comparability of external criteria forhierarchical cluster analysis. Multivariate Behavioral Research, 21(4):441–458, 1986.(Cited on pages 3 and 129.)

185

R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Networkmotifs: simple building blocks of complex networks. Science, 298(5594):824–827, 2002.(Cited on page 161.)

B. Mirkin. Mathematical Classification and Clustering. Kluwer Academic Publisher,1996. (Cited on page 134.)

B. Mirkin and L. Chernyi. Measurement of the distance between distinct partitions ofa finite set of objects. Autom Tel, 5:120–127, 1970. (Cited on page 38.)

G. Moise and J. Sander. Finding non-redundant, statistically significant regions in highdimensional data: a novel approach to projected and subspace clustering. In Proceed-ings of the 14th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, pages 533–541. ACM, 2008. (Cited on page 34.)

L. C. Morey and A. Agresti. The measurement of classification agreement: An ad-justment to the rand statistic for chance agreement. Educational and PsychologicalMeasurement, 44(1):33–37, 1984. (Cited on page 129.)

E. Müller, I. Assent, S. Günnemann, R. Krieger, and T. Seidl. Relevant subspaceclustering: Mining the most interesting non-redundant concepts in high dimensionaldata. In Proceedings of the Ninth IEEE International Conference on Data Mining(ICDM 2009), pages 377–386. IEEE, 2009a. (Cited on page 34.)

E. Müller, S. Günnemann, I. Assent, and T. Seidl. Evaluating clustering in subspaceprojections of high dimensional data. Proceedings of the VLDB Endowment, 2(1):1270–1281, 2009b. (Cited on page 34.)

H. Nagesh, S. Goil, and A. Choudhary. Adaptive grids for clustering massive datasets. In Proceedings of the 1st SIAM International Conference on Data Mining (SDM2001), volume 477, 2001. (Cited on page 34.)

M. C. Naldi, A. Carvalho, and R. J. Campello. Cluster ensemble selection based onrelative validity indexes. Data Mining and Knowledge Discovery, 27(2):259–289, 2013.(Cited on page 33.)

186

N. Nguyen and R. Caruana. Consensus clusterings. In Seventh IEEE InternationalConference on Data Mining (ICDM 2007), pages 607–612. IEEE, 2007. (Cited onpage 32.)

M. E. Nilsback and A. Zisserman. A visual vocabulary for flower classification. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,volume 2, pages 1447–1454, 2006. (Cited on page 87.)

D. Niu, J. G. Dy, and M. I. Jordan. Multiple non-redundant spectral clustering views. InProceedings of the 27th International Conference on Machine Learning (ICML 2010),pages 831–838, 2010. (Cited on pages 22, 27, and 31.)

D. Niu, J. G. Dy, and Z. Ghahramani. A nonparametric bayesian model for multipleclustering with overlapping feature views. In International Conference on ArtificialIntelligence and Statistics, pages 814–822, 2012. (Cited on page 34.)

D. Niu, J. G. Dy, and M. I. Jordan. Iterative discovery of multiple alternativeclusteringviews. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(7):1340–1353, 2014. (Cited on pages 27 and 29.)

A. Ochiai. Zoogeographic studies on the soleoid fishes found in japan and its neighbour-ing regions. Bull. Jpn. Soc. Sci. Fish, 22(9):526–530, 1957. (Cited on page 136.)

L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: areview. ACM SIGKDD Explorations Newsletter, 6(1):90–105, 2004. (Cited on pages 2and 34.)

C. S. Peirce. The numerical measure of the success of predictions. Science, pages453–454, 1884. (Cited on page 135.)

H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria ofmax-dependency, max-relevance, and min-redundancy. IEEE Transactions on PatternAnalysis and Machine Intelligence, 27(8):1226–1238, 2005. (Cited on page 57.)

J. M. Phillips, P. Raman, and S. Venkatasubramanian. Generating a diverse set ofhigh-quality clusterings. arXiv, 1108.0017, 2011. (Cited on pages 23 and 52.)

187

V. Pihur, S. Datta, and S. Datta. Weighted rank aggregation of cluster validationmeasures: a monte carlo cross-entropy approach. Bioinformatics, 23(13):1607–1615,2007. (Cited on page 62.)

G. Punj and D. W. Stewart. Cluster analysis in marketing research: Review and sug-gestions for application. Journal of Marketing Research, pages 134–148, 1983. (Citedon page 1.)

Z. Qi and I. Davidson. A principled and flexible framework for finding alternative cluster-ings. In Proceedings of the 15th ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, pages 717–726. ACM, 2009. (Cited on pages 22, 25,31, and 35.)

W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of theAmerican Statistical Association, 66(336):846–850, 1971. (Cited on pages 36, 37, 98,and 134.)

D. J. Rogers and T. T. Tanimoto. A computer program for classifying plants. Science,132(3434):1115–1118, 1960. (Cited on page 136.)

S. Romano, J. Bailey, V. Nguyen, and K. Verspoor. Standardized mutual informationfor clustering comparisons: one step further in adjustment for chance. In Proceedingsof the 31st International Conference on Machine Learning (ICML 2014), pages 1143–1151, 2014. (Cited on pages 40 and 129.)

P. J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation ofcluster analysis. Journal of Computational and Applied Mathematics, 20:53–65, 1987.(Cited on page 44.)

P. F. RUSSELL, T. R. Rao, et al. On habitat and association of species of anophelinelarvae in south-eastern madras. Journal of the Malaria Institute of India, 3(1):153–178, 1940. (Cited on page 135.)

S. Ryali, T. Chen, A. Padmanabhan, W. Cai, and V. Menon. Development and vali-dation of consensus clustering-based framework for brain segmentation using resting

188

fmri. Journal of Neuroscience Methods, 240:128–140, 2015. (Cited on pages 124and 155.)

W. Sheng, S. Swift, L. Zhang, and X. Liu. A weighted sum validity function for clusteringwith a hybrid niching genetic algorithm. IEEE Transactions on Systems, Man, andCybernetics, Part B (Cybernetics), 35(6):1156–1167, 2005. (Cited on page 62.)

A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah, and T. Herawan. Big data clustering:a review. In International Conference on Computational Science and Its Applications,pages 707–720. Springer, 2014. (Cited on page 2.)

K. Sim, V. Gopalkrishnan, A. Zimek, and G. Cong. A survey on enhanced subspaceclustering. Data Mining and Knowledge Discovery, 26(2):332–397, 2013. (Cited onpage 34.)

D. Simovici. On generalized entropy and entropic metrics. Journal of Multiple ValuedLogic and Soft Computing, 13(4/6):295, 2007. (Cited on pages 142 and 143.)

N. Slonim, N. Friedman, and N. Tishby. Multivariate information bottleneck. NeuralComputation, 18(8):1739–1789, 2006. (Cited on page 29.)

P. H. Sneath, R. R. Sokal, et al. Numerical taxonomy. The principles and practice ofnumerical classification. 1973. (Cited on page 135.)

R. R. Sokal and P. H. A. Sneath. Principles of numerical taxonomy. A Series of booksin biology. San Francisco : W. H. Freeman, 1963. (Cited on pages 135 and 136.)

A. Strehl and J. Ghosh. Cluster ensembles—a knowledge reuse framework for combiningmultiple partitions. The Journal of Machine Learning Research, 3:583–617, 2003.(Cited on pages 5, 32, 38, 44, 56, and 96.)

D. M. Titterington, A. F. Smith, U. E. Makov, et al. Statistical analysis of finite mixturedistributions, volume 7. Wiley New York, 1985. (Cited on page 16.)

A. Topchy, A. K. Jain, and W. Punch. Clustering ensembles: Models of consensus andweak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence,27(12):1866–1881, 2005. (Cited on pages 32 and 33.)

189

S. VAN DONGEN. Performance criteria for graph clustering and markov cluster exper-iments. Report-Information Systems, (12):1–36, 2000. (Cited on page 41.)

L. Vendramin, R. J. Campello, and E. R. Hruschka. Relative clustering validity criteria:A comparative overview. Statistical Analysis and Data Mining, 3(4):209–235, 2010.(Cited on page 128.)

N. X. Vinh and J. Epps. mincentropy: A novel information theoretic approach forthe generation of alternative clusterings. In Proceedings of the 10th InternationalConference on Data Mining (ICDM 2010), pages 521–530. IEEE, 2010. (Cited onpages 22, 29, 31, and 66.)

N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings com-parison: Variants, properties, normalization and correction for chance. The Journalof Machine Learning Research, 9999:2837–2854, 2010. (Cited on pages 37, 39, 40, 55,56, 96, 108, 124, and 129.)

D. L. Wallace. Comment. Journal of the American Statistical Association, 78(383):569–576, 1983. (Cited on page 134.)

C. D. Wang, J. H. Lai, D. Huang, and W. S. Zheng. Svstream: a support vector-basedalgorithm for clustering data streams. IEEE Transactions on Knowledge and DataEngineering, 25(6):1410–1424, 2013. (Cited on pages 124 and 155.)

L. Wang, U. T. Nguyen, J. C. Bezdek, C. A. Leckie, and K. Ramamohanarao. iVATand aVAT: enhanced visual analysis for cluster tendency assessment. In Pacific-AsiaConference on Knowledge Discovery and Data Mining, pages 16–27. 2010. (Cited onpages 58 and 59.)

B. Wiswedel, F. Höppner, and M. R. Berthold. Learning in parallel universes. DataMining and Knowledge Discovery, 21(1):130–152, 2010. (Cited on page 33.)

J. Wu, H. Xiong, and J. Chen. Adapting the right measures for k-means clustering. InProceedings of the 15th ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, pages 877–886. ACM, 2009. (Cited on pages 5, 18, and 128.)

190

J. Wu, H. Yuan, H. Xiong, and G. Chen. Validation of overlapping clustering: A randomclustering perspective. Information Sciences, 180(22):4353–4369, 2010. (Cited onpages 41 and 128.)

W. Wu, H. Xiong, and S. Shekhar. Clustering and information retrieval, volume 11.Springer Science & Business Media, 2013. (Cited on page 1.)

K. S. Xu, M. Kliger, and A. O. Hero Iii. Adaptive evolutionary clustering. Data Miningand Knowledge Discovery, 28(2):304–336, 2014. (Cited on pages 124 and 155.)

W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrixfactorization. In Proceedings of the 26th Annual International ACM SIGIR Conferenceon Research and Development in Informaion Retrieval, pages 267–273. ACM, 2003.(Cited on page 2.)

Y. Ye, R. Liu, and Z. Lou. Incorporating side information into multivariate informationbottleneck for generating alternative clusterings. Pattern Recognition Letters, 51:70–78, 2015. (Cited on pages 29 and 31.)

G. Yule. On the association of attributes in statistics. Philosophical Transactions of theRoyal Society of London, 194:257–319, 1900. (Cited on page 136.)

J. Zakaria, A. Mueen, and E. Keogh. Clustering time series using unsupervised-shapelets.In Proceedings 12th International Conference on Data Mining (ICDM 2012), pages785–794, 2012. (Cited on pages 124 and 155.)

Y. Zhang and T. Li. Extending consensus clustering to explore multiple clusteringviews. In Proceedings of the 11th SIAM International Conference on Data Mining(SDM 2011), pages 920–931, 2011. (Cited on pages 6, 23, 52, and 59.)

D. Zhou and C. J. Burges. Spectral clustering and transductive learning with multipleviews. In Proceedings of the 24th International Conference on Machine Learning(ICML 2007), pages 1159–1166. ACM, 2007. (Cited on page 33.)

Minerva Access is the Institutional Repository of The University of Melbourne

Author/s:

Lei, Yang

Title:

Cluster validation and discovery of multiple clusterings

Date:

2016

Persistent Link:

http://hdl.handle.net/11343/121995

File Description:

Thesis - Cluster Validation and Discovery of Multiple Clusterings

Terms and Conditions:

Terms and Conditions: Copyright in works deposited in Minerva Access is retained by the

copyright owner. The work may not be altered without permission from the copyright owner.

Readers may only download, print and save electronic copies of whole works for their own

personal non-commercial use. Any use that exceeds these limits requires permission from

the copyright owner. Attribution is essential when quoting or paraphrasing from these works.

cluster validation and discovery of multiple clusterings

Documents