clustering mixed-type data

41
CLUSTERING MIXED- TYPE DATA MARIANTHI MARKATOU DEPT. OF BIOSTATISTICS UNIVERSITY AT BUFFALO STATISTICAL & COMPUTATIONAL CHALLENGES IN PRECISION MEDICINE NOVEMBER 6-9, 2018 INSTITUTE FOR MATHEMATICS & ITS APPLICATIONS

Upload: others

Post on 05-Oct-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CLUSTERING MIXED-TYPE DATA

CLUSTERING MIXED-TYPE DATA

M A R I A N T H I M A R K ATO U

D E P T. O F B I O S TAT I S T I C S

U N I V E R S I T Y AT B U F FA LO

S TAT I S T I C A L & C O M P U TAT I O N A L C H A L L E N G E S I N P R E C I S I O N M E D I C I N E

N O V E M B E R 6 - 9 , 2 0 1 8

I N S T I T U T E F O R M AT H E M AT I C S & I T S A P P L I C AT I O N S

Page 2: CLUSTERING MIXED-TYPE DATA

Collaborators

Alex Foss*, Senior Scientist, Sandia National LabBonnie Ray, VP, TalkspaceAliza Heching, IBM Watson Research Center

Page 3: CLUSTERING MIXED-TYPE DATA

OUTLINEIntroduction & Problem StatementBackground/Literature ReviewKAMILA: A new algorithm for clustering mixed-type dataSoftware PlatformExamplesReferences

Page 4: CLUSTERING MIXED-TYPE DATA

INTRODUCTIONPrecision Medicine: an approach to disease treatment andprevention that is designed to optimize efficiency or therapeuticbenefit for groups of patients.These subgroups are identified via data-driven techniques.The data are obtained from large, heterogeneouscombinations of sources:

Demographic Information: age, gender, race, ethnicity, economicstatus;Diagnostic testing: inflammation score, tumor size, disease stage;High-throughput “omics” data sets: interval scale gene expression

data, categorical SNP data.

Page 5: CLUSTERING MIXED-TYPE DATA

INTRODUCTION/LITERATURE REVIEWMany “big data” sets contain variables of both, interval and categorical(nominal/ordinal) scale.Many commonly used approaches for clustering mixed-scale (or type) datainvolve strategies to adapt existing techniques for single-type data. That is:

Interval scale data are discretized and then use techniquesfor clustering categorical scale data;Categorical scale variables are dummy-coded and then use

interval-scale data clustering techniques.There are significant problems associated with the aforementioned actions.

Page 6: CLUSTERING MIXED-TYPE DATA

INTRODUCTIONMixed-type data: refers to data that are a combination ofrealizations from both continuous (e.g. height, weight, systolicblood pressure) and categorical (e.g. gender, race, ethnicity, HCVgenotype) random variables.Cluster analysis: aims to identify groups of similar units in a dataset.Definition: A cluster is a set of data points that is compact andisolated, that is, a set of points that are similar to each otherand well-separated from points not belonging to the cluster(Cormack, 1971; Gordon, 1981; Jain, 2010; McNicholas, 2016).

Page 7: CLUSTERING MIXED-TYPE DATA

INTRODUCTIONAdditional characteristics of clusters that may be desirabledepending on the context include stability of identifiedclusters, independence of variables within a cluster, and thedegree to which a cluster can be well-represented by itscentroid.Recent literature that provides interesting discussions of the aforementioned issues are: 1) Hennig, C. (2015). What are the true clusters? Pattern Recognition Letters, 64, 53-62; 2)McNicholas, P.D. (2016). Model-Based Clustering. Journal of Classification, 33, 331-373.

Page 8: CLUSTERING MIXED-TYPE DATA

PROBLEM STATEMENT

We focus on clustering mixed-scale data, that is, data sets that contain realizations of both interval and categorical (nominal and/or ordinal) scale variables.

QUESTION: Why does this problem matter?Answer: Informative subgroup identification, i.e. clustering.

Page 9: CLUSTERING MIXED-TYPE DATA

LITERATURE REVIEWFundamental challenges are:

1) to equitably balance the contribution from the continuous and categorical variables;

2) current clustering algorithms are unable to properly handle data sets in which only a subset of variables are related to the underlying cluster structure of interest. In what follows, we will examine closely each of these challenges.

Page 10: CLUSTERING MIXED-TYPE DATA

FRAMEWORKThere are many perspectives and definitions of clusters. Here weinvoke the mixture model perspective. This perspective is particularlyeffective because: (a) it produces mathematically rigorous generativemodels; (b) it can naturally accommodate multiple scale datawithout transformations and approximations; (c) it can handledependencies within and between data types; and (d) it is flexibleenough to capture a very wide range of scenarios of practicalsignificance.

However, clustering is not always about recovering mixturecomponents. The definition of a cluster is highly dependent upon thecontext of the problem and the available data.

Page 11: CLUSTERING MIXED-TYPE DATA

DATA TRANSFORMATIONSDISCRETIZATION

Interval-scale variables are discretized and a clustering method suitable for exclusively categorical variables is applied.EXAMPLE: Consider a Monte-Carlo study in which we cluster manifestations of the random vector (V, W)defined by a mixture of two well-separated populations as follows:

Population 1: V ∼ N(0, 5.2), W∼Multin(n=1, 𝑝𝑝𝑇𝑇=(.45, .45, .05,.05))

Population 2: V∼N(5.2, 5.2)W∼Multin(n=1, 𝑝𝑝𝑇𝑇=(.05,.05,.45,.45))

Page 12: CLUSTERING MIXED-TYPE DATA

DISCRETIZATION

Sample size is N=500;

V is discretized with 1) median split into 2 bins; 2) tertile split into 3 bins and so on up to 9 bins.We use k-modes algorithm and LCA algorithm.

Figure 1 (next slide) shows ARI for each of the discretization conditions

Page 13: CLUSTERING MIXED-TYPE DATA

FIGURE 1: Performance of k-modes and LCA for various quantile splits of the data

Page 14: CLUSTERING MIXED-TYPE DATA

RESULTSK-modes algorithm performs best for the median split of the interval scale variable and degrades as the number of bins increases;LCA does not degrade as the number of cut points increases;The normal-multinomial mixture model, which uses the untransformed interval scale variable outperforms both competing algorithms for all choices of the cut-point.

Page 15: CLUSTERING MIXED-TYPE DATA

NUMERICAL CODING METHODInvolves converting categorical variables into numeric variables and clustering the new set with methods suitable for interval scale data only (e.g. k-means).

In practice, clustering with a numerical coding technique always involves using 0-1 dummy coding with standardized continuous variables. But this is ineffective because we have the following proposition.

Consider (V, W) defined as follows:V = 𝑌𝑌1 with probability π∊[0, 1]V= 𝑌𝑌2 with probability 1-π

Page 16: CLUSTERING MIXED-TYPE DATA

Numerical Coding Methods𝑌𝑌1, 𝑌𝑌2 are two continuous random variables with means 0, μand variances σ12 and 𝜎𝜎22 . Let η = 𝐸𝐸 𝑉𝑉 , ν2 = Var V =𝜋𝜋𝜎𝜎12 + (1 − 𝜋𝜋)𝜎𝜎22 + 𝜋𝜋(1 − 𝜋𝜋)𝜇𝜇2 . Let also W= 𝐵𝐵1 withprobability π and equal to 𝐵𝐵2 with probability 1-π;𝐵𝐵1 ∼ Bern(𝑝𝑝1) and 𝐵𝐵2∼Bern(𝑝𝑝2).The squared Euclidean distance between population 1 and 2is then:

((𝑌𝑌1 − 𝜂𝜂)/𝜈𝜈 − (𝑌𝑌2 − 𝜂𝜂)/𝜈𝜈) 2 + (𝐵𝐵1 − 𝐵𝐵2)2.

Page 17: CLUSTERING MIXED-TYPE DATA

PropositionLet (V, W) be a mixed-type bivariate vector. Then, the expectation of the continuous contribution to the distance is:

𝐸𝐸[(𝑌𝑌1−𝑌𝑌2ν

)2] = φ,

φ>1, 𝜎𝜎1 ≠ 𝜎𝜎2 and φ>2, 𝜎𝜎1 = 𝜎𝜎2. Furthermore, since |𝐵𝐵1 - 𝐵𝐵2| is 0 or 1,

0 < E[(𝐵𝐵1 − 𝐵𝐵2)2] < 1, for all 𝑝𝑝1, 𝑝𝑝2belong in (0, 1).

Page 18: CLUSTERING MIXED-TYPE DATA

Numerical CodingThe continuous contribution has expectation > 1 for 𝜎𝜎1 ≠ 𝜎𝜎2;The continuous contribution has expectation > 2 for equal variances;

The categorical contribution has expectation < 1.This implies that there is unbalanced treatment of continuous and categorical variables

How do we deal with this unbalanced treatment?

Perhaps categorical variable should be up-weighted? I.e. instead of taking 0-1 to take 0-2?This is an ineffective strategy in the general case.

Page 19: CLUSTERING MIXED-TYPE DATA

HYBRID DISTANCE

Page 20: CLUSTERING MIXED-TYPE DATA

HYBRID DISTANCE METHOD: MODHA-SPANGLER WEIGHTING

Page 21: CLUSTERING MIXED-TYPE DATA

MODHA-SPANGLER WEIGHTING

Page 22: CLUSTERING MIXED-TYPE DATA

KAMILA ALGORITHM►We introduce a novel clustering algorithm for clustering mixed data►“KAy” means for MIxed LArge datasets■Speed of k-means■Desirable continuous/categorical weighting strategy of finite mixturemodels■Categorical variables are modeled as multinomial random variables■For continuous variables, distances to nearest centroids are calculated,and an appropriate univariate kernel density estimator is constructed■Clusters in continuous dimensions are modeled as arbitraryspherical/elliptical mixtures using a kernel density-based method.

Page 23: CLUSTERING MIXED-TYPE DATA

Full KAMILA Algorithm

Page 24: CLUSTERING MIXED-TYPE DATA

RADIAL KDE

Page 25: CLUSTERING MIXED-TYPE DATA

RADIAL KDE

Page 26: CLUSTERING MIXED-TYPE DATA

DATA MODEL

Page 27: CLUSTERING MIXED-TYPE DATA

ALGORITHM

Page 28: CLUSTERING MIXED-TYPE DATA

ALGORITHM

Page 29: CLUSTERING MIXED-TYPE DATA

IDENTIFIABILITY OF THE MIXTURE MODEL

Page 30: CLUSTERING MIXED-TYPE DATA

TIME COMPLEXITY

Page 31: CLUSTERING MIXED-TYPE DATA

TIME COMPLEXITY

Page 32: CLUSTERING MIXED-TYPE DATA

SOFTWARE DEVELOPMENT

Page 33: CLUSTERING MIXED-TYPE DATA

SOFTWARE DEVELOPMENTThe software can be downloaded from

https://CRAN.R-project.org/package=kamila and it incorporates KAMILA and Modha-Spangler techniques.

Weighting techniques: Henning-Liao and othersA Hadoop implementation of KAMILA has been developed that is designed for clustering very large data sets stored in distributed file systems.

Map-Reduce programming model.

Page 34: CLUSTERING MIXED-TYPE DATA

SIMULATIONS

Page 35: CLUSTERING MIXED-TYPE DATA

UNDERSTANDING PROSTATE CANCER MORTALITY

Page 36: CLUSTERING MIXED-TYPE DATA

UNDERSTANDING PROSTATE CANCER MORTALITY

Page 37: CLUSTERING MIXED-TYPE DATA

PROSTATE CANCER MORTALITY

Page 38: CLUSTERING MIXED-TYPE DATA

PROSTATE CANCER MORTALITY

Page 39: CLUSTERING MIXED-TYPE DATA

CONCLUSIONS

Page 40: CLUSTERING MIXED-TYPE DATA

REFERENCES1. Foss, A., Markatou, M., Ray, B. (2018). Distance metrics and clustering methods for mixed-type data. International Statistical Review, https://doi.org/10.1111/insr.12274

2. Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining & Knowledge Discovery, 2(3), 283-304

3. Byar, D. & Green, S. (1980). The choice of treatment for cancer patients based on covariate information: applications to prostate cancer. Bulletin du Cancer, 67, 477-490.

4. Foss, A., Markatou, M., Ray, B., Heching, A. (2016). A semi-parametric method for clustering mixed data. Machine Learning, 105(3), 419-458.

Page 41: CLUSTERING MIXED-TYPE DATA

REFERENCES5. Holzmann, H., Munk, A., Gneiting, T. (2006). Identifiability of finite mixtures of elliptical distributions. Scandinavian Journal of Statistics,33(4), 753-763.6. Tibshirani, R. & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511-528.7. Foss, A., Markatou, M. (2018). Kamila: Clustering mixed-type data in R and Hadoop. Journal of Statistical Software, 83(13), doi:10.18637/jss.v083.i13.8. Modha, D. S. & Spangler, W.S. (2003). Feature weighting in k-means clustering. Machine Learning, 52(3), 217-237.