clustering categorical data

7/28/2019 Clustering Categorical Data

1/29

Clustering Categorical Data

Steven X. WangDepartment of Mathematics and Statistics

York University

April 11, 2005


2/29

Presentation Outline

Brief literature review Some new algorithms for categorical data

Challenges in clustering categorical data Future work and discussions


3/29

Algorithms for Continuous Data

There are many clustering algorithmsproposed in the literature:

1. K-means

2. EM algorithm3. Hierarchical clustering

4. CLARAN5. OPTICS


4/29

Algorithms for Categorical Data

K modes (modification of K-means) AutoClass (based on EM algorithm)

ROCK and CLOPE

There are only several algorithms for

clustering categorical data.


5/29

Categorical Data Structure

Categorical data has a different structurethan the continuous data.

The distance functions in the continuous

data might not be applicable to thecategorical data.

Algorithms for clustering continuous data

can not be applied directly to categoricaldata.


6/29

K-means for clusteringContinuous data

K-means is one of the oldest and widely usedalgorithm for clustering categorical data

1) Choose the number of clusters and initialize

the clusters centers2) Perform iterations until a selectedconvergence criterion is reached.

3) Computational Complexity O(n).


7/29

Categorical Sample Space

Assume that the data set is stored in an*p matrix,wheren is the number of observations andp the

number of categorical variables.

The sample space consists of all possiblecombinations generated byp variables.

The sample space is discrete and has no natural

origin.


8/29

K-modes for Categorical Data

K-modes has exactly the same structure ofthe k-means, i.e., choose k cluster modesand iterates until convergence.

K- modes has a fundamental flaw:

the partition is sensitive to the input order,

i.e., the clustering results would bedifferent for the same data set if the inputorder is different.


9/29

AutoClass Algorithm

This is an algorithms applicable to bothcontinuous and categorical data

It is model-based algorithm without the input ofthe number of clusters

Computational Complexity O(n log n)

EM algorithm has a slow convergence andsensitive to the initial values


10/29

Hamming Distance and CD vector

Hamming distance measures the number of differentattributes between two categorical variables.

Hamming Distance has been used in clusteringcategorical data in algorithms similar to K-modes.

We construct Categorical Distance (CD) vector toproject the sample space into 1-dimesional space.


11/29

Example of a CD vector

0 2 4 6 8 10 12 14 16 180

5

10

15

20

25


12/29

More on CD vector The dense region of the

CD vector is notnecessarily a cluster!

We can construct many

CD vectors on one data

set by choosing different

origin.

0 2 4 6 8 10 12 14 16 180

5

10

15

20

25


13/29

UCD: Expected CD vector under Null.

0 2 4 6 8 10 12 14 16 180

2

4

6

8

10

12

14

16

18

20


14/29

0 2 4 6 8 10 12 14 16 180

5

10

15

20

25

0 2 4 6 8 10 12 14 16 180

2

4

6

8

10

12

14

16

18

20

CD

Vector

UCDVector


15/29

CD Algorithm

Find a cluster center; Construct the CD vector given the current

center ;

Perform modified Chi-square test;

If we reject the null, then determine the radius

of the current cluster; Extract the cluster

Repeat until we do not reject the null.


16/29

Numerical Comparison withK-mode and AutoClass

CD AutoClass K-mode

No. of Clusters 4 4 [3] [4] [5]

_____________________________________________________

Classi. Rates 100% 100% 75% 84% 82%

Variations 0% 0% 6% 15% 10%

Inform. Gain 100% 100% 67% 84% 93%

Variations 0% 0% 10% 15% 11%

_____________________________________________________

Soybean Data: n=47 and p=35. No of clusters=4.


17/29

Numerical Comparison withK-mode and AutoClass

CD AutoClass K-mode

No. of Clusters 7 3 [6] [7] [8]

_____________________________________________________

Classi. Rates 95% 73% 74% 72% 71%

Variations 0% 0% 6% 15% 10%

Inform. Gain 92% 60% 75% 79% 81%

Variations 0% 0% 7% 6% 6%

_____________________________________________________

Zoo Data: n=101 and p=16. No of clusters=7.


18/29

Computational Complexity

The upper bound of the computationalcomplexity of our algorithm is O(kpn)

It is much less computational intensive than K-

modes and AutoClass since it does not demand

convergence.


19/29

CD Algorithm

It is based on hamming distance.

It does not require the input of parameters.

It has no convergence criterion.

Ref: Zhang, Wang and Song (2005). JASA. To appear.


20/29

Difficulties in ClusteringCategorical data

Distance function Similarity measure to organize clusters

Scalability or computational complexity


21/29

Challenge 1: Distance Function

Hamming distance is a natural andreasonable one if the categorical scalinghas not natural order (nominal data).

If we apply the method for nominal datasuch as the CD algorithm to the ordinaldata, there might be a serious loss ofinformation as the order is ignored.


22/29

Challenge 2:Organization of Clusters

Organization of clusters is crucial inclustering large data sets.

Similarity measures are needed to

organize clusters in hierarchical clustering

Different similarity measure will be have

different results


23/29

Challenge 3: Scalability

In practice, an approximate answer is somuch better than no answer at all.

Complexity O(n).

Scalability O(mn)

How many variables that we are dealing

with?


24/29

Challenge 1:

What to do about the ordering? To propose a reasonable distance function

for ordinal data might require a careful

examination of the dependence structure.

We need to look into different measure of

association for categorical data.


25/29

Challenge 2:

A nave measure of similarity would be thedistance between two clusters. Entropymight be a good one to try even thought it

is not a distance function.


26/29

Challenge 3:

There are many hierarchical clusteringalgorithms available. Any clusteringalgorithm could be integrated into those

algorithms if the distance function andsimilarity measure could be definedappropriately.


27/29

Beyond Categorical Data

The ultimate goal is to cluster any datasets with complex data structures.

Mixed data types would be the next on the

list. The challenge there is again on the

distance function (dependence structure

between the continuous part andcategorical portion.)


28/29

More Challenges

Measure of uncertainty Hard clustering vs. soft clustering

Parallel computing.


29/29

Thank you!

clustering categorical data

Documents