extensions to the k-means algorithm for clustering large data sets with categorical values

2001/11/06 The Lab of Intelligent Database System, IDS

Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

Author: Zhexue Huang

Advisor: Dr. Hsu

Graduate: Yu-Wei Su


Outline

Motivation Objective Research Review Notation K-means Algorithm K-mode Algorithm K-prototype Algorithm Experiment Conclusion Personal opinion


Motivation

K-means methods are efficient for processing large data sets

K-means is limited to numeric data Numeric and categorical data are mixed with

million objects in real world


Objective

Extending K-means to categorical domains and domains with mixed numeric and categorical values


Research review

Partition methods Partitioning algorithm organizes the objects into K

partition(K<N) K-means[ MacQueen, 1967] K-medoids[ Kaufman and Rousseeuw, 1990] CLARANS[ Ng and Han, 1994]


Notation

[A1,A2,…..Am] means attribute numbers ,each Ai describes a domains of values, denoted by DOM(Ai)

X={X1,X2,…..,Xn} be a set of n objects,object Xi is represented as [Xi,1,Xi,2,…..,Xi,m}

Xi=Xk if Xi,j =Xk,j for 1<=j<=m [ ], the first p elements ar

e numeric values, the rest are categorical values

xxxxxc

m

c

p

r

p

rr,....,,,....,,

121


K-means Algorithm

K is clustering numbers, n is objects number

W is an nxk partition matrix, Q={Q1,Q2,…Qk} is a set of objects in the same object domain

d(.,.) is the Euclidean distance between two objects

),(),(1 1

,

k

l

n

i

lili QXdwQWP

Subject to

minimise

11

,

k

lliw ,1<=i<=n

1,0, liw ,1<=i<=n, 1<=l<=k

Problem P


K-means Algorithm (cont.)

Problem P can be solved by iteratively solving the following two problems: Problem P1: fix Q= , reduced problem P(W, )Q

Q

wi,l=1 if d(Xi,Ql) <= d(Xi,Qt), for 1 <= t <= kwi,t=0 for t <> l

Problem P2: fix W= , reduced problem P( ,Q)W

W

n

i li

n

i jili

jlw

xwq

1 ,

1 ,,

, ,1 <= l <= k, and 1<= j <= m


K-means Algorithm (cont.)

1. Choose an initial and solve P(W, ) to obtain . Set t=0

2. Let = and solve P( ,Q) to obtain .

if P( , )=P( , ), output , and stop; otherwise, go to 3

3. Let = and solve P(W, ) to obtain . if P( , )=P( , ), output , and stop;

otherwise, let t=t+1 and go to 2

oQ oQ0W

W

tW W

1tQ

W

tQ W

1tQ W

tQ

1tQ Q

1tWtW Q

1tW Q

tW Q

Q


K-mode Algorithm

Using a simple matching dissimilarity measure for categorical objects

Replacing means of clusters by modes Using a frequency-based method to find the

modes


K-mode Algorithm( cont.)

Dissimilarity measure

where

Mode of a setA mode of X ={X1,X2,…..,Xn} is a vector Q=[q1,q2,…,qm]

minimise

m

jjj yxYXd

11 ),(),(

)(1

)(0),(

jj

jjjj yx

yxyx

n

ii QXdQXD

11 ),(),(



Find a mode for a set

let be the number of objects having the Kth category in attribute

the relative frequency of category in X

Theorem 1

D(X,Q) is minimised iff

for qj <> for all j=1,…,m

jkcn

,

jkc , jA

n

nXcAf jkc

jkjr,

, )|(

jkc ,

)|()|( , XcAfXqAf jkjrjjr

jkc ,



Two initial mode selection methods1. Select the first K distinct records from the data sets as the

K modes

2. Select the K modes by frequency-based method



To calculate the total cost P against the whole data set each time when a new Q or W is obtained

k

l

n

i

m

jmljili qxwQWP

1 1 1,,, ),(),(

where andWw li ,QqqqQ mllll ],.....,,[ ,2,1,



1. Select K initial modes, one for each cluster

2. Allocate an object to the cluster whose mode is the nearest to it . Update the mode of the cluster after each allocation according to theorem 1



3. After all objects have been allocated to clusters, retest the dissimilarity of objects against the current modes if an object is found its nearest mode belongs to another cluster, reallocate the object to that cluster and update the modes of both clusters

4. Repeat 3 until no objects has changed clusters


K-prototypes Algorithm

To integrate the k-means and k-modes algorithms and to cluster the mixed-type objects

,m is the attribute numbers the first p means numeric data, the rest means categorical data

cm

cp

rp

rr AAAAA ,...,,,....,, 121


K-prototypes Algorithm( cont.)

The first term is the Euclidean distance measure on the numeric attributes and the second term is the simple matching dissimilarity measure on the categorical attributes

The weight is used to avoid favouring either type of attribute

p

j

m

pjjjjj yxyxYXd

1 1

22 ),()(),(



Cost functionMinimise

n

i

m

pjjlji

n

ili

p

jjljili

k

l

qxwqxwQWP1 1

,,1

,2

1,,,

1

)),()((),(



Choose clusters

Modify the mode



Modify the mode


Experiment

K-modes the data set was the soybean disease data set, with 4 diseases 47 instances: {D=10,C=10,R=10,p=17}, 21 attributes

K-prototypethe second data was the credit approval data set, with 2 class 666 instances { approval=299, reject=367}, 6 numeric and 9 categorical attributes


Experiment( cont.)


Conclusion

The k-modes algorithm is faster than the k-means and k-prototypes algorithm because it needs less iterations to converge

How many clusters are in the data? The weight adds an additional problem


Personal opinion

Conceptual inclusion relationships Outlier problem Massive data sets cause efficient problem

extensions to the k-means algorithm for clustering large data sets with categorical values

Documents