extensions to the k-means algorithm for clustering large data sets with categorical values

27
2001/11/06 The Lab of Intelligent Database System, IDS Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values Author: Zhexue Hua ng Advisor: Dr. Hsu Graduate: Yu-Wei S u

Upload: mandel

Post on 19-Jan-2016

82 views

Category:

Documents


2 download

DESCRIPTION

Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values. Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su. Outline. Motivation Objective Research Review Notation K-means Algorithm K-mode Algorithm K-prototype Algorithm Experiment Conclusion - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

Author: Zhexue Huang

Advisor: Dr. Hsu

Graduate: Yu-Wei Su

Page 2: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

Outline

Motivation Objective Research Review Notation K-means Algorithm K-mode Algorithm K-prototype Algorithm Experiment Conclusion Personal opinion

Page 3: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

Motivation

K-means methods are efficient for processing large data sets

K-means is limited to numeric data Numeric and categorical data are mixed with

million objects in real world

Page 4: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

Objective

Extending K-means to categorical domains and domains with mixed numeric and categorical values

Page 5: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

Research review

Partition methods Partitioning algorithm organizes the objects into K

partition(K<N) K-means[ MacQueen, 1967] K-medoids[ Kaufman and Rousseeuw, 1990] CLARANS[ Ng and Han, 1994]

Page 6: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

Notation

[A1,A2,…..Am] means attribute numbers ,each Ai describes a domains of values, denoted by DOM(Ai)

X={X1,X2,…..,Xn} be a set of n objects,object Xi is represented as [Xi,1,Xi,2,…..,Xi,m}

Xi=Xk if Xi,j =Xk,j for 1<=j<=m [ ], the first p elements ar

e numeric values, the rest are categorical values

xxxxxc

m

c

p

r

p

rr,....,,,....,,

121

Page 7: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-means Algorithm

K is clustering numbers, n is objects number

W is an nxk partition matrix, Q={Q1,Q2,…Qk} is a set of objects in the same object domain

d(.,.) is the Euclidean distance between two objects

),(),(1 1

,

k

l

n

i

lili QXdwQWP

Subject to

minimise

11

,

k

lliw ,1<=i<=n

1,0, liw ,1<=i<=n, 1<=l<=k

Problem P

Page 8: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-means Algorithm (cont.)

Problem P can be solved by iteratively solving the following two problems: Problem P1: fix Q= , reduced problem P(W, )Q

Q

wi,l=1 if d(Xi,Ql) <= d(Xi,Qt), for 1 <= t <= kwi,t=0 for t <> l

Problem P2: fix W= , reduced problem P( ,Q)W

W

n

i li

n

i jili

jlw

xwq

1 ,

1 ,,

, ,1 <= l <= k, and 1<= j <= m

Page 9: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-means Algorithm (cont.)

1. Choose an initial and solve P(W, ) to obtain . Set t=0

2. Let = and solve P( ,Q) to obtain .

if P( , )=P( , ), output , and stop; otherwise, go to 3

3. Let = and solve P(W, ) to obtain . if P( , )=P( , ), output , and stop;

otherwise, let t=t+1 and go to 2

oQ oQ0W

W

tW W

1tQ

W

tQ W

1tQ W

tQ

1tQ Q

1tWtW Q

1tW Q

tW Q

Q

Page 10: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-mode Algorithm

Using a simple matching dissimilarity measure for categorical objects

Replacing means of clusters by modes Using a frequency-based method to find the

modes

Page 11: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-mode Algorithm( cont.)

Dissimilarity measure

where

Mode of a setA mode of X ={X1,X2,…..,Xn} is a vector Q=[q1,q2,…,qm]

minimise

m

jjj yxYXd

11 ),(),(

)(1

)(0),(

jj

jjjj yx

yxyx

n

ii QXdQXD

11 ),(),(

Page 12: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-mode Algorithm( cont.)

Find a mode for a set

let be the number of objects having the Kth category in attribute

the relative frequency of category in X

Theorem 1

D(X,Q) is minimised iff

for qj <> for all j=1,…,m

jkcn

,

jkc , jA

n

nXcAf jkc

jkjr,

, )|(

jkc ,

)|()|( , XcAfXqAf jkjrjjr

jkc ,

Page 13: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-mode Algorithm( cont.)

Two initial mode selection methods1. Select the first K distinct records from the data sets as the

K modes

2. Select the K modes by frequency-based method

Page 14: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-mode Algorithm( cont.)

To calculate the total cost P against the whole data set each time when a new Q or W is obtained

k

l

n

i

m

jmljili qxwQWP

1 1 1,,, ),(),(

where andWw li ,QqqqQ mllll ],.....,,[ ,2,1,

Page 15: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-mode Algorithm( cont.)

1. Select K initial modes, one for each cluster

2. Allocate an object to the cluster whose mode is the nearest to it . Update the mode of the cluster after each allocation according to theorem 1

Page 16: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-mode Algorithm( cont.)

3. After all objects have been allocated to clusters, retest the dissimilarity of objects against the current modes if an object is found its nearest mode belongs to another cluster, reallocate the object to that cluster and update the modes of both clusters

4. Repeat 3 until no objects has changed clusters

Page 17: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-prototypes Algorithm

To integrate the k-means and k-modes algorithms and to cluster the mixed-type objects

,m is the attribute numbers the first p means numeric data, the rest means categorical data

cm

cp

rp

rr AAAAA ,...,,,....,, 121

Page 18: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-prototypes Algorithm( cont.)

The first term is the Euclidean distance measure on the numeric attributes and the second term is the simple matching dissimilarity measure on the categorical attributes

The weight is used to avoid favouring either type of attribute

p

j

m

pjjjjj yxyxYXd

1 1

22 ),()(),(

Page 19: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-prototypes Algorithm( cont.)

Cost functionMinimise

n

i

m

pjjlji

n

ili

p

jjljili

k

l

qxwqxwQWP1 1

,,1

,2

1,,,

1

)),()((),(

Page 20: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-prototypes Algorithm( cont.)

Choose clusters

Modify the mode

Page 21: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

K-prototypes Algorithm( cont.)

Modify the mode

Page 22: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

Experiment

K-modes the data set was the soybean disease data set, with 4 diseases 47 instances: {D=10,C=10,R=10,p=17}, 21 attributes

K-prototypethe second data was the credit approval data set, with 2 class 666 instances { approval=299, reject=367}, 6 numeric and 9 categorical attributes

Page 23: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

Experiment( cont.)

Page 24: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

Experiment( cont.)

Page 25: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

Experiment( cont.)

Page 26: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

Conclusion

The k-modes algorithm is faster than the k-means and k-prototypes algorithm because it needs less iterations to converge

How many clusters are in the data? The weight adds an additional problem

Page 27: Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

2001/11/06 The Lab of Intelligent Database System, IDS

Personal opinion

Conceptual inclusion relationships Outlier problem Massive data sets cause efficient problem