clustering categorical data

Upload: blackcross

Post on 03-Apr-2018

232 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Clustering Categorical Data

    1/29

    Clustering Categorical Data

    Steven X. WangDepartment of Mathematics and Statistics

    York University

    April 11, 2005

  • 7/28/2019 Clustering Categorical Data

    2/29

    Presentation Outline

    Brief literature review Some new algorithms for categorical data

    Challenges in clustering categorical data Future work and discussions

  • 7/28/2019 Clustering Categorical Data

    3/29

    Algorithms for Continuous Data

    There are many clustering algorithmsproposed in the literature:

    1. K-means

    2. EM algorithm3. Hierarchical clustering

    4. CLARAN5. OPTICS

  • 7/28/2019 Clustering Categorical Data

    4/29

    Algorithms for Categorical Data

    K modes (modification of K-means) AutoClass (based on EM algorithm)

    ROCK and CLOPE

    There are only several algorithms for

    clustering categorical data.

  • 7/28/2019 Clustering Categorical Data

    5/29

    Categorical Data Structure

    Categorical data has a different structurethan the continuous data.

    The distance functions in the continuous

    data might not be applicable to thecategorical data.

    Algorithms for clustering continuous data

    can not be applied directly to categoricaldata.

  • 7/28/2019 Clustering Categorical Data

    6/29

    K-means for clusteringContinuous data

    K-means is one of the oldest and widely usedalgorithm for clustering categorical data

    1) Choose the number of clusters and initialize

    the clusters centers2) Perform iterations until a selectedconvergence criterion is reached.

    3) Computational Complexity O(n).

  • 7/28/2019 Clustering Categorical Data

    7/29

    Categorical Sample Space

    Assume that the data set is stored in an*p matrix,wheren is the number of observations andp the

    number of categorical variables.

    The sample space consists of all possiblecombinations generated byp variables.

    The sample space is discrete and has no natural

    origin.

  • 7/28/2019 Clustering Categorical Data

    8/29

    K-modes for Categorical Data

    K-modes has exactly the same structure ofthe k-means, i.e., choose k cluster modesand iterates until convergence.

    K- modes has a fundamental flaw:

    the partition is sensitive to the input order,

    i.e., the clustering results would bedifferent for the same data set if the inputorder is different.

  • 7/28/2019 Clustering Categorical Data

    9/29

    AutoClass Algorithm

    This is an algorithms applicable to bothcontinuous and categorical data

    It is model-based algorithm without the input ofthe number of clusters

    Computational Complexity O(n log n)

    EM algorithm has a slow convergence andsensitive to the initial values

  • 7/28/2019 Clustering Categorical Data

    10/29

    Hamming Distance and CD vector

    Hamming distance measures the number of differentattributes between two categorical variables.

    Hamming Distance has been used in clusteringcategorical data in algorithms similar to K-modes.

    We construct Categorical Distance (CD) vector toproject the sample space into 1-dimesional space.

  • 7/28/2019 Clustering Categorical Data

    11/29

    Example of a CD vector

    0 2 4 6 8 10 12 14 16 180

    5

    10

    15

    20

    25

  • 7/28/2019 Clustering Categorical Data

    12/29

    More on CD vector The dense region of the

    CD vector is notnecessarily a cluster!

    We can construct many

    CD vectors on one data

    set by choosing different

    origin.

    0 2 4 6 8 10 12 14 16 180

    5

    10

    15

    20

    25

  • 7/28/2019 Clustering Categorical Data

    13/29

    UCD: Expected CD vector under Null.

    0 2 4 6 8 10 12 14 16 180

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

  • 7/28/2019 Clustering Categorical Data

    14/29

    0 2 4 6 8 10 12 14 16 180

    5

    10

    15

    20

    25

    0 2 4 6 8 10 12 14 16 180

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    CD

    Vector

    UCDVector

  • 7/28/2019 Clustering Categorical Data

    15/29

    CD Algorithm

    Find a cluster center; Construct the CD vector given the current

    center ;

    Perform modified Chi-square test;

    If we reject the null, then determine the radius

    of the current cluster; Extract the cluster

    Repeat until we do not reject the null.

  • 7/28/2019 Clustering Categorical Data

    16/29

    Numerical Comparison withK-mode and AutoClass

    CD AutoClass K-mode

    No. of Clusters 4 4 [3] [4] [5]

    _____________________________________________________

    Classi. Rates 100% 100% 75% 84% 82%

    Variations 0% 0% 6% 15% 10%

    Inform. Gain 100% 100% 67% 84% 93%

    Variations 0% 0% 10% 15% 11%

    _____________________________________________________

    Soybean Data: n=47 and p=35. No of clusters=4.

  • 7/28/2019 Clustering Categorical Data

    17/29

    Numerical Comparison withK-mode and AutoClass

    CD AutoClass K-mode

    No. of Clusters 7 3 [6] [7] [8]

    _____________________________________________________

    Classi. Rates 95% 73% 74% 72% 71%

    Variations 0% 0% 6% 15% 10%

    Inform. Gain 92% 60% 75% 79% 81%

    Variations 0% 0% 7% 6% 6%

    _____________________________________________________

    Zoo Data: n=101 and p=16. No of clusters=7.

  • 7/28/2019 Clustering Categorical Data

    18/29

    Computational Complexity

    The upper bound of the computationalcomplexity of our algorithm is O(kpn)

    It is much less computational intensive than K-

    modes and AutoClass since it does not demand

    convergence.

  • 7/28/2019 Clustering Categorical Data

    19/29

    CD Algorithm

    It is based on hamming distance.

    It does not require the input of parameters.

    It has no convergence criterion.

    Ref: Zhang, Wang and Song (2005). JASA. To appear.

  • 7/28/2019 Clustering Categorical Data

    20/29

    Difficulties in ClusteringCategorical data

    Distance function Similarity measure to organize clusters

    Scalability or computational complexity

  • 7/28/2019 Clustering Categorical Data

    21/29

    Challenge 1: Distance Function

    Hamming distance is a natural andreasonable one if the categorical scalinghas not natural order (nominal data).

    If we apply the method for nominal datasuch as the CD algorithm to the ordinaldata, there might be a serious loss ofinformation as the order is ignored.

  • 7/28/2019 Clustering Categorical Data

    22/29

    Challenge 2:Organization of Clusters

    Organization of clusters is crucial inclustering large data sets.

    Similarity measures are needed to

    organize clusters in hierarchical clustering

    Different similarity measure will be have

    different results

  • 7/28/2019 Clustering Categorical Data

    23/29

    Challenge 3: Scalability

    In practice, an approximate answer is somuch better than no answer at all.

    Complexity O(n).

    Scalability O(mn)

    How many variables that we are dealing

    with?

  • 7/28/2019 Clustering Categorical Data

    24/29

    Challenge 1:

    What to do about the ordering? To propose a reasonable distance function

    for ordinal data might require a careful

    examination of the dependence structure.

    We need to look into different measure of

    association for categorical data.

  • 7/28/2019 Clustering Categorical Data

    25/29

    Challenge 2:

    A nave measure of similarity would be thedistance between two clusters. Entropymight be a good one to try even thought it

    is not a distance function.

  • 7/28/2019 Clustering Categorical Data

    26/29

    Challenge 3:

    There are many hierarchical clusteringalgorithms available. Any clusteringalgorithm could be integrated into those

    algorithms if the distance function andsimilarity measure could be definedappropriately.

  • 7/28/2019 Clustering Categorical Data

    27/29

    Beyond Categorical Data

    The ultimate goal is to cluster any datasets with complex data structures.

    Mixed data types would be the next on the

    list. The challenge there is again on the

    distance function (dependence structure

    between the continuous part andcategorical portion.)

  • 7/28/2019 Clustering Categorical Data

    28/29

    More Challenges

    Measure of uncertainty Hard clustering vs. soft clustering

    Parallel computing.

  • 7/28/2019 Clustering Categorical Data

    29/29

    Thank you!