dm13 clustering

Upload: saurabh-kumar

Post on 07-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/18/2019 Dm13 Clustering

    1/35

  • 8/18/2019 Dm13 Clustering

    2/35

    Outline

    Introduction

    K-means clustering

    Hierarchical clustering: COBWEB

  • 8/18/2019 Dm13 Clustering

    3/35

    Classifcation vs. Clustering

    Classification: Supervised learning:

    Learns a method for predicting theinstance class from pre-labeled(classified) instances

  • 8/18/2019 Dm13 Clustering

    4/35

    Clustering

    Unsupervised learning:

    Finds “natural” grouping ofinstances given un-labeled data

  • 8/18/2019 Dm13 Clustering

    5/35

    Clustering Methods

    Many dierent method and algorithms:

    !or numeric and"or sym#olic data

    $eterministic vs. %ro#a#ilistic

    E&clusive vs. overla%%ing

    Hierarchical vs. 'at

     (o%-do)n vs. #ottom-u%

  • 8/18/2019 Dm13 Clustering

    6/35

    Clusters:e&clusive vs. overla%%ing

    a

     j

    i

    h

    g

    ed

    c

    b

     Simple 2-D representation

     Non-overlapping 

    Venn diagram

    Overlapping 

    a

     j

    i

    h

    g

    ed

    c

    b

  • 8/18/2019 Dm13 Clustering

    7/35

    Clustering Evaluation

    Manual ins%ection

    Benchmar*ing on e&isting la#els

    Cluster +uality measures distance measures

    high similarity )ithin a cluster, lo) across

    clusters

  • 8/18/2019 Dm13 Clustering

    8/35

     (he distance unction

    im%lest case: one numeric attri#ute /

    $istance01,23 4 /013 5 /023

    everal numeric attri#utes: $istance01,23 4 Euclidean distance #et)een 1,2

    6ominal attri#utes: distance is set to 7 ivalues are dierent, 8 i they are e+ual

    /re all attri#utes e+ually im%ortant9

    Weighting the attri#utes might #e necessary

  • 8/18/2019 Dm13 Clustering

    9/35

    im%le Clustering: K-means

    Wor*s )ith numeric data only

    73 ic* a num#er 0K3 o cluster centers 0atrandom3

    ;3 /ssign every item to its nearest clustercenter 0e.g. using Euclidean distance3

    e%eat ste%s ;,< until convergence

    0change in cluster assignments less than athreshold3

  • 8/18/2019 Dm13 Clustering

    10/35

    K-means e&am%le, ste% 7

    k 1

    k 2

    k 3

    X

    Y

    Pick 3

    initial

    cluster 

    centers(randomly)

  • 8/18/2019 Dm13 Clustering

    11/35

    K-means e&am%le, ste% ;

    k 1

    k 2

    k 3

    X

    Y

    ssi!n

    eac" #oint

    to t"e closest

    cluster 

    center 

  • 8/18/2019 Dm13 Clustering

    12/35

    K-means e&am%le, ste% <

    X

    Y

    $o%e

    eac" cluster

    center 

    to t"e meano& eac" cluster 

     !

    k 2

     "

    k 1

    k 3

     #

  • 8/18/2019 Dm13 Clustering

    13/35

  • 8/18/2019 Dm13 Clustering

    14/35

  • 8/18/2019 Dm13 Clustering

    15/35

    K-means e&am%le, ste% =#

    X

    Y

    recom#ute

    cluster

    means

    k 1

    k 3k 

    2

  • 8/18/2019 Dm13 Clustering

    16/35

    K-means e&am%le, ste% @

    X

    Y

    mo%e cluster

    centers tocluster means

    k 2

    k 1

    k 3

  • 8/18/2019 Dm13 Clustering

    17/35

    $iscussion, 7

    What can #e the %ro#lems )ith

    K-means clustering9

  • 8/18/2019 Dm13 Clustering

    18/35

    $iscussion, ;

    >esult can vary signifcantly de%ending oninitial choice o seeds 0num#er and %osition3

    Can get tra%%ed in local minimum

    E&am%le:

    A: What can #e done9

    instances

    initial clustercenters

  • 8/18/2019 Dm13 Clustering

    19/35

    $iscussion, <

    /: (o increase chance o fndingglo#al o%timum: restart )ithdierent random seeds.

  • 8/18/2019 Dm13 Clustering

    20/35

    K-means clusteringsummary/dvantages

    im%le,understanda#le

    items automaticallyassigned to clusters

    $isadvantages

    Must %ic* num#er oclusters #eore hand

    /ll items orced into acluster

     (oo sensitive tooutliers

  • 8/18/2019 Dm13 Clustering

    21/35

    K-means clustering - outliers9What can #e done a#out outliers9

  • 8/18/2019 Dm13 Clustering

    22/35

    K-means variations

    K-medoids 5 instead o mean, use medianso each cluster

    Mean o 7,

  • 8/18/2019 Dm13 Clustering

    23/35

    DHierarchical clustering Bottom u%

    tart )ith single-instance clusters

    /t each ste%, oin the t)o closest clusters

    $esign decision: distance #et)een clusters

    E.g.t)o closest instances in clustersvs. distance #et)een means

     (o% do)n

    tart )ith one universal cluster

    !ind t)o clusters

    roceed recursively on each su#set

    Can #e very ast

    Both methods %roduce a

    dendrogram 

    g a c i e d k b j f h

  • 8/18/2019 Dm13 Clustering

    24/35

    DIncremental clustering

    Heuristic a%%roach 0COBWEB"CF/I(3

    !orm a hierarchy o clusters incrementally

    tart:

    tree consists o em%ty root node

     (hen:

    add instances one #y one

    u%date tree a%%ro%riately at each stage to u%date, fnd the right lea or an instance

    May involve restructuring the tree

    Base u%date decisions on category utility

  • 8/18/2019 Dm13 Clustering

    25/35

    DClustering )eather dataID Outlook Temp. Humidity Windy

    A Sunny Hot High False

    Sunny Hot High True

    ! O"ercast Hot High False

    D #ainy $ild High False

    % #ainy !ool &ormal False

    F #ainy !ool &ormal True

    ' O"ercast !ool &ormal True

    H Sunny $ild High False

    I Sunny !ool &ormal False

     ( #ainy $ild &ormal False

    K Sunny $ild &ormal True

    ) O"ercast $ild High True

    $ O"ercast Hot &ormal False

    & #ainy $ild High True

    7

    ;

    <

  • 8/18/2019 Dm13 Clustering

    26/35

    DClustering )eather dataID Outlook Temp. Humidity Windy

    A Sunny Hot High False

    Sunny Hot High True

    ! O"ercast Hot High False

    D #ainy $ild High False

    % #ainy !ool &ormal False

    F #ainy !ool &ormal True

    ' O"ercast !ool &ormal True

    H Sunny $ild High False

    I Sunny !ool &ormal False

     ( #ainy $ild &ormal False

    K Sunny $ild &ormal True

    ) O"ercast $ild High True

    $ O"ercast Hot &ormal False

    & #ainy $ild High True

    =

    <

    Merge *esthost and

    runner-up

    @

    !onsider splitting the*est host i+ merging

    doesn,t help

  • 8/18/2019 Dm13 Clustering

    27/35

    D!inal hierarchyID Outlook Temp. Humidity Windy

    A Sunny Hot High False

    Sunny Hot High True

    ! O"ercast Hot High False

    D #ainy $ild High False

    Oops a and b are

    actually "ery similar

  • 8/18/2019 Dm13 Clustering

    28/35

    DE&am%le: the iris data 0su#set3

  • 8/18/2019 Dm13 Clustering

    29/35

    DClustering )ith cuto 

  • 8/18/2019 Dm13 Clustering

    30/35

    DCategory utility

    Category utility: +uadratic loss unctiondefned on conditional %ro#a#ilities:

    Every instance in dierent category

    numerator #ecomes

    vaC vaC 

    C C C CU   l i j

    ijil ijil 

    ∑ ∑∑   =−=

    =

    )*Pr+*,(Pr+*Pr+

    )-...--(

    22

    21

    2*Pr+ iji   vam   =− maximm

    num*er o+ attri*utes

  • 8/18/2019 Dm13 Clustering

    31/35

    DOverftting-avoidanceheuristic

    I every instance gets %ut into a dierent categorythe numerator #ecomes 0ma&imal3:

      Where n is num#er o all %ossi#le attri#ute values.

    o )ithout k  in the denominator o the CG-ormula,every cluster )ould consist o one instance

    ∑∑   =−i j

      ijvian   2*Pr+ Maximum value of CUMaximum value of CU

  • 8/18/2019 Dm13 Clustering

    32/35

    Other Clustering /%%roaches

    EM 5 %ro#a#ility #ased clustering

    Bayesian clustering

    OM 5 sel-organiing ma%s?

  • 8/18/2019 Dm13 Clustering

    33/35

    $iscussion Can inter%ret clusters #y using su%ervised learning

    learn a classifer #ased on clusters

    $ecrease de%endence #et)een attri#utes9

    %re-%rocessing ste%

    E.g. use principal component analysis

    Can #e used to fll in missing values

    Key advantage o %ro#a#ilistic clustering:

    Can estimate li*elihood o data

    Gse it to com%are dierent models o#ectively

  • 8/18/2019 Dm13 Clustering

    34/35

    E&am%les o Clustering/%%lications

      $arketing discover customer grou%s and use

    them or targeted mar*eting and re-organiation

     Astronomy fnd grou%s o similar stars and

    gala&ies

     %arth-/uake studies O#served earth +ua*e

    e%icenters should #e clustered along continent

    aults  'enomics fnding grou%s o gene )ith similar

    e&%ressions

    ?

  • 8/18/2019 Dm13 Clustering

    35/35

    Clustering ummary

    unsu%ervised

    many a%%roaches

    K-means 5 sim%le, sometimes useul K-medoids is less sensitive to outliers

    Hierarchical clustering 5 )or*s or sym#olicattri#utes

    Evaluation is a %ro#lem