lesson9 clustering

Upload: sandyndemma

Post on 08-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Lesson9 Clustering

    1/46

    Chapter 3: Cluster Analysis

    3.1 Basic Conc epts of Clustering

    3.2 Partitioning Methods

    3.3 Hierarchica l Methods

    3.3.1 The Princ iple3.3.2 Ag glomera tive and Divisive Cluste ring

    3.3.3 BIRCH

    3.3.4 Roc k

    3.4 Density-based Methods3.4.1 The Princ iple

    3.4.2 DBSCAN

    3.4.3 OPTICS

    3.5 Clustering High-Dimensional Data

    3.6 Outlier Analysis

  • 8/6/2019 Lesson9 Clustering

    2/46

    3.3.1 The Princ ip le

    Group da ta ob jec ts into a tree of c lusters

    Hiera rc hic a l methods c an be

    Agglomerative: bo ttom-up a pp roa c h

    Divisive: top-down ap proa c h

    Hierarc hic a l c lustering has no bac ktrac king

    If a pa rtic ula r merge or sp lit turns out to be poor c hoic e, it cannot

    be correc ted

  • 8/6/2019 Lesson9 Clustering

    3/46

    3.3.2 Agglomerative and Divisive

    Agglomerative Hierarchic al Clustering

    Bottom-up stra tegy

    Eac h c luster sta rts with only one ob jec t Clusters a re merged into la rger and la rger c lusters until:

    All the ob jec ts a re in a sing le c luster

    Certa in te rmina tion c ond itions a re sa tisfied

    Divisive Hierarchical Clustering

    Top-down stra tegy

    Sta rt w ith a ll ob jec ts in one c luster

    Clusters a re sub d ivided into smaller and smaller c luste rs until:

    Eac h ob jec t fo rms a c luster on its own

    Certa in te rmina tion c ond itions a re sa tisfied

  • 8/6/2019 Lesson9 Clustering

    4/46

    Example

    Agglomera tive and d ivisive a lgorithms on a da ta set o f fiveobjects {a , b , c , d, e}

    Step 0 Step 1 Step 2 Step 3 Step 4

    b

    d

    c

    e

    aa b

    d e

    c d e

    a b c d e

    Step 4 Step 3 Step 2 Step 1 Step 0

    agglomerative(AGNES)

    divisive(DIANA)

  • 8/6/2019 Lesson9 Clustering

    5/46

    Example

    AGNES

    Clusters C1 and C2

    ma y be m erged if an ob jec tin C1 and an ob jec t in C2 form

    the minimum Euc lidean

    d istance between any two

    ob jec ts from d ifferent c lusters

    DIANA

    A c luster is sp lit ac c ord ing to some p rinc ip le, e.g ., the maximumEuc lid ian d istanc e b etween the c losest neighboring ob jec ts in thecluster

    Step 0 Step 1 Step 2 Step 3 Step 4

    b

    d

    c

    e

    aa b

    d e

    c d e

    a b c d e

    Step 4 Step 3 Step 2 Step 1 Step 0

    agglomerative

    (AGNES)

    divisive

    (DIANA)

  • 8/6/2019 Lesson9 Clustering

    6/46

    Distanc e Between Clusters

    First measure: Minimum distanc e

    | p -p | is the d istance b etween two ob jec ts p and p

    Use c ases

    An a lgorithm tha t uses the minimum d ista nc e to measure the

    d istanc e between c lusters is c a lled sometimes nearest-neighborc lustering algorithm

    If the c lustering p roc ess termina tes when the minimum d ista nc e

    between nearest c lusters exc eeds an a rb itra ry threshold , it iscalled single-linkage algorithm

    An a gg lomera tive a lgorithm tha t uses the minimum d istanc emeasure is a lso c a lled minimal spanning tree a lgorithm

    |'|min),( ',min ppCCd ji CpCpji =

  • 8/6/2019 Lesson9 Clustering

    7/46

    Distanc e Between Clusters

    Sec ond mea sure: Maximum d istance

    | p -p | is the d istance b etween two ob jec ts p and p

    Use c ases

    An a lgorithm tha t uses the maximum d istanc e to measure the

    d istanc e between c lusters is c a lled sometimes farthest-neighborc lustering algorithm

    If the c lustering p roc ess termina tes when the maximum d istanc e

    between nearest c lusters exc eeds an a rb itra ry threshold , it iscalled complete-linkage algorithm

    |'|max),( ',max ppCCd ji CpCpji =

  • 8/6/2019 Lesson9 Clustering

    8/46

    Distanc e Between Clusters

    Minimum a nd maximum d istanc es a re extreme imp lying tha t theyare overly sensitive to outliers or no isy d a ta

    Third measure: Mean distance

    m i and mj are the mea ns for c luster C i and Cj respectively

    Fourth measure: Average d istance

    | p -p | is the d istance b etween two ob jec ts p and p

    ni and nj are the number of ob jec ts in c luster C i and Cj respectively

    Mean is d iffic ult to c ompute for c a tegoric a l da ta

    ||),( jijimean mmCCd =

    =

    i jCp Cpji

    jiavg ppnn

    CCd'

    |'|1

    ),(

  • 8/6/2019 Lesson9 Clustering

    9/46

    Cha llenges & Solutions

    It is difficult to selec t merge or sp lit points

    No backtracking

    Hierarc hica l c lustering does not sc ale well: examines a goodnumber of ob jec ts before a ny dec ision of sp lit or merge

    One p romising d irec tions to solve these p rob lems is to c ombinehiera rc hic a l c lustering with other c lustering tec hniques: multiple-phase c lustering

  • 8/6/2019 Lesson9 Clustering

    10/46

    3.3.3 BIRCH

    BIRCH: Ba lanc ed Ite ra tive Reduc ing and Clustering UsingHierarchies

    Agglomerative Clustering designed for c lustering a large amountof numeric a l data

    What Birc h a lgorithm tries to solve?

    Most o f the existing a lgorithms DO NOT c onsider the c ase tha tdatasets ca n be too large to fit in ma in memory

    They DO NOT c onc entra te on minimizing the number of scans ofthe da taset

    I/ O c osts are very high

    The c omp lexity o f BIRCH is O(n) where n is the number of ob jec ts

    to be c lustered .

  • 8/6/2019 Lesson9 Clustering

    11/46

    BIRCH: The Idea by example

    Data Objec ts

    1

    Clustering Proc ess (build a tree)

    Cluster1

    1

    2

    3

    4

    5

    6

    2

    If c luster 1 bec omes too large (not compac t) by add ing ob jec t 2,

    then split the c luster

    Lea f node

  • 8/6/2019 Lesson9 Clustering

    12/46

    BIRCH: The Idea by example

    Data Objec ts

    1

    Clustering Proc ess (build a tree)

    Cluster1

    1

    2

    3

    4

    5

    6

    2

    Lea f node

    Cluster2

    entry 1 entry 2

    Lea f node with two entries

  • 8/6/2019 Lesson9 Clustering

    13/46

    BIRCH: The Idea by example

    Data Objec ts

    1

    Clustering Proc ess (build a tree)

    Cluster1

    1

    2

    3

    4

    5

    6

    2

    Lea f node

    Cluster2

    3

    entry1 is the c losest to objec t 3

    If c luster 1 bec omes too large by adding objec t 3,then split the c luster

    entry 1 entry 2

  • 8/6/2019 Lesson9 Clustering

    14/46

    BIRCH: The Idea by example

    Data Objec ts

    1

    Clustering Proc ess (build a tree)

    Cluster1

    1

    2

    3

    4

    5

    6

    2

    Lea f node

    Cluster2

    3

    entry 1 entry 2 entry 3

    Cluster3

    Leaf node with three entries

  • 8/6/2019 Lesson9 Clustering

    15/46

    BIRCH: The Idea by example

    Data Objec ts

    1

    Clustering Proc ess (build a tree)

    Cluster1

    1

    2

    3

    4

    5

    6

    2

    Lea f node

    Cluster2

    3

    entry 1 entry 2 entry 3

    Cluster3

    4

    entry3 is the c losest to objec t 4

    Cluster 2 remains compa c t when adding objec t 4then add ob jec t 4 to c luster 2

    Cluster2

  • 8/6/2019 Lesson9 Clustering

    16/46

    BIRCH: The Idea by example

    Data Objec ts

    1

    Clustering Proc ess (build a tree)

    Cluster1

    1

    2

    3

    4

    5

    6

    2

    Lea f node

    3

    entry 1 entry 2 entry 3

    Cluster3

    4

    entry2 is the c losest to objec t 5

    Cluster 3 becomes too large by adding objec t 5then split c luster 3?

    BUT there is a limit to the number of entries a node can haveThus, split the node

    Cluster2

    5

  • 8/6/2019 Lesson9 Clustering

    17/46

    BIRCH: The Idea by example

    Data Objec ts

    1

    Clustering Proc ess (build a tree)

    Cluster1

    1

    2

    3

    4

    5

    6

    2

    Lea f node

    3

    Cluster3

    4

    Cluster2

    5

    entry 1 entry 2

    entry 1.1 entry 1.2 entry 2.1 entry 2.2

    Lea f node

    Non-Leaf node

    Cluster4

  • 8/6/2019 Lesson9 Clustering

    18/46

    BIRCH: The Idea by example

    Data Objec ts

    1

    Clustering Proc ess (build a tree)

    Cluster1

    1

    2

    3

    4

    5

    6

    2

    Lea f node

    3

    Cluster3

    4

    Cluster2

    5

    entry 1 entry 2

    entry 1.1 entry 1.2 entry 2.1 entry 2.2

    Lea f node

    Non-Leaf node

    Cluster4

    6

    entry1.2 is the c losest to objec t 6

    Cluster 3 remains compac t when adding objec t 6

    then add ob jec t 6 to c luster 3

    Cluster3

  • 8/6/2019 Lesson9 Clustering

    19/46

    BIRCH: Key Components

    Clustering Feature (CF)

    Summary of the sta tistics for a g iven c luste r: the 0-th, 1st a nd 2ndmoments of the c luster from the sta tistic a l point o f view

    Used to c om pute centroids, and mea sure the c ompac tnessand d istanc e of c lusters

    CF-Tree

    height-ba lanced tree

    two pa rame ters:

    numb er of entries in eac h nod e

    The diameterof a ll entries in a lea f node

    Lea f nod es a re c onnec ted via prevand nextpointers

  • 8/6/2019 Lesson9 Clustering

    20/46

    Clustering Feature

    Clustering Feature (CF): CF = (N, LS, SS)

    N: Numb er of da ta points

    LS:linea r sum of N points:

    SS:squa re sum of N points:

    =N

    i iX

    1

    =

    N

    i i

    X1

    2

    Cluster 1(2,5)(3,2)(4,3)

    CF2= 3, (35,36), (417 ,440)Cluster 2

    CF1= 3, (2+3+4 , 5+2+3), (22+32+42 , 52+22+32) = 3, (9,10), (29 ,38)

    Cluster3

    CF3=CF1+CF2= 3+3, (9+35, 10+36), (29+417 , 38+440) = 6, (44,46), (446 ,478)

  • 8/6/2019 Lesson9 Clustering

    21/46

    Properties of Clustering Feature

    CF entry is a summary of sta tistics of the c luster

    A representation of the c luster

    A CF entry has suffic ient information to c a lc ula te the c entroid ,

    rad ius, d iameter and many other d istanc e mea sures

    Additively theorem a llows us to merge sub-c lusters incrementa lly

  • 8/6/2019 Lesson9 Clustering

    22/46

    Distanc e Measures

    Given a c luster with da ta points ,

    Centroid:

    Radius: average d istanc e from any point of the c luster to itscentroid

    Diameter: square roo t of average mea n squared d istancebetween a ll pa irs of points in the c luster

    n

    X

    x

    n

    i

    i== 10

    n

    xx

    R

    n

    i

    i=

    = 1

    2

    0 )(

    n

    xx

    D

    n

    i

    ji

    n

    j

    = =

    =1

    2

    1

    )(

  • 8/6/2019 Lesson9 Clustering

    23/46

    CF Tree

    B = Branc hing Fac to r,

    maximum c hild ren

    in a non-lea f node

    T= Thresho ld ford iameter or rad ius

    of the c luster in a lea f

    L = number of entries in

    a lea f

    CF entry in pa rent = sum of CF entries of a c hild of tha t entry

    In-memory, height-ba lanc ed tree

    CF1 CF2 CFk

    CF1 CF2 CFk

    Rootlevel

    Firstlevel

  • 8/6/2019 Lesson9 Clustering

    24/46

    CF Tree Insertion

    Sta rt w ith the root

    Find the CF entry in the root c losest to the da ta point, move totha t c hild and repeat the p roc ess until a c losest lea f entry isfound.

    At the lea f

    If the point can be ac c ommodated in the c luster, upd ate theentry

    If this add ition vio la tes the thresho ld T, sp lit the entry, if thisvio la tes the limit imp osed by L, sp lit the lea f. If its pa rent node isfull, sp lit tha t a nd so on

    Update the CF entries from the lea f to the roo t to ac c ommoda tethis point

  • 8/6/2019 Lesson9 Clustering

    25/46

    Phase 1: Load into memory by build ing a CF tree

    Phase 2 (op tiona l): Condense tree intodesirab le range by build ing a sma ller CF tree

    Initial CF tree

    Data

    Pha se 3: Globa l Cluste ring

    Smaller CF tree

    Good Clusters

    Pha se 4: (op tiona l and offline): Cluste r Refining

    Better Clusters

    Birc h Algorithm

  • 8/6/2019 Lesson9 Clustering

    26/46

    Birc h Algorithm: Phase 1

    Choose an initia l va lue for threshold , sta rt inserting the da ta pointsone by one into the tree a s per the insertion a lgorithm

    If, in the midd le of the above step , the size of the CF tree exc eed sthe size of the ava ilab le memory, inc rease the va lue of threshold

    Convert the partia lly built tree into a new tree

    Repeat the above steps until the ent ire da taset is sc anned and afull tree is built

    Outlier Hand ling

  • 8/6/2019 Lesson9 Clustering

    27/46

    Birc h Algorithm: Phase 2,3, and 4

    Phase 2

    A bridge between pha se 1 and p hase 3

    Builds a sma ller CF tree by inc reasing the thresho ld

    Phase 3

    Ap p ly g loba l c lustering a lgorithm to the sub -c lusters g iven b ylea f entries of the CF tree

    Improves c lustering qua lity

    Phase 4

    Sc an the entire d a taset to label the da ta points

    Outlier ha nd ling

  • 8/6/2019 Lesson9 Clustering

    28/46

    3.3.4 ROCK: for Categorical Data

    Experiments show tha t d istanc e func tions do not lea d to highqua lity c lusters when c lustering c a tegoric a l da ta

    Most c luste ring tec hniques assess the simila rity b etween p oints toc rea te c lusters

    At eac h step , points tha t a re simila r a re merged into a sing lecluster

    Loc a lized approac h p rone to errors

    ROCK: uses links instead of distances

  • 8/6/2019 Lesson9 Clustering

    29/46

    Example: Compute Jac c ard Coeffic ient

    Transac tion items: a,b,c,d,e,f,g Two c lusters oftransactions

    Compute Jac cardcoeffic ient betweentransactions

    ||

    ||),(

    ji

    ji

    jiTT

    TTTTsim

    Sim({a,b,c},{b,d,e})=1/ 5=0.2

    Jac card c oeffic ientbetween transac tions ofCluster1 ranges from 0.2to 0.5

    Cluster1. {a , b , c }{a , b , d }{a , b , e}{a , c , d }{a , c , e}{a , d , e}

    {b , c , d }{b , c , e}{b , d , e}{c , d , e}

    Cluster2. {a, b , f}{a , b , g}{a, f, g }

    {b, f, g }

    Jac card coeffic ient betweentransac tions belonging todifferent c lusters can alsoreach 0.5

    Sim({a,b,c},{a,b,f})=2/4=0.5

  • 8/6/2019 Lesson9 Clustering

    30/46

    Example: Using Links

    Transac tion items: a,b,c,d,e,f,g Two c lusters oftransactionsThe number of links between Ti and Tj

    is the number of commonneighbors

    Ti and Tj are neighbors if

    Sim(Ti,Tj)>

    Consider =0.5Link({a,b,f}, {a,b,g}) = 5

    (common neighbors)

    Link({a,b,f},{a,b,c})=3

    (common neighbors)

    Cluster1. {a , b , c }{a , b , d }

    {a , b , e}{a , c , d }{a , c , e}{a , d , e}{b , c , d }{b , c , e}{b , d , e}{c , d , e}

    Cluster2. {a, b , f}{a , b , g}{a, f, g }

    {b, f, g }

    Link is a better measure

    than Jac card coeffic ient

  • 8/6/2019 Lesson9 Clustering

    31/46

    ROCK

    ROCK: Robust Clustering using linKs

    Major Ideas

    Use links to measure simila rity/ p roximity Not d istanc e-based

    Comp utationa l c omplexity

    ma: average numb er of neighbors

    mm: ma ximum number of neighbors n: number of ob jec ts

    Algorithm

    Sampling -based c luste ring

    Draw random samp le

    Cluste r with links

    Label da ta in d isk

    O n n m m n nm a

    ( log )2 2+ +

  • 8/6/2019 Lesson9 Clustering

    32/46

  • 8/6/2019 Lesson9 Clustering

    33/46

    3.4.1 The Princ ip le

    Regard c lusters as dense reg ions in the da ta spac e separa ted byreg ions of low density

    Major fea tures

    Disc over c luste rs of a rb itra ry sha pe Hand le no ise One sc an Need density parameters as termina tion c ond ition

    Several interesting studies

    DBSCAN: Este r, et a l. (KDD96) OPTICS: Ankerst, et a l (SIGMOD99).

    DENCLUE: Hinneburg & D. Keim (KDD98)

    CLIQUE: Agrawa l, et a l. (SIGMOD98) (more grid -based)

  • 8/6/2019 Lesson9 Clustering

    34/46

    Basic Conc epts: -neighborhood & c ore ob jec ts

    = 1 c m

    The neighborhood within a rad ius of a g iven ob jec t is c a lledthe -neighborhood of the ob jec t

    If the -neighborhood of an ob jec t c onta ins a t lea st a minimumnumber, MinPts, of ob jec ts then the ob jec t is c a lled a core objec t

    Example: = 1 cm, MinPts=3

    m and p are c ore ob jc ets bec ausetheir -neighborhoods

    c onta in a t lea st 3 po ints

    p

    pmq

  • 8/6/2019 Lesson9 Clustering

    35/46

    Direc tly density-Reachable Ob jec ts

    An ob jec t p isdirec tly density- reachab le from ob jec t q if p is

    within the -neighborhood o f q and q is a c ore ob jec t

    Example:

    q is d irec tly density-reac hab le from mm is d irec tly density-reac ha b le from pand vic e versa

    pmq

  • 8/6/2019 Lesson9 Clustering

    36/46

    Density-Reachable Objec ts

    An ob jec t p is density-reachable from ob jec t q with respec t to and MinPts if there is a c ha in of ob jec ts p1,pn where p1=q and

    pn=p suc h tha t p i+1 is d irec tly reac hab le from p i with respec t to and MinPts

    Example:

    q is density-reac hab le from p because q is d irec tly density-

    rea c hab le from m and m is d irec tly density-reac ha b le from p

    p is not density-reac hab le from q because q is not a c oreobject

    pmq

  • 8/6/2019 Lesson9 Clustering

    37/46

    Density-Connectivity

    An ob jec t p isdensity-connected to ob jec t q with respec t to and MinPts if there is an ob jec t O suc h as both p and q are

    density reac hab le from O with respec t to and MinPts

    Example:

    p,q and m are a ll density connec ted

    pmq

  • 8/6/2019 Lesson9 Clustering

    38/46

    3.4.2 DBSCAN

    Searc hes for c lusters by c hec king the -neighborhood of eac hpoint in the d a tabase

    If the -neighborhood of a p oint p c onta ins more than MinPts, anew c luster with a c ore ob jec t is c rea ted

    DBSCAN itera tively c ollec ts d irec tly density reac hab le ob jec tsfrom these c ore ob jec ts. Whic h may involve the m erge of a fewdensity-reac hab le c lusters

    The p roc ess terminates when no new p oint can be a dded to a nycluster

    D i b d Cl i

  • 8/6/2019 Lesson9 Clustering

    39/46

    Density-based Clustering

    1 2

    3 4

    MinPts=4

    D it b d Cl t i

  • 8/6/2019 Lesson9 Clustering

    40/46

    Density-based Clustering

    5 6

    7 8

    DBSCAN S i i P

  • 8/6/2019 Lesson9 Clustering

    41/46

    DBSCAN: Sensitive to Parameters

    3 4 3 OPTICS

  • 8/6/2019 Lesson9 Clustering

    42/46

    3.4.3 OPTICS

    Motivation

    Very d ifferent loc a l densities may beneeded to revea l c lusters in d ifferentregions

    Clusters A,B,C1,C2, and C3 c annot bedetec ted using one g lob a l densityparameter

    A g lob a l density pa rameter c andetec t either A,B,C or C1,C2,C3

    Solutions

    Use OPTICS

    A B

    C C1C2

    C3

    OPTICS P i i l

  • 8/6/2019 Lesson9 Clustering

    43/46

    OPTICS Princ ip le

    Produce a spec ia l order of the d a tabase

    with respec t to its density-based c lustering struc ture

    c onta in information about every c lustering leve l of the d a ta set

    (up to a genera ting d istance )

    Whic h information to use?

    Core distance and Reac hability distance

  • 8/6/2019 Lesson9 Clustering

    44/46

    Core-distance and Reac hability-d istance

    The core-distance of an ob jec t is the sma llest that makes {p} ac ore ob jec t

    If p is not a c ore ob jec t, the c ore d istance o f p is undefined

    Example (, MinPts=5)

    is the c ore d istance o f p

    It is the d istance b etween p and the

    fourth c losest ob jec t

    The reachability-distance of an objec t q

    with respec t to ob jec t to ob jec t p is:

    Example

    Reachability-distance(q 1,p)=core-distance(p)=

    Reachability-distance(q 2,p)=Euclidian(q2,p )

    =6mm

    =3mm

    p

    p

    q1

    q2

    Max(core-distanc e(p), Euc lid ian(p,q))

    OPTICSAlgorithm

  • 8/6/2019 Lesson9 Clustering

    45/46

    OPTICS Algorithm

    Crea tes an ordering of the ob jec ts in the d a tabase a nd stores forea c h ob jec t its:

    Core-distance

    Distanc e rea c hab ility from the c losest c ore ob jec t from whic h anob jec t have b een d irec tly density-rea c hab le

    This info rmation is suffic ient for the extrac tion of a ll density-basedc lustering with respec t to any d istanc e tha t is sma ller than used in genera ting the order

  • 8/6/2019 Lesson9 Clustering

    46/46