1 a system for outlier detection and cluster repair ying liu dr. sprague oct 21, 2005

37
1 A System for Outlier Detection and Cluster Repair Ying Liu Dr. Sprague Oct 21, 2005

Upload: kristopher-harvey

Post on 13-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

1

A System for Outlier Detection and Cluster Repair

Ying LiuDr. SpragueOct 21, 2005

2

A data set

3

Clustering algorithms could generate bad cluster

hMETIS (k=6)

4

Clustering algorithms could generate bad cluster

hMETIS (k=20)

5

BIRCH

6

BIRCH

7

Clustering algorithms could generate bad cluster

BIRCH (k=20)

8

Factors affecting clustering results

Outliers Inappropriate value for parameters Drawbacks of the clustering

algorithm themselves

9

Factors affecting outlier detection results

Distributions Boundary between outlier group

and microcluster Nested outliers

10

Two steps of cluster repair Outlier/outlier group detection for each cluster

Separate points which are not supposed to be together Merge density connected points

Merge points which should be together

Outlier detection of different clusters.

Clusters generated by a clustering algorithm

Merge similar points from different clusters.

11

Step 1: Cluster Repair

Outlier Detection and Evaluation by Network Flow

12

Network Flow: Maximum Flow/Minimum Cut

Ford-Fulkerson (1962) The maximum flow problem is to

find a f for which the total flow is maximum. The total flow can be measured at the sink, or it can be measured at any cut separating the source from the sink.

13

Outlier detection: Maximum flow/Minimum cut

s t

a b

c d

19/19

12/13

7/10 9/9 7/7

12/12

28/30

3/3

10/11

s->a->b->t: 12

s->a->c->d->b->t: 7

s->c->b->t: 9

s->c->d->t: 3

maximum-flow= minimum-cut = 12+3+9+7=31

14

Outlier detection by network flow

1. compute k nearest neighbors of each point in a cluster of data.

2. for the data of a cluster, set up the network.3. begin at a random vertex as source/sink s, choose

its farthest vertex as the sink/source t.4. use the Maximum-Flow/Minimum-Cut algorithm to

find the flow from source to sink, get the cut separating s and t, and use the smaller side as the candidate outlier or outlier group.

5. remove the candidate outlier or outlier groups from the graph.

6. select the next source, go back to 3 until the stop criterion.

7. adjusting: coarsen the graph and adjust the maximum flow.

15

Loosely connected clusters

20

19

10

1

2

16

7 nearest neighbors591 points, 5028 edges

Setting up the Network

The No. 20 cluster , 591 points

Experiments (setting up the network)

17

Setting up the network Compute k nearest neighbors, make

sure all vertices are connected. Compute the capacity between two

vertices by the distance.

4

100*1

1

cCapacity

distc

18

Experiment resultLoop Max Flow

No. 4 1267

No. 1 1269

No. 3 3256

No. 5 3937

No. 8 5939

No. 7 7717

No. 14 8962

No. 9 10148

No. 10 16194

No. 2 16533

No. 13 17793

No. 6 25378

No. 11 63797

No. 12 160515

No. 15 359560

No. 17 427908

No. 16 1307310

19

Experiment (adjusting)

18 vertices, 66 edges

Loop Cut Max Flow

No. 1 vertex 4 1267

No. 2 vertex 1 1269

No. 3 vertex 3 3256

No. 4 Vertex 5 3937

No. 5 vertex 8 5939

No. 6 vertex 7,9,10 16531

No. 7 vertex 2 16533

No. 8 vertex 13 17793

No. 9 Vertex 14 20261

No. 10 Vertex 6 25378

No. 11 Vertex 11 52498

No. 12 Vertex 12 160515

No. 13 Vertex 15 359560

No. 14 Vertex 17 427908

No. 15 Vertex 16 1307310

20

Stop criteria Users input the number of outlier or outlier

group they want. Use the maximum flow as the stop

condition.

Stop when Dflow Davg

Davg = average distance of the remaining data

4

1

100

cCapacity

distc

1

_#max_

100

4

edgecrossflow

D flow

21

Outlier Degree

22

Experiment (20 clusters)

2 7

1

3

4

5

6 8

14 15

13

9 10

20

19

1716

18

11

12

23

Step 2: Cluster Repair

Merge Density Connected Points

24

Merge density connected microclusters by flexible parameters

of DBSCAN

2 7

1

3

4

5

6 8

14 15

13

9 10

20

19

1716

18

11

12

25

Flexible parameters of DBSCAN

get the average distance d of every microcluster by each point’s k nearest neighbors

No. 20 clusterNo. 19 cluster

No. 10 cluster

26

DBSCAN

27

DBSCAN

28

DBSCAN with flexible Eps

Original DBSCAN use least dense e-neighborhood as global Eps and set MinPts=4.

We use average distance of every microcluster as the Eps. When do DBSCAN, points in different

microclusters use different Eps.

29

Kd tree Use kd tree to find buckets with more than two

microclusters from different original cluster results.

30

No. 125 bucket

31

MinPts = 4 for dim = 2

Epsp

Search the rectangle (x+Eps, y+Eps, x-Eps, y-Eps) by R* tree,when Eps = avg_dist between points, it is very possible the point P could include 3 extra points besides itself.

32

No. 125 bucket

(a) MinPts = 5 (b) MinPts = 5

33

Other controversial buckets

No.119 bucket No.113 bucket No.114 bucket

If x% points of a microcluster are merged into another microcluster, then mergeThese two microclusters. Since the proportion of points of these microclusters in these buckets that are merged exceeds 90%, 24 and 28 microclusters are merged.

34

No. 20, 19 and 10 cluster repair

35

After repair 20 clusters

36

Conclusion

Repair cluster from two aspects. Removing points which are loosely connect to the

clusters by outlier/outlier group detection; merging points which are density connected by

DBSCAN with flexible Eps. Analyze interested microclusters

Found the Relationship among Outliers, outlier groups and main clusters.

37

Questions

MinPts in high dimensional data For 3-d, MinPts=5; 4-d, MinPts=6?

For some outlier group microcluster, MinPts could be very high, it’s because border points include points in neighbor dense microcluters within its Eps, how to use each microcluster’s MinPts as reference.