1 a system for outlier detection and cluster repair ying liu dr. sprague oct 21, 2005
TRANSCRIPT
8
Factors affecting clustering results
Outliers Inappropriate value for parameters Drawbacks of the clustering
algorithm themselves
9
Factors affecting outlier detection results
Distributions Boundary between outlier group
and microcluster Nested outliers
10
Two steps of cluster repair Outlier/outlier group detection for each cluster
Separate points which are not supposed to be together Merge density connected points
Merge points which should be together
Outlier detection of different clusters.
Clusters generated by a clustering algorithm
Merge similar points from different clusters.
12
Network Flow: Maximum Flow/Minimum Cut
Ford-Fulkerson (1962) The maximum flow problem is to
find a f for which the total flow is maximum. The total flow can be measured at the sink, or it can be measured at any cut separating the source from the sink.
13
Outlier detection: Maximum flow/Minimum cut
s t
a b
c d
19/19
12/13
7/10 9/9 7/7
12/12
28/30
3/3
10/11
s->a->b->t: 12
s->a->c->d->b->t: 7
s->c->b->t: 9
s->c->d->t: 3
maximum-flow= minimum-cut = 12+3+9+7=31
14
Outlier detection by network flow
1. compute k nearest neighbors of each point in a cluster of data.
2. for the data of a cluster, set up the network.3. begin at a random vertex as source/sink s, choose
its farthest vertex as the sink/source t.4. use the Maximum-Flow/Minimum-Cut algorithm to
find the flow from source to sink, get the cut separating s and t, and use the smaller side as the candidate outlier or outlier group.
5. remove the candidate outlier or outlier groups from the graph.
6. select the next source, go back to 3 until the stop criterion.
7. adjusting: coarsen the graph and adjust the maximum flow.
16
7 nearest neighbors591 points, 5028 edges
Setting up the Network
The No. 20 cluster , 591 points
Experiments (setting up the network)
17
Setting up the network Compute k nearest neighbors, make
sure all vertices are connected. Compute the capacity between two
vertices by the distance.
4
100*1
1
cCapacity
distc
18
Experiment resultLoop Max Flow
No. 4 1267
No. 1 1269
No. 3 3256
No. 5 3937
No. 8 5939
No. 7 7717
No. 14 8962
No. 9 10148
No. 10 16194
No. 2 16533
No. 13 17793
No. 6 25378
No. 11 63797
No. 12 160515
No. 15 359560
No. 17 427908
No. 16 1307310
19
Experiment (adjusting)
18 vertices, 66 edges
Loop Cut Max Flow
No. 1 vertex 4 1267
No. 2 vertex 1 1269
No. 3 vertex 3 3256
No. 4 Vertex 5 3937
No. 5 vertex 8 5939
No. 6 vertex 7,9,10 16531
No. 7 vertex 2 16533
No. 8 vertex 13 17793
No. 9 Vertex 14 20261
No. 10 Vertex 6 25378
No. 11 Vertex 11 52498
No. 12 Vertex 12 160515
No. 13 Vertex 15 359560
No. 14 Vertex 17 427908
No. 15 Vertex 16 1307310
20
Stop criteria Users input the number of outlier or outlier
group they want. Use the maximum flow as the stop
condition.
Stop when Dflow Davg
Davg = average distance of the remaining data
4
1
100
cCapacity
distc
1
_#max_
100
4
edgecrossflow
D flow
24
Merge density connected microclusters by flexible parameters
of DBSCAN
2 7
1
3
4
5
6 8
14 15
13
9 10
20
19
1716
18
11
12
25
Flexible parameters of DBSCAN
get the average distance d of every microcluster by each point’s k nearest neighbors
No. 20 clusterNo. 19 cluster
No. 10 cluster
28
DBSCAN with flexible Eps
Original DBSCAN use least dense e-neighborhood as global Eps and set MinPts=4.
We use average distance of every microcluster as the Eps. When do DBSCAN, points in different
microclusters use different Eps.
29
Kd tree Use kd tree to find buckets with more than two
microclusters from different original cluster results.
31
MinPts = 4 for dim = 2
Epsp
Search the rectangle (x+Eps, y+Eps, x-Eps, y-Eps) by R* tree,when Eps = avg_dist between points, it is very possible the point P could include 3 extra points besides itself.
33
Other controversial buckets
No.119 bucket No.113 bucket No.114 bucket
If x% points of a microcluster are merged into another microcluster, then mergeThese two microclusters. Since the proportion of points of these microclusters in these buckets that are merged exceeds 90%, 24 and 28 microclusters are merged.
36
Conclusion
Repair cluster from two aspects. Removing points which are loosely connect to the
clusters by outlier/outlier group detection; merging points which are density connected by
DBSCAN with flexible Eps. Analyze interested microclusters
Found the Relationship among Outliers, outlier groups and main clusters.