![Page 1: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/1.jpg)
CHAN Siu Lung, DanielCHAN Wai Kin, KenCHOW Chin Hung, VictorKOON Ping Yin, Bob
CURE: Efficient Clustering AlgorithmCURE: Efficient Clustering Algorithm for Large Databasesfor Large Databases
![Page 2: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/2.jpg)
Content
I. Different problem in traditional clustering methodII. Basic idea of CURE clusteringIII. Improved CUREIV. SummaryV. References
![Page 3: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/3.jpg)
I. Different problem in traditional clustering method
![Page 4: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/4.jpg)
Partitional Clustering
• Partitional Clustering– This category of clustering method try to reduce the
data set into k clusters based on some criterion functions.
– The most common criterion is square-error criterion.
– This method favor to clusters with data points as compact and separated as possible
Different problem in traditional clustering method
![Page 5: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/5.jpg)
Partitional Clustering
• You may find error in case the square-error is reduced by splitting some large cluster to favor some other group.
Different problem in traditional clustering method
Figure: Splitting occur in large cluster by partitional method
![Page 6: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/6.jpg)
• Hierarchical Clustering– This category of clustering method try to merge sequences of disjoint
clusters into the target k clusters base on the minimum distance between two clusters.
– The distance between clusters can be measured as:• Distance between mean:
• Distance between average point
• Distance between two nearest point within cluster
Hierarchical Clustering
Different problem in traditional clustering method
![Page 7: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/7.jpg)
Hierarchical Clustering
– This method favor hyper-spherical shape and uniform data.
– Let’s take some prolonged data as example:
– Result of dmean :
Different problem in traditional clustering method
![Page 8: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/8.jpg)
Hierarchical Clustering
– Result of dmin :
Different problem in traditional clustering method
![Page 9: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/9.jpg)
Problems summary
1. Traditional clustering mainly favors spherical shape.
2. Data in the cluster must be compact together.
3. Each cluster must separate far away enough.
4. Cluster size must be uniform.
5. Outliner will greatly disturb the cluster result.
Different problem in traditional clustering method
![Page 10: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/10.jpg)
II. Basic idea of CURE clustering
![Page 11: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/11.jpg)
General CURE clustering procedure.
1. It is similar to hierarchical clustering approach. But it use sample point variant as the cluster representative rather than every point in the cluster.
2. First set a target sample number c . Than we try to select c well scattered sample points from the cluster.
3. The chosen scattered points are shrunk toward the centroid in a fraction of where 0 <<1
Basic idea of CURE clustering
![Page 12: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/12.jpg)
General CURE clustering procedure.
4. These points are used as representative of clusters and will be used as the point in dmin cluster merging approach.
5. After each merging, c sample points will be selected from original representative of previous clusters to represent new cluster.
6. Cluster merging will be stopped until target k cluster is found
Basic idea of CURE clustering
NearestMerge
Nearest
Merge
![Page 13: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/13.jpg)
Pseudo function of CURE
Basic idea of CURE clustering
![Page 14: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/14.jpg)
CURE efficient
• The worst-case time complexity is O(n2logn)
• The space complexity is O(n) due to the use of k-d treee and heap.
Basic idea of CURE clustering
![Page 15: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/15.jpg)
III. Improved CURE
![Page 16: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/16.jpg)
• In case of dealing with large database, we can’t store every data point to the memory.
• Handle of data merge in large database require very long time.
• We use random sampling to both reduce the time complexity and memory usage.
• Assume if we need to detect a cluster u present, we need to at least capture f fraction of data from this cluster f|u|
• The the required sampling data s to capture can be present as follow:
• You can refer to proof from the reference (i). Here we just want to show that we can determine a sample size smin such that the probability of get enough sample from every cluster u is 1 -
Random Sampling
Improved CURE
![Page 17: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/17.jpg)
Partitioning and two pass clustering
• In addition, we use two-pass approach to reduce the computation time.
• First, we divide the n data point into p partition and each contain n/p data point.
• We than pre-cluster each partition until the number of cluster n/pq reached in each partition for some q > 1
• Then each cluster in the first pass result will be used as the second pass clustering input to form the final cluster.
• Each one partition’s time complexity is:
• Therefore, the first pass complexity will be:
• And the second pass complexity is:
• Overall, the time complexity will become:
Improved CURE
![Page 18: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/18.jpg)
Partitioning and two pass clustering
• The overall improvement will be:
• Also, to maintain the quality of clustering, we must make sure n/pq must be 2 to 3 times of k.
Improved CUR
![Page 19: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/19.jpg)
Outlier elimination
• We can introduce outliners elimination by two method.
1. Random sampling: With random sampling, most of outlier points are filtered out.
2. Outlier elimination: As outliner is not a compact group, it will grow in size very slowly during the cluster merge stage. We will then kick in the elimination procedure during the merging stage such that those cluster with 1 ~ 2 data points are removed from the cluster list.
• In order to prevent these outliners from merging into proper cluster, we must trigger the procedure in proper stage such that we can properly remove the outliners. In general, we will trigger this procedure when cluster sets reduce to 1/3 of total data sets.
Improved CURE
![Page 20: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/20.jpg)
Data labeling
• Due to the use of random sample. We need to label back every remaining data points to the proper cluster group.
• Each data point is assigned to the cluster group with a representative point nearest to the data point.
Improved CURE
![Page 21: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/21.jpg)
Final overview of CURE flow
Improved CURE
Data
Draw Random Sample
Partition Sample
Partially cluster partition
Elimination outliers
Cluster partial clusters
Label data in disk
![Page 22: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/22.jpg)
Sample result with different parameter
Improved CURE
Different shrinking factor
![Page 23: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/23.jpg)
Sample result with different parameter
Improved CURE
Different number of representatives c
![Page 24: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/24.jpg)
Sample result with different parameter
Improved CURE
Relation of execution time, different partition number p, and different sample points s
![Page 25: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/25.jpg)
IV. Summary
• CURE can effectively detect proper shape of the cluster with the help of scattered representative point and centroid shrinking.
• CURE can reduce computation time and memory loading with random sampling and 2 pass clustering
• CURE can effectively remove outlier.
• The quality and effectiveness of CURE can be tuned be varying different s,p,c, to adapt different input data set.
![Page 26: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/26.jpg)
V. References
i. GRS97 Sudipto Guha, R. Rastogi, and K. Shim. CURE: A clustering algorithm for large databases. Technical report, Bell Laboratories, Murray Hill, 1997.
ii. ZRL96 Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, ACM SIGMOD Record, v.25 n.2, p.103-114, June 1996
iii. Sudipto Guha, Rajeev Rastogi, Kyuseok Shim: CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD, 1998.
![Page 27: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649f4a5503460f94c6c471/html5/thumbnails/27.jpg)