part ii - clustering© prentice hall1 clustering large db most clustering algorithms assume a large...
TRANSCRIPT
© Prentice Hall 1Part II - Clustering
Clustering Large DBClustering Large DB
Most clustering algorithms assume a large Most clustering algorithms assume a large data structure which is memory resident.data structure which is memory resident.
Clustering may be performed first on a Clustering may be performed first on a sample of the database then applied to the sample of the database then applied to the entire database.entire database.
AlgorithmsAlgorithms– BIRCHBIRCH– DBSCANDBSCAN– CURECURE
© Prentice Hall 2Part II - Clustering
Desired Features for Large Desired Features for Large DatabasesDatabases
One scan (or less) of DBOne scan (or less) of DB OnlineOnline Suspendable, stoppable, resumableSuspendable, stoppable, resumable IncrementalIncremental Work with limited main memoryWork with limited main memory Different techniques to scan (e.g. Different techniques to scan (e.g.
sampling)sampling) Process each tuple onceProcess each tuple once
© Prentice Hall 3Part II - Clustering
BIRCHBIRCH Balanced Iterative Reducing and Balanced Iterative Reducing and
Clustering using HierarchiesClustering using Hierarchies Incremental, hierarchical, one scanIncremental, hierarchical, one scan Save clustering information in a tree Save clustering information in a tree Each entry in the tree contains Each entry in the tree contains
information about one clusterinformation about one cluster New nodes inserted in closest entry in New nodes inserted in closest entry in
treetree
© Prentice Hall 4Part II - Clustering
Clustering FeatureClustering Feature (N,LS,SS)(N,LS,SS)
– N: Number of points in clusterN: Number of points in cluster– LS: Sum of points in the clusterLS: Sum of points in the cluster– SS: Sum of squares of points in the clusterSS: Sum of squares of points in the cluster
CF TreeCF Tree– Balanced search treeBalanced search tree– Node has CF triple for each childNode has CF triple for each child– Leaf node represents cluster and has CF Leaf node represents cluster and has CF
value for each subcluster in it.value for each subcluster in it.– Subcluster has maximum diameterSubcluster has maximum diameter
© Prentice Hall 5Part II - Clustering
BIRCH AlgorithmBIRCH Algorithm
© Prentice Hall 6Part II - Clustering
Improve ClustersImprove Clusters
© Prentice Hall 7Part II - Clustering
DBSCANDBSCAN
Density Based Spatial Clustering of Density Based Spatial Clustering of Applications with NoiseApplications with Noise
Outliers will not effect creation of cluster.Outliers will not effect creation of cluster. InputInput
– MinPts MinPts – minimum number of points in – minimum number of points in clustercluster
– EpsEps – for each point in cluster there must – for each point in cluster there must be another point in it less than this distance be another point in it less than this distance away.away.
© Prentice Hall 8Part II - Clustering
DBSCAN Density ConceptsDBSCAN Density Concepts
Eps-neighborhood:Eps-neighborhood: Points within Eps Points within Eps distance of a point.distance of a point.
Core point:Core point: Eps-neighborhood dense enough Eps-neighborhood dense enough (MinPts)(MinPts)
Directly density-reachable:Directly density-reachable: A point p is A point p is directly density-reachable from a point q if the directly density-reachable from a point q if the distance is small (Eps) and q is a core point.distance is small (Eps) and q is a core point.
Density-reachable:Density-reachable: A point si density- A point si density-reachable form another point if there is a path reachable form another point if there is a path from one to the other consisting of only core from one to the other consisting of only core points.points.
© Prentice Hall 9Part II - Clustering
Density ConceptsDensity Concepts
© Prentice Hall 10Part II - Clustering
DBSCAN AlgorithmDBSCAN Algorithm
© Prentice Hall 11Part II - Clustering
CURECURE
Clustering Using RepresentativesClustering Using Representatives Use many points to represent a cluster Use many points to represent a cluster
instead of only oneinstead of only one Points will be well scatteredPoints will be well scattered
© Prentice Hall 12Part II - Clustering
CURE ApproachCURE Approach
© Prentice Hall 13Part II - Clustering
CURE AlgorithmCURE Algorithm
© Prentice Hall 14Part II - Clustering
CURE for Large DatabasesCURE for Large Databases