part ii - clustering© prentice hall1 clustering large db most clustering algorithms assume a large...

© Prentice Hall 1Part II - Clustering

Clustering Large DBClustering Large DB

Most clustering algorithms assume a large Most clustering algorithms assume a large data structure which is memory resident.data structure which is memory resident.

Clustering may be performed first on a Clustering may be performed first on a sample of the database then applied to the sample of the database then applied to the entire database.entire database.

AlgorithmsAlgorithms– BIRCHBIRCH– DBSCANDBSCAN– CURECURE


Desired Features for Large Desired Features for Large DatabasesDatabases

One scan (or less) of DBOne scan (or less) of DB OnlineOnline Suspendable, stoppable, resumableSuspendable, stoppable, resumable IncrementalIncremental Work with limited main memoryWork with limited main memory Different techniques to scan (e.g. Different techniques to scan (e.g.

sampling)sampling) Process each tuple onceProcess each tuple once


BIRCHBIRCH Balanced Iterative Reducing and Balanced Iterative Reducing and

Clustering using HierarchiesClustering using Hierarchies Incremental, hierarchical, one scanIncremental, hierarchical, one scan Save clustering information in a tree Save clustering information in a tree Each entry in the tree contains Each entry in the tree contains

information about one clusterinformation about one cluster New nodes inserted in closest entry in New nodes inserted in closest entry in

treetree


Clustering FeatureClustering Feature (N,LS,SS)(N,LS,SS)

– N: Number of points in clusterN: Number of points in cluster– LS: Sum of points in the clusterLS: Sum of points in the cluster– SS: Sum of squares of points in the clusterSS: Sum of squares of points in the cluster

CF TreeCF Tree– Balanced search treeBalanced search tree– Node has CF triple for each childNode has CF triple for each child– Leaf node represents cluster and has CF Leaf node represents cluster and has CF

value for each subcluster in it.value for each subcluster in it.– Subcluster has maximum diameterSubcluster has maximum diameter


BIRCH AlgorithmBIRCH Algorithm


Improve ClustersImprove Clusters


DBSCANDBSCAN

Density Based Spatial Clustering of Density Based Spatial Clustering of Applications with NoiseApplications with Noise

Outliers will not effect creation of cluster.Outliers will not effect creation of cluster. InputInput

– MinPts MinPts – minimum number of points in – minimum number of points in clustercluster

– EpsEps – for each point in cluster there must – for each point in cluster there must be another point in it less than this distance be another point in it less than this distance away.away.


DBSCAN Density ConceptsDBSCAN Density Concepts

Eps-neighborhood:Eps-neighborhood: Points within Eps Points within Eps distance of a point.distance of a point.

Core point:Core point: Eps-neighborhood dense enough Eps-neighborhood dense enough (MinPts)(MinPts)

Directly density-reachable:Directly density-reachable: A point p is A point p is directly density-reachable from a point q if the directly density-reachable from a point q if the distance is small (Eps) and q is a core point.distance is small (Eps) and q is a core point.

Density-reachable:Density-reachable: A point si density- A point si density-reachable form another point if there is a path reachable form another point if there is a path from one to the other consisting of only core from one to the other consisting of only core points.points.


Density ConceptsDensity Concepts


DBSCAN AlgorithmDBSCAN Algorithm


CURECURE

Clustering Using RepresentativesClustering Using Representatives Use many points to represent a cluster Use many points to represent a cluster

instead of only oneinstead of only one Points will be well scatteredPoints will be well scattered


CURE ApproachCURE Approach


CURE AlgorithmCURE Algorithm


CURE for Large DatabasesCURE for Large Databases

part ii - clustering© prentice hall1 clustering large db most clustering algorithms assume a large...

Documents