part ii - clustering© prentice hall1 clustering large db most clustering algorithms assume a large...

14
© Prentice Hall 1 Part II - Clustering Clustering Large DB Clustering Large DB Most clustering algorithms assume a Most clustering algorithms assume a large data structure which is large data structure which is memory resident. memory resident. Clustering may be performed first Clustering may be performed first on a sample of the database then on a sample of the database then applied to the entire database. applied to the entire database. Algorithms Algorithms BIRCH BIRCH DBSCAN DBSCAN CURE CURE

Upload: clinton-parsons

Post on 16-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering

© Prentice Hall 1Part II - Clustering

Clustering Large DBClustering Large DB

Most clustering algorithms assume a large Most clustering algorithms assume a large data structure which is memory resident.data structure which is memory resident.

Clustering may be performed first on a Clustering may be performed first on a sample of the database then applied to the sample of the database then applied to the entire database.entire database.

AlgorithmsAlgorithms– BIRCHBIRCH– DBSCANDBSCAN– CURECURE

Page 2: Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering

© Prentice Hall 2Part II - Clustering

Desired Features for Large Desired Features for Large DatabasesDatabases

One scan (or less) of DBOne scan (or less) of DB OnlineOnline Suspendable, stoppable, resumableSuspendable, stoppable, resumable IncrementalIncremental Work with limited main memoryWork with limited main memory Different techniques to scan (e.g. Different techniques to scan (e.g.

sampling)sampling) Process each tuple onceProcess each tuple once

Page 3: Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering

© Prentice Hall 3Part II - Clustering

BIRCHBIRCH Balanced Iterative Reducing and Balanced Iterative Reducing and

Clustering using HierarchiesClustering using Hierarchies Incremental, hierarchical, one scanIncremental, hierarchical, one scan Save clustering information in a tree Save clustering information in a tree Each entry in the tree contains Each entry in the tree contains

information about one clusterinformation about one cluster New nodes inserted in closest entry in New nodes inserted in closest entry in

treetree

Page 4: Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering

© Prentice Hall 4Part II - Clustering

Clustering FeatureClustering Feature (N,LS,SS)(N,LS,SS)

– N: Number of points in clusterN: Number of points in cluster– LS: Sum of points in the clusterLS: Sum of points in the cluster– SS: Sum of squares of points in the clusterSS: Sum of squares of points in the cluster

CF TreeCF Tree– Balanced search treeBalanced search tree– Node has CF triple for each childNode has CF triple for each child– Leaf node represents cluster and has CF Leaf node represents cluster and has CF

value for each subcluster in it.value for each subcluster in it.– Subcluster has maximum diameterSubcluster has maximum diameter

Page 5: Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering

© Prentice Hall 5Part II - Clustering

BIRCH AlgorithmBIRCH Algorithm

Page 6: Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering

© Prentice Hall 6Part II - Clustering

Improve ClustersImprove Clusters

Page 7: Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering

© Prentice Hall 7Part II - Clustering

DBSCANDBSCAN

Density Based Spatial Clustering of Density Based Spatial Clustering of Applications with NoiseApplications with Noise

Outliers will not effect creation of cluster.Outliers will not effect creation of cluster. InputInput

– MinPts MinPts – minimum number of points in – minimum number of points in clustercluster

– EpsEps – for each point in cluster there must – for each point in cluster there must be another point in it less than this distance be another point in it less than this distance away.away.

Page 8: Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering

© Prentice Hall 8Part II - Clustering

DBSCAN Density ConceptsDBSCAN Density Concepts

Eps-neighborhood:Eps-neighborhood: Points within Eps Points within Eps distance of a point.distance of a point.

Core point:Core point: Eps-neighborhood dense enough Eps-neighborhood dense enough (MinPts)(MinPts)

Directly density-reachable:Directly density-reachable: A point p is A point p is directly density-reachable from a point q if the directly density-reachable from a point q if the distance is small (Eps) and q is a core point.distance is small (Eps) and q is a core point.

Density-reachable:Density-reachable: A point si density- A point si density-reachable form another point if there is a path reachable form another point if there is a path from one to the other consisting of only core from one to the other consisting of only core points.points.

Page 9: Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering

© Prentice Hall 9Part II - Clustering

Density ConceptsDensity Concepts

Page 10: Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering

© Prentice Hall 10Part II - Clustering

DBSCAN AlgorithmDBSCAN Algorithm

Page 11: Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering

© Prentice Hall 11Part II - Clustering

CURECURE

Clustering Using RepresentativesClustering Using Representatives Use many points to represent a cluster Use many points to represent a cluster

instead of only oneinstead of only one Points will be well scatteredPoints will be well scattered

Page 12: Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering

© Prentice Hall 12Part II - Clustering

CURE ApproachCURE Approach

Page 13: Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering

© Prentice Hall 13Part II - Clustering

CURE AlgorithmCURE Algorithm

Page 14: Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering

© Prentice Hall 14Part II - Clustering

CURE for Large DatabasesCURE for Large Databases