an efficient approach to clustering in large multimedia databases with noise alexander hinneburg and...

29
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Upload: joella-page

Post on 12-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

An Efficient Approach to Clustering in Large Multimedia Databases with Noise

Alexander Hinneburg and Daniel A. Keim

Page 2: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Outline Multimedia data Density-based clustering Influence and density functions Center-defined vs. Arbitrary-shape Comparison with other algorithms Algorithm What can we learn / have we learned?

Page 3: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Multimedia Data Examples

Images CAD Geographic Molecular biology

High-dimensional feature vectors Color histograms Shape descriptors Fourier vectors

Page 4: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Density-Based Clustering(loose definition) Clusters defined by high density of

points Many points with the same

combination of attribute values Is density irrelevant for other

methods? No! Most methods look for dense areas DENCLUE uses density directly

Page 5: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Density-Based Clustering(stricter definition) Closeness to a dense area is the only

criterion for cluster membershipDENCLUE has two variants Arbitrary-shaped clusters

Similar to other density based methods Center-defined clusters

Similar to distance-based methods

Page 6: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Idea Each data point

has an influence that extends over a range Influence function

Add all influence functions Density function

Page 7: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Influence Functions

Page 8: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Definitions Density Attractor x*

Local maximum of the density function

Density attracted points Points from which a path to x* exists

for which the gradient is continuously positive (case of continuous and differentiable influence function)

Page 9: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Center Defined Clusters All points that are density

attracted to a given density attractor x*

Density function at the maximum must exceed

Points that are attracted to smaller maxima are considered outliers

Page 10: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Arbitrary-Shape Clusters Merges center defined clusters if a

path exists for which the density function continuously exceeds

Page 11: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Examples

Page 12: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Noise Invariance Density distribution for noise is

constant No influence on number and location of

attractors

Claim Number of density attractors with or

without noise is the same Probability that they are identical

goes to 1 for large noise

Page 13: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Parameter Choices

Choice of : Use different and determine

largest interval with constant number or clusters

Choice of : Greater than noise level Smaller than smallest relevant

maxima

Page 14: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Comparison with DBSCAN

Corresponding setup Square wave influence function radius

models neighborhood in DBSCAN Definition of core objects in DBSCAN

involves MinPts <=> Density reachable in DBSCAN

becomes density attracted in DENCLUE (!?)

Page 15: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Comparison with k-means

Corresponding setup Gaussian influence function Step-size for hill-climbing = /2Claim In DENCLUE can be chosen such

that k clusters are found DENCLUE result corresponds to

global optimum in k-means

Page 16: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Comparison with Hierarchical Methods Start with very small to get

largest number of clusters Increasing will merge clusters Finally one density attractor

Page 17: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Algorithm Step 1: Construct a map of data points

Uses hypercubes of with edge length 2 Only populated cubes are saved

Step 2: Determine density attractors for all points using hill-climbing Keeps track of paths that have been taken

and points close to them

Page 18: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Local Density Function Influence function of “near” points

contributes fully Far away points are ignored For Gaussian influence function:

cut-off chosen as 4

Page 19: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Step 1: Constructing the map Hypercubes contain

Number of data points Pointers to data points Sum of data values (for mean)

Save populated hypercubes in B+ tree Connect neighboring populated cubes

for fast access Limited to highly populated cubes derived

from outlier criterion

Page 20: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Step 2: Clustering Step Uses only highly populated cubes

and cubes that are connected to them

Hill-climbing based on local density function and its gradient

Points within /2 of each hill-climbing path are attached to clusters as well

Page 21: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Time Complexity / Efficiency Worst case, for N data points

O(N log(N)) Average case (without building

data structure?) O(log(N)) Explanation: Only highly populated

areas are considered Up to 45 times faster than DBSCAN

Page 22: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Application to Molecular Biology Simulation of a small but flexible

peptide Point in a 19-dimensional angle

space Pharmaceutical industry is

interested in stable conformations Non-stable conformations make up

>50 percent => noise

Page 23: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

What can we learn?

Algorithm is fast for 2 reasons Efficient data structure

Data points that are close in attribute space are stored together

Similar to P-trees: fast access to data, based on attribute values

Optimization problem inherently linear in search space K-medoids problem is quadratic!

Page 24: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Why is k-medoids quadratic in the search space?Review: Cost function calculated as sum over

squared distance within each cluster I.e. cost associated with each cluster

center depends on all other cluster centers!

Can be viewed as an influence function that depends on cluster boundaries

Page 25: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Cost functions K-medoids DENCLUE

Page 26: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Motivating a Gaussian influence function Why not use a parabola as

influence function? Only 1 minimum (mean of data set) We need cut-off

K-medoids cut-off depends on cluster centers

Cluster center independent cut-off? Gaussian function!

Page 27: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Is DENCLUE only an Approximation to k-medoids?

Not necessarily Minimizing square distance is a

fundamental measure but not the only one

Why should “influence” depend on density of points? “Influence” may be determined by

system

Page 28: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

If DENCLUE is so good can we still improve it? Need a special data structure They map out all space

Density-based idea A distance based version can look

for cluster centers only Allows using a promising starting

point Define partitions by proximity

Page 29: An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Conclusion DENCLUE paper contains many

fundamentally valuable ideas Data structure efficient Algorithm related to but much more

efficient than k-medoids