an efficient approach to clustering in large multimedia databases with noise alexander hinneburg and...

An Efficient Approach to Clustering in Large Multimedia Databases with Noise

Alexander Hinneburg and Daniel A. Keim

Outline Multimedia data Density-based clustering Influence and density functions Center-defined vs. Arbitrary-shape Comparison with other algorithms Algorithm What can we learn / have we learned?

Multimedia Data Examples

Images CAD Geographic Molecular biology

High-dimensional feature vectors Color histograms Shape descriptors Fourier vectors

Density-Based Clustering(loose definition) Clusters defined by high density of

points Many points with the same

combination of attribute values Is density irrelevant for other

methods? No! Most methods look for dense areas DENCLUE uses density directly

Density-Based Clustering(stricter definition) Closeness to a dense area is the only

criterion for cluster membershipDENCLUE has two variants Arbitrary-shaped clusters

Similar to other density based methods Center-defined clusters

Similar to distance-based methods

Idea Each data point

has an influence that extends over a range Influence function

Add all influence functions Density function

Influence Functions

Definitions Density Attractor x*

Local maximum of the density function

Density attracted points Points from which a path to x* exists

for which the gradient is continuously positive (case of continuous and differentiable influence function)

Center Defined Clusters All points that are density

attracted to a given density attractor x*

Density function at the maximum must exceed

Points that are attracted to smaller maxima are considered outliers

Arbitrary-Shape Clusters Merges center defined clusters if a

path exists for which the density function continuously exceeds

Examples

Noise Invariance Density distribution for noise is

constant No influence on number and location of

attractors

Claim Number of density attractors with or

without noise is the same Probability that they are identical

goes to 1 for large noise

Parameter Choices

Choice of : Use different and determine

largest interval with constant number or clusters

Choice of : Greater than noise level Smaller than smallest relevant

maxima

Comparison with DBSCAN

Corresponding setup Square wave influence function radius

models neighborhood in DBSCAN Definition of core objects in DBSCAN

involves MinPts <=> Density reachable in DBSCAN

becomes density attracted in DENCLUE (!?)

Comparison with k-means

Corresponding setup Gaussian influence function Step-size for hill-climbing = /2Claim In DENCLUE can be chosen such

that k clusters are found DENCLUE result corresponds to

global optimum in k-means

Comparison with Hierarchical Methods Start with very small to get

largest number of clusters Increasing will merge clusters Finally one density attractor

Algorithm Step 1: Construct a map of data points

Uses hypercubes of with edge length 2 Only populated cubes are saved

Step 2: Determine density attractors for all points using hill-climbing Keeps track of paths that have been taken

and points close to them

Local Density Function Influence function of “near” points

contributes fully Far away points are ignored For Gaussian influence function:

cut-off chosen as 4

Step 1: Constructing the map Hypercubes contain

Number of data points Pointers to data points Sum of data values (for mean)

Save populated hypercubes in B+ tree Connect neighboring populated cubes

for fast access Limited to highly populated cubes derived

from outlier criterion

Step 2: Clustering Step Uses only highly populated cubes

and cubes that are connected to them

Hill-climbing based on local density function and its gradient

Points within /2 of each hill-climbing path are attached to clusters as well

Time Complexity / Efficiency Worst case, for N data points

O(N log(N)) Average case (without building

data structure?) O(log(N)) Explanation: Only highly populated

areas are considered Up to 45 times faster than DBSCAN

Application to Molecular Biology Simulation of a small but flexible

peptide Point in a 19-dimensional angle

space Pharmaceutical industry is

interested in stable conformations Non-stable conformations make up

>50 percent => noise

What can we learn?

Algorithm is fast for 2 reasons Efficient data structure

Data points that are close in attribute space are stored together

Similar to P-trees: fast access to data, based on attribute values

Optimization problem inherently linear in search space K-medoids problem is quadratic!

Why is k-medoids quadratic in the search space?Review: Cost function calculated as sum over

squared distance within each cluster I.e. cost associated with each cluster

center depends on all other cluster centers!

Can be viewed as an influence function that depends on cluster boundaries

Cost functions K-medoids DENCLUE

Motivating a Gaussian influence function Why not use a parabola as

influence function? Only 1 minimum (mean of data set) We need cut-off

K-medoids cut-off depends on cluster centers

Cluster center independent cut-off? Gaussian function!

Is DENCLUE only an Approximation to k-medoids?

Not necessarily Minimizing square distance is a

fundamental measure but not the only one

Why should “influence” depend on density of points? “Influence” may be determined by

system

If DENCLUE is so good can we still improve it? Need a special data structure They map out all space

Density-based idea A distance based version can look

for cluster centers only Allows using a promising starting

point Define partitions by proximity

Conclusion DENCLUE paper contains many

fundamentally valuable ideas Data structure efficient Algorithm related to but much more

efficient than k-medoids

an efficient approach to clustering in large multimedia databases with noise alexander hinneburg and...

Documents

density attractors

density functionscenter

density functiondensity

high density

density attractoralgorithmstep

x points

given density attractor

clustersall points