an efficient approach to clustering in large multimedia databases with noise alexander hinneburg and...
TRANSCRIPT
An Efficient Approach to Clustering in Large Multimedia Databases with Noise
Alexander Hinneburg and Daniel A. Keim
Outline Multimedia data Density-based clustering Influence and density functions Center-defined vs. Arbitrary-shape Comparison with other algorithms Algorithm What can we learn / have we learned?
Multimedia Data Examples
Images CAD Geographic Molecular biology
High-dimensional feature vectors Color histograms Shape descriptors Fourier vectors
Density-Based Clustering(loose definition) Clusters defined by high density of
points Many points with the same
combination of attribute values Is density irrelevant for other
methods? No! Most methods look for dense areas DENCLUE uses density directly
Density-Based Clustering(stricter definition) Closeness to a dense area is the only
criterion for cluster membershipDENCLUE has two variants Arbitrary-shaped clusters
Similar to other density based methods Center-defined clusters
Similar to distance-based methods
Idea Each data point
has an influence that extends over a range Influence function
Add all influence functions Density function
Influence Functions
Definitions Density Attractor x*
Local maximum of the density function
Density attracted points Points from which a path to x* exists
for which the gradient is continuously positive (case of continuous and differentiable influence function)
Center Defined Clusters All points that are density
attracted to a given density attractor x*
Density function at the maximum must exceed
Points that are attracted to smaller maxima are considered outliers
Arbitrary-Shape Clusters Merges center defined clusters if a
path exists for which the density function continuously exceeds
Examples
Noise Invariance Density distribution for noise is
constant No influence on number and location of
attractors
Claim Number of density attractors with or
without noise is the same Probability that they are identical
goes to 1 for large noise
Parameter Choices
Choice of : Use different and determine
largest interval with constant number or clusters
Choice of : Greater than noise level Smaller than smallest relevant
maxima
Comparison with DBSCAN
Corresponding setup Square wave influence function radius
models neighborhood in DBSCAN Definition of core objects in DBSCAN
involves MinPts <=> Density reachable in DBSCAN
becomes density attracted in DENCLUE (!?)
Comparison with k-means
Corresponding setup Gaussian influence function Step-size for hill-climbing = /2Claim In DENCLUE can be chosen such
that k clusters are found DENCLUE result corresponds to
global optimum in k-means
Comparison with Hierarchical Methods Start with very small to get
largest number of clusters Increasing will merge clusters Finally one density attractor
Algorithm Step 1: Construct a map of data points
Uses hypercubes of with edge length 2 Only populated cubes are saved
Step 2: Determine density attractors for all points using hill-climbing Keeps track of paths that have been taken
and points close to them
Local Density Function Influence function of “near” points
contributes fully Far away points are ignored For Gaussian influence function:
cut-off chosen as 4
Step 1: Constructing the map Hypercubes contain
Number of data points Pointers to data points Sum of data values (for mean)
Save populated hypercubes in B+ tree Connect neighboring populated cubes
for fast access Limited to highly populated cubes derived
from outlier criterion
Step 2: Clustering Step Uses only highly populated cubes
and cubes that are connected to them
Hill-climbing based on local density function and its gradient
Points within /2 of each hill-climbing path are attached to clusters as well
Time Complexity / Efficiency Worst case, for N data points
O(N log(N)) Average case (without building
data structure?) O(log(N)) Explanation: Only highly populated
areas are considered Up to 45 times faster than DBSCAN
Application to Molecular Biology Simulation of a small but flexible
peptide Point in a 19-dimensional angle
space Pharmaceutical industry is
interested in stable conformations Non-stable conformations make up
>50 percent => noise
What can we learn?
Algorithm is fast for 2 reasons Efficient data structure
Data points that are close in attribute space are stored together
Similar to P-trees: fast access to data, based on attribute values
Optimization problem inherently linear in search space K-medoids problem is quadratic!
Why is k-medoids quadratic in the search space?Review: Cost function calculated as sum over
squared distance within each cluster I.e. cost associated with each cluster
center depends on all other cluster centers!
Can be viewed as an influence function that depends on cluster boundaries
Cost functions K-medoids DENCLUE
Motivating a Gaussian influence function Why not use a parabola as
influence function? Only 1 minimum (mean of data set) We need cut-off
K-medoids cut-off depends on cluster centers
Cluster center independent cut-off? Gaussian function!
Is DENCLUE only an Approximation to k-medoids?
Not necessarily Minimizing square distance is a
fundamental measure but not the only one
Why should “influence” depend on density of points? “Influence” may be determined by
system
If DENCLUE is so good can we still improve it? Need a special data structure They map out all space
Density-based idea A distance based version can look
for cluster centers only Allows using a promising starting
point Define partitions by proximity
Conclusion DENCLUE paper contains many
fundamentally valuable ideas Data structure efficient Algorithm related to but much more
efficient than k-medoids