clustering and indexing in high-dimensional spaces
DESCRIPTION
Clustering and Indexing in High-dimensional spaces. Outline. CLIQUE GDR and LDR. CLIQUE (Clustering In QUEst). Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98). Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/1.jpg)
Clustering and Indexing in High-dimensional spaces
![Page 2: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/2.jpg)
Outline
• CLIQUE
• GDR and LDR
![Page 3: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/3.jpg)
CLIQUE (Clustering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
• Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space
• CLIQUE can be considered as both density-based and grid-based– It partitions each dimension into the same number of equal length intervals
– It partitions an m-dimensional data space into non-overlapping rectangular units
– A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter
– A cluster is a maximal set of connected dense units within a subspace
![Page 4: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/4.jpg)
CLIQUE: The Major Steps• Partition the data space and find the number of points that
lie inside each cell of the partition.
• Identify the subspaces that contain clusters using the Apriori principle
• Identify clusters:
– Determine dense units in all subspaces of interests– Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters– Determine maximal regions that cover a cluster of connected
dense units for each cluster– Determination of minimal cover for each cluster
![Page 5: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/5.jpg)
Sala
ry
(10,
000)
20 30 40 50 60age
54
31
26
70
20 30 40 50 60age
54
31
26
70
Vac
atio
n(w
eek)
age
Vac
atio
n
Salary 30 50
= 3
![Page 6: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/6.jpg)
Strength and Weakness of CLIQUE
• Strength – It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those subspaces
– It is insensitive to the order of records in input and does not presume some canonical data distribution
– It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases
• Weakness– The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
![Page 7: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/7.jpg)
High Dimensional Indexing Techniques
• Index trees (e.g., X-tree, TV-tree, SS-tree, SR-tree, M-tree, Hybrid Tree)– Sequential scan better at high dim. (Dimensionality Curse)
• Dimensionality reduction (e.g., Principal Component Analysis (PCA)), then build index on reduced space
![Page 8: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/8.jpg)
Global Dimensionality Reduction (GDR)
First PrincipalComponent (PC) First PC
•works well only when data is globally correlated
•otherwise too many false positives result in high
query cost
•solution: find local correlations instead of global
correlation
![Page 9: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/9.jpg)
Local Dimensionality Reduction (LDR)
First PC
GDR LDR
First PC of Cluster1
Cluster1
Cluster2
First PC of Cluster2
![Page 10: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/10.jpg)
Correlated Cluster
Second PC(eliminated dim.)
Centroid of cluster (projection of mean on eliminated dim)
First PC(retained dim.)
Mean of all points in cluster
A set of locally correlated points = <PCs, subspace dim, centroid, points>
![Page 11: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/11.jpg)
Reconstruction Distance
Centroid of cluster
First PC(retained dim)
Second PC(eliminated dim)
Point QProjection of Q on eliminated dim
ReconstructionDistance(Q,S)
![Page 12: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/12.jpg)
Reconstruction Distance Bound
Centroid
First PC(retained dim)
Second PC(eliminated dim)
MaxReconDist
MaxReconDist
ReconDist(P, S) MaxReconDist, P in S
![Page 13: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/13.jpg)
Other constraints
• Dimensionality bound: A cluster must not retain any more dimensions necessary and subspace dimensionality MaxDim
• Size bound: number of points in the cluster MinSize
![Page 14: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/14.jpg)
Clustering Algorithm Step 1: Construct Spatial Clusters
• Choose a set of well-scattered points as centroids (piercing set) from random sample
• Group each point P in the dataset with its closest centroid C if the Dist(P,C)
![Page 15: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/15.jpg)
Clustering Algorithm Step 2: Choose PCs for each cluster
• Compute PCs
![Page 16: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/16.jpg)
Clustering AlgorithmStep 3: Compute Subspace Dimensionality
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14 16
#dims retained
Fra
c p
oin
ts o
be
yin
g
rec
on
s.
bo
un
d
• Assign each point to cluster that needs min dim. to accommodate it
• Subspace dim. for each cluster is the min # dims to retain to keep most points
![Page 17: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/17.jpg)
Clustering Algorithm Step 4: Recluster points
• Assign each point P to the cluster S such that ReconDist(P,S)
MaxReconDist
• If multiple such clusters, assign to first cluster (overcomes “splitting” problem)
Emptyclusters
![Page 18: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/18.jpg)
Clustering algorithmStep 5: Map points
• Eliminate small clusters
• Map each point to subspace (also store reconstruction dist.)
Map
![Page 19: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/19.jpg)
Clustering algorithmStep 6: Iterate
• Iterate for more clusters as long as new clusters are being found among outliers
• Overall Complexity: 3 passes, O(ND2K)
![Page 20: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/20.jpg)
Experiments (Part 1)• Precision Experiments:
– Compare information loss in GDR and LDR for same reduced
dimensionality
– Precision = |Orig. Space Result|/|Reduced Space Result| (for
range queries)
– Note: precision measures efficiency, not answer quality
![Page 21: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/21.jpg)
Datasets• Synthetic dataset:
– 64-d data, 100,000 points, generates clusters in different subspaces (cluster sizes and subspace dimensionalities follow Zipf distribution), contains noise
• Real dataset:– 64-d data (8X8 color histograms extracted from 70,000
images in Corel collection), available at http://kdd.ics.uci.edu/databases/CorelFeatures
![Page 22: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/22.jpg)
Precision Experiments (1)
0
0.5
1
Prec.
0 0.5 1 2
Skew in c luster size
Sensitivity of prec. to skew
LDR GDR
0
0.5
1
Prec.
1 2 5 10
Number of c lusters
Sensitivity of prec. to num clus
LDR GDR
![Page 23: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/23.jpg)
Precision Experiments (2)
0
0.5
1
Prec.
0 0.02 0.05 0.1 0.2
Degree of Correlation
Sensitivity of prec. to correlation
LDR GDR
0
0.5
1
Prec.
7 10 12 14 23 42
Reduced dim
Sensitivity of prec. to reduced dim
LDR GDR
![Page 24: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/24.jpg)
Index structureRoot containing pointers to root of each cluster index (also stores PCs and subspace dim.)
Index
on
Cluster 1
Index
on
Cluster K
Set of outliers (no index: sequential scan)
Properties: (1) disk based
(2) height 1 + height(original space index) (3) almost balanced
![Page 25: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/25.jpg)
Cluster Indices• For each cluster S, multidimensional index on (d+1)-dimensional space instead of d-
dimensional space:
– NewImage(P,S)[j] = projection of P along jth PC for 1 j d
= ReconDist(P,S) for j= d+1
• Better estimate:
D(NewImage(P,S), NewImage(Q,S))
D(Image(P,S
), Image(Q,S))
• Correctness: Lower Bounding Lemma D(NewImage(P,S), NewImage(Q,S)) D(P,Q)
![Page 26: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/26.jpg)
Effect of Extra dimension
I/O cost
0200400600800
1000
12 14 15 17 19 30 34
Reduced dimensionality
# r
an
d d
isk
ac
ce
sse
s
d-dim
(d+1)-dim
![Page 27: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/27.jpg)
Outlier Index
• Retain all dimensions
• May build an index, else use sequential scan (we use sequential scan for our experiments)
![Page 28: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/28.jpg)
Query Support
• Correctness:– Query result same as original space index
• Point query, Range Query, k-NN query– similar to algorithms in multidimensional index structures
– see paper for details
• Dynamic insertions and deletions– see paper for details
![Page 29: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/29.jpg)
Experiments (Part 2)• Cost Experiments:
– Compare linear scan, Original Space Index(OSI), GDR and LDR in terms
of I/O and CPU costs. We used hybrid tree index structure for OSI, GDR
and LDR.
• Cost Formulae:– Linear Scan: I/O cost (#rand accesses)=file_size/10, CPU cost
– OSI: I/O cost=num index nodes visited, CPU cost
– GDR: I/O cost=index cost+post processing cost (to eliminate false positives), CPU cost
– LDR: I/O cost=index cost+post processing cost+outlier_file_size/10, CPU cost
![Page 30: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/30.jpg)
I/O Cost (#random disk accesses)
I/O cost comparison
0
500
1000
1500
2000
2500
3000
7 10 12 14 23 42 50 60
Reduced dim
#rand disk
acc
LDR
GDR
OSI
Lin Scan
![Page 31: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/31.jpg)
CPU Cost (only computation time)
CPU cost comparison
0
20
40
60
80
7 10 12 14 23 42
Reduced dim
CPU cost
(sec)
LDR
GDR
OSI
Lin Scan
![Page 32: Clustering and Indexing in High-dimensional spaces](https://reader035.vdocuments.us/reader035/viewer/2022062304/568146a2550346895db3bedc/html5/thumbnails/32.jpg)
Conclusion• LDR is a powerful dimensionality reduction technique
for high dimensional data
– reduces dimensionality with lower loss in distance
information compared to GDR
– achieves significantly lower query cost compared to linear
scan, original space index and GDR
• LDR has applications beyond high dimensional indexing