1 nnh: improving performance of nearest- neighbor searches using histograms liang jin (uc irvine)...
TRANSCRIPT
![Page 1: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/1.jpg)
1
NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Chen Li (UC Irvine)
Supported by NSF CAREER No. IIS-0238586
EDBT 2004
![Page 2: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/2.jpg)
2
NN (nearest-neighbor) searchKNN: find the k nearest neighbors of an object.
qNN-join: for each object in the 1st dataset, find
the k nearest neighbors in the 2nd dataset
D1 D2
![Page 3: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/3.jpg)
3
Example: image search
Images represented as features (color histogram, texture moments, etc.)
Similarity search using these features “Find 10 most similar images for the query image”
Other applications: Web-page search: “Find 100 most similar pages for a given
page GIS: “find 5 closest cities of Irvine” Data cleaning
Query image
![Page 4: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/4.jpg)
4
NN Algorithms Distance measurement:
For objects are points, distance well defined Usually Euclidean Other distances possible
For arbitrary-shaped objects, assume we have a distance function between them
Most algorithms assume a high-dimensional tree structure for the datasets (e.g., R-tree).
![Page 5: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/5.jpg)
5
Search process (1-NN for example)
Most algorithms traverse the structure (e.g., R-tree) top down, and follow a branch-and-bound approach
Keep a priority queue of nodes (“MBR”) to be visited Sorted based on the “minimum distance” between q and each no
de Improvement:
Use MINDIST and MINMAXDIST Reduce the queue size Avoid unnecessary disk IO’s to access MBR’s
Priority queue
![Page 6: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/6.jpg)
6
Problem Queue size may be large:
60,000 objects, 32d (image) vectors, 50 NNs Max queue size: 15K entries Avg queue size: half (7.5K entries)
If queue can’t fit in memory, more disk IOs! Problem worse for k-NN joins
E.g., 1500 x 1500 join: Max queue size: 1.7M entries: >= 1GB memory! 750 seconds to run
Couldn’t scale up to 2000 objects! Disk thrashing
![Page 7: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/7.jpg)
7
Our Solution: Nearest-Neighbor Histogram (NNH)
Main idea Utilizing NNH in a search (KNN, join) Construction and incremental
maintenance Experiments Related work
![Page 8: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/8.jpg)
8
p1p2
pm
Distances of its nearest neighbors: r1, r2, …,
NNH: Nearest-Neighbor Histograms
m: # of pivots
They are not part of the database
![Page 9: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/9.jpg)
9
Structure Nearest Neighbor Vectors: Trrpv ,...,)( 1
Nearest Neighbor Histogram Collection of m pivots with their NN vectors
each ri is the distance of p’s i-th NN
T: length of each vector
![Page 10: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/10.jpg)
10
Outline
Main idea Utilizing NNH in a search (KNN, join) Construction and incremental
maintenance Experiments Related work
![Page 11: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/11.jpg)
11
Estimate NN distance for query object
NNH does not give exact NN information for an object But we can estimate an upper bound for the k-NN dista
nce qest of q
mikpHpq iiq 1),,(),(
Triangle inequality : of NN- theof Distance qk
![Page 12: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/12.jpg)
12
Estimate NN for query object(con’t)
Apply the triangle inequality to all pivots Upper bound estimate of NN distance of q
)),(),((min1
kpHpq iimi
estq
Complexity: O(m)
![Page 13: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/13.jpg)
13
Utilizing estimates in NN search More pruning: prune an mbr if:
),( mbrqMINDISTestq
mbrMINDISTq
![Page 14: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/14.jpg)
14
Utilizing estimates in NN join K-NN join: for each object o1 in D1, find
its k-nearest neighbors in D2. Traverse two trees top down; keep a
queue of pairs
![Page 15: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/15.jpg)
15
Utilizing estimates in NN join (cont’t)
Construct NNH for D2. For each object o1 in D1, keep its estimated
NN radius o1est using NNH of D2.
Similar to k-NN query, ignore mbr for o1 if:
),( 11mbroMINDISTest
o
mbrMINDISTo1
![Page 16: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/16.jpg)
16
More powerful: prune MBR pairs
)),(),((min 212
1kpHpmbrMAXDIST ii
Hp
estmbr
i
)),(),(: 2111 1kpHpombro iio
)),(),( 211kpHpmbrMAXDIST iimbr
![Page 17: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/17.jpg)
17
Prune MBR pairs (cont)
),( 211mbrmbrMINDISTest
mbr
mbr1mbr2
MINDIST
Prune this MBR pair if:
![Page 18: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/18.jpg)
18
Outline
Main idea Utilizing NNH in a search (KNN, join) Construction and incremental
maintenance Experiments Related work
![Page 19: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/19.jpg)
19
NNH Construction If we have selected the m pivots:
Just run KNN queries for them to construct NNH
Time is O(m) Offline
Important: selecting pivots Size-Constraint Construction Error-Constraint Construction (see paper)
![Page 20: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/20.jpg)
20
# of pivots “m” determines Storage size Initial construction cost Incremental-maintenance cost
Choose m “best” pivots
Size-constraint NNH construction
![Page 21: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/21.jpg)
21
Size-constraint NNH construction
Given m (# of pivots), assume: query objects are from the database D H(pi,k) doesn’t vary too much
Goal: Find pivots p1, p2, …, pm to minimize object distances to the pivots:
Clustering problem: Many algorithms available Use K-means for its simplicity and efficiency
miDq
ipq,...,1,
),(
mikpHpq iiq 1),,(),(
![Page 22: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/22.jpg)
22
Incremental Maintenance How to update the NNH when inserting or d
eleting objects? Need to “shift” each vector:
Associate a valid length Ei to each NN vector.
![Page 23: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/23.jpg)
23
Outline
Main idea Utilizing NNH in a search (KNN, join) Construction and incremental
maintenance Experiments Related work
![Page 24: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/24.jpg)
24
Experiments
Datasets: Corel image database
Contains 60,000 images Each image represented by a 32-dimensional float vector
Time-series data from AT&T Similar trends. Report results for Corel data set
Test bed: PC: 1.5G Athlon, 512MB Mem, 80G HD, Windows 2000. GNU C++ in CYGWIN
![Page 25: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/25.jpg)
25
Goal
Is the pruning using NNH estimates powerful? KNN queries NN-join queries
Is it “cheap” to have such a structure? Storage Initial construction Incremental maintenance
![Page 26: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/26.jpg)
26
Improvement in k-NN search Ran k-means algorithm to generate
400 pivots for 60K objects, and constructed NNH
Performed K-NN queries on 100 randomly selected query objects.
Queue size to measure memory usage. Max queue size Average queue size
![Page 27: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/27.jpg)
27
Reduced Memory Requirement
![Page 28: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/28.jpg)
28
Reduced running time
![Page 29: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/29.jpg)
29
Effects of different # of pivots
![Page 30: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/30.jpg)
30
Join: Reduced Memory Requirement
![Page 31: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/31.jpg)
31
Join: Reduced running time
![Page 32: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/32.jpg)
32
Join:Running time for different data sizes
![Page 33: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/33.jpg)
33
Cost/Benefit of NNH
Pivot # (m) 10 50 100 150 200 250 300 350 400
Construction time (sec)
0.7 3.59
6.6 9.4 11.5 13.7 15.7 17.8
20.4
Storage space (kB)
2 10 20 30 40 50 60 70 80
Incr mantnce. time (ms)
~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0
Improved q-size(kNN)(%)
40 30 28 24 24 24 23 20 18
Improved q-size(join)(%)
45 34 28 26 26 25 24 24 22
“~0” means almost zero.
For 60,000 float vectors (32-d).
![Page 34: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/34.jpg)
34
Conclusion NNH: efficient, effective approach to
improving NN-search performance. Can be easily embedded into current
implementation of NN algorithms. Can be efficiently constructed and
maintained. Offers substantial performance
advantages.
![Page 35: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/35.jpg)
35
Related work Summary histograms
E.g., [Jagadish et al VLDB98], [Mattias et al VLDB00] Objective: approximate frequency values
NN Search algorithms Many algorithms developed Many of them can benefit from NNH
Algorithms based on “pivots/foci/anchors” E.g., Omni [Filho et al, ICDE01], Vantage objects [Vleugels et al
VIIS99], M-trees [Ciaccia et al VLDB97] Choose pivots far from each other (to represent the “intrinsic
dimensionality”) NNH: pivots depend on how clustered the objects are Experiments show the differences
![Page 36: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)](https://reader036.vdocuments.us/reader036/viewer/2022062713/56649cf35503460f949c10bc/html5/thumbnails/36.jpg)
36
Work conducted in the Flamingo Project on Data Cleansing at UC Irvine