data structures and algorithms for proximity search

12
Data structures and algorithms for proximity search CSE 291 A: Tree-based proximity search in Euclidean space

Upload: others

Post on 27-Oct-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data structures and algorithms for proximity search

Data structures and algorithms for proximity search

CSE 291

A: Tree-based proximity search in Euclidean space

Page 2: Data structures and algorithms for proximity search

The complexity of nearest neighbor search

Given n data points in Rd , how long does it take to find the nearest neighbor of aquery point q?

The complexity of nearest neighbor search

What if d = 1? Is there a clever data structure that can be used? What is thepreprocessing time and query time?

Page 3: Data structures and algorithms for proximity search

Data structures for proximity search

Given a data set of n points in Rd (or in a more general distance space), build a datastructure for efficiently answering subsequent nearest neighbor queries q.

• Data structure should take space O(n)

• Query time should be o(n)

Many data structures have been designed for this purpose, e.g.,

1 K -d trees

2 Ball trees

3 Locality-sensitive hashing

These are part of standard Python libraries for NN, and help a lot.

Spatial data structure: k-d tree

A hierarchical, rectilinear spatial partition.

For data set S ⊂ Rd :

• If |S | ≤ no : return leaf cell containing S

• Pick a coordinate 1 ≤ i ≤ d

• v = median({xi : x ∈ S})• Split S into two halves:

SL = {x ∈ S : xi < v}SR = {x ∈ S : xi ≥ v}

• Recurse on SL, SR

What is the height of the tree? And how long does it take to build?

Page 4: Data structures and algorithms for proximity search

NN search with a k-d tree

Defeatist search for query q:

• Move q down to its leaf-cell

• Return closest point in that cell

• Time: O(log(n/no)) + O(d · no)

Problem: Might not return the actual NN.

Page 5: Data structures and algorithms for proximity search

NN search with a k-d tree

Comprehensive search for query q:

• Go to q’s leaf cell, find closest point in that cell

• Backtrack up the tree, looking for closer pointsand using geometric reasoning (triangleinequality) to decide if a subtree can be ignored

• Always returns the NN

• Worst case: look at all points, O(dn)

B: Tree-based search in metric spaces

Page 6: Data structures and algorithms for proximity search

Data structure: Ball treeInstance space X with metric d .

Ball tree for a set of points S ⊂ X :

• Hierarchical partition of S with cells organized in a tree• Each node of the tree has an associated ball

B(z , r) = {x ∈ X : d(x , z) ≤ r}

that contains all points in that node

Building a ball tree

Lots of flexibility in how to split a cell, e.g.

• Pick two points in the cell

• Partition the cell according to which of those two points is closer

Page 7: Data structures and algorithms for proximity search

Defeatist search with a ball tree

Given a query q, how do we move down the tree?

B(z1, r1)<latexit sha1_base64="5zCf9DHOjDBMJTv0ETMwvnkHZAQ=">AAAB8nicbVBNSwMxEJ2tX7V+VT16CRahgpTdIuix6MVjBfsB7bJk02wbmk2WJCvUpT/DiwdFvPprvPlvTNs9aOuDgcd7M8zMCxPOtHHdb6ewtr6xuVXcLu3s7u0flA+P2lqmitAWkVyqbog15UzQlmGG026iKI5DTjvh+Hbmdx6p0kyKBzNJqB/joWARI9hYqXdTfQq8C6QC7zwoV9yaOwdaJV5OKpCjGZS/+gNJ0pgKQzjWuue5ifEzrAwjnE5L/VTTBJMxHtKepQLHVPvZ/OQpOrPKAEVS2RIGzdXfExmOtZ7Eoe2MsRnpZW8m/uf1UhNd+xkTSWqoIItFUcqRkWj2PxowRYnhE0swUczeisgIK0yMTalkQ/CWX14l7XrNc2ve/WWlUc/jKMIJnEIVPLiCBtxBE1pAQMIzvMKbY5wX5935WLQWnHzmGP7A+fwBGH2Pww==</latexit><latexit sha1_base64="5zCf9DHOjDBMJTv0ETMwvnkHZAQ=">AAAB8nicbVBNSwMxEJ2tX7V+VT16CRahgpTdIuix6MVjBfsB7bJk02wbmk2WJCvUpT/DiwdFvPprvPlvTNs9aOuDgcd7M8zMCxPOtHHdb6ewtr6xuVXcLu3s7u0flA+P2lqmitAWkVyqbog15UzQlmGG026iKI5DTjvh+Hbmdx6p0kyKBzNJqB/joWARI9hYqXdTfQq8C6QC7zwoV9yaOwdaJV5OKpCjGZS/+gNJ0pgKQzjWuue5ifEzrAwjnE5L/VTTBJMxHtKepQLHVPvZ/OQpOrPKAEVS2RIGzdXfExmOtZ7Eoe2MsRnpZW8m/uf1UhNd+xkTSWqoIItFUcqRkWj2PxowRYnhE0swUczeisgIK0yMTalkQ/CWX14l7XrNc2ve/WWlUc/jKMIJnEIVPLiCBtxBE1pAQMIzvMKbY5wX5935WLQWnHzmGP7A+fwBGH2Pww==</latexit><latexit sha1_base64="5zCf9DHOjDBMJTv0ETMwvnkHZAQ=">AAAB8nicbVBNSwMxEJ2tX7V+VT16CRahgpTdIuix6MVjBfsB7bJk02wbmk2WJCvUpT/DiwdFvPprvPlvTNs9aOuDgcd7M8zMCxPOtHHdb6ewtr6xuVXcLu3s7u0flA+P2lqmitAWkVyqbog15UzQlmGG026iKI5DTjvh+Hbmdx6p0kyKBzNJqB/joWARI9hYqXdTfQq8C6QC7zwoV9yaOwdaJV5OKpCjGZS/+gNJ0pgKQzjWuue5ifEzrAwjnE5L/VTTBJMxHtKepQLHVPvZ/OQpOrPKAEVS2RIGzdXfExmOtZ7Eoe2MsRnpZW8m/uf1UhNd+xkTSWqoIItFUcqRkWj2PxowRYnhE0swUczeisgIK0yMTalkQ/CWX14l7XrNc2ve/WWlUc/jKMIJnEIVPLiCBtxBE1pAQMIzvMKbY5wX5935WLQWnHzmGP7A+fwBGH2Pww==</latexit><latexit sha1_base64="5zCf9DHOjDBMJTv0ETMwvnkHZAQ=">AAAB8nicbVBNSwMxEJ2tX7V+VT16CRahgpTdIuix6MVjBfsB7bJk02wbmk2WJCvUpT/DiwdFvPprvPlvTNs9aOuDgcd7M8zMCxPOtHHdb6ewtr6xuVXcLu3s7u0flA+P2lqmitAWkVyqbog15UzQlmGG026iKI5DTjvh+Hbmdx6p0kyKBzNJqB/joWARI9hYqXdTfQq8C6QC7zwoV9yaOwdaJV5OKpCjGZS/+gNJ0pgKQzjWuue5ifEzrAwjnE5L/VTTBJMxHtKepQLHVPvZ/OQpOrPKAEVS2RIGzdXfExmOtZ7Eoe2MsRnpZW8m/uf1UhNd+xkTSWqoIItFUcqRkWj2PxowRYnhE0swUczeisgIK0yMTalkQ/CWX14l7XrNc2ve/WWlUc/jKMIJnEIVPLiCBtxBE1pAQMIzvMKbY5wX5935WLQWnHzmGP7A+fwBGH2Pww==</latexit>

B(z2, r2)<latexit sha1_base64="ohlksqG8Jo28OshZqm5aOBmupMo=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBahgpQkCHosevFYwX5AGsJmu2mXbjZhdyPU0J/hxYMiXv013vw3btsctPXBwOO9GWbmhSlnStv2t1VaW9/Y3CpvV3Z29/YPqodHHZVkktA2SXgieyFWlDNB25ppTnuppDgOOe2G49uZ332kUrFEPOhJSv0YDwWLGMHaSN5N/SlwL5AM3POgWrMb9hxolTgFqUGBVlD96g8SksVUaMKxUp5jp9rPsdSMcDqt9DNFU0zGeEg9QwWOqfLz+clTdGaUAYoSaUpoNFd/T+Q4VmoSh6Yzxnqklr2Z+J/nZTq69nMm0kxTQRaLoowjnaDZ/2jAJCWaTwzBRDJzKyIjLDHRJqWKCcFZfnmVdNyGYzec+8ta0y3iKMMJnEIdHLiCJtxBC9pAIIFneIU3S1sv1rv1sWgtWcXMMfyB9fkDG4yPxQ==</latexit><latexit sha1_base64="ohlksqG8Jo28OshZqm5aOBmupMo=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBahgpQkCHosevFYwX5AGsJmu2mXbjZhdyPU0J/hxYMiXv013vw3btsctPXBwOO9GWbmhSlnStv2t1VaW9/Y3CpvV3Z29/YPqodHHZVkktA2SXgieyFWlDNB25ppTnuppDgOOe2G49uZ332kUrFEPOhJSv0YDwWLGMHaSN5N/SlwL5AM3POgWrMb9hxolTgFqUGBVlD96g8SksVUaMKxUp5jp9rPsdSMcDqt9DNFU0zGeEg9QwWOqfLz+clTdGaUAYoSaUpoNFd/T+Q4VmoSh6Yzxnqklr2Z+J/nZTq69nMm0kxTQRaLoowjnaDZ/2jAJCWaTwzBRDJzKyIjLDHRJqWKCcFZfnmVdNyGYzec+8ta0y3iKMMJnEIdHLiCJtxBC9pAIIFneIU3S1sv1rv1sWgtWcXMMfyB9fkDG4yPxQ==</latexit><latexit sha1_base64="ohlksqG8Jo28OshZqm5aOBmupMo=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBahgpQkCHosevFYwX5AGsJmu2mXbjZhdyPU0J/hxYMiXv013vw3btsctPXBwOO9GWbmhSlnStv2t1VaW9/Y3CpvV3Z29/YPqodHHZVkktA2SXgieyFWlDNB25ppTnuppDgOOe2G49uZ332kUrFEPOhJSv0YDwWLGMHaSN5N/SlwL5AM3POgWrMb9hxolTgFqUGBVlD96g8SksVUaMKxUp5jp9rPsdSMcDqt9DNFU0zGeEg9QwWOqfLz+clTdGaUAYoSaUpoNFd/T+Q4VmoSh6Yzxnqklr2Z+J/nZTq69nMm0kxTQRaLoowjnaDZ/2jAJCWaTwzBRDJzKyIjLDHRJqWKCcFZfnmVdNyGYzec+8ta0y3iKMMJnEIdHLiCJtxBC9pAIIFneIU3S1sv1rv1sWgtWcXMMfyB9fkDG4yPxQ==</latexit><latexit sha1_base64="ohlksqG8Jo28OshZqm5aOBmupMo=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBahgpQkCHosevFYwX5AGsJmu2mXbjZhdyPU0J/hxYMiXv013vw3btsctPXBwOO9GWbmhSlnStv2t1VaW9/Y3CpvV3Z29/YPqodHHZVkktA2SXgieyFWlDNB25ppTnuppDgOOe2G49uZ332kUrFEPOhJSv0YDwWLGMHaSN5N/SlwL5AM3POgWrMb9hxolTgFqUGBVlD96g8SksVUaMKxUp5jp9rPsdSMcDqt9DNFU0zGeEg9QwWOqfLz+clTdGaUAYoSaUpoNFd/T+Q4VmoSh6Yzxnqklr2Z+J/nZTq69nMm0kxTQRaLoowjnaDZ/2jAJCWaTwzBRDJzKyIjLDHRJqWKCcFZfnmVdNyGYzec+8ta0y3iKMMJnEIdHLiCJtxBC9pAIIFneIU3S1sv1rv1sWgtWcXMMfyB9fkDG4yPxQ==</latexit>

When you get to a leaf: return the NN in that leaf.

Comprehensive search with a ball tree

Suppose the closest point we’ve found so far (to query q) is x .How to decide whether we need to explore a cell C?

q<latexit sha1_base64="WDXk5YbgOwagPhYkqYUboZ0/S1U=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeCF48t2A9oQ9lsJ+3azSbuboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJb3ZpqgH9GR5CFn1Fip+TgoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukXat6btVrXlXqtTyOIpzBOVyCB9dQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A1s2M5Q==</latexit><latexit sha1_base64="WDXk5YbgOwagPhYkqYUboZ0/S1U=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeCF48t2A9oQ9lsJ+3azSbuboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJb3ZpqgH9GR5CFn1Fip+TgoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukXat6btVrXlXqtTyOIpzBOVyCB9dQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A1s2M5Q==</latexit><latexit sha1_base64="WDXk5YbgOwagPhYkqYUboZ0/S1U=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeCF48t2A9oQ9lsJ+3azSbuboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJb3ZpqgH9GR5CFn1Fip+TgoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukXat6btVrXlXqtTyOIpzBOVyCB9dQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A1s2M5Q==</latexit><latexit sha1_base64="WDXk5YbgOwagPhYkqYUboZ0/S1U=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeCF48t2A9oQ9lsJ+3azSbuboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJb3ZpqgH9GR5CFn1Fip+TgoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukXat6btVrXlXqtTyOIpzBOVyCB9dQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A1s2M5Q==</latexit>

x<latexit sha1_base64="IeMWPNjdVKiUccM/0cvkCECEX9w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukXat6btVrXlXqtTyOIpzBOVyCB9dQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A4WmM7A==</latexit><latexit sha1_base64="IeMWPNjdVKiUccM/0cvkCECEX9w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukXat6btVrXlXqtTyOIpzBOVyCB9dQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A4WmM7A==</latexit><latexit sha1_base64="IeMWPNjdVKiUccM/0cvkCECEX9w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukXat6btVrXlXqtTyOIpzBOVyCB9dQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A4WmM7A==</latexit><latexit sha1_base64="IeMWPNjdVKiUccM/0cvkCECEX9w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukXat6btVrXlXqtTyOIpzBOVyCB9dQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A4WmM7A==</latexit>

C<latexit sha1_base64="NdWAD45Fw/PqY4K7xfUQ/ALMveU=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMdCLx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RS2tnd294r7pYPDo+OT8ulZR8epYthmsYhVL6AaBZfYNtwI7CUKaRQI7AbTxsLvPqHSPJYPZpagH9Gx5CFn1Fip1RiWK27VXYJsEi8nFcjRHJa/BqOYpRFKwwTVuu+5ifEzqgxnAuelQaoxoWxKx9i3VNIItZ8tD52TK6uMSBgrW9KQpfp7IqOR1rMosJ0RNRO97i3E/7x+asI7P+MySQ1KtloUpoKYmCy+JiOukBkxs4Qyxe2thE2ooszYbEo2BG/95U3SqVU9t+q1bir1Wh5HES7gEq7Bg1uowz00oQ0MEJ7hFd6cR+fFeXc+Vq0FJ585hz9wPn8AkRWMtw==</latexit><latexit sha1_base64="NdWAD45Fw/PqY4K7xfUQ/ALMveU=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMdCLx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RS2tnd294r7pYPDo+OT8ulZR8epYthmsYhVL6AaBZfYNtwI7CUKaRQI7AbTxsLvPqHSPJYPZpagH9Gx5CFn1Fip1RiWK27VXYJsEi8nFcjRHJa/BqOYpRFKwwTVuu+5ifEzqgxnAuelQaoxoWxKx9i3VNIItZ8tD52TK6uMSBgrW9KQpfp7IqOR1rMosJ0RNRO97i3E/7x+asI7P+MySQ1KtloUpoKYmCy+JiOukBkxs4Qyxe2thE2ooszYbEo2BG/95U3SqVU9t+q1bir1Wh5HES7gEq7Bg1uowz00oQ0MEJ7hFd6cR+fFeXc+Vq0FJ585hz9wPn8AkRWMtw==</latexit><latexit sha1_base64="NdWAD45Fw/PqY4K7xfUQ/ALMveU=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMdCLx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RS2tnd294r7pYPDo+OT8ulZR8epYthmsYhVL6AaBZfYNtwI7CUKaRQI7AbTxsLvPqHSPJYPZpagH9Gx5CFn1Fip1RiWK27VXYJsEi8nFcjRHJa/BqOYpRFKwwTVuu+5ifEzqgxnAuelQaoxoWxKx9i3VNIItZ8tD52TK6uMSBgrW9KQpfp7IqOR1rMosJ0RNRO97i3E/7x+asI7P+MySQ1KtloUpoKYmCy+JiOukBkxs4Qyxe2thE2ooszYbEo2BG/95U3SqVU9t+q1bir1Wh5HES7gEq7Bg1uowz00oQ0MEJ7hFd6cR+fFeXc+Vq0FJ585hz9wPn8AkRWMtw==</latexit><latexit sha1_base64="NdWAD45Fw/PqY4K7xfUQ/ALMveU=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoMdCLx5bsB/QhrLZTtq1m03Y3Qgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RS2tnd294r7pYPDo+OT8ulZR8epYthmsYhVL6AaBZfYNtwI7CUKaRQI7AbTxsLvPqHSPJYPZpagH9Gx5CFn1Fip1RiWK27VXYJsEi8nFcjRHJa/BqOYpRFKwwTVuu+5ifEzqgxnAuelQaoxoWxKx9i3VNIItZ8tD52TK6uMSBgrW9KQpfp7IqOR1rMosJ0RNRO97i3E/7x+asI7P+MySQ1KtloUpoKYmCy+JiOukBkxs4Qyxe2thE2ooszYbEo2BG/95U3SqVU9t+q1bir1Wh5HES7gEq7Bg1uowz00oQ0MEJ7hFd6cR+fFeXc+Vq0FJ585hz9wPn8AkRWMtw==</latexit>

Let the bounding ball of C be B(z , r). Then we need to go into C if:

Page 8: Data structures and algorithms for proximity search

C: The curse of dimension in proximity search

The curse of dimension in NN search

Situation: n data points in Rd .

1 Not good: Storage is O(nd)

2 Not good: Time to compute distance is O(d) for `p norms

3 But worst of all: GeometryIt is possible to have 2O(d) points that are roughly equidistant from each other.

The best current methods for exact NN search have worst-case query time proportionalto 2d and log n.

Page 9: Data structures and algorithms for proximity search

The nightmare scenario in NN search

For any 0 < ε < 1,

• Pick 2O(ε2d) points uniformly from the unit sphere in Rd

• With high probability, all interpoint distances are (1± ε)√

2

How can this bad case be defeated?

1 Most real data doesn’t look like a uniform distribution on a sphere.Data is often “intrinsically” lower-dimensional that it appears (e.g. lies on ado-dimensional manifold in Rd , for do � d). Many methods for exact search canreplace dependence on d by do .

2 Approximate nearest neighbor search

Page 10: Data structures and algorithms for proximity search

Cover trees for metric spaces

• Hierarchical cover of an arbitrary metric space• Space O(n), permits dynamic insertion and deletion of data points• Query time O(poly(c) log n), for a dimension-related constant c

A finite set X in a metric space has expansion rate c if for any point x and any radiusr > 0,

|B(x , 2r) ∩ X | ≤ c · |B(x , r) ∩ X |.

D: Approximate nearest neighbor search

Page 11: Data structures and algorithms for proximity search

Approximate nearest neighbor

For data set S ⊂ Rd and query q, a c-approximate nearest neighbor is any x ∈ S suchthat

‖x − q‖ ≤ c ·minz∈S‖z − q‖.

Complexity of approximate NN search in Euclidean space:

• Data structure size n1+1/c2

• Query time n1/c2

This is based on locality-sensitive hashing.

Locality-sensitive hashingh1(x)

h2(x) x

Typical hash function hi :random projection + binning

hi (x) =

⌊ri · x + b

w

• ri is a random direction

• b is a random offset

• w is the bin width

• For any data set x1, . . . , xn, query q: probability < 1 of failing to return anapproximate NN.

• To reduce this probability, make t tables. Space: O(nt).

Page 12: Data structures and algorithms for proximity search

NN search structures: an impressionistic history

• 1975: The k-d tree (Bentley and Friedman).Widely used, but algorithmic guarantees on weak footing.

• 1980s-1990s: More tree structures (e.g. Clarkson, Mount).Could accommodate general metric spaces.

• 1990s-: It’s okay to fail sometimes (e.g. Clarkson, Kleinberg).

• Late 1990s-: Locality-sensitive hashing (Indyk, Motwani, Andoni).Hashing scheme with some failure probability, widely used.

• Recently: many variants of hashing; resurgence of trees.