A CUDA Based Parallel Decoding Algorithm forthe Leech Lattice Locality Sensitive Hash Family
Lee Carraher
Department of Computer ScienceUniversity of Cincinnati
May 8, 2012
Outline
I Thesis Goal
I Nearest Neighbors Problem and Applications
I LSH
I Parallel Leech Decoder
I Results
I Conclusions and Future Directions
Our Goal
The goal of this thesis is to present a CUDA optimized parallelimplementation of a bounded distance Leech Lattice decoder 1 foruse in query optimized c-approximate k-nearest neighbors using thelocality sensitive hash framework of E2LSH 2.
In addition, we will present an improvement to the E2LSHAlgorithm that improves the performance for real world dataproblems by an order of magnitude in regards to selectivity, withno change in algorithmic or storage complexity.
1 O. Amrani, Y. Be’ery, A. Vardy, F.-W. Sun, and H. C. A. van Tilborg,“The leech lattice and the golay code: bounded-distance decoding and mul-tilevel constructions,” Information Theory, IEEE Transactions on, vol. 40,pp. 1030–1043, jul 1994
2 A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximatenearest neighbor in high dimensions,” in Foundations of Computer Science,2006. FOCS ’06. 47th Annual IEEE Symposium on, pp. 459–468, oct. 2006
The Nearest Neighbor Problem
Definition (Exact NN)
Given a set of points P in Rd and query point q return a pointp ∈ P such that p = Argmin {ρ(p, q)} , where ρ is some metricfunction.3
And an extension to an arbitrary number of neighbors.
Definition (k-Nearest Neighbor Search)
Given a set of points P in Rd and query point q return the knearest points p ∈ P to q4
3 H. Samet, Foundations of Multidimensional and Metric Data Structures.Morgan Kaufmann, 2006
4 H. Samet, Foundations of Multidimensional and Metric Data Structures.Morgan Kaufmann, 2006
Applications of NN
I Pattern Recognition
I Image/Object Recognition - Robot Vision systems, object andmotion tracking
I Genomics - similar and exact gene sequences in long DNAreads
I Biology/Medical - drug interaction and symptoms
I Recommender Systems - similar items based on usersubmission
I Data Compression
Standard Solutions for NN search
Definition (Linear Search NN)
Given a set of points P in Rd and query point q return the nearestpoints p ∈ P to q by minimizing the function f () = ρ(P, q), whereρ is some distance function.
This algorithm has complexity Θ(nd). And for large d and n,becomes computationally expensive. For this reason, it is not oftenused in practice and instead kd-tree based NN search is used.
Kd Trees for NN search
Kd-trees are an extension of a binary tree in which nodescorrespond to points in the database and splitting decisions arebased on the value of the point along one if its axis’s. This methodforms splitting planes which partition the points in Rd . A simple2d example is given below.
[3,5]
[2,1]
[2,2]
[1,2]
[5,5]
[4,6]
Kd Trees for NN search
4,6
1,2
3,5
2,1
2,2
5,5.
.
..
..
I The search algorithm consists of recursively searching thebinary tree.
I In 1 dimension, this is equivalent to a binary search, and isoptimal.
I However, as d →∞, the recursion approaches linear search.
The Curse of Dimensionality
Although there are many interpretations of the common COD inregards to nearest neighbor searching it applies to an exact searchmethod’s complexity approaching linear search complexity as dgrows large This is due to the chosen splitting plane having anincreasingly lower probability of partitioning the data equally.
consider the distance between any two points x and y . under
Euclidean distance. dist(x , y) =√∑d
i=1 (xi − yi )2 Now if weconsider the vector as being composed of values that have auniformly distribution, according to Ullman 5. the range ofdistances are all very near the average distance between points.Given this, it becomes very difficult to differentiate between nearand far points.
5 A. Rajaraman and J. D. Ullman, Mining of Massive Datasets. CambridgeUniversity Press, 2011
COD Continued
COD is sometimes cited as the cause for the distance functionloosing its usefullness for high dimensions.6 This arises from theratio of metric space partitioning to hypersphere embedding.
limd→∞
Vol(Sd)
Vol(Cd)=
πd/2
d2d−1Γ (d/2)→ 0
Given a single distribution, the minimum and the maximumdistances become indiscernible. 7 Or the relative majority of spaceis outside of the sphere
6 K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is ”nearestneighbor” meaningful?,” in In Int. Conf. on Database Theory, pp. 217–235,1999
7 K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is ”nearestneighbor” meaningful?,” in In Int. Conf. on Database Theory, pp. 217–235,1999
Approximate NN Methods
Due to COD’s effects, approximate methods are often used to notonly solve the problems of increasing computational complexity,but they also allow formulations of the nearest neighbor problemthat somewhat side step the the space embedding issue by castingthem as a probabilistic problem. Many variants of Kd-Trees exist,and generally apply to approximate linear search methods
Definition (c-Approximate Nearest Neighbor Search)
Given a set of points P in Rd and query point q return a pointp ∈ P such that its distance to q is less than cρ(q, p) where p isthe exact solution, and ρ is some distance metric.8
*For Kd-Trees, common heuristic algorithms are best-bin first, andrandomized Kd-Tree.
8 H. Samet, Foundations of Multidimensional and Metric Data Structures.Morgan Kaufmann, 2006
LSH Outline
Although approximate KD-Tree searches offer various solutions tothe growing search complexity issue, they either lack guarantees ofexact search complexity bounds or fixed approximation values. Forthis reason, we consider the LSH search method of 9.
Andoni and Indyk state a query and space complexity ofO(n)-space and O(nρ)-time where ρ is parameter of the hashfunction that is bounded by 1
4 .
9 P. Indyk and R. Motwani, “Approximate nearest neighbors: towards remov-ing the curse of dimensionality,” in Proceedings of the thirtieth annual ACMsymposium on Theory of computing, STOC ’98, (New York, NY, USA),pp. 604–613, ACM, 1998
LSH
The general concept of LSH based NN lies in the definition of alocality sensitive hash function.
LSH Hash Families
Definition (Locality Sensitive Hash Function)
let H = {h : S → U} is (r1, r2, p1, p2)−sensitive if for any u, v ∈ S
1. if d(u, v) ≤ r1 then PrH[h(u) = h(v)] ≥ p1
2. if d(u, v) > r2 then PrH[h(u) = h(v)] ≤ p2
For this family ρ = log p1log p2
LSH
Using the above family of hash functions, the LSH algorithm canbe defined as an iterative convergence of probabilities leading to asolution to the approximate near neighbor problem.
Require: X = {x1, ..., xm}, xk ∈ Rn a set of pointsH is a locality sensitive hash functionD - is a set of bucketsfor all xk ∈ X do
Add xk to the hash bucket H(x)end forreturn H,D
Require: H,D, x̂ ∈ Rn a set of pointsreturn D[H(x̂)]
If H is defined as splitting the vectors over a particular randomplane, the results are very similar to the approximate kd-treealgorithm with depth 1.
To improve this algorithm, we must come up with a better localitysensitive hash functions than simply splitting the dataset.
I select multiple planes and then intersect the returned hashbuckets
I to increase selectivity of the buckets, we could concatenatehashes to create longer keys and more specific buckets
An Example Hash Family
.
.
.
.
.
.
.
.
.
.
0
1
2
3
...
w
w
w
r
Figure: Random Projection of R2 → R1
Voronoi TilingAn optimal partitioning in 2-dimensions is the Voronoi partitioning,and with point location search can be done with complexityΘ(log(n))) and linear space10.
Figure: Voronoi Partitioning of R2
10 M. d. Berg, M. v. Kreveld, and M. O. . . e. a. ], Computational geometry :algorithms and applications. Berlin: Springer, 2000
Voronoi
I Voronoi diagrams make for very efficient hash functions in 2dbecause, by definition, a point within a Voronoi region isnearest to the regions representative point.
I Voronoi regions provide an optimal solution to the NNpartitioning in 2-d Space!
I However, for arbitrary dimension d, Voronoi diagrams requireΘ(nd/2)-space, and no known optimal point locationalgorithms exists.
Lattices
Instead we will consider lattices, which provide regular spacepartitioning and scale to arbitrarily large dimensional space, andhave sub-linear nearest center search algorithms associated withthem.
Definition (Lattice in Rn)
let v1, ..., vn be n linear independent vectors wherevi = vi ,1, vi ,2, ..., vi ,n The lattice Λ with basis {v1, ..., vn} is the setof all integer combinations of v1, ..., vn the integer combinations ofthe basis vectors are the points of the lattice.
Λ = {z1v1 + z2v2 + ...+ znvn|zi ∈ Z, 1 ≤ i ≤ n}
11
11 P. V. Huffman W., Fundamentals of Error-Correcting Codes. CambridgeUniversity Press, 1 ed., 2003
Examples in 2D
Figure: Square(left) and Hexagonal(right) Lattices in R2
Finding The Nearest Representative in Constant Time
I Certain lattices allow us to find the nearest representativepoint in constant time.
I For example the above square lattice.I The nearest point can be found by simply rounding our real
valued point to its nearest integer.
I With the exception of a few exceptional lattices, morecomplex lattices have more complex searches (exponential asd increases 12 ).
12 V. Tarokh and I. F. Blake, “Trellis complexity versus the coding gain oflattices. i,” Information Theory, IEEE Transactions on, vol. 42, pp. 1796–1807, nov 1996
Exceptional Higher Dimensional Lattices
The previous lattices work well in R2, but our data spaces are ingeneral � 2.
I fortunately there are some higher dimensional lattices, withefficient nearest center search algorithms.
I E8 or Gosset’s Lattice, is one such lattice in
I it is also the densest lattice packing in R813.
13
Example of decoding E8
E8 can be formed by gluing two D8 integer lattices together andshifting by a vector of 1
2 . This gluing of less dense lattices andshifting by a “glue vector” is a common theme in finding denselattices.
I Decoding D8 is simple
I E8 = D8 ∩ D8 + 12
I both cosets of D8 can be computed in parallel
I D8’s decoding algorithm consists of rounding all values totheir nearest integer value s.t they sum to an even number
Example of decoding E8
define f (x) and g(x) to round the components of x, except in g(x)we round the furthest value from an integer in the wrong direction.let
x =< 0.1, 0.1, 0.8, 1.3, 2.2,−0.6,−0.7, 0.9 >
thenf (x) =< 0, 0, 1, 1, 2,−1,−1, 1 >, sum = 3
andg(x) =< 0, 0, 1, 1, 2, 0,−1, 1 >, sum = 4
since g(x) is even, it is the nearest lattice point in D8
Example of decoding E8 conti.
I However we want the nearest in E8, so we also have to includethe coset ∩D8 + 1
2 .
I We can do this by subtracting 12 from all the values of x.
f (x − 1
2) =< 0, 0, 0, 1, 2,−1,−1, 0 >, sum = 1
g(x − 1
2) =< -1, 0, 0, 1, 2,−1,−1, 0 > sum = 0
Now we find the coset representative that is closest to x using asimple distance metric.
||x − g(x)||2 = 0.65
||x − g(x1
2)||2 = 0.95
So this case it is the first coset representative:
< 0, 0, 1, 1, 2, 0,−1, 1 >
Leech Lattice
By gluing sets of E8 together in a way originally conceived byCurtis’ MOG, we can get an even higher dimensional dense latticecalled the Leech lattice.
Leech Lattice
Here we will state some attributes of the leech lattice as well asgive a comparison to other lattices by way of Eb/N0 and thecomputational cost of decoding.Some Important Attributes:
I Densest Regular Lattice Packing in R24
I Lattice Construction can be based on 2 cosets G24
I Sphere Packing Density: π12
12! ≈ 0.00192957
I Kmin = 196560
Leech Lattice
0 5 10 15 20 25EB/NO (dB)
�5
�4
�3
�2
�1
Prob
abili
ty o
f Bit
Erro
r - lo
g10(
Pb)
Error Correcting Performance (3072000 Samples)
Leech hex -.2db and QAM16E8 2PSKUnencoded QAM64Unencoded QAM16
Figure: Performance of Some Coded and Unencoded Data TransmissionSchemes
Leech Lattice DecodingSome information about the decoding algorithm:
I The decoding of the leech lattice is based closely on theDecoding of the Golay Code.
I In general, advances in either Leech decoding or binary Golaydecoding imply an advance in the other.
I The decoding method used in this implementation is based onAmrani and Be’ery’s ’9614 publication for decoding the Leechlattice, and consists of around 519 floating point operationsand suffers a gain loss of only 0.2bB.
I In general decoding complexity scales exponentially withdimension.15
Next is an outline of the decoding process.14 O. Amrani, Y. Be’ery, A. Vardy, F.-W. Sun, and H. C. A. van Tilborg,
“The leech lattice and the golay code: bounded-distance decoding and mul-tilevel constructions,” Information Theory, IEEE Transactions on, vol. 40,pp. 1030–1043, jul 1994
15 V. Tarokh and I. F. Blake, “Trellis complexity versus the coding gain oflattices. i,” Information Theory, IEEE Transactions on, vol. 42, pp. 1796–1807, nov 1996
Leech Decoder
QAM Decoder
A Points
Block Confidence
Even Reps
Min Over H6
Even Reps
H-Parity
Even
K-Parity
Even
QAM Decoder
B Points
Block Confidence
Odd Reps
Block Confidence
Even Reps
Block Confidence
Odd Reps
Min Over H6
Odd Reps
H-Parity
Odd
Min Over H6
Even Reps
Min Over H6
Odd Reps
H-Parity
Even
K-Parity
Odd
H-Parity
Odd
Compare Candidate Codewords
24 Reals Received
K-Parity
Odd
K-Parity
Even
Figure: Leech Lattice Decoder
CUDA
CUDA is a programming/computational platform for performinggeneral purpose processing on GPU hardware. CUDA isimplemented as an extension of C, exposing various parallelizationand memory allocation primitives. The advantages of GPGPUcomputing can be most easily seen in a picture on the next slide.
Advantages of CUDA in a Picture
ALU
Control
Cache
ALU ALU ALU ALU ALU ALU ALU ALU
ALU ALU ALU ALU ALU ALU ALU ALU
ALU ALU ALU ALU ALU ALU ALU ALU
ALU ALU ALU ALU ALU ALU ALU ALU
ALU
ALUALU
ALU ALU ALU ALU ALU ALU ALU ALUControl
Cache
Control
Cache
Control
Cache
Control
Cache
Control
Cache
DRAM DRAM
CPU Multiprocessor GPU Streaming Processor
Figure: Transistor Allocation for CPU vs. GPU Architectures
CUDA doesn’t come without a cost
So what do we have to pay for all of these processors and overallhigher ALU density?
I We have to adapt or rewrite programs into CUDA
I Memory Latency, no caching* (feeding lots of little beasts)
I Memory size per core
I Synchronization, communication, standard parallelprogramming costs
Parallel Decoder Outline
I Naive CUDA Implementation
I Improved CUDA Implementation
I CUDA Pitfalls
I Some Performance Tweaks
Preprocessing
Require: X = {x1, ..., xm}, xk ∈ Rn
1: U(x) is a universal hash function2: hk(x) ∈ H is (r , cr , p1, p2)-sensitive3: choose l s.t.4: G← i ∈ Zl from [0, n)5: g(x) = {h1(x), ...hj(x)}6: D ← []7: for all xk ∈ X do8: D ← U(g(xk))9: end for
10: return G,D
Query: c-approx k-nearest neighbor
Require: x̂ ∈ Rn, G,H,U(),D1: for all gj(x) ∈ G do2: U(gj(x̂))3: end for4: K = []5: for all l ∈ L do6: d ← dist(x̂ ,D[l ])7: K ← {d ,D[l ]}8: end for9: sort(K )
10: return K [0 : k]
An addition to the Standard Leech Lattice LSH Algorithm
I As suggested in Panigrahy 16, minimize database size (toΘ(n)) by delaying replication to the query phase.
I In addition we use a suggestion of Jegou17 for E8 exactc-nearest neighbor search, by ordering our 2L exact search listby the lattice center weights returned by the leech latticealgorithm.
16 R. Panigrahy, “Entropy based nearest neighbor search in high dimensions,”in Proceedings of the seventeenth annual ACM-SIAM symposium on Dis-crete algorithm, SODA ’06, (New York, NY, USA), pp. 1186–1195, ACM,2006
17
SIFT
Scale invariant feature transform is an image transform patentedby David Lowe of Carnegie Mellon University, for identifying localfeatures in an image that are invariant to scale, rotational, affineand to some extent lighting transformations.
SIFT maps an image to a set of feature vectors, by first findingkeypoints and then applying various rotational and scaletransformations such that the now feature point is scale androtation invariant.
These features will serve as our database vectors, and a queryimage will be SIFT’d and its features will be searched in thedatabase using the Lattice LSH search Algorithm.
SIFT Example
Figure: Top image is a scene and the bottom contains the SIFT imagematches
Results
Test data was performed on real world SIFT vectors formed fromthe Caltech25618 Database. The images in the imageset wereconverted to their SIFT vector equivalents and an LSH databasewas created from the SIFT Vectors relating images. Randomimages were chosen and searched with our parallel LSH Algorithm,as well as an exhaustive search for nearest neighbors based on thedefault algorithm provided with the SIFT library.
18 G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,”Tech. Rep. 7694, 2007
Query Time as a Ratio of Parallel to Total TimeAverage speedup in decoding between parallel and sequentialdecoding is approximately 41x.
0 50 100 150 200 250 300Samples
0.88
0.89
0.90
0.91
0.92
0.93
0.94Ra
tio P
ar/S
eq
Query Database: Parallel/Total Compute Time for SIFT Samples
SIFT sample data samplesavg=0.9340, R2=2.83203799077e-05
Figure: Ratio of Parallel to Total Time for Queries on an LSH SearchDatabase using Real World SIFT Sample Data
Extrapolated Lowes Linear vs Parallel LSH SearchBelow is a semilog comparison of linear vs. parallel LSH Search.Extrapolation is required due to the rapidly increasing complexityof linear search.
0 20000 40000 60000 80000 100000Number of Images
10-2
10-1
100
101
102
103
104Av
erag
e Ti
me
log-
scal
eExtrapolated Comparison of Time for LSH and Linear Search (Semilog)
Extrapolated Linear SearchExtrapolated LSH
Figure: Extrapolated Average Search Query Times
Extrapolated Lowes Linear vs Parallel LSH Search
0 50 100 150 200 250Sample Images
0
50
100
150
200
250Nu
mbe
r of M
atch
es
Accuracy of Matches for Various L's and Linear Search
Linear Search Matchl=1 LSH Matchl=5 LSH Matchl=10 LSH Match
Figure: Recall Accuracy of Random SIFT Vector Image Query
Effects of our Addition 2 and Selectivity
10-5 10-4 10-3
Selectivity
0.0
0.2
0.4
0.6
0.8
1.0
NN R
ecal
l
Data Recall Selectivity for Query Adaptive and Unranked LSH
Distance SortedUnsorted
Figure: Data Recall Selectivity for Query Adaptive and Unranked LSH
Conclusions
I Our primary conclusion is that parallel decoding is a usefulmethod for decoding the leech lattice, however its utility isnot scalable to arbitrarily large databases in thisimplementation due to the sequential bottleneck.
I On a more positive note, the query adaptive implementationyields a considerable benefit to selectivity over the originalmethod.
Future Directions
Although the applications of approximate nearest neighbor searchare manifold, offering a wide variety of similar applications above,we will instead focus on directions that better accommodate thissystems strengths and avoiding its weaknesses.
I Adaptive Mean Shift
I Networked Distributed KNN Search for Heterogeneous LargeScale Problems
I Finding more rigorous correlations between ECCs and LocalitySensitive Hashing Functions in regards to Θ(nρ).
Future Directions Adaptive Mean Shift
Without explicitly defining the functions involved in Mean Shiftand the variant Adaptive Mean Shift, we will simply show theupdate gradient step below. and define N(x) as the set of nearestneighbors to x.
→x =
∑xi∈N(x)
K (x − xi )→x∑
xi∈N(x)K (x − xi )
A simple extension would be to use LSH-KNN at this step,furthermore the approximation factor of c, may even be somewhatbeneficial in real world data as it may add a beneficial soft marginto the clustering.
Distributed KNN ...
I This extension may very well be used as a jumping point intoparallel exa-scale computing for my further research.
I A direction I would like to pursue is to use the universallyhashed, locality sensitive hash values for nearest neighborvectors, and fold them into a search across heterogeneoussystems using the same hashing scheme and supporting aquery architecture similar to internet routing schemes such asRIP or the more scalable OSPF (open shortest path first)algorithm.
I As an added bonus, there seems to be a variety of tunableattributes
I space-time tradeoffI actual network architecture versus algorithmic architecture and
optimizations.
Closing Notes
Thank you everyone for your time! Complete References as well asslides, source code, and original thesis are freely available athomepages.uc.edu/ carrahle
O. Amrani, Y. Be’ery, A. Vardy, F.-W. Sun, and H. C. A. vanTilborg, “The leech lattice and the golay code: bounded-distancedecoding and multilevel constructions,” Information Theory, IEEETransactions on, vol. 40, pp. 1030–1043, jul 1994.
A. Andoni and P. Indyk, “Near-optimal hashing algorithms forapproximate nearest neighbor in high dimensions,” in Foundations ofComputer Science, 2006. FOCS ’06. 47th Annual IEEE Symposiumon, pp. 459–468, oct. 2006.
H. Samet, Foundations of Multidimensional and Metric DataStructures.
Morgan Kaufmann, 2006.
A. Rajaraman and J. D. Ullman, Mining of Massive Datasets.
Cambridge University Press, 2011.
K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is”nearest neighbor” meaningful?,” in In Int. Conf. on DatabaseTheory, pp. 217–235, 1999.
P. Indyk and R. Motwani, “Approximate nearest neighbors: towardsremoving the curse of dimensionality,” in Proceedings of the thirtiethannual ACM symposium on Theory of computing, STOC ’98, (NewYork, NY, USA), pp. 604–613, ACM, 1998.
M. d. Berg, M. v. Kreveld, and M. O. . . e. a. ], Computationalgeometry : algorithms and applications.
Berlin: Springer, 2000.
P. V. Huffman W., Fundamentals of Error-Correcting Codes.
Cambridge University Press, 1 ed., 2003.
V. Tarokh and I. F. Blake, “Trellis complexity versus the coding gainof lattices. i,” Information Theory, IEEE Transactions on, vol. 42,pp. 1796–1807, nov 1996.
V. Tarokh and I. F. Blake, “Trellis complexity versus the coding gainof lattices. i,” Information Theory, IEEE Transactions on, vol. 42,pp. 1796–1807, nov 1996.
R. Panigrahy, “Entropy based nearest neighbor search in highdimensions,” in Proceedings of the seventeenth annual ACM-SIAMsymposium on Discrete algorithm, SODA ’06, (New York, NY,USA), pp. 1186–1195, ACM, 2006.
G. Griffin, A. Holub, and P. Perona, “Caltech-256 object categorydataset,” Tech. Rep. 7694, 2007.