gpu-accelerated semantic similarity search at...
TRANSCRIPT
![Page 1: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/1.jpg)
GPU-Accelerated Semantic Similarity Search at Scale
Kubilay AtasuIBM Research - Zurich
in collaboration withThomas Parnell, Celestine Duenner, Manolis Sifalakis, Haris PozidisVasileios Vasileiadis, Michail Vlachos, Cesar Berrospi, Abdel Labbi
![Page 2: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/2.jpg)
2
Outline
§ Introduction
§ Background
§ Our solution
§ Our results
![Page 3: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/3.jpg)
Why scalable similarity (i.e., nearest neighbors) search?
Example: financial news analysis
§ 100k news entries every day
§ 100M entries in three years
§ searching, browsing, clustering
3
Need for a similarity/distance metric:§ must be accurate and scalable
Image Source: http://social-dynamics.org/
![Page 4: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/4.jpg)
Word Mover’s Distance (WMD) for Semantic Similarity
4
M. Kusner et al.: From Word Embeddings to Document Distances. ICML 2015.
Principle of WMD
The Queen to tour Canada
Royal visit to Halifax
Canada
Queentour
Royalvisit
Halifax
embedding space
![Page 5: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/5.jpg)
High Cost of Word Mover’s Distance (WMD)
5
Quality Timecomplexity
GPUfriendly
WMD Very high Cubical No
Relaxed WMD High Quadratic YesOur solution High Linear Yes
Word Mover’s Distance: very high quality, but very high complexity!
Dense and Sparse Linear Algebra on Graphics Processing Units (GPUs)
Sub-second query performance on very large data sets!
![Page 6: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/6.jpg)
6
Outline
§ Introduction
§ Background
§ Our solution
§ Our results
![Page 7: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/7.jpg)
WMD: Earth Mover’s Distance using Word Embeddings
7
Cost matrix (word distances)
CanadaQueentour
Royalvisit
Halifax
Bag-of-Words
Solves a minimum-cost flow problem!
Cubical time complexity in the size of the histograms!
Histogram 1 Histogram 2
![Page 8: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/8.jpg)
Relaxed WMD (RWMD): A Lower Bound of WMD
Histogram 1
His
togr
am 2
⊙
⊙
Cost Matrix
# Words = h1
# Words = h2
M. Kusner et al.: From Word Embeddings to Document Distances. ICML 2015.
Quadratic time and space!
It is not possible to do better when comparing only two histograms!
![Page 9: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/9.jpg)
Problem Formulation: Input and Output Data Structures
Sparse matrix X1(DATABASE)
# Words in vocabulary: v
# Hists = n1
X[i,w]: frequency of word w in histogram i
Dense matrix E
Size of embedding vectors: m
# Words = v
E[w]: embedding vector for word w
Sparse matrix X2(QUERY)
# Words in vocabulary: v
# Hists = n2 Dense matrix R
k most similar histograms in X1
# Hists = n2
Given two sets of histograms (X1 and X2) and the embedding vectors E:For each histogram in X2, compute the K most similar histograms in X1
9
![Page 10: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/10.jpg)
Computing the Cost Matrix: Dense Matrix-Matrix Multiplication
# Words = h1,i
Compute pairwise distances between all words in histogram i and all words in histogram j
# Words = h2,j
T1,i: embedding vectors of all words in hist i
Dense matrix T1,i
Size of embedding vectors: m
# Words = h1,i
T2,j: embedding vectors of all words in hist j
Dense matrix T2,j
Size of embedding vectors: m
# Words = h2,j
Complexity: O(h%m)C),+ = T.,) ∘ T%,+
10
Excellent candidate for GPU acceleration!
![Page 11: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/11.jpg)
Word Mover’s Distance (WMD)
Dense matrix D
D[j,i]: Distance between hist j and hist i
# Words = h1,i
F1,i: frequency of the words in histogram i
Dense vector F1,i Computing D[i,j]:
Computing D[j]:
# Words = h1,i
# Words = h2,j
C),+ = T.,) ∘ T%,+
D[j, i] = WMD(F.,), F%,+, C),+)O(h%m + h9logh)
# Hists = n2
# Hists = n1
O(nh%m + nh9logh)
Given Ci,j, F1,i, and F2,j, compute D[j,i]
11
![Page 12: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/12.jpg)
Relaxed Word Mover’s Distance – Quadratic Implementation
Dense matrix D# Hists = n2
D[j,i]: distance between hist j and hist i
# Hists = n1
# Words = h1,i
F1,i: frequency the words in histogram i
Dense vector F1,i Complexity of computing D[j,i]:
Complexity of computing D[j]:
O(h%m)O(nh%m)
D[j, i] = F.,) > min(C),+)
# Words = h1,i
# Words = h2,j
C),+ = T.,) ∘ T%,+
Given Ci,j, F1,i, and F2,j, compute D[j,i]
12
![Page 13: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/13.jpg)
Quadratic Implementation – GPU Mapping
T%,+, 0 ≤ j ≤ n% − 1: streamed in one document at a time
Dense matrix T2,j
Size of word vectors: m
# Words = h2,j
T.,), 0 ≤ i ≤ n. − 1: resident in GPU memory
T1,0
Size of word vectors: m
h1,0
T1,1h1,1
T1,ih1,i
T1, n1-1h1,n1-1
# Hists: n1
Compute one row of D in parallel on a single GPU
Memory requirement on one GPU:
D[j] = F. > min(T. ∘ T%,+)
O(nhm)
13
![Page 14: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/14.jpg)
How far away is RWMD from WMD?
14
What is the fraction of overlap between top-k results of RWMD and WMD?
10 1 0.1 0.01 0.0010
0.2
0.4
0.6
0.8
1
1.2
RWMDselection(%)
RWMDvsWMD- Precision
WMD(0.001%)
WMD(0.01%)
WMD(0.1%)
WMD(1%)
WMD(10%)
![Page 15: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/15.jpg)
15
Outline
§ Introduction
§ Background
§ Our solution
§ Our results
![Page 16: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/16.jpg)
Relaxed WMD (RWMD): Redundancy
16
Histogram 1H
isto
gram
2
⊙
⊙
Cost Matrix
# Words = h1
His
togr
am 3
⊙ Cost Matrix
Common words are the problem!
Redundancy can be eliminated!
![Page 17: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/17.jpg)
Overview of Linear-Complexity RWMD (LC-RWMD)
17
. . .
Query documents (X2)
Phase 1
Clu
ster
of G
PUs
For each word in the vocabulary, compute the distance to the closest word in one query doc. Store the results in a dense vector Z.
Phase 2
Dense vector Z
Sparse-matrix dense-vector multiplication between X1 and Z to compute the distances between the query and the database docs.
Database documents (X1)
Distribute X2across GPUs
![Page 18: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/18.jpg)
LC-RWMD: First Phase
T2,j: embedding vectors of the query histogram j
Dense matrix T2,j
Size of embedding vectors: m
# Words = h2,jDense matrix E
Size of embedding vectors: m
# Words = v
E: embedding vectors of the complete vocabulary
Multiply E and the transpose of T and compute row-wise minimums
Z# Words = v
1
Z=min(E∘ T%,+)
18
E∘ T%,+
h2,j
v
Complexity of the first phase: O(vhm)
![Page 19: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/19.jpg)
LC-RWMD: Second Phase
Z: distance to the closest word in query histogram for each word in vocabulary
Z# Words = v
1
Sparse matrix X1# Hists = n1
# Words in vocabulary: v
Sparse matrix vector multiply to compute the distances. Complexity:
D j = X.×Z
O(nh)
Overall complexity: O(vhm+nh)
19
X1: weights of the database histograms in compressed sparse row (csr) format
![Page 20: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/20.jpg)
Complexity Comparison (Time)
Complexity of linear RWMD:
Complexity of quadratic RWMD:
Improvement vs. quadratic RWMD:
O(vhm+nh)
O(nh%m)
O(min(nh/v, hm))
20
h: avg. size of histogramsm: size of word vectors v: size of the vocabulary
Comparing one query document with n database documents
![Page 21: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/21.jpg)
LC-RWMD – GPU Mapping
X1 is resident in the GPU memory T%,+, 0 ≤ j ≤ n% − 1:streamed in one document at a time
Dense matrix T2,j
Size of embedding vectors: m
# Words = d2,j
Memory requirement on one GPU:
Sparse matrix X1# Hists = n1
# Words in vocabulary: v
Dense matrix E
Size of embedding vectors: m
Word embeddings E is resident in GPU memory
# Words = vO(vm+nh + vh)
21
![Page 22: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/22.jpg)
Complexity Comparison (Space)
Space complexity of linear RWMD:
Space complexity of quadratic RWMD:
Improvement w.r.t. quadratic RWMD:
O(vm+nh + vh)
O(nhm)
O(min(nh/v, nm/v,m))
22
n: # database documents h: avg. size of histogramsm: size of word vectors v: size of the vocabulary
![Page 23: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/23.jpg)
LC-RWMD: Dealing with Asymmetric Distances
Sparse matrix X1(DATABASE)
# Words: v
n1 Sparse matrix X2(QUERY)
# Words: v
n2
23
Distance Matrix D1n1
n2
Sparse matrix X2(DATABASE)
# Words: v
n2 Sparse matrix X1(QUERY)
# Words: v
n1 Distance Matrix D2n2
n1
D = max D1M, D2Transpose, maximum, and top-k on the CPU
![Page 24: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/24.jpg)
24
Outline
§ Introduction
§ Background
§ Our solution
§ Our results
![Page 25: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/25.jpg)
Speed-up vs CPU-based RWMD § CPU: Intel ® Core ® i7-6900K @ 3.20 GHz, 8 cores (SMT2), 64 GB memory, Intel ® MKL
§ GPU: NVIDIA ® Tesla ® P100, 16 GB memory, CUDA 9.0 with CUBLAS and CUSPARSE
25
Set 1: h=150 words per histogram
Set 2: h=30 words per histogram
Google’s Word2Vec (Google News)
v = 3M words, m = 300 floating pt. nums
All operations in single-prec. floating pt.
18 19
27951330
1
10
100
1000
10000
Set1 Set2
RWMD GPU LC-RWMD GPU
![Page 26: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/26.jpg)
Runtime vs GPU-accelerated WMD
26
Time to compare one query doc with all database docs using 16 CPU processes + 16 GPUs
Set 1: n=1M docs, h=150 words per hist. Set 2: n=2.8M docs, h=30 words per hist.
0.001
0.01
0.1
1
10
100
1000
Set1 Set2
Runtime (secs)
WMD (16 GPUs) LC-RWMD (16 GPUs)
30000x2000x
![Page 27: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/27.jpg)
4 8 16 32 64 128
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
LC-RWMD very large
LC-RWMD large
WMD medium
LC-RWMD medium
WMD small
LC-RWMD small
Comparison with WMD: Precision at Top-K for Set 2
§ small: 300-1000 examples per label
§ medium: 1k-10k examples per label
§ large: 10k-100k examples per label
§ very large: 100k-1M examples per label27
Precision
K
![Page 28: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/28.jpg)
Summary
A linear complexity method for computing Relaxed Word Mover’s Distance
§ The original method proposed by Kusner et al. has quadratic complexity
§ ~30000-fold improvement in performance w.r.t GPU-accelerated WMD
§ ~2800-fold improvement w.r.t CPU-accelerated quadratic RWMD
Main insight: Big Data offers new ways of dealing with algorithmic complexity!
§ Reduce complexity by eliminating redundant and repetitive operations
§ Exploit the massive parallelism offered by GPUs and clusters of GPUs
28
![Page 29: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/29.jpg)
Business and Academic Impact
§ Being used by our business developers: ingestion of business news
§ Sub-second execution latency for similarity queries (100k docs per day)
§ Database of 100M documents using 16 NVIDIA ® Tesla ® P100 GPUs
§ Larger databases or higher ingestion rates? Simply add more GPUs!
§ IEEE Big Data 2017 Conference, ERCIM News, GTC 2018 Conference
29
![Page 30: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/30.jpg)
Future directions
§ Possible improvements:
§ CUDA streams to overlap CPU/GPU computation, half-precision support
§ Sinkhorn Distance to better approximate WMD (quadratic complexity)
§ Supervised training of word weights and word vectors (supervised WMD)
§ Limitations of bag-of-words: augment syntax trees with word vectors
§ Possible extensions:
§ Use FPGAs with hard floating-point cores and high-bandwidth memories
§ Similarity search in other domains: time series, images, genomics data
30
![Page 31: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/31.jpg)
Questions?
![Page 32: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/32.jpg)
Many-to-many LC-RWMD: First Phase
Dense matrix E
Size of embedding vectors: m
# Words = v
E[w]: vector representation of word w
Z# Words = v
n2
Z=min(E∘ T%)
32
T%,+, 0 ≤ j ≤ n% − 1: resident in GPU memory
T2,0
Size of word vectors: m
h2,0
T2,1h2,1
T2,jh2,j
T2, n2-1h2,n2-1
# Hists: n2
![Page 33: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/33.jpg)
Many-to-many LC-RWMD: Second Phase
Z[w,j] stores the distance to closest word in histogram j for each word w in the vocabulary
Sparse matrix X1# Hists = n1
# Words in vocabulary: v
Sparse-matrix dense-matrix multiply to compute D D = X.×Z
33
Z# Words = v
![Page 34: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word](https://reader033.vdocuments.us/reader033/viewer/2022052721/5f0a4d047e708231d42afbb9/html5/thumbnails/34.jpg)
How far away is Word Centroid Distance from WMD?
34
10 1 0.1 0.01 0.0010
0.2
0.4
0.6
0.8
1
1.2
WCD selection (%)
WCD vs WMD - Precision
WMD (0.001%)
WMD (0.01%)
WMD (0.1%)
WMD (1%)
WMD (10%)
What is the fraction of overlap between top-k results of WCD and WMD?