a gpu-based approximate svd algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •cpu:...
TRANSCRIPT
![Page 1: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/1.jpg)
Blake Foster, Sridhar Mahadevan, Rui Wang
A GPU-based Approximate SVD Algorithm
University of Massachusetts Amherst
![Page 2: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/2.jpg)
2
Introduction
Singular Value Decomposition (SVD)
• 𝑨: 𝑚 × 𝑛 matrix (𝑚 ≥ 𝑛)
• 𝑼, 𝑽: orthogonal matrices
• 𝜮: diagonal matrix
Exact SVD can be computed in 𝑶(𝒎𝒏𝟐) time.
![Page 3: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/3.jpg)
3
Introduction
Approximate SVD
• 𝑨: 𝑚 × 𝑛 matrix, whose intrinsic dimensionality is far less than 𝑛
Usually sufficient to compute the 𝑘 (≪ 𝑛) largest singular values and corresponding vectors.
Approximate SVD can be solved in 𝑶(𝒎𝒏𝒌) time ( compared to 𝑶(𝒎𝒏𝟐) for full svd).
![Page 4: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/4.jpg)
4
Introduction
Sample-based Approximate SVD
• Construct a basis 𝑽 by directly sampling or taking linear combinations of rows from 𝑨.
• An approximate SVD can be extracted from the projection of 𝑨 onto 𝑽 .
![Page 5: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/5.jpg)
5
Introduction
Sample-based Approximate SVD
• Construct a basis 𝑽 by directly sampling or taking linear combinations of rows from 𝑨.
• An approximate SVD can be extracted from the projection of 𝑨 onto 𝑽 .
Sampling Methods
• Length-squared [Frieze et al. 2004, Deshpande et al. 2006]
• Random projection [Friedland et al 2006, Sarlos et al. 2006, Rokhlin et al. 2009]
![Page 6: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/6.jpg)
6
Introduction
QUIC-SVD [Holmes et al. NIPS 2008]
• Sampled-based approximate SVD.
• Progressively constructs a basis 𝑽 using a cosine tree.
• Results have controllable L2 error basis construction adapts to the matrix and the desired error.
![Page 7: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/7.jpg)
7
Our Goal
Map QUIC-SVD onto the GPU (Graphics Processing Units)
• Powerful data-parallel computation platform.
• High computation density, high memory bandwidth.
• Relatively low-cost.
NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 3 GB memory 200 GB/s bandwidth $549
![Page 8: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/8.jpg)
8
Related Work
OpenGL-based: rasterization, framebuffers, shaders
• [Galoppo et al. 2005]
• [Bondhugula et al. 2006]
CUDA-based:
• Matrix factorizations [Volkov et al. 2008]
• QR decomposition [Kerr et al. 2009]
• SVD [Lahabar et al. 2009]
• Commercial GPU linear algebra package [CULA 2010]
![Page 9: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/9.jpg)
9
Our Work
A CUDA implementation of QUIC-SVD.
Up to 𝟔~𝟕 times speedup over an optimized CPU version.
A partitioned version that can process large, out-of-core matrices on a single GPU
• The largest we tested are 22,000 x 22,000 matrices (double precision FP).
![Page 10: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/10.jpg)
10
Overview of QUIC-SVD
[Holmes et al. 2008]
𝑽
𝑨
![Page 11: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/11.jpg)
11
Overview of QUIC-SVD
Start from a root node that owns the entire matrix (all rows).
[Holmes et al. 2008]
𝑨
![Page 12: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/12.jpg)
12
Overview of QUIC-SVD
Select a pivot row by importance sampling the squared magnitudes of the rows.
[Holmes et al. 2008]
pivot row →
𝑨
![Page 13: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/13.jpg)
13
Overview of QUIC-SVD
Compute inner product of every row with the pivot row. (this is why it’s called Cosine Tree)
[Holmes et al. 2008]
pivot row →
𝑨
![Page 14: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/14.jpg)
14
Overview of QUIC-SVD
Based on the inner products, the rows are partitioned into 2 subsets, each represented by a child node.
[Holmes et al. 2008]
𝑨
Closer to zero →
T𝐡𝐞 𝐫𝐞𝐬𝐭 →
![Page 15: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/15.jpg)
15
Overview of QUIC-SVD
Compute the row mean (average) of each subset; add it to the current basis set (keep orthogonalized).
[Holmes et al. 2008]
𝑨
𝑽
![Page 16: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/16.jpg)
16
Overview of QUIC-SVD
Repeat, until the estimated L2 error is below a threshold. The basis set now provides a good low-rank approximation.
[Holmes et al. 2008]
𝑨
𝑽
![Page 17: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/17.jpg)
17
Overview of QUIC-SVD
Monte Carlo Error Estimation:
Input: any subset of rows 𝑨𝑠, the basis 𝑽 , 𝛿 ∈ [0,1]
Output: estimated , s.t. with probability at least 1 − 𝛿, the L2 error:
Termination Criteria:
![Page 18: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/18.jpg)
18
Overview of QUIC-SVD
5 Steps:
1. Select a leaf node with maximum estimated L2 error;
2. Split the node and create two child nodes;
3. Calculate and insert the mean vector of each child to the basis set 𝑉 (replacing the parent vector), orthogonalized;
4. Estimate the error of newly added child nodes;
5. Repeat steps 1—4 until the termination criteria is satisfied.
![Page 19: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/19.jpg)
19
GPU Implementation of QUIC-SVD
Computation cost mostly on constructing the tree.
Most expensive steps:
• Computing vector inner products and row means
• Basis orthogonalization
![Page 20: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/20.jpg)
20
Parallelize Vector Inner Products and Row Means
Compute all inner products and row means in parallel.
Assume a large matrix, there are sufficient number of rows to engage all GPU threads.
An index array is used to point to the physical rows.
![Page 21: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/21.jpg)
21
Parallelize Gram Schmidt Orthogonalization
Classical Gram Schmidt: Assume are in the current basis set (i.e. they are orthonormal vectors), to add a new vector :
This is easy to parallelize (as each projection is independently calculated), but it has poor numerical stability.
1 2 3, , ... nv v v v
r
1 2( , ) ( , ) ... ( , )nr r proj r v proj r v proj r v
( , ) ( )proj r v r v v1n
rv
r
where
![Page 22: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/22.jpg)
22
Parallelize Gram Schmidt Orthogonalization
Modified Gram Schmidt:
This has good numerical stability, but is hard to parallelize, as the computation is sequential!
1 1
2 1 1 2
1 1
( , ),
( , ),
... ...
( , )k k k k
r r proj r v
r r proj r v
r r r proj r v
1nv r r
![Page 23: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/23.jpg)
23
Parallelize Gram Schmidt Orthogonalization
Our approach: Blocked Gram Schmidt
This hybrid approach combines the advantages of the previous two – numerically stable, and GPU-friendly.
(1)
1 2
(2) (1) (1) (1) (1)
1 2 2
( / ) ( / ) ( / ) ( / )
1 2 1 2 2
( , ) ( , ) ... ( , )
( , ) ( , ) ... ( , )
... ...
( , ) ( , ) ... ( , )
m
m m m
n m n m n m n m
k m m k
v v proj v u proj v u proj v u
v v proj v u proj v u proj v u
u v proj v u proj v u proj v u
![Page 24: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/24.jpg)
24
Extract the Final Result
Input: matrix 𝑨, basis set 𝑽
Output: 𝑼,𝜮, 𝑽, the best approximate SVD of 𝑨 within the subspace 𝑽 .
Algorithm:
• Compute 𝑨𝑽 , then (𝑨𝑽 )𝑻𝑨𝑽 , and SVD 𝑼′𝜮′𝑽′𝑻 = (𝑨𝑽 )𝑻𝑨𝑽
• Let 𝑽 = 𝑽 𝑽′, 𝜮 = (𝜮′)𝟏/𝟐, 𝑼 = (𝑨𝑽 )𝑽′𝜮−𝟏,
• Return 𝑼,𝜮, 𝑽.
![Page 25: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/25.jpg)
25
Extract the Final Result
Input: matrix 𝑨, subspace basis 𝑽
Output: 𝑼,𝜮, 𝑽, the best approximate SVD of 𝑨 within the subspace 𝑽 .
Algorithm:
• Compute 𝑨𝑽 , then (𝑨𝑽 )𝑻𝑨𝑽 , and SVD 𝑼′𝜮′𝑽′𝑻 = (𝑨𝑽 )𝑻𝑨𝑽
• Let 𝑽 = 𝑽 𝑽′, 𝜮 = (𝜮′)𝟏/𝟐, 𝑼 = (𝑨𝑽 )𝑽′𝜮−𝟏,
• Return 𝑼,𝜮, 𝑽.
SVD of a 𝒌 × 𝒌 matrix.
![Page 26: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/26.jpg)
26
Partitioned QUIC-SVD
The advantage of the GPU is only obvious for large-scale problems, e.g. matrices larger than 10,0002.
For dense matrices, this will soon become an out-of-core problem.
We modified our algorithm to use a matrix partitioning scheme, so that each partition can fit in GPU memory.
![Page 27: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/27.jpg)
27
Partitioned QUIC-SVD
Partition the matrix into fixed-size blocks
𝑨𝟏
𝑨𝟐
𝑨𝒔
… …
… …
![Page 28: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/28.jpg)
28
Partitioned QUIC-SVD
Partition the matrix into fixed-size blocks
𝑨𝟏
𝑨𝟐
𝑨𝒔
… …
… …
![Page 29: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/29.jpg)
29
Partitioned QUIC-SVD
![Page 30: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/30.jpg)
30
Partitioned QUIC-SVD
Termination Criteria is guaranteed:
![Page 31: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/31.jpg)
31
Implementation Details
Main implementation done in CUDA.
Use CULA library to extract the final SVD (this involves processing a 𝑘 × 𝑘 matrix, where 𝑘 ≪ 𝑛).
Selection of the splitting node is done using a priority queue, maintained on the CPU side.
Source code will be made available on our site http://graphics.cs.umass.edu
![Page 32: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/32.jpg)
32
Performance
Test environment:
• CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory
• GPU: NVIDIA GTX 480 GPU
Test data:
• Given a rank 𝑘 and matrix size 𝑛, we generate a 𝑛 × 𝑘 left matrix and 𝑘 × 𝑛 right matrix, both filled with random numbers; the product of the two gives the test matrix.
Result accuracy:
• Both CPU and GPU versions use double floats; termination error 𝜖 = 10−12, Monte Carlo estimation 𝛿 = 10−12.
![Page 33: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/33.jpg)
33
Performance
Comparisons:
1. Matlab’s svds;
2. CPU-based QUIC-SVD (optimized, multi-threaded, implemented using Intel Math Kernel Library);
3. GPU-based QUIC-SVD.
4. Tygert SVD (approximate SVD algorithm, implemented in Matlab, based on random projection);
![Page 34: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/34.jpg)
34
Performance – Running Time
![Page 35: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/35.jpg)
35
Performance – Speedup Factors
![Page 36: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/36.jpg)
36
Performance – Speedup Factors
![Page 37: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/37.jpg)
37
Performance – CPU vs. GPU QUIC-SVD
Running Time Speedup
![Page 38: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/38.jpg)
38
Performance – GPU QUIC-SVD vs. Tygert SVD
Running Time Speedup
![Page 39: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/39.jpg)
39
Integration with Manifold Alignment Framework
![Page 40: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/40.jpg)
40
Conclusion
A fast GPU-based implementation of an approximate SVD algorithm.
Reasonable (up to 6~7 ×) speedup over an optimized CPU implementation, and Tygert SVD.
Our partitioned version allows processing out-of-core matrices.
![Page 41: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/41.jpg)
41
Future Work
Be able to process sparse matrices.
Application in solving large graph Laplacians.
Components from this project can become building blocks for other algorithms: random projection trees, diffusion wavelets etc.
![Page 42: A GPU-based Approximate SVD Algorithmgraphics.cs.umass.edu › pubs › ppam_talk.pdf · •CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory •GPU: NVIDIA GTX 480 GPU Test data:](https://reader036.vdocuments.us/reader036/viewer/2022070805/5f0387167e708231d4097f98/html5/thumbnails/42.jpg)
42
Acknowledgement
Alexander Gray
PPAM reviewers
NSF Grant FODAVA-1025120