september 15, 2015 1 utilizing cuda for preconditioned gmres solvers dcabes’09 shiming xu 1, hai...
TRANSCRIPT
![Page 1: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/1.jpg)
April 21, 2023
1
Utilizing CUDA for Preconditioned GMRES Solvers
DCABES’09
Shiming Xu1, Hai Xiang Lin1, Wei Xue2 , and Ke Wang3
1 Delft Institute of Applied Mathematics, TU Delft2 Department of Computer Science & Technology, Tsinghua University3 Lab. Parallel Soft. & Comp. Sci., Inst. Software Chinese Academy of Science
![Page 2: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/2.jpg)
Outline
• Introduction to Krylov-subspace linear system solvers & preconditioning techniques
• Introduction to GPGPU & NVIDIA CUDA• GMRES solver on CUDA• Approximate inverse preconditioner based on A-
biconjugate (AINV) on CUDA• Experiments & Results• Conclusion
April 21, 2023 2
![Page 3: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/3.jpg)
April 21, 2023 3
Introduction – Krylov subspace solver• Iterative Linera System Solver[2]
• Krylov Subspace-based solver:
• Popular Solvers:• GMRES, CG, Bi-CG
![Page 4: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/4.jpg)
Introduction – Preconditioners (1)
• Iteration Count ~ Condition of matrix A• Preconditioners[2,9]:• Improve condition of the ‘actual’ matrix for
iteration• Left & right preconditioning• Effective matrix & system:
April 21, 2023 4
![Page 5: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/5.jpg)
April 21, 2023
Introduction – Preconditioned GMRES
5
![Page 6: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/6.jpg)
Introduction – Preconditioners (2)
• Incomplete Factorization-based:• Incomplete LU/Cholesky factorization[1,2]
• ILU(0), ILUt, ILUk, ILUtp, etc• Preconditioning: forward/backward elimination
• Approximate Inverse-based:• A-biconjugation (AINV)[8]
• Forbenius-norm minimization• Preconditioning: matrix-vector product
April 21, 2023 6
![Page 7: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/7.jpg)
April 21, 2023
A-biconjugate based Preconditioner
7
![Page 8: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/8.jpg)
Introduction – GPGPU & NVIDIA CUDA• General Purposed computing on Graphics Processing
Units[12]
• NVIDIA CUDA[6]: • First (de facto) widely adopted platform for GP-GPU
• Characteristics of GPU:• Throughput-oriented architecture, SIMD style• High peak FLOPS/bandwidth• Massively parallel (>thousands of concurrent threads)• Weak caching/memory model/programmability• Weak support for branches, no ILP mechanism
April 21, 2023 8
![Page 9: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/9.jpg)
Introduction – GPU/CPU ComparisonCPU GPU
Sample Intel i7-920 (Nehalem) NVIDIA Tesla C1060
Freq. 2.66 GHz 1.3 GHz
Peak FLOPS (SP/DP)
85 G / 42.5 G 624 G / 78 G
Peak Bandwidth ~25 GB/s ~100 GB/s
Core configuration
4-physical core8-virtual cores
10-multiprocessor240-stream processor
Cache System 3-level coherent cache(32KB x4, 256KB x4, 8MB)
2-level cache (24KB x10, 256 KB)
SW-managed Cache
None 16KB x30April 21, 2023 9
![Page 10: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/10.jpg)
CUDA Thread HierarchyCUDA Thread Hierarchy
April 21, 2023
CUDA Device AbstractionCUDA Device Abstraction
Introduction – NVIDIA CUDA
10
![Page 11: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/11.jpg)
Data Formats for Sparse Matrices
• ELLPACK & ELLPACK-based (HYB)[4]
• Good bandwidth utilization• CSR/CSC (Compressed Sparse Row/Column)
April 21, 2023 11
00
01
02
03
04
05
06
07
08
09
0A
0B
0C
0D
0E
0F
10
11
12
13
14
15
16
17
18
19
1A
1B
1C
1D
1E
1F
20
21
22
23
24
25
26
27
28
29
2A
2B
2C
2D
2E
2F
30
31
32
33
34
35
36
37
38
39
3A
3B
3C
3D
3E
3F
50
51
52
53
54
55
56
57
58
59
5A
5B
5C
5D
5E
5F
40
41
42
43
44
45
46
47
48
49
4A
4B
4C
4D
4E
4F
60
61
62
63
64
65
66
67
68
69
6A
6B
6C
6D
6E
6F
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Thread 8
Thread 9
Thread 10
Thread 11
Thread 12
Thread 13
Thread 14
Thread 15
Thread 16
à
à
à
à
à
à
à
à
à
à
à
à
à
à
à
à
![Page 12: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/12.jpg)
April 21, 2023G-SG-S
Modified G-SModified G-S
GMRES in CUDA – Algorithms
• Orthgonalization[11,13]:• Gram-Schmidt• Modified Gram-Schmidt• Gram-Schmidt with re-orthogonalization
12
![Page 13: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/13.jpg)
GMRES in CUDA – Implementation
• Sparse Matrix-Dense Vector products (SpMV)• Orthogonalization• Inner Products• AXPY operations
• Preconditioner – AINV• Close relationship to ILU-related ones• High-performance/easy parallelization
April 21, 2023 13
![Page 14: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/14.jpg)
AINV w/ Predefined Sparsity Pattern
• AINV-0:• WT+Z has the same sparsity pattern as A• Similar to ILU-0
• Preconditioner Generation:• CSC format for both W and Z
• Preconditioning in GMRES:• HYB format
April 21, 2023 14
![Page 15: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/15.jpg)
AINV in CUDA
• Parallelization:• Inner iteration on Line 4~7 and Line 8~12
• Kernels:• Sparse-Vector Sparse-Vector inner products
(Line 5~6)• Sparse-Vector Sparse-Vector updates (Line
9~11)
April 21, 2023 15
![Page 16: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/16.jpg)
Experiments – Tests
• GMRES kernels• Krylov subspace generation: SpMV• Orthogonalization
• AINV-0 preconditioner generation• AINV-0 preconditioned GMRES iteration
April 21, 2023 16
![Page 17: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/17.jpg)
Experiments – Configurations
CPU Intel i7-920 (4-core, 2.66GHz)Memory 12GB (DDR-III, 1066MHz)GPU NVidia Tesla C1060GPU Memory
4GB
CUDA Version
2.0
April 21, 2023 17
Protein CantWindTunn
elEpide
mCircuit Petro OPF TDS Cubes
Parabolic
Size 36K 62K 218K 526K 171K 132K 2.1M 25K 101K 526K
NNZ 4.3M 4.0M 11.6M 2.1M 959K 4.2M 8.1M 160K 874K 2.1M
Table.1 System Configurations
Table.2 Test Matrices
![Page 18: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/18.jpg)
Experiments – SpMV
• 3.7x speedup in SpMV• Performance:
• Bandwidth utilization• Distribution in non-zero
element count per row
April 21, 2023 18
![Page 19: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/19.jpg)
Experiments – Orthogonalization
• Modified G-S scheme
• Orthogonalization:• 1 vector ~ 64 bases
• Short vectors: CPU• Long vectors: GPU
April 21, 2023 19
![Page 20: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/20.jpg)
Experiments – AINV-0 Construction
• Averaged 2x speed-up• Performance:
• Lower matrix bandwidth• Fewer non-zeros per row• Adjacent rows with higher
sparsity pattern similarity• Larger Matrix size
April 21, 2023 20
![Page 21: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/21.jpg)
Experiments – GMRES iterations
• Restarted GMRES(64)• Components:
• Orthogonalization (1~64)
• A-based SpMV• Preconditioning
• Left, Right, & Scaling• ~3x speed-up per iteration
April 21, 2023 21
![Page 22: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/22.jpg)
Conclusion
• >3x speed-up’s for Krylov-subspace methods kernel • >3.5x speed-up for Krylov-subspace generation• ~7x speed-up for orthogonalization process for
long matrix/vector size• 2x speed-up for AINV-0 preconditioner generation• ~3x speed-up for GMRES iteration• Future Work:
• Optimization in both CPU & GPU implementation• AINV with dynamic fill-in’s
April 21, 2023 22
![Page 23: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/23.jpg)
References1. Timothy A. Davis, Direct Methods for Sparse Linear Systems, SIAM, 2006.2. Yousef Saad, Iterative Methods for Sparse Linear Systems, 2nd Ed., SIAM, 2003.3. BLAS – Basic Linear Algebra Subprograms, http://www.netlib.org/blas/.4. Nathan Bell and Michael Garland, Implementing Sparse Matrix-Vector Multiplication on Throughput-
Oriented Processors, SC’09.5. Muthu Manikandan Baskaran and Rajesh Bordawekar, Optimizing Sparse Matrix-Vector Multiplication on
GPUs, Technical Report 2009.6. CUDA Zone, http://www.nvidia.com/cuda.7. Vasily Volkov and James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra, SC’08.8. M. Benzi and M. Tuma, A Sparse Approximate Inverse Preconditioner for Nonsymmetric Linear Systems,
SIAM J. Sci. Comput. 19 (1998).9. Michele Benzi, Preconditioning Techniques for Large Linear Systems: A Survey, Journal of Computational
Physics 182 (2002) pp.418-477.10. Matthias Christen, Olaf Schenk and Helmar Burkhart, General-Purpose Sparse Matrix Building Blocks
using the NVIDIA CUDA Technology Platform, Workshop on GPGPU, 2007.11. L. Giarud, J. Langou and M. Rozloznik, The Loss of Orthogonality in the Gram-Schmidt Orthogonalization
Process, Intl. J. Computers & Math. with Applications, 50 (2005), pp.1069-1075.12. GPGPU, http://www.gpgpu.org. 13. W. Hoffmann, Iterative Algorithms for Gram-Schmidt Orthogonalization, Computing 41 (1989), 335-348. 14. Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schroder, Sparse Matrix Solvers on the GPU: Conjugate
Gradients and Multigrid, SIGGRAPH’05.
April 21, 2023 23
![Page 24: September 15, 2015 1 Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute](https://reader037.vdocuments.us/reader037/viewer/2022110213/56649e4b5503460f94b3f94c/html5/thumbnails/24.jpg)
ANY QUESTIONS?THANK YOU!
April 21, 2023 24