low hanging fruit...write faster code approximate calculations data layout compute flow i/o or...
TRANSCRIPT
![Page 1: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/1.jpg)
![Page 2: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/2.jpg)
Write faster
code
Approximate
calculations Data layout
Compute
flow
I/O or
Compute
bound?
Profile code
(80/20 rule)
Low hanging fruit
I/O bound?Process parallel
Loop parallel
Different quantity
Lower precision
pthreads, OpenMP Refactor Modify algorithmGet SSD
![Page 3: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/3.jpg)
![Page 4: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/4.jpg)
// parallel vectors
![Page 5: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/5.jpg)
float metadata[N]; float metadata[N];
// parallel vectors
![Page 6: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/6.jpg)
float metadata[N]; float metadata[N];
// parallel vectors
Data not contiguous in memory
Memory jumps in accessing data
leads to slow distance calculations
x1 y1 f1 … f1 f1 x2 y2 f2 … f2 f2
x1 y1 i2 j2 … … … … … … … … xn yn
f1 f1 f1 f1 … f2 f2 f2 f2 … f2 f3 f3 f3
Data is contiguous
![Page 7: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/7.jpg)
Data layout matters
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
0 32 64 128 256 512 1024 2048 4096 8192
Slo
w d
ow
n f
act
or
Jump size (Bytes)
![Page 8: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/8.jpg)
![Page 9: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/9.jpg)
Language choice
prototypingshipping
readability
timeexisting software
memory
power
speedsecurity
hardware
dependencies
![Page 10: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/10.jpg)
A simple benchmark
An algorithm that is
• well-understood
• not domain-specific
• computationally intensive
Computing Cholesky decomposition of A
A = L LT
simplifies the process of solving Ax = b
![Page 11: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/11.jpg)
Python
Cython
Python
NumPy/SciPy
Numba
PyCUDA/
Scikit-CUDA
BLAS/
LAPACK
Numba-
CUDA
GPUCPU
Any algorithm
Standard algorithms
Speed
Effort
![Page 12: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/12.jpg)
Python Cholesky implementations
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
100000
1000000
10000000
32 64 128 256 512 1024 2048 4096
Exe
cuti
on
tim
e (
ms)
Matrix size
Python
NumPy
Numba
np.linalg
sp.linalg
sp.linalg.lapack
skcuda
![Page 13: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/13.jpg)
C++
C++
C++
SIMD
CUBLAS/CUDNN
BLAS/
LAPACK
Speed
Effort
GPUCPU
C++
CUDA
Compiler
options
Any algorithm
Standard algorithms
![Page 14: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/14.jpg)
C++ Cholesky implementations (M= L LT)
0.01
0.1
1
10
100
1000
10000
100000
64 128 256 512 1024 2048 4096
Exe
cuti
on
tim
e (
ms)
Matrix size
CPP-O3
CPP-Fast-Math
BLAS (n=1)
AVX
![Page 15: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/15.jpg)
C++ Cholesky implementations (M= L LT)
0.01
0.1
1
10
100
1000
10000
100000
64 128 256 512 1024 2048 4096
Exe
cuti
on
tim
e (
ms)
Matrix size
CPP-O3
CPP-Fast-Math
BLAS (n=1)
AVX
Eigen
LAPACK (n=1)
LAPACK
CUDA
![Page 16: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/16.jpg)
Using CUDA from Python vs C++
0.1
1
10
100
1000
4 8 16 32 64 128 256 512 1024 2048 4096
Exe
cuti
on
tim
e (
ms)
Lo
g s
cale
Matrix size
CUDA
CUDA-compute
skcuda
skcuda-
compute
![Page 17: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/17.jpg)
![Page 18: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/18.jpg)
Using domain knowledge
SIMD implementation
• 4x less storage
• 8-12x faster feature computation
• 64x faster feature matching
Image courtesy of scikit-cuda docs
![Page 19: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/19.jpg)
C++ optimization cycle
Eigen
BLAS
LAPACK
CUDA
OpenMP
Loop unrolling
Code bloat
Correct instructions
AVX/SSE/
Arm NEON
Domain knowledge
Approximations
Find hotspots
80/20 rule
Select candidates for optimization
![Page 20: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/20.jpg)
![Page 21: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/21.jpg)
End of general purpose H/W
Images from company web pages/press releases
![Page 22: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel](https://reader034.vdocuments.us/reader034/viewer/2022042210/5eaef67ebcfcc334203b32ce/html5/thumbnails/22.jpg)
Thank you for listening