exploiting cuda dynamic parallelism for low power arm...
TRANSCRIPT
![Page 1: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/1.jpg)
www.bsc.es
Exploiting CUDA Dynamic Parallelism
for low power ARM based prototypesVishal Mehta
Engineer, Barcelona Supercomputing Center
![Page 2: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/2.jpg)
BSC/UPC CUDA Centre of Excellence (CCOE)
Training
• Build an education program on parallel programming using CUDA, OpenCL and
OmpSs
• PUMPS summer school 2010-2015, courses at BSC and UPC
Research
• Generation, Simulation and Rendering of Large Varied Animated Crowds that
attendees can get a presentation using OmpSs at current GTC
• HERTA Security GPU-based machine learning for real-time face
recognition, and bio-Marketing, also presented at this GTC.
• Exploring the potential of low-power GPU clusters as high-performance
platforms involved in Mont-Blanc and PRACE prototypes
2
![Page 3: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/3.jpg)
Top500 Power Consumption Evolution
Higher performance, at the expense of higher power
0
1
2
3
4
5
6
7
8
2008 2009 2010 2011 2012 2013
Po
wer
[MW
] TOP10
TOP50
TOP500
x3.25 in 5y
x3.13 in 5y
x5.04 in 5y
3
![Page 4: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/4.jpg)
Mont-Blanc Project
European approach for energy efficient HPC systems.http://www.montblanc-project.eu
Partners:
Objectives:
•To develop a full energy-efficient HPC prototype using low-power commercially available embedded
technology.
•To develop a portfolio of exascale applications to be run on this new generation of HPC systems.
• Exploring different alternatives for the compute node (from low-power mobile sockets to special-purpose
high-end ARM chips), and its implications on the rest of the system
4
![Page 5: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/5.jpg)
Euroserver Project
http://www.euroserver-project.eu
Partners:
European approach for energy efficient data servers.
Objectives:
•Reduced Energy consumption by: (i) using ARM (64-bit) cores (ii) drastically reducing the core-to-memory
distance (iii) improving on the "energy proportionality".
•Reduced Cost to build and operate each microserver, (i) improved manufacturing yield (ii) reduced
physical volume of the packaged interposer module (iii) and energy efficient semiconductor process
(FDSOI) .
5
![Page 6: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/6.jpg)
Mont-Blanc Prototype Ecosystem
6
![Page 7: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/7.jpg)
Outline
1.Pedraforca Prototype Architecture
2.Evaluation application
3.Exploiting Dynamic Parallelism
4.Some benchmarks and results
5.Limitations & Conclusions
7
![Page 8: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/8.jpg)
Pedraforca : Prototype Node Architecture
E4 ARKA single node desktop unit
8
![Page 9: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/9.jpg)
Pedraforca: Cluster
3 ⨉ bullx 1200 rack
78 compute nodes
2 login nodes
4 36-port InfiniBand switches (MPI)
2 50-port GbE switches (storage)
9
![Page 10: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/10.jpg)
Low power ARM
Component Max power usage
Tesla K20 235
Board 25
CPU 5
Total 265
Component Max power usage
Tesla K20 235
Board 80
CPU 90
Total 405
Comparing Power Budgets
Quad core Intel i5-3570K @3.4GHz ,
ASUS P8Z77 V-pro
Tegra 3 (quad core ARM A9 @ 1.3 GHz),
Mini ITX – Carrier
X86_64 based system
10
![Page 11: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/11.jpg)
Outline
1.Pedraforca Prototype Architecture
2.Evaluation application
3.Exploiting Dynamic Parallelism
4.Some benchmarks and results
5.Limitations & Conclusions
11
![Page 12: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/12.jpg)
Thick restarted Lanczos Algorithm in Lattice QCD
• Each point on lattice is SU(3) vector and links connecting points are SU(3) matrix.
• Using thick restarted Lanczos algorithm for generating eigenpairs of the lattice
• 80 % cuBLAS routines
• Average number of cuBLAS calls: 60000 – 90000 depending on lattice
configuration
• Process lattice from multiple time steps in parallel
SU(3) vector (complex double)
SU(3 x 3) matrix(complex
double)
12
At time ‘t’
![Page 13: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/13.jpg)
Evaluation Example – Lanczos Iteration
• Large number of BLAS operations
• Dominated by global orthogonalization
module which includes BLAS
• Implemented using cuBLAS, highly
modularized and easy to use
• Iterations are not independent of each
other
Initial vector (v0)
Apply matrix
Vi = A (Vi-1)
N
iterations
Compute alpha
αi = dot(Vi,Vi-1)
AXPY kernel
Vi = Vi - αi Vi-1 – βi-1 Vi-2
Compute beta
βi = Euclidean norm(Vi)
New subspace vector
Vi = Vi / βi
Global
orthogonalization
13
![Page 14: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/14.jpg)
Bottlenecks
• Large number of calls to cuBLAS.
• Overall algorithm is serial
• Dominated by CPU’s capability of
launching cuBLAS kernels
• ARM processor is not fast enough
to quickly launch kernels on GPU.
GPU in underutilized
CPU works as
coordinator
CPU pipelineGPU pipeline
Start
Apply matrix
End
cuBLAS dot
kernel
cuBLAS AXPY
kernel
Serial Dependency
Algorithm Implementation for the Prototype
GPU slave
executes kernels
14
![Page 15: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/15.jpg)
Outline
1.Pedraforca Prototype Architecture
2.Evaluation application
3.Exploiting Dynamic Parallelism
4.Some benchmarks and results
5.Limitations & Conclusions
15
![Page 16: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/16.jpg)
Exploiting Dynamic Parallelism
The reason for dynamic parallelism, is to make GPU adapt to data
Can we exploit further to solve bottlenecks and save power ?
16
![Page 17: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/17.jpg)
Approach for Exploiting Dynamic Parallelism for Low Power Prototype
CPU pipeline GPU pipeline
StartApply
matrix
End
cuBLAS dot
kernel
cuBLAS
AXPY kernel
Serial Dependency
CPU pipeline GPU pipeline
Start
End
cuBLAS
dot kernel
cuBLAS
AXPY
kernel
Wrapper kernel, 1 control
threadApply
matrix
CPU works as
coordinator
GPU slave
executes kernels
CPU starts and
ends wrapper
GPU wrapper
coordinates the tasks
17
![Page 18: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/18.jpg)
__global__ Applymatrix(..,..)
int main()
{
copytoGPU();
Applymatrix<<<…,…>>>();
cublasZdot();
cublasZAXPY();
copyfromGPU();
}
__global__ Applymatrix(..,..)
__global__ wrapper(..,..)
{
Applymatrix<<<…,…>>>();
cublasZdot();
cublasZAXPY();
}
int main()
{
copytoGPU();
wrapper<<<1,1>>>();
copyfromGPU();
}
Example code:1 - Simple WrapperOriginal code Code with wrapper
18
![Page 19: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/19.jpg)
Multiple Threads in WrapperCPU pipeline GPU pipeline
Start
End
cuBLAS
dot
kernel
cuBLAS
AXPY
kernel
GPU wrapper, 2 CUDA thread
Apply
matrix
cuBLAS
dot
kernel
cuBLAS
AXPY
kernel
Apply
matrix
When wrapper executed
with more than one
thread to process multiple
instances.
Wrapper<<<1,2>>>()PROBLEM
Threads in same block
launch kernels one after
another. Multiple
instances are not
executed simultaneously.
19
![Page 20: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/20.jpg)
Bottleneck caused by multiple threads in wrapper
CPU pipeline GPU pipeline
Start
End
cuBLAS
dot kernel
cuBLAS
AXPY
kernel
GPU wrapper, 2 CUDA thread
Apply
matrixcuBLAS
dot kernel
cuBLAS
AXPY
kernel
Apply
matrix
SOLUTION
CUDA streams
created on GPU
side
OUR GOAL
Wrapper
20
![Page 21: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/21.jpg)
Solution for processing multiple instances by CUDA streams
CPU pipeline GPU pipeline
Start
End
cuBLAS
dot kernel
cuBLAS
AXPY
kernel
GPU wrapper, 2 CUDA thread
Apply
matrixcuBLAS
dot kernel
cuBLAS
AXPY
kernel
Apply
matrix
Modification to code
__global__ wrapper(..,..)
{
cudaStream_t stream;
cudaStreamCreateWithFlags(&str
eam,cudaStreamNonBlocking);
cublasSetStream(….,stream);
Applymatrix<<<…,…stream>>>();
cublasZdot();
cublasZAXPY();
cudaStreamDestroy(stream);
}
CUDA create
stream
CUDA create
stream
Wrapper
21
![Page 22: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/22.jpg)
Outline
1.Pedraforca Prototype Architecture
2.Evaluation application
3.Exploiting Dynamic Parallelism
4.Some benchmarks and results
5.Limitations & Conclusions
22
![Page 23: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/23.jpg)
0
1
2
3
Speed Up
Speed Up
cuBLAS kernel launch scalingNo of kernel
calls
cuBLAS calls by
CPU (seconds)
cuBLAS calls
GPU thread
(seconds)
Speed up
1 x 103 1.72 1.43 1.20 x
3 x 103 2.23 1.62 1.37 x
5 x 103 4.7 2.9 1.62 x
10 x 103 7.52 3.5 2.14 x
50 x 103 11.78 4.2 2.80 x
cuBLAS level 1 routines
40% reduction kernel
30% AXPY kernel
30% dot product
no. of cuBLAS calls
Speed u
p
23
![Page 24: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/24.jpg)
4.4 6.411.2
2.3 4.17.5
5.27.6
12.8
2.85.2
8.7
0
10
20
30
40
50
24 32 48
Kernel calls by CPU Kernel calls by CPU (with streams)
Kernel calls by GPU Kernel calls by GPU (with streams)
Application Performance (High Frequency CPU)
Quad core intel i5-3570K @3.4GHz
Code with wrapper may be slower on a system with fast CPU
24
Lattice size
Execution T
ime (
sec)
![Page 25: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/25.jpg)
Application Performance (Pedraforca Prototype)
Tegra 3 - quad core ARM A9 @ 1.3 GHz
Lattice size
Execution T
ime (
sec)
Code with wrapper kernel performs better on ARM based system
25
13.6
20.4
36.4
15.2
23.5
40.6
5.3 7.513.1
2.75.2
9
0
10
20
30
40
50
24 32 48
Kernel calls by CPU Kernel calls by CPU (with streams)Kernel calls by GPU Kernel calls by GPU (with streams)
![Page 26: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/26.jpg)
Comparing systems
26
A BQuad core i5-
Hz
Tesla K20 Tesla K20
Quad core
GHz
![Page 27: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/27.jpg)
Comparing power footprint – Without CUDA streams
QCD lattice
size
A B A B A B
24 4.4 5.3 367 245 1614.8 1298.5
32 6.4 7.5 359 246 2297.6 1845
48 11.2 13.1 365 243 4088 3183.3
Execution time (seconds) Average Power (W) Energy Consumption (J)
A : All kernels launched by CPU(Quad core intel [email protected]) B : All kernels launched by GPU (Tegra 3-quad core ARM [email protected] GHz)
1618202224
24 32 48
Energy savings (%)
Energy savings (%)
Lattice size
Perc
enta
ge
27
![Page 28: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/28.jpg)
Comparing power footprint – With CUDA streams
QCD lattice
size
A B A B A B
24 2.3 2.7 420 286 966 772.2
32 4.1 5.2 426 287 1746.6 1392.4
48 7.5 9.0 425 282 3187.5 2538
Execution time (seconds) Average Power (W) Energy Consumption (J)
1618202224
24 32 48
Energy savings (%)
Energy savings (%)
Lattice size
Perc
enta
ge
28
A : All kernels launched by CPU(Quad core intel [email protected]) B : All kernels launched by GPU (Tegra 3-quad core ARM [email protected] GHz)
![Page 29: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/29.jpg)
Scaling across Cluster
Without GPU Direct With GPU Direct
State of art technologies like GPU Direct, CUDA aware MPI can significantly improve
data transfers among multiple nodes
Wrapper kernel ensures, low frequency CPU has sufficient time for communication.
CPU pipeline GPU pipelineStart
End
Some
dynamic
CUDA
processing
load
I/O
with
networ
k card
G
P
U
m
e
m
or
y
CPU pipeline GPU pipelineStart
End
Some
dynamic
CUDA
processing
load
I/O
netwo
rk
card
G
P
U
m
e
m
or
y
29
![Page 30: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/30.jpg)
Pedraforca limitations
30
GOOD NEWS!!
64 bit SoC, upto 4GB support
Driver support for 32 bit
32 bit SoC
![Page 31: Exploiting CUDA Dynamic Parallelism for low power ARM ...on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf · Exploiting CUDA Dynamic Parallelism for low power](https://reader034.vdocuments.us/reader034/viewer/2022042019/5e833af408feb973674c1435/html5/thumbnails/31.jpg)
Conclusions
• With CUDA dynamic parallelism and CUDA streams in action we are able to
save roughly 20 % of power on Pedraforca prototype.
• CUDA Dynamic Parallelism helps reducing GPU-CPU communication, hence
faster CPU is not always necessary.
• More libraries supporting dynamic parallelism have to be developed.
• Embedding ARM cores inside big accelerator like Tesla could be promising
31