5rdg 7r 7dvnlqj - developer.download.nvidia.com · 7kh 5rfn\ 5rdg 7r 7dvnlqj 0dufk ,yr .dedgvkrz...
TRANSCRIPT
The Rocky Road To Tasking
March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre
Member of the Helmholtz Association
HPC ≠ HPC
ns 𝜇s ms s min h
CPU Cycle Network Latency
High Frequency Trading
MD
Game Dev
Deep Learning
Astrophysics
Critical walltime
Requirements for MD
Strong scalability
Performance portability
Member of the Helmholtz Association March 21, 2019 Slide 1
HPC ≠ HPC
ns 𝜇s ms s min h
CPU Cycle Network Latency
High Frequency Trading
MD
Game Dev
Deep Learning
Astrophysics
Critical walltime
Requirements for MD
Strong scalability
Performance portability
Member of the Helmholtz Association March 21, 2019 Slide 1
HPC ≠ HPC
ns 𝜇s ms s min h
CPU Cycle Network Latency
High Frequency Trading
MD
Game Dev
Deep Learning
Astrophysics
Critical walltime
Requirements for MD
Strong scalability
Performance portability
Member of the Helmholtz Association March 21, 2019 Slide 1
HPC ≠ HPC
ns 𝜇s ms s min h
CPU Cycle Network Latency
High Frequency Trading
MD
Game Dev
Deep Learning
Astrophysics
Critical walltime
Requirements for MD
Strong scalability
Performance portability
Member of the Helmholtz Association March 21, 2019 Slide 1
HPC ≠ HPC
ns 𝜇s ms s min h
CPU Cycle Network Latency
High Frequency Trading
MD
Game Dev
Deep Learning
Astrophysics
Critical walltime
Requirements for MD
Strong scalability
Performance portability
Member of the Helmholtz Association March 21, 2019 Slide 1
Our MotivationSolving Coulomb problem for Molecular Dynamics
Task: Compute all pairwise interactions of N particles
N-body problem: O(N2) → O(N) with FMM
Why is that an issue?
MD targets < 1ms runtime per time step
MD runs millions or billions of time steps
not compute-bound, but synchronization bound
no libraries (like BLAS) to do the heavy lifting
We might have to look under the hood ... and get our hands dirty.
Member of the Helmholtz Association March 21, 2019 Slide 2
Parallelization Potential
Classical
O(N2)
high low
easy
hard
Algorithmic Complexity
Parallelization
Classical Approach
Lots of independent parallelism
Member of the Helmholtz Association March 21, 2019 Slide 3
Parallelization Potential
FMM
O(N)
Classical
O(N2)
high low
easy
hard
Algorithmic Complexity
Parallelization
Fast Multipole Method (FMM)
Many dependent phases
Varying amount of parallelism
Member of the Helmholtz Association March 21, 2019 Slide 4
Coarse-Grained Parallelization
Input P2M M2M M2L L2L L2P P2P Output
synchronization
points
Different amount of available loop-level parallelism within each phase
Some phases contain sub-dependencies
Synchronizations might be problematic
Member of the Helmholtz Association March 21, 2019 Slide 5
Coarse-Grained Parallelization
Input P2M M2M M2L L2L L2P P2P Output
synchronization
points
Different amount of available loop-level parallelism within each phase
Some phases contain sub-dependencies
Synchronizations might be problematic
Member of the Helmholtz Association March 21, 2019 Slide 5
FMM Algorithmic FlowMultipole to multipole (M2M), shifting multipoles upwards
𝜔
0
1
2
3
4
d =+ +
+ + + +
+ + + + + + + +
+ + + + + + + + + + + + + + + +
Dataflow – Fine-grained Dependencies
Member of the Helmholtz Association March 21, 2019 Slide 6
FMM Algorithmic FlowMultipole to multipole (M2M), shifting multipoles upwards
𝜔
0
1
2
3
4
d =+ +
+ + + +
+ + + + + + + +
+ + + + + + + + + + + + + + + +
Dataflow – Fine-grained Dependencies
p2m m2m m2l l2l l2p𝜔 𝜔
𝜔
Member of the Helmholtz Association March 21, 2019 Slide 7
FMM Algorithmic FlowMultipole to local (M2L), translate remote multipoles into local taylor moments
𝜇
0
1
2
3
4
d =+
++
+
+ ++
Dataflow – Fine-grained Dependencies
Member of the Helmholtz Association March 21, 2019 Slide 8
FMM Algorithmic FlowMultipole to local (M2L), translate remote multipoles into local taylor moments
𝜇
0
1
2
3
4
d =+
++
+
+ ++
Dataflow – Fine-grained Dependencies
p2m m2m m2l l2l l2p
𝜔
𝜔 𝜇
𝜇
Member of the Helmholtz Association March 21, 2019 Slide 9
FMM Algorithmic FlowLocal to local (L2L), shifting Taylor moments downwards
𝜇
0
1
2
3
4
d =+ +
+ + + +
+ + + + + + + +
+ + + + + + + + + + + + + + + +
Dataflow – Fine-grained Dependencies
Member of the Helmholtz Association March 21, 2019 Slide 10
FMM Algorithmic FlowLocal to local (L2L), shifting Taylor moments downwards
𝜇
0
1
2
3
4
d =+ +
+ + + +
+ + + + + + + +
+ + + + + + + + + + + + + + + +
Dataflow – Fine-grained Dependencies
p2m m2m m2l l2l l2p𝜇 𝜇
𝜇
Member of the Helmholtz Association March 21, 2019 Slide 11
CPU Tasking Framework
Core
ThreadingWrapperThread
Scheduler
Queue
⋯
Dispatcher
TaskFactory
LoadBalancer
Member of the Helmholtz Association March 21, 2019 Slide 12
CPU Tasking Framework
Core
ThreadingWrapperThread
Scheduler
Queue
⋯
Dispatcher
TaskFactory
LoadBalancer
Member of the Helmholtz Association March 21, 2019 Slide 12
CPU Tasking Framework
Core
ThreadingWrapperThread
Scheduler
Queue
⋯
Dispatcher
TaskFactory
LoadBalancer
Member of the Helmholtz Association March 21, 2019 Slide 12
CPU Tasking Framework
Core
ThreadingWrapperThread
Scheduler
Queue
⋯
Dispatcher
TaskFactory
LoadBalancer
Member of the Helmholtz Association March 21, 2019 Slide 12
CPU Tasking FrameworkTask life-cycle per thread
Dispatcher
TaskFactory LoadBalancer
�
Queues
� Task execution
� new task
Tasks can be prioritized by task type
Only ready-to-execute tasks are stored in queue
Workstealing from other threads is possible
Member of the Helmholtz Association March 21, 2019 Slide 13
CPU Tasking FrameworkTask life-cycle per thread
Dispatcher
TaskFactory LoadBalancer
Queues
� Task execution
� new task � task
Tasks can be prioritized by task type
Only ready-to-execute tasks are stored in queue
Workstealing from other threads is possible
Member of the Helmholtz Association March 21, 2019 Slide 13
CPU Tasking FrameworkTask life-cycle per thread
Dispatcher
TaskFactory LoadBalancer
Queues
� Task execution
� new task � task
Tasks can be prioritized by task type
Only ready-to-execute tasks are stored in queue
Workstealing from other threads is possible
Member of the Helmholtz Association March 21, 2019 Slide 13
CPU Tasking FrameworkTask life-cycle per thread
Dispatcher
TaskFactory LoadBalancer
Queues
� Task execution
� new task � task
Tasks can be prioritized by task type
Only ready-to-execute tasks are stored in queue
Workstealing from other threads is possible
Member of the Helmholtz Association March 21, 2019 Slide 13
CPU Tasking FrameworkTask life-cycle per thread
Dispatcher
TaskFactory LoadBalancer
�
Queues
� Task execution
� new task� new task � task
Tasks can be prioritized by task type
Only ready-to-execute tasks are stored in queue
Workstealing from other threads is possible
Member of the Helmholtz Association March 21, 2019 Slide 13
Tasking Without Workstealing103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
4
8
12
16
20
24
P2P
P2M
M2M
M2L
L2L
L2P
Runtime [s]
#ActiveThreads
Member of the Helmholtz Association March 21, 2019 Slide 14
Tasking With Workstealing103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
4
8
12
16
20
24
P2P
P2M
M2M
M2L
L2L L2P
Runtime [s]
#ActiveThreads
Member of the Helmholtz Association March 21, 2019 Slide 15
The Rocky Road To Tasking
March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre
Member of the Helmholtz Association
GPU TaskingGoal
Provide same features as CPU tasking:
Static and dynamic load balancing
Priority queues
Ready-to-execute tasks
Member of the Helmholtz Association March 21, 2019 Slide 16
GPU TaskingUniform Programming Model for CPUs and GPUs
Member of the Helmholtz Association March 21, 2019 Slide 17
GPU TaskingUniform Programming Model for CPUs and GPUs
Member of the Helmholtz Association March 21, 2019 Slide 17
GPU TaskingUniform Programming Model for CPUs and GPUs
Member of the Helmholtz Association March 21, 2019 Slide 17
GPU TaskingUniform Programming Model for CPUs and GPUs
Member of the Helmholtz Association March 21, 2019 Slide 17
GPU TaskingUniform Programming Model for CPUs and GPUs
Member of the Helmholtz Association March 21, 2019 Slide 17
PitfallsPerformance Portability
Diverse GPU programming approaches:
OpenCL
CUDA
SYCL
Our requirements:
Strong subset of C++11
Portability between GPU vendors
Tasking features
Maturity
(Intermediate) Solution
Use CUDA for reasons of performance, specific tasking features and maturity. Take the loss of not being
portable out of the box.
Member of the Helmholtz Association March 21, 2019 Slide 18
PitfallsPerformance Portability
For performance portability we consider diverse GPU programming approaches:
OpenCL
CUDA
SYCL
Unsatisfying (Intermediate) Solution
Use CUDA for reasons of performance and specific features. Take the loss of not being portable out of the
box.
Member of the Helmholtz Association March 21, 2019 Slide 19
PitfallsArchitectural Differences
Pitfalls for Load Balancing
No thread pinning
No cache coherency
Pitfalls for Mutual Exclusion
Weak memory consistency
Missing forward progress guarantees
Member of the Helmholtz Association March 21, 2019 Slide 20
PitfallsLoad Balancing
No possibility to pin threads to streaming multiprocessors
No direct access to shared memory of other streaming multiprocessors
Work stealing requires multi-producer multi-consumer queues → Mechanism for mutual exclusion?
Member of the Helmholtz Association March 21, 2019 Slide 21
PitfallsMutual Exclusion
Weak memory consistency
Warp-synchronous deadlocks due to lock step
How to prove thread safety?
Member of the Helmholtz Association March 21, 2019 Slide 22
PitfallsMutex Implementation
class Mutex
{
__inline__ __device__ void lock()
{
while (atomicCAS(&mutex, 0, 1) != 0)
__threadfence();
};
__inline__ __device__ void unlock()
{
__threadfence();
atomicExch(&mutex, 0);
};
int mutex = 0;
};
Member of the Helmholtz Association March 21, 2019 Slide 23
Very First EvaluationConditions
Tasking with global queue only
Measurements without work load to determine enqueue and dequeue overhead
Measurements on P100 with 56 thread blocks with 1024 threads each
Measurements on V100 with 80 thread blocks with 1024 threads each
Member of the Helmholtz Association March 21, 2019 Slide 24
First EvaluationTasking Overhead on P100 and V100
100 101 102 103 104 105 106
10−1
101
103
105
#Tasks
Runtimeinms
P100
V100
Member of the Helmholtz Association March 21, 2019 Slide 25
GPU TaskingConclusion
Fine-grained task parallelism pays off on CPUs
Developed mapping between CPU and GPU concepts
(Partly) overcome pitfalls:
Lock-based mutual exclusion
Reusability of CPU tasking code
Architectural differences between CPU and GPU
Successfully transferred parts of CPU tasking to GPUs
Member of the Helmholtz Association March 21, 2019 Slide 26
Next StepsAnalyze and solve performance issues in dependency resolution
Use memory pool for dynamic allocations
Implement hierarchical queues
Transfer priority queue to GPU
Exploit data-parallelism through warps
Consider the use of lock-free data structures
Implement FMM based on GPU tasking
Member of the Helmholtz Association March 21, 2019 Slide 27
Thank You to Our Sponsor!
NVIDIA Tesla V100 and NVIDIA Tesla P100 where provided by
Member of the Helmholtz Association March 21, 2019 Slide 28
The Rocky Road To Tasking
March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre
Member of the Helmholtz Association