5rdg 7r 7dvnlqj - developer.download.nvidia.com · 7kh 5rfn\ 5rdg 7r 7dvnlqj 0dufk ,yr .dedgvkrz...

$: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation$
The Rocky Road To Tasking

March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre

Member of the Helmholtz Association

HPC ≠ HPC

ns 𝜇s ms s min h

CPU Cycle Network Latency

High Frequency Trading

MD

Game Dev

Deep Learning

Astrophysics

Critical walltime

Requirements for MD

Strong scalability

Performance portability

Member of the Helmholtz Association March 21, 2019 Slide 1

Our MotivationSolving Coulomb problem for Molecular Dynamics

Task: Compute all pairwise interactions of N particles

N-body problem: O(N2) → O(N) with FMM

Why is that an issue?

MD targets < 1ms runtime per time step

MD runs millions or billions of time steps

not compute-bound, but synchronization bound

no libraries (like BLAS) to do the heavy lifting

We might have to look under the hood ... and get our hands dirty.


Parallelization Potential

Classical

O(N2)

high low

easy

hard

Algorithmic Complexity

Parallelization

Classical Approach

Lots of independent parallelism


Parallelization Potential

FMM

O(N)

Classical

O(N2)

high low

easy

hard

Algorithmic Complexity

Parallelization

Fast Multipole Method (FMM)

Many dependent phases

Varying amount of parallelism


Coarse-Grained Parallelization

Input P2M M2M M2L L2L L2P P2P Output

synchronization

points

Different amount of available loop-level parallelism within each phase

Some phases contain sub-dependencies

Synchronizations might be problematic


FMM Algorithmic FlowMultipole to multipole (M2M), shifting multipoles upwards

𝜔

0

1

2

3

4

d =+ +

+ + + +

+ + + + + + + +

+ + + + + + + + + + + + + + + +

Dataflow – Fine-grained Dependencies


FMM Algorithmic FlowMultipole to multipole (M2M), shifting multipoles upwards

𝜔

0

1

2

3

4

d =+ +

+ + + +

+ + + + + + + +

+ + + + + + + + + + + + + + + +


p2m m2m m2l l2l l2p𝜔 𝜔

𝜔


FMM Algorithmic FlowMultipole to local (M2L), translate remote multipoles into local taylor moments

𝜇

0

1

2

3

4

d =+

++

+

+ ++



FMM Algorithmic FlowMultipole to local (M2L), translate remote multipoles into local taylor moments

𝜇

0

1

2

3

4

d =+

++

+

+ ++


p2m m2m m2l l2l l2p

𝜔

𝜔 𝜇

𝜇


FMM Algorithmic FlowLocal to local (L2L), shifting Taylor moments downwards

𝜇

0

1

2

3

4

d =+ +

+ + + +

+ + + + + + + +

+ + + + + + + + + + + + + + + +



FMM Algorithmic FlowLocal to local (L2L), shifting Taylor moments downwards

𝜇

0

1

2

3

4

d =+ +

+ + + +

+ + + + + + + +

+ + + + + + + + + + + + + + + +


p2m m2m m2l l2l l2p𝜇 𝜇

𝜇


CPU Tasking Framework

Core

ThreadingWrapperThread

Scheduler

Queue

⋯

Dispatcher

TaskFactory

LoadBalancer


CPU Tasking FrameworkTask life-cycle per thread

Dispatcher

TaskFactory LoadBalancer

�

Queues

� Task execution

� new task

Tasks can be prioritized by task type

Only ready-to-execute tasks are stored in queue

Workstealing from other threads is possible



Dispatcher


Queues

� Task execution

� new task � task






Dispatcher


�

Queues

� Task execution

� new task� new task � task





Tasking Without Workstealing103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

4

8

12

16

20

24

P2P

P2M

M2M

M2L

L2L

L2P

Runtime [s]

#ActiveThreads


Tasking With Workstealing103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

4

8

12

16

20

24

P2P

P2M

M2M

M2L

L2L L2P

Runtime [s]

#ActiveThreads


GPU TaskingGoal

Provide same features as CPU tasking:

Static and dynamic load balancing

Priority queues

Ready-to-execute tasks


GPU TaskingUniform Programming Model for CPUs and GPUs


PitfallsPerformance Portability

Diverse GPU programming approaches:

OpenCL

CUDA

SYCL

Our requirements:

Strong subset of C++11

Portability between GPU vendors

Tasking features

Maturity

(Intermediate) Solution

Use CUDA for reasons of performance, specific tasking features and maturity. Take the loss of not being

portable out of the box.


PitfallsPerformance Portability

For performance portability we consider diverse GPU programming approaches:

OpenCL

CUDA

SYCL

Unsatisfying (Intermediate) Solution

Use CUDA for reasons of performance and specific features. Take the loss of not being portable out of the

box.


PitfallsArchitectural Differences

Pitfalls for Load Balancing

No thread pinning

No cache coherency

Pitfalls for Mutual Exclusion

Weak memory consistency

Missing forward progress guarantees


PitfallsLoad Balancing

No possibility to pin threads to streaming multiprocessors

No direct access to shared memory of other streaming multiprocessors

Work stealing requires multi-producer multi-consumer queues → Mechanism for mutual exclusion?


PitfallsMutual Exclusion

Weak memory consistency

Warp-synchronous deadlocks due to lock step

How to prove thread safety?


PitfallsMutex Implementation

class Mutex

{

__inline__ __device__ void lock()

{

while (atomicCAS(&mutex, 0, 1) != 0)

__threadfence();

};

__inline__ __device__ void unlock()

{

__threadfence();

atomicExch(&mutex, 0);

};

int mutex = 0;

};


Very First EvaluationConditions

Tasking with global queue only

Measurements without work load to determine enqueue and dequeue overhead

Measurements on P100 with 56 thread blocks with 1024 threads each

Measurements on V100 with 80 thread blocks with 1024 threads each


First EvaluationTasking Overhead on P100 and V100

100 101 102 103 104 105 106

10−1

101

103

105

#Tasks

Runtimeinms

P100

V100


GPU TaskingConclusion

Fine-grained task parallelism pays off on CPUs

Developed mapping between CPU and GPU concepts

(Partly) overcome pitfalls:

Lock-based mutual exclusion

Reusability of CPU tasking code

Architectural differences between CPU and GPU

Successfully transferred parts of CPU tasking to GPUs


Next StepsAnalyze and solve performance issues in dependency resolution

Use memory pool for dynamic allocations

Implement hierarchical queues

Transfer priority queue to GPU

Exploit data-parallelism through warps

Consider the use of lock-free data structures

Implement FMM based on GPU tasking


Thank You to Our Sponsor!

NVIDIA Tesla V100 and NVIDIA Tesla P100 where provided by


5rdg 7r 7dvnlqj - developer.download.nvidia.com · 7kh 5rfn\ 5rdg 7r 7dvnlqj 0dufk ,yr .dedgvkrz...

Documents