![Page 1: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/1.jpg)
From Piz Daint to the Stars:Simulation of Stellar Mergersusing High-Level Abstractions
Denver, Colorado
November, 2019
University of Stuttgart, IPVS, SSE
![Page 2: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/2.jpg)
Authors
• Gregor Daiß• John Biddiscombe
• Parsa Amini• Patrick Diehl• Juhan Frank• Kevin Huck• Hartmut Kaiser• Dominic Marcello• David Pfander• Dirk Pflüger
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 1 / 23
![Page 3: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/3.jpg)
Motivation
• Simulation of binary star systems and their mergers• Octo-Tiger models these star systems using self-gravitating fluids on an AMR grid using
HPX• Large-scale runs on Piz Daint use up to 768 million cells.
Contributions:• Significant speedup replacing MPI with Libfabric without changing any application code• Integrating small GPU kernels efficiently into an asynchronous many-task runtime system
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 2 / 23
![Page 4: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/4.jpg)
Motivation
• Simulation of binary star systems and their mergers• Octo-Tiger models these star systems using self-gravitating fluids on an AMR grid using
HPX• Large-scale runs on Piz Daint use up to 768 million cells.
Contributions:• Significant speedup replacing MPI with Libfabric without changing any application code• Integrating small GPU kernels efficiently into an asynchronous many-task runtime system
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 2 / 23
![Page 5: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/5.jpg)
Table of Contents
1 Octo-Tiger in a Nutshell2 HPX and the Libfabric Parcelport in a Nutshell3 Asynchronous Many Tasks with GPUs4 Results
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 3 / 23
![Page 6: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/6.jpg)
Octo-Tiger in a Nutshell
![Page 7: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/7.jpg)
Octo-Tiger in a Nutshell
Octo-Tiger simulates self-gravitating fluids on an AMR grid
Gravity Solver• Using Fast Multipole Method
(FMM)• Has to be solved every
timestep• Is the more
compute-intensive part
Hydro Solver• Navier-Stokes Equations• Using finite volumes
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 4 / 23
![Page 8: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/8.jpg)
Octo-Tiger in a Nutshell
Octo-Tiger simulates self-gravitating fluids on an AMR grid
Node 1
Locality 1
Node 2
Locality 2
...
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 4 / 23
![Page 9: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/9.jpg)
HPX and the Libfabric Parcelportin a Nutshell
![Page 10: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/10.jpg)
A Distributed Task-Based Runtime
Locality
0
Locality
1
Locality
N
actions
async
Active Global Address Space(Locality still matters)
Component
Unified C++ syntax for localand remote operations
Asynchronoustasks with futures
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 5 / 23
![Page 11: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/11.jpg)
Standards Driven C++ Tasking for Parallelism and Concurrency
Futures for Synchronization• Continuation Passing Style (CPS) preferred• Functional approach to programming• Task synchronization is also data driven
Runtime• Lightweight threads• Suspend on get(), resume when ready• Work stealing when current task done/suspended
AGAS• Manages a handle to a component• Forward work to the locality holding data
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 6 / 23
![Page 12: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/12.jpg)
Standards Driven C++ Tasking for Parallelism and Concurrency
Futures for Synchronization• Continuation Passing Style (CPS) preferred• Functional approach to programming• Task synchronization is also data driven
Runtime• Lightweight threads• Suspend on get(), resume when ready• Work stealing when current task done/suspended
AGAS• Manages a handle to a component• Forward work to the locality holding data
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 6 / 23
![Page 13: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/13.jpg)
Standards Driven C++ Tasking for Parallelism and Concurrency
Futures for Synchronization• Continuation Passing Style (CPS) preferred• Functional approach to programming• Task synchronization is also data driven
Runtime• Lightweight threads• Suspend on get(), resume when ready• Work stealing when current task done/suspended
AGAS• Manages a handle to a component• Forward work to the locality holding data
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 6 / 23
![Page 14: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/14.jpg)
Building DAGs from Futures
Task 1 Task 2f1.then()
Task 2
when_xxx(f1,f2)
Task 3
Task 1
Task 3
Task 2
Task 1
shared.then()
N N
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 7 / 23
![Page 15: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/15.jpg)
Remote Actions use Active Messages
AMR refinement and redistribution:• Moving a subgrid from one node to another
calling a (copy) constructor on a remote nodewith the contents of this subgrid as parameter(s)
Halo exchange:• Execute a put on remote node
- with a data buffer as parameter• Execute a get on remote node
- with a (local) buffer address as parameter
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 8 / 23
![Page 16: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/16.jpg)
Remote Actions use Active Messages
AMR refinement and redistribution:• Moving a subgrid from one node to another
calling a (copy) constructor on a remote nodewith the contents of this subgrid as parameter(s)
Halo exchange:• Execute a put on remote node
- with a data buffer as parameter• Execute a get on remote node
- with a (local) buffer address as parameter
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 8 / 23
![Page 17: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/17.jpg)
Active Messages
Syntax• Instead of a more traditional
MPI_Isend(buffer, count, datatype, dest_rank, tag, comm, request)HPX messages take the form of a remote function invocation
future = hpx::async(dest_locality, function, arg1, arg2...)
where any C++ data args can be sent (vector/set/list/map/custom)
Implementation• Data is passed as arguments to a remote function (or object::function)• Remote function parameters are serialized into a parcel - consisting of
- a function identifier, (including object if complex like a grid::node)
- a list of parameters
Channels• HPX uses Channel abstraction to simplify send/recv for halo regions
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 9 / 23
![Page 18: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/18.jpg)
Active Messages
Syntax• Instead of a more traditional
MPI_Isend(buffer, count, datatype, dest_rank, tag, comm, request)HPX messages take the form of a remote function invocation
future = hpx::async(dest_locality, function, arg1, arg2...)
where any C++ data args can be sent (vector/set/list/map/custom)
Implementation• Data is passed as arguments to a remote function (or object::function)• Remote function parameters are serialized into a parcel - consisting of
- a function identifier, (including object if complex like a grid::node)
- a list of parameters
Channels• HPX uses Channel abstraction to simplify send/recv for halo regions
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 9 / 23
![Page 19: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/19.jpg)
Active Messages
Syntax• Instead of a more traditional
MPI_Isend(buffer, count, datatype, dest_rank, tag, comm, request)HPX messages take the form of a remote function invocation
future = hpx::async(dest_locality, function, arg1, arg2...)
where any C++ data args can be sent (vector/set/list/map/custom)
Implementation• Data is passed as arguments to a remote function (or object::function)• Remote function parameters are serialized into a parcel - consisting of
- a function identifier, (including object if complex like a grid::node)
- a list of parameters
Channels• HPX uses Channel abstraction to simplify send/recv for halo regions
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 9 / 23
![Page 20: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/20.jpg)
Existing MPI-Based Parcelport Implementation
A parcel is represented by a ‘chunk list’ + data block• If params are small (eager protocol)
- Index chunk (size/offset)copy into parcel buffer
• For large params (rendezvous)
- Pointer chunk - separate sends• Message handling of parcels is currently
sub-optimal one sided put/get can/should be usedfor rendevous items
Header Eager Data
chunk list
type
SendParcel
RecvParcel
Largedata?
ack
Postrecvs
Senddata
decode
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 10 / 23
![Page 21: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/21.jpg)
Libfabric as an Alternative to MPI
Downsides of MPI• MPI_Put/Get not asynchronous API• Copies completions to MPI_Request
handles• Memory management less flexible
Benefits of libfabric• API asynchronous (inc. Put/Get –
enqueue many)• Maps driver/GNI completion queues
(without copy)• Robustly threadsafe• Vectorized sends : fi_sendv• Flexible memory pinning
GNI
MPI
libfabric
KernelUser
other
MPI_RequestsCommunicatorsMemory Windows
Epochs
CompletionsEndpoints
Memory Pinning
HPX Futures
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 11 / 23
![Page 22: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/22.jpg)
Fine Tuning of RDMA-Based Parcelport
Impedance match between HPX and libfabric API• Identical asynchronous GNI/driver level completions for send/recv/get/(put)• Trigger futures directly from completion handler
Memory Management• C++ allocator for pinned memory blocks• Flow control – we explicitly manage queues (=buffers)• FI_sendv allows reduced memory copies• Multi-Parcels when send buffers filling up (FI_sendv)• RDMA<T> types integrated into our parcelport (channels ongoing)
Threading• Robust threadsafe libfabric API• FI_CONTEXT allows us to be 100% lock free in our HPX layer
• map completions directly to objects (c.f. communicators)
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 12 / 23
![Page 23: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/23.jpg)
Fine Tuning of RDMA-Based Parcelport
Impedance match between HPX and libfabric API• Identical asynchronous GNI/driver level completions for send/recv/get/(put)• Trigger futures directly from completion handler
Memory Management• C++ allocator for pinned memory blocks• Flow control – we explicitly manage queues (=buffers)• FI_sendv allows reduced memory copies• Multi-Parcels when send buffers filling up (FI_sendv)• RDMA<T> types integrated into our parcelport (channels ongoing)
Threading• Robust threadsafe libfabric API• FI_CONTEXT allows us to be 100% lock free in our HPX layer
• map completions directly to objects (c.f. communicators)
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 12 / 23
![Page 24: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/24.jpg)
Fine Tuning of RDMA-Based Parcelport
Impedance match between HPX and libfabric API• Identical asynchronous GNI/driver level completions for send/recv/get/(put)• Trigger futures directly from completion handler
Memory Management• C++ allocator for pinned memory blocks• Flow control – we explicitly manage queues (=buffers)• FI_sendv allows reduced memory copies• Multi-Parcels when send buffers filling up (FI_sendv)• RDMA<T> types integrated into our parcelport (channels ongoing)
Threading• Robust threadsafe libfabric API• FI_CONTEXT allows us to be 100% lock free in our HPX layer
• map completions directly to objects (c.f. communicators)
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 12 / 23
![Page 25: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/25.jpg)
Performance/Integration of Libfabric in the Runtime
Threadpool 1
7
Poll +Handle
completions
Task Queues
Execute task
Core 1Poll +
Handle completions
Task Queues
Execute task
Core N
… Core 2,3,4, ...
Threadpool 2
Poll +Handle
completions
Empty? Queues
Do Nothing?
Octo-Tiger
• Every core can poll for completion events during background processing• Polling can be moved to another thread pool, with or without tasks• Every microsecond saved in polling/handling = 1MFlop on a 1TFlop GPU
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 13 / 23
![Page 26: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/26.jpg)
Asynchronous Many Tasks withGPUs
![Page 27: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/27.jpg)
Example FMM Kernels from Octo-Tiger
• Calculation of the gravityinteractions between neighboringcells on the same oct-tree level
• Stencil code
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 14 / 23
![Page 28: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/28.jpg)
Example FMM Kernels from Octo-Tiger
• Calculation of the gravityinteractions between neighboringcells on the same oct-tree level
• Stencil code
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 14 / 23
![Page 29: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/29.jpg)
Example FMM Kernels from Octo-Tiger
• Calculation of the gravityinteractions between neighboringcells on the same oct-tree level
• Stencil code
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 14 / 23
![Page 30: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/30.jpg)
Example FMM Kernels from Octo-Tiger
• Calculation of the gravityinteractions between neighboringcells on the same oct-tree level
• Stencil code
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 14 / 23
![Page 31: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/31.jpg)
Example FMM Kernels from Octo-Tiger
• Calculation of the gravityinteractions between neighboringcells on the same oct-tree level
• Stencil code• (3D) Stencil has 1074 elements• Stencil gets applied for all the
512 cells per subgrid
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 14 / 23
![Page 32: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/32.jpg)
Running the CUDA Kernel
CPU Thread ...
1. memcpy async HToD
2. queue CUDA kernel
3. memcpy async DToHWhen to sync results?
Time
CUDA Stream
Goals:• Interleave GPU kernel with
arbitrary CPU kernels andcommunication
• Non-blocking synchronization
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 15 / 23
![Page 33: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/33.jpg)
Running the CUDA Kernel
CPU Thread ...
1. memcpy async HToD
2. queue CUDA kernel
3. memcpy async DToHWhen to sync results?
Time
CUDA Stream
Goals:• Interleave GPU kernel with
arbitrary CPU kernels andcommunication
• Non-blocking synchronization
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 15 / 23
![Page 34: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/34.jpg)
Integrating with HPX
HPX Task fut
1. memcpy async HToD
2. queue CUDA kernel
3. memcpy async DToH
4. get future
4.5. insert callback
HPX Scheduler
Time
CUDA Stream
Solution:• Use HPX tasks instead• Insert callback into CUDA stream• Scheduler can return a future
that becomes ready once thiscallback get executed
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 16 / 23
![Page 35: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/35.jpg)
Integrating with HPX
HPX Task fut.get()
1. memcpy async HToD
2. queue CUDA kernel
3. memcpy async DToH
4. get future
4.5. insert callback
HPX Scheduler
Time
CUDA Stream
Solution:• Use HPX tasks instead• Insert callback into CUDA stream• Scheduler can return a future
that becomes ready once thiscallback get executed
• HPX task gets suspended
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 16 / 23
![Page 36: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/36.jpg)
Integrating with HPX
HPX Task fut.get()
1. memcpy async HToD
2. queue CUDA kernel
3. memcpy async DToH
4. get future
4.5. insert callback
HPX Scheduler
Time
CUDA Stream
Solution:• Use HPX tasks instead• Insert callback into CUDA stream• Scheduler can return a future
that becomes ready once thiscallback get executed
• HPX task gets suspended
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 16 / 23
![Page 37: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/37.jpg)
Integrating with HPX
HPX Task fut.get()
1. memcpy async HToD
2. queue CUDA kernel
3. memcpy async DToH
4. get future
Arbitrary HPX task
4.5. insert callback
HPX Scheduler
Time
CUDA Stream
Solution:• Use HPX tasks instead• Insert callback into CUDA stream• Scheduler can return a future
that becomes ready once thiscallback get executed
• HPX task gets suspended• HPX thread can work on other
tasks/communication• Task will be resumed once the
GPU kernel has finished
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 16 / 23
![Page 38: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/38.jpg)
Filling the GPU?
• One kernel calculates 512 * 1074 cell interactions• Depending on the type, 12 to 455 floating point operations• Still not enough work to utilize even one GPU
• Leverage CUDA streams for implicit work aggregation• Avoid on-the-fly allocations
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 17 / 23
![Page 39: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/39.jpg)
Filling the GPU!
HPX Task
futureLaunch kernel
Launcher (lockfree/ threadlocal)
GPU
Slot 1
Pinned host memory
CUDA stream
CUDA buffer
CPU kernel
fallback
Slot 2 Slot 3 Slot 4
Solution• Launch many small kernels in
different streams• One launcher for each HPX
thread• One kernel launch per slot• If all slots are busy, execute the
kernel on the CPU• Most kernels executed on the
GPU (99.5%)• Arbitrary number of slots on
multiple GPUs
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 18 / 23
![Page 40: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/40.jpg)
Filling the GPU!
HPX Task
futureLaunch kernel
Launcher (lockfree/ threadlocal)
GPU
Slot 1
Pinned host memory
CUDA stream
CUDA buffer
CPU kernel
fallback
Slot 2 Slot 3 Slot 4
Solution• Launch many small kernels in
different streams• One launcher for each HPX
thread• One kernel launch per slot• If all slots are busy, execute the
kernel on the CPU• Most kernels executed on the
GPU (99.5%)• Arbitrary number of slots on
multiple GPUs
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 18 / 23
![Page 41: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/41.jpg)
Filling the GPU!
HPX Task
futureLaunch kernel
Launcher (lockfree/ threadlocal)
GPU
Slot 1
Pinned host memory
CUDA stream
CUDA buffer
CPU kernel
fallback
Slot 2 Slot 3 Slot 4
Solution• Launch many small kernels in
different streams• One launcher for each HPX
thread• One kernel launch per slot• If all slots are busy, execute the
kernel on the CPU• Most kernels executed on the
GPU (99.5%)• Arbitrary number of slots on
multiple GPUs
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 18 / 23
![Page 42: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/42.jpg)
Work Size
HPX Task
futureLaunch kernel
Launcher (lockfree/ threadlocal)
GPU 1 GPU 2
Slot 1
Pinned host memory
CUDA stream
CUDA buffer
CPU kernel
fallback
Slot 2 Slot 3 Slot 4
Solution• Launch many small kernels in
different streams• One launcher for each HPX
thread• One kernel launch per slot• If all slots are busy, execute the
kernel on the CPU• Most kernels are being executed
on the GPU (99.5207%)• Arbitrary number of slots on
multiple GPUs
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 18 / 23
![Page 43: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/43.jpg)
Asynchronous Many Tasks with GPUs
Advantages:• Optimize GPU launch
• CUDA streams• HPX futures• non-blocking launcher
• Overlapping• CPU/GPU tasks;• computation and communication
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 19 / 23
![Page 44: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/44.jpg)
Results
![Page 45: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/45.jpg)
FMM Node-Level Results on Piz Daint
• Ported most important FMM kernels to the GPU• Scenario: V1309 contact binary merger• Setup: Single node, 12 HPX-Threads, 128 launch slots (CUDA streams)• 10928 sub-grids resulting in 5595136 cells
Utilized Hardware FMM Total scenario runtimeruntime GFLOP/s fraction of peak (FMM + Hydro + Others)
One Piz Daint NodeIntel Xeon E5-2690v3 980s 157 GFLOP/s 31% [CPU] 2415swith 1x NVIDIA P100 (PCI-E) 158s 973 GFLOP/s 21% [GPU] 1592s
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 20 / 23
![Page 46: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/46.jpg)
FMM Node-Level Results on Piz Daint
• Ported most important FMM kernels to the GPU• Scenario: V1309 contact binary merger• Setup: Single node, 12 HPX-Threads, 128 launch slots (CUDA streams)• 10928 sub-grids resulting in 5595136 cells
Utilized Hardware FMM Total scenario runtimeruntime GFLOP/s fraction of peak (FMM + Hydro + Others)
One Piz Daint NodeIntel Xeon E5-2690v3 980s 157 GFLOP/s 31% [CPU] 2415swith 1x NVIDIA P100 (PCI-E) 158s 973 GFLOP/s 21% [GPU] 1592s
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 20 / 23
![Page 47: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/47.jpg)
Distributed Results
2 8 32 128 512 2048 8192
2
8
32
128
512
2048
Number of nodes
Spe
edup
w.r.
tsub
-grid
son
one
node
Level 14 Level 14Level 15 Level 15Level 16 Level 16Level 17 Level 17
• The red lines show the resultsusing HPX’s MPI parcelport andthe blue lines using HPX’sLibfabric parcelport, respectively
• Number of subgrids ranges from10928 (level 14) to 1.5 millionsubgrids (level 17) depending onthe refinement levels
• Achieved a weak scaling of68.1% with 2048 nodes on PizDaint on level 17
• At 4096 nodes 2.7 speedupusing Libfabric
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 21 / 23
![Page 48: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/48.jpg)
Distributed Results
2 8 32 128 512 2048 8192
2
8
32
128
512
2048
Number of nodes
Spe
edup
w.r.
tsub
-grid
son
one
node
Level 14 Level 14Level 15 Level 15Level 16 Level 16Level 17 Level 17
• The red lines show the resultsusing HPX’s MPI parcelport andthe blue lines using HPX’sLibfabric parcelport, respectively
• Number of subgrids ranges from10928 (level 14) to 1.5 millionsubgrids (level 17) depending onthe refinement levels
• Achieved a weak scaling of68.1% with 2048 nodes on PizDaint on level 17
• At 4096 nodes 2.7 speedupusing Libfabric
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 21 / 23
![Page 49: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/49.jpg)
Summary
• HPX programming modelexposes easy synchronization with futures for
• Networking• GPU / CUDA• CPU
• Reduced overhead to maximize throughput
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 22 / 23
![Page 50: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/50.jpg)
Thank you for your attention!
![Page 51: From Piz Daint to the Stars: Simulation of Stellar Mergers ...sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap557s5.pdfHPX Large-scale runs on Piz Daint use up to](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed731c1c30795314c175f9b/html5/thumbnails/51.jpg)
Distributed Results
2 8 32 128 512 2,048 8,192
1
1.5
2
2.5
Number of nodes
Rat
ioof
proc
esse
dsu
bgr
ids
pers
econ
d
Level 14 Level 15Level 16
• Ratio of processed sub grids persecond between HPXs Libfabricand MPI parcelport on Piz Daint
• Switch to Libfabric did not requireany changes within Octo-Tiger
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions Gregor Daiß, John Biddiscombe 23 / 23