starpu a runtime system for scheduling tasks, or how to get ......starpu a runtime system for...
TRANSCRIPT
![Page 1: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/1.jpg)
1StarPU a runtime system for Scheduling Tasks,
orHow to get portable performance on
accelerator-based platforms without the agonizing pain
NVIDIA GPU Technology Conference – San Jose (USA) – September 2010
Cédric Augonnet Samuel Thibault Raymond Namyst
INRIA Bordeaux, LaBRI, University of Bordeaux
![Page 2: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/2.jpg)
2
Once upon a time in computer architecture ...
![Page 3: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/3.jpg)
3
IntroductionProgramming Accelerator-based machines
• Prehistory (<2007)• SMP machines (1960s !)
• NUMA architectures (1970s)
• Vector machines (1980s-90s)
• Multicore chips (2000s)
• GPGPU for masses ? (from 2007)• CPUs are now deprecated ?
• Rewrite all codes for accelerators
• Pure offloading model
![Page 4: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/4.jpg)
4
IntroductionProgramming Accelerator-based machines
• Pure offloading model• CUDA 1.x (2007 – 2008)
– Synchronous cards
• Ignore CPUs
• Concentrate on efficient kernels– Complex memory accesses– CUDA heros (eg. V.Volkov)
• Port standard libraries to CUDA– CUBLAS– CUFFT– GPUCV...
CPU GPU
Pre-process input
Post-process input
Compute on GPU
Upload input
Download result
![Page 5: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/5.jpg)
5
IntroductionProgramming Accelerator-based machines
• Multi-GPU era• CUDA 2.x (2008 – 2009)
– Asynchronous transfers– S1070 servers
• Still Ignore CPUs (in general)
• Suitable for regular applications
– Massively parallel problems– Use previously written kernels
• New problems– Parallel programming for real– PCI bus = bottleneck– Pre/Post-processing is costly
CPU GPUs
Distribute work
Gather results
![Page 6: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/6.jpg)
6
IntroductionProgramming Accelerator-based machines
• GPU computing era• CUDA 3.x (2009 – ?)
– End of GPGPU– Hybrid machines
• Tightly coupled CPUs and GPUs– Take advantage of all ressources– MUCH more complicated
• Load balancing– Who does what ?– Heterogeneous capabilities
• Data management– Numerous data transfers– Fully asynchronous model
M.M.
CPU
CPU
CPU
CPU M.GPU
M.GPU
M.M.
CPU
CPU
CPU
CPU M.GPU
M.GPU
![Page 7: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/7.jpg)
7
IntroductionChallenging issues at all stages
• Applications• Programming paradigm
• BLAS kernels, FFT, …
• Compilers• Languages
• Code generation/optimization
• Runtime systems• Resources management
• Task scheduling
• Architecture• Memory interconnect
Compiling environment
HPC Applications
Runtime system
Operating System
Hardware
Specific librairies
![Page 8: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/8.jpg)
8
IntroductionChallenging issues at all stages
• Applications• Programming paradigm
• BLAS kernels, FFT, …
• Compilers• Languages
• Code generation/optimization
• Runtime systems• Resources management
• Task scheduling
• Architecture• Memory interconnect
Compiling environment
HPC Applications
Runtime system
Operating System
Hardware
Specific librairies
Expressive interface
Execution Feedback
![Page 9: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/9.jpg)
9
• The StarPU runtime system
• Task Scheduling• Load balancing
• Improving data locality
• Evaluation on dense linear algebra algorithms• Synthetic “LU” decomposition
• Mixing PLASMA and MAGMA (Cholesky & QR)
• Scheduling parallel tasks
• Adding support for MPI in StarPU
Outline
![Page 10: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/10.jpg)
10
The StarPU runtime system
![Page 11: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/11.jpg)
11
ParallelCompilers
HPC Applications
Runtime system
Operating System
CPU
Parallel Libraries
• “do dynamically what can’t be done statically anymore”
• Library that provides• Task scheduling• Memory management
• Compilers and libraries generate (graphs of) parallel tasks
• Additional information is welcome!
The need for runtime systems
GPU …
The StarPU runtime system
![Page 12: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/12.jpg)
12
ParallelCompilers
HPC Applications
StarPU
Drivers (CUDA, OpenCL)
CPU
Parallel Libraries
• StarPU provides a Virtual Shared Memory subsystem
• Weak consistency
• Replication
• Single writer
• High level API– Partitioning filters
•Input & ouput of tasks = reference to VSM data
Data management library
GPU …
The StarPU runtime system
![Page 13: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/13.jpg)
13
ParallelCompilers
HPC Applications
StarPU
Drivers (CUDA, OpenCL)
CPU
Parallel Libraries
•Tasks =• Data input & output
– Reference to VSM data• Multiple implementations
– E.g. CUDA + CPU implementation
• Dependencies with other tasks
• Scheduling hints
•StarPU provides an Open Scheduling platform
• Scheduling algorithm = plug-ins
The StarPU runtime systemTask scheduling
GPU …fcpugpuspu
(ARW, BR, CR)
![Page 14: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/14.jpg)
14
ParallelCompilers
HPC Applications
StarPU
Drivers (CUDA, OpenCL)
CPU
Parallel Libraries
• Who generates the code ?• StarPU Task = ~function pointers• StarPU don't generates code
• Programming heros ?
• Libraries era• PLASMA + MAGMA• FFTW + CUFFT...
• Rely on compilers• PGI accelerators• CAPS HMPP...
The StarPU runtime systemTask scheduling
GPU …fcpugpuspu
(ARW, BR, CR)
![Page 15: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/15.jpg)
15
The StarPU runtime systemExecution model
Scheduling engine
Application
GPU driver
MemoryManagement
(DSM)
RAM GPU
CPU driver#k
CPU#k
...
Sta
rPU
A B
BA
![Page 16: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/16.jpg)
16
The StarPU runtime systemExecution model
Scheduling engine
Application
GPU driver
MemoryManagement
(DSM)
RAM GPU
CPU driver#k
CPU#k
...
Sta
rPU
Submit task « A += B »
A+= B
A B
BA
![Page 17: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/17.jpg)
17
The StarPU runtime systemExecution model
Scheduling engine
Application
GPU driver
MemoryManagement
(DSM)
RAM GPU
CPU driver#k
CPU#k
...
Sta
rPU
Schedule task
A+= B
A B
BA
![Page 18: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/18.jpg)
18
The StarPU runtime systemExecution model
Scheduling engine
Application
GPU driver
MemoryManagement
(DSM)
RAM GPU
CPU driver#k
CPU#k
...
Sta
rPU
A+= B
B B
BA
A
Fetch data
![Page 19: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/19.jpg)
19
The StarPU runtime systemExecution model
Scheduling engine
Application
GPU driver
MemoryManagement
(DSM)
RAM GPU
CPU driver#k
CPU#k
...
Sta
rPU
A+= B
B B
BA
A A
Fetch data
![Page 20: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/20.jpg)
20
The StarPU runtime systemExecution model
Scheduling engine
Application
GPU driver
MemoryManagement
(DSM)
RAM GPU
CPU driver#k
CPU#k
...
Sta
rPU
A+= B
B B
BA
A A
Fetch data
![Page 21: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/21.jpg)
21
The StarPU runtime systemExecution model
Scheduling engine
Application
GPU driver
MemoryManagement
(DSM)
RAM GPU
CPU driver#k
CPU#k
...
Sta
rPU
B B
BA
A A
Offload computation
A+= B
![Page 22: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/22.jpg)
22
The StarPU runtime systemExecution model
Scheduling engine
Application
GPU driver
MemoryManagement
(DSM)
RAM GPU
CPU driver#k
CPU#k
...
Sta
rPU
A B
BA
![Page 23: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/23.jpg)
23
Task Scheduling
![Page 24: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/24.jpg)
24
Why do we need task scheduling ?Blocked Matrix multiplication
2 Xeon cores
Quadro FX5800
Quadro FX4600
Things can go (really) wrong even on trivial problems !• Static mapping ?
– Not portable, too hard for real-life problems• Need Dynamic Task Scheduling
– Performance models
![Page 25: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/25.jpg)
25
Predicting task durationLoad balancing
Time
cpu #3
gpu #1
cpu #2
cpu #1
gpu #2
•Task completion time estimation
• History-based
• User-defined cost function
• Parametric cost model
•Can be used to improve scheduling
• E.g. Heterogeneous Earliest Finish Time
![Page 26: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/26.jpg)
26
Time
cpu #3
gpu #1
cpu #2
cpu #1
gpu #2
•Task completion time estimation
• History-based
• User-defined cost function
• Parametric cost model
•Can be used to improve scheduling
• E.g. Heterogeneous Earliest Finish Time
•Task completion time estimation
• History-based
• User-defined cost function
• Parametric cost model
•Can be used to improve scheduling
• E.g. Heterogeneous Earliest Finish Time
Predicting task durationLoad balancing
![Page 27: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/27.jpg)
27
Time
cpu #3
gpu #1
cpu #2
cpu #1
gpu #2
•Task completion time estimation
• History-based
• User-defined cost function
• Parametric cost model
•Can be used to improve scheduling
• E.g. Heterogeneous Earliest Finish Time
Predicting task durationLoad balancing
![Page 28: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/28.jpg)
28
Time
cpu #3
gpu #1
cpu #2
cpu #1
gpu #2
•Task completion time estimation
• History-based
• User-defined cost function
• Parametric cost model
•Can be used to improve scheduling
• E.g. Heterogeneous Earliest Finish Time
Predicting task durationLoad balancing
![Page 29: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/29.jpg)
29
Predicting data transfer overheadMotivations
• Hybrid platforms• Multicore CPUs and GPUs
• PCI-e bus is a precious ressource
• Data locality vs. Load balancing• Cannot avoid all data transfers
• Minimize them
• StarPU keeps track of• data replicates
• on-goig data movements
M.M.
CPU
CPU
CPU
CPU M.GPU
GPU
CPU
CPU
CPU
CPU
M.M.
B
M.GPU
M.GPU A
M.B
A
![Page 30: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/30.jpg)
30
Predicting data transfer overheadOffline bus benchmarking
• Offline bus benchmarking• When StarPU is launched for the first
time
• Measure bandwidth and latency– Stored as files
• Loaded when StarPU is initialized
• Detect CPU/GPU affinity• Control a GPU from the closest CPU
• Significant impact on bus usage
• Straightforward cost prediction• Latency + size * bandwidth
• Could be improved in many ways
M.M.
CPU
CPU
CPU
CPU M.GPU
GPU
CPU
CPU
CPU
CPU
M.M.
B
M.GPU
M.GPU A
M.B
A
![Page 31: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/31.jpg)
31
Impact of scheduling policy on a synthetic LU decomposition
(without pivoting !)
![Page 32: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/32.jpg)
32
Scheduling in a hybrid environment
• LU without pivoting (16GB input matrix)• 8 CPUs (nehalem) + 3 GPUs (FX5800)
Performance models
Speed (GFlops)0
100200300400500600700800
Greedytask modelprefetchdata model
Transfers (GB)0
10
20
30
40
50
60
![Page 33: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/33.jpg)
33
Scheduling in a hybrid environment
• LU without pivoting (16GB input matrix)• 8 CPUs (nehalem) + 3 GPUs (FX5800)
Performance models
Speed (GFlops)0
100200300400500600700800
Greedytask modelprefetchdata model
Transfers (GB)0
10
20
30
40
50
60
![Page 34: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/34.jpg)
34
Scheduling in a hybrid environment
• LU without pivoting (16GB input matrix)• 8 CPUs (nehalem) + 3 GPUs (FX5800)
Performance models
Speed (GFlops)0
100200300400500600700800
Greedytask modelprefetchdata model
Transfers (GB)0
10
20
30
40
50
60
![Page 35: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/35.jpg)
35
Scheduling in a hybrid environment
• LU without pivoting (16GB input matrix)• 8 CPUs (nehalem) + 3 GPUs (FX5800)
Performance models
Speed (GFlops)0
100200300400500600700800
Greedytask modelprefetchdata model
Transfers (GB)0
10
20
30
40
50
60
![Page 36: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/36.jpg)
36
Mixing PLASMA and MAGMA with StarPU
(in collaboration with UTK)
Cholesky & QR decompositions
![Page 37: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/37.jpg)
37
• State of the art algorithms• PLASMA (Multicore CPUs)
– Dynamically scheduled with Quark
• MAGMA (Multiple GPUs)
– Hand-coded data transfers
– Static task mapping
• General SPLAGMA design
• Use PLASMA algorithm with « magnum tiles »
• PLASMA kernels on CPUs, MAGMA kernels on GPUs
• Bypass the QUARK scheduler
• Programmability
• Cholesky: ~half a week
• QR : ~2 days of works
• Quick algorithmic prototyping
Mixing PLASMA and MAGMA with StarPU
![Page 38: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/38.jpg)
38
• Cholesky decomposition • 5 CPUs (Nehalem) + 3 GPUs (FX5800)
• Efficiency > 100%
Mixing PLASMA and MAGMA with StarPU
![Page 39: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/39.jpg)
39
• Cholesky decomposition • 5 CPUs (Nehalem) + 3 GPUs (FX5800)
• Efficiency > 100%
Mixing PLASMA and MAGMA with StarPU
![Page 40: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/40.jpg)
40
• Memory transfers during Cholesky decomposition
Mixing PLASMA and MAGMA with StarPU
~2.5x lesstransfers
![Page 41: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/41.jpg)
41
• QR decomposition• Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)
Mixing PLASMA and MAGMA with StarPU
![Page 42: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/42.jpg)
42
• QR decomposition• Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)
Mixing PLASMA and MAGMA with StarPU
MAGMA
![Page 43: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/43.jpg)
43
• QR decomposition• Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)
Mixing PLASMA and MAGMA with StarPU
+12 CPUs~200GFlops
Peak : 12 cores~150 GFlops
![Page 44: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/44.jpg)
44
• « Super-Linear » efficiency in QR?• Kernel efficiency
– sgeqrt
– CPU: 9 Gflops GPU: 30 Gflops (Speedup : ~3)
– stsqrt
– CPU: 12Gflops GPU: 37 Gflops (Speedup: ~3)
– somqr
– CPU: 8.5 Gflops GPU: 227 Gflops (Speedup: ~27)
– Sssmqr
– CPU: 10Gflops GPU: 285Gflops (Speedup: ~28)
• Task distribution observed on StarPU
– sgeqrt: 20% of tasks on GPUs
– Sssmqr: 92.5% of tasks on GPUs
• Taking advantage of heterogeneity !
– Only do what you are good for
– Don't do what you are not good for
Mixing PLASMA and MAGMA with StarPU
![Page 45: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/45.jpg)
45
Scheduling parallel tasks
![Page 46: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/46.jpg)
46
• Take advantage of multicore architectures
• Task parallelism may not be suited at a fine grain
• Use other parallel paradigms– eg. OpenMP
• Use existing parallel libraries– eg. do not reimplement parallel BLAS …
• Alleviate granularity concerns
• Less tasks
• Large enough tasks (suited for the GPU)
Parallel tasksWhy do we need parallel tasks ?
![Page 47: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/47.jpg)
47
Parallel tasks
• StarPU allocates processing units
Time
cpu #3
gpu #1
cpu #2
cpu #1
gpu #2
cpu #0
Scheduling parallel tasks
![Page 48: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/48.jpg)
48
Parallel tasks
• StarPU allocates processing units
Time
cpu #3
gpu #1
cpu #2
cpu #1
cpu #0
gpu #1
gpu #2
Scheduling parallel tasks
![Page 49: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/49.jpg)
49
Parallel tasks
• StarPU allocates processing units
Time
cpu #3
cpu #2
cpu #1
cpu #0
gpu #1gpu #1
gpu #2
Scheduling parallel tasks
![Page 50: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/50.jpg)
50
Parallel tasks
• StarPU allocates processing units
Time
cpu #3
cpu #2
cpu #1
cpu #0
gpu #1gpu #1
gpu #2
Scheduling parallel tasks
![Page 51: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/51.jpg)
51
Adding support for MPI in StarPU
![Page 52: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/52.jpg)
52
Accelerating MPI applications with StarPU
• Keep MPI SPMD style• Static distribution of data• Scheduling within the node only
– No load balancing between MPI processes
• Inter-process data dependencies• MPI communications triggered by StarPU data availability
• Support from StarPU's memory management– Automaticallly construct MPI datatype
![Page 53: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/53.jpg)
53
Accelerating MPI applications with StarPU
• Provided API• starpu_mpi_{send,recv}
• starpu_mpi_{isend,irecv}
• starpu_mpi_{test,wait}
• starpu_mpi_{send,recv}_detached
• starpu_mpi_*_array
• Detached calls• No need to explicitly test/wait for the request
• Automatic progression
• Automatic data dependencies• MPI transfers ~ StarPU tasks
• Accelerating legacy codes
![Page 54: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/54.jpg)
54
Accelerating LU/MPI with StarPU
• LU decomposition• MPI+multiGPU
• Static MPI distribution• 2D block cyclic• ~SCALAPACK• No pivoting !
• Algorithmic work required• Collaboration with UTK
![Page 55: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/55.jpg)
55
Conclusion
![Page 56: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/56.jpg)
56
ParallelCompilers
HPC Applications
Runtime system
Operating System
CPU
Parallel Libraries
• StarPU
• Freely available under LGPL
• Available on Linux, OS/X, Windows
• Open to external contributors!
• Task Scheduling
• Required on hybrid platforms
• Auto-tuned performance models
• Combined PLASMA and MAGMA
• Parallel tasks
• MPI extensions
ConclusionSummary
GPU …
![Page 57: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/57.jpg)
57
• Implement more algorithms
• LU, Hessenberg
• Communication Avoiding algorithms
• Hybrid Scalapack
• Provide higher level constructs (eg. reductions)
• Provide a back-end for compilers
• StarSs, XscalableMP, HMPP
• Support new architectures
• Intel SCC, Fermi cards, …
• Dynamically adapt granularity
• Divisible tasks
ConclusionFuture work
Compiling environment
HPC Applications
Runtime system
Operating System
Hardware
Specific librairies
![Page 58: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/58.jpg)
58
• Implement more algorithms
• LU, Hessenberg
• Communication Avoiding algorithms
• Hybrid Scalapack
• Provide higher level constructs (eg. reductions)
• Provide a back-end for compilers
• StarSs, XscalableMP, HMPP
• Support new architectures
• Intel SCC, Fermi cards, …
• Dynamically adapt granularity
• Divisible tasks
ConclusionFuture work
Compiling environment
HPC Applications
Runtime system
Operating System
Hardware
Specific librairies
Thanks for your attention !Any question ?
![Page 59: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/59.jpg)
59
![Page 60: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/60.jpg)
60
![Page 61: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/61.jpg)
61
Performance ModelsOur History-based proposition
• Hypothesis• Regular applications
• Execution time independent from data content
– Static Flow Control
• Consequence• Data description fully characterizes tasks
• Example: matrix-vector product
– Unique Signature : ((1024, 512), 1024, 1024)
– Per-data signature
– CRC(1024, 512) = 0x951ef83b
– Task signature
– CRC(CRC(1024, 512), CRC(1024), CRC(1024)) = 0x79df36e2
1024
512 1024x 1024=
![Page 62: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/62.jpg)
62
Performance ModelsOur History-based proposition
• Generalization is easy• Task f(D1, … , Dn)
• Data– Signature(Di) = CRC(p1, p2, … , pk)
• Task ~ Series of data– Signature(D1, ..., Dn) = CRC(sign(D1), ..., sign(Dn))
• Systematic method• Problem independent
• Transparent for the programmer
• Efficient
![Page 63: StarPU a runtime system for Scheduling Tasks, or How to get ......StarPU a runtime system for Scheduling 1Tasks, or How to get portable performance on accelerator-based platforms without](https://reader035.vdocuments.us/reader035/viewer/2022071514/613625400ad5d2067647d4f7/html5/thumbnails/63.jpg)
63
EvaluationExample: LU decomposition
• Faster
• No code change !
• More stable
(16k x 16k) (30k x 30k)
ref. 89.98 ± .2 97 130.64 ± .1 66
1st iter 48.31 96.63
2nd iter 103.62 130.23
3rd iter 103.11 133.50
≥ 4 iter 103.92 ± .0 46
135.90 ± .0 00
Speed (GFlop/s)
• Dynamic calibration
• Simple, but accurate