dynamic load balancing - picksc · • dynamic load balancing in the transverse directions does not...
TRANSCRIPT
Ricardo Fonseca | OSIRIS Workshop 2017
Dynamic load balancing in OSIRIS
R. A. Fonseca1,2
1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal
Ricardo Fonseca | OSIRIS Workshop 2017
Particles/node, iz = 12
Maintaining parallel load balance is crucial
2
LWFA Simulation Parallel Partition
• 94×24×24 = 55k cores
Load Imbalance (max/avg load)
• 9.04×
Average Performance
• ~12% peak
• Full scale 3D modeling of relevant scenarios requires scalability to large number of cores
• Code performance must be sustained throughout the entire simulation
• The overall performance will be limited by the slowest node • Simulation time is dominated by particle calculations
• Some nodes may have more particles than other
• If the distribution of particles remains approximately constant throughout the simulation we could tackle this at initialization
• Static load balancing
• However this will usually depend on the dynamics of the simulation
• Shared memory parallelism can help mitigate the problem • Use a “particle domain” decomposition inside shared memory region
• Smear out localized computational load peaks
• Dynamic load balancing required for optimal efficiency
Ricardo Fonseca | OSIRIS Workshop 2017
Adjusting processor load dynamically
3
process 0 process 1
process 2 process 3
Even particle load across processes 4 cores
Global Communication ∝ log2(# nodes)
Communication with overlapping nodes
Unb
alan
ced
Bala
nced
Node Boundaries
•Use minimal number of messages
•Only the first communication depends on #nodes
Begin
Gather Integral of Load
Unbalance > Threshold
End
Get Best Partition
Significant Improvement
Reshape Simulation
No
No
Yes
Yes
•The code can change node boundaries
dynamically to attempt to maintain a
even load across nodes:
•Determine best possible partition
from current particle distribution
•Reshape the simulation
Ricardo Fonseca | OSIRIS Workshop 2017
Calculate cumulative load along load balance direction
4
Main overhead (~90%) Scales well for large number of nodes
∝ log2(# nodes)
Global Integral Load is known by all nodes
node 0 node 1 node 2 node 3
node 0 node 1 node 2 node 3
Get Load along load balancing direction on each node
all nodes
all nodes
Integrate Load along load balancing direction
Add load from all nodes
+ Distribute result to
all nodes
MPI_REDUCEALL(few kB)
• The algorithm for calculating the best partition requires the cumulative computational load
• Essentially a scan operation
• Implemented as a reduce all operation followed by local calculations
• All nodes know the result • Simplifies best partition
calculations
Ricardo Fonseca | OSIRIS Workshop 2017
Finding the best partition
• Algorithm for 1D partitions
– Only 1D integral load required (less communication)
• Find point where integral load is half between endpoints
• Repeat with two resulting intervals until all node partitions are calculated
• Same calculation is done by all nodes
– No need for extra communication
• Fast algorithm
• Well balanced distribution
• Handles “impossible” scenarios well
– i.e. when it is impossible to find a load balanced situation it finds the best solution
• Can handle extra constraints:
– minimum number of cells per node
– Cells per node must be an integer multiple of some quantity
5
40299
20105
60467
10035
70491
50445
30149
Binary Search Algorithm
Ricardo Fonseca | OSIRIS Workshop 2017
Partition reshaping
6
All nodes know old and new partition
node 0 node 1 node 2
node 0 node 1 node 2
node 1
Send data to node 0
Get data from node 2
Communication pattern for node 1
Old partition
New partition
• All nodes know previous and new partition
• For each node
– Detect overlap between local node on old partition and other nodes on new partition
– Detect overlap between local node on new partition and other nodes on old partition
– Send/receive data to proper nodes/from old nodes
• For particles
– Remove/add to particle buffer
• For grids
– Send/get data to/from other nodes
– Allocate new grid with new shape
– Copy in values from old grid and data from other nodes
– Free old grid
Ricardo Fonseca | OSIRIS Workshop 2017
• Cell calculations (field solve + grid filtering) can be considered by including an effective cell weight
• In this example each cell is considered to have the same computational load as 1.5 particles
• Load estimate considering only particles per vertical slice
Including cells in the computational load
7
α = 1.5
Particles only Particles + cells
Node BoundariesNode Boundaries
Ricardo Fonseca | OSIRIS Workshop 2017
Hydrodynamic nano plasma explosion
8
Tota
l Sim
ulat
ion
Tim
e [s
]
0
100
200
300
400
500
600
700
800
128 64 32 16 8
No dynamic load balanceParticles onlyParticles and Grids
Iterations Between Load Balance
Num
ber o
f Par
ticle
s
0
2 105
4 105
6 105
8 105
1 106
0 500 1000 1500 2000 2500 3000
Fixed BoundariesDynamic Load Balance
IterationMin. Load Max. Load
2 species
uth = 0.1 c
Radius: 12.8 c/ωp
• 2D Cartesian geometry
• 3000 iterations
• Grid
– 10242 cells
– 102.42 (c/ωp)2
• Particles – electrons + positrons
– 82 particle/species/cell
– Generalized thermal velocity 0.1 c
Simulation Setup
• Algorithm maintains the best possible load balance
• Repartitioning the simulation is a time consuming task
• Don’t do it too often!
Load
• Particles & grids ~50% better than just particles
• nloadbalance ~64 iterations yields good results
• Performance boost ≥ 2
• Other scenarios can be as high as 4
Performance
Ricardo Fonseca | OSIRIS Workshop 2017
Multi-dimensional dynamic load balancing
9
•Difficult task: •Single solution exists only for 1D parallel partitions • Improving load balance in one direction might result in worse
results in another direction
t = 0 t’ > 0
node boundary
Best Partition
•Assume separable load function (i.e. load is a product of fx(x)×fy(y)×fz(z) ) •Each partition direction becomes independent
•Accounting for transverse directions •Use max / sum value across perpendicular partition
Multidimensional Partition
Ricardo Fonseca | OSIRIS Workshop 2017
Large scale LWFA run: Close but no cigar
10
• Best result: • Dynamic load balance along x2 and x3 • Use max load • 30% improvement in imbalance but... • Lead to a 5% overall slowdown!
No overall speedup
Laser
• The ASCR problems are very difficult to load balance
• Very small problem size per node
• When choosing partitions with less nodes along propagation direction imbalance degrades significantly
• Not enough room to dynamic load balance along propagation direction
• Dynamic load balancing in the transverse directions does not yield big improvements
• Very different distributions from slice to slice
• Dynamic load balance in just 1 direction does not improve imbalance
• Using max node load helps to highlight the hot spots for the algorithm x1-x2 slice at box center
similar partition along x3 > 30% improvement in inbalance
Ricardo Fonseca | OSIRIS Workshop 2017
Alternative: Patch based load balancing
11
• Partition the space into (10-100x) more domains (patches) than processing elements (PE)
• Dynamically assign patches to PE • Assign similar load to PEs • Attempt to maintain neighboring patches in the same PE
Implemented in PSC / SMILEI
316 K. Germaschewski et al. / Journal of Computational Physics 318 (2016) 305–326
Fig. 6. Load balancing increased density across the diagonal. We compare traditional domain decomposition (top row) to patch-based load balancing (bottom row). Thick black lines demarcate the subdomains assigned to each MPI process, thin grey lines show the underlying patches. (a), (d) plasma density. (b), (e) computational load for each decomposition unit. (c), (f) aggregate computational load for each MPI process. Computational load in this simple example is defined as number of particles.
While having many small patches on a process means increased synchronization work at patch boundaries (in particular handling ghost points and exchanging particles), most of this work is wholly within the local subdomain and so can be handled directly or via shared memory rather than by more expensive MPI communication. As we will show later, dividing the work into smaller patches can even have a positive impact on computational performance due to enhanced data locality which makes better use of processor caches.
4.3.2. Example: Enhanced density across the diagonalThe next example demonstrates the load balancing algorithm in action. We chose a uniform low background density of
nb = 0.1 and enhance it by up to n = 1.1 along the diagonal of the domain, as shown in Fig. 6. This density distribution is chosen to be one where the psc’s former approach to load balancing is not effective at all: After dividing the domain into 4 × 4 subdomains it is not possible to shift the subdomain boundaries in a way that reduces load imbalance; this simulation will always be unbalanced by a factor of more than 7× as shown in Fig. 6(b), (c), where the load is simply defined as the number of particles per patch or subdomain, respectively. Subdomains near the high-density diagonal have an estimated load of 81 200, while away from it the load is as low as 11 300.
Patch-based load balancing, however, works quite well. The resulting decomposition is shown in Fig. 6 (d)–(f) by the thicker black lines. It can be seen that some subdomains contain only a few patches, including some with a large load (high density, i.e. many particles), while other subdomains contain more patches but mostly at a low load.
Fig. 6 (b) shows the load for each patch, which is calculated for each patch as number of particles plus number of cells in that patch – clearly this mirrors the particle density. Fig. 6 (c) plots the aggregate load per process, calculated as the sum of the individual loads for each patch in that process’s subdomain. It is clear that the load is not perfectly balanced, but it is contained within ±7.5% of the average load of 29 300. This is certainly a vast improvement over an imbalance by a factor of more than 7× in the original code.
4.3.3. Load balancing algorithmThe goal of the actual balancing algorithm is quite straightforward: Divide the 1-d space filling curve that enumerates
all patches into Nproc segments, where Nproc is the number of processes such that all processes have approximately equal load. In order to accommodate inhomogeneous machines, we add a “capability” specification. For example, we may want to run 16 MPI processes on a Cray XK7 node. 15 of those processes run on one core each, while the last one is used to drive the GPU on the node. In this case, we would assign a capability of 1 to the first 15 processes, and a capability of 60 to the last process, since the Titan’s GPU (Nvidia Tesla K20X) performance is roughly 60× faster than a single CPU core for our code and we want it to get correspondingly more work.
316 K. Germaschewski et al. / Journal of Computational Physics 318 (2016) 305–326
Fig. 6. Load balancing increased density across the diagonal. We compare traditional domain decomposition (top row) to patch-based load balancing (bottom row). Thick black lines demarcate the subdomains assigned to each MPI process, thin grey lines show the underlying patches. (a), (d) plasma density. (b), (e) computational load for each decomposition unit. (c), (f) aggregate computational load for each MPI process. Computational load in this simple example is defined as number of particles.
While having many small patches on a process means increased synchronization work at patch boundaries (in particular handling ghost points and exchanging particles), most of this work is wholly within the local subdomain and so can be handled directly or via shared memory rather than by more expensive MPI communication. As we will show later, dividing the work into smaller patches can even have a positive impact on computational performance due to enhanced data locality which makes better use of processor caches.
4.3.2. Example: Enhanced density across the diagonalThe next example demonstrates the load balancing algorithm in action. We chose a uniform low background density of
nb = 0.1 and enhance it by up to n = 1.1 along the diagonal of the domain, as shown in Fig. 6. This density distribution is chosen to be one where the psc’s former approach to load balancing is not effective at all: After dividing the domain into 4 × 4 subdomains it is not possible to shift the subdomain boundaries in a way that reduces load imbalance; this simulation will always be unbalanced by a factor of more than 7× as shown in Fig. 6(b), (c), where the load is simply defined as the number of particles per patch or subdomain, respectively. Subdomains near the high-density diagonal have an estimated load of 81 200, while away from it the load is as low as 11 300.
Patch-based load balancing, however, works quite well. The resulting decomposition is shown in Fig. 6 (d)–(f) by the thicker black lines. It can be seen that some subdomains contain only a few patches, including some with a large load (high density, i.e. many particles), while other subdomains contain more patches but mostly at a low load.
Fig. 6 (b) shows the load for each patch, which is calculated for each patch as number of particles plus number of cells in that patch – clearly this mirrors the particle density. Fig. 6 (c) plots the aggregate load per process, calculated as the sum of the individual loads for each patch in that process’s subdomain. It is clear that the load is not perfectly balanced, but it is contained within ±7.5% of the average load of 29 300. This is certainly a vast improvement over an imbalance by a factor of more than 7× in the original code.
4.3.3. Load balancing algorithmThe goal of the actual balancing algorithm is quite straightforward: Divide the 1-d space filling curve that enumerates
all patches into Nproc segments, where Nproc is the number of processes such that all processes have approximately equal load. In order to accommodate inhomogeneous machines, we add a “capability” specification. For example, we may want to run 16 MPI processes on a Cray XK7 node. 15 of those processes run on one core each, while the last one is used to drive the GPU on the node. In this case, we would assign a capability of 1 to the first 15 processes, and a capability of 60 to the last process, since the Titan’s GPU (Nvidia Tesla K20X) performance is roughly 60× faster than a single CPU core for our code and we want it to get correspondingly more work.
Standard MPI partition
Patch based partitioning
be considered more as a dynamic load limiting algorithm ratherthan balancing. It is absolutely mandatory for modern simulationssince these new kind of developments are emerging in both la-boratory and astrophysical plasmas communities. Details of thisapproach can be found in [6].
4. Summary and outlook
The importance of 3D simulations has been established. Theircost is still prohibitive for most applications since large scale si-mulations are so massively parallel that load imbalance becomes abarrier. It cannot be alleviated by brute force parallelism. That ismerely a waste of computational resources. Germaschewski et al.suggested a new data structure for PIC codes, adapted to bothmodern massively parallel hardware and to the associated algo-rithmic challenges [7]. This structure has been adapted with suc-cess in SMILEI [18] and shows tremendous improvement in thecase of LWFA, even in 2D.
Recent advances in the physics included in PIC codes in bothastrophysical plasmas and laser-plasma communities often leadsto prohibitive numbers of SP. Now, the problem is not just onlycompute bound, but also memory bound. This issue may be re-solved only through a load-limiting algorithm such as particlemerging through k-means clustering [6].
The two methods above could be combined into a coherentdynamic load management strategy without interfering with anyof the physics included in the code. It is even compatible withdifferent geometry models like CC itself which is also subject toload imbalance.
Acknowledgments
PP and CC simulations were performed using the PRACE Re-search Infrastructures resources FERMI based in Italy at CINECAand Curie based in France at TGCC under the Grant 2013091859.SMILEI simulations were performed using DARI allocationc2015067484 on the OCCIGEN system, based in France at CINES.This work was partially supported with a bi-lateral STSM visitGrant under the EU COST Action MP 1208 (title "Simulation ofLaser Plasma Interaction", grant number 130114-035408) “Devel-oping the physics and the scientific community for inertial fusion”and by the LabEx PALM (ANR-10-LABX-0039-PALM) under projectSimPLE.
References
[1] A. Beck, Y. Kalmykov, X. Davoine, et al., Nucl. Instrum. Methods Phys. Res. A(2013), http://dx.doi.org/10.1016/j.nima.2013.11.003.
[2] B. Cros, B.S. Paradkar, X. Davoine, et al., Nucl. Instrum. Methods Phys. Res. A(2013), http://dx.doi.org/10.1016/j.nima.2013.11.003.
[3] A. Pukhov, J. Meyer-ter-Vehn, Appl. Phys. B: Lasers Opt. 74 (2002) 355.[4] B.M. Cowan, S.Y. Kalmykov, A. Beck, et al., J. Plasma Phys. 78 (2012) 469.[5] J.T. Frederiksen, Astrophys. J. Lett. 680 (1) (2008).[6] J.T. Frederiksen, G. Lapenta, M.E. Pessah, Particle control in phase space by
global k-means clustering (2015) arXiv:1504.03849.[7] K. Germaschewski, W. Fox, N. Ahmadi, et al., The plasma simulation code: a
modern particle-in-cell code with load-balancing and GPU support, 2013, arXiv:1310.7866.
[8] T. Haugbølle, J. Trier Frederiksen, Å. Nordlund, et al., Phys. Plasmas 20 (2013)062904.
[9] D. Hilbert, Math. Ann. 38 (1891) 459.[10] S.Y. Kalmykov, S.A. Yi, A. Beck, et al., New J. Phys. 12 (2010) 045019.[11] S.Y. Kalmykov, S.A. Yi, A. Beck, et al., Plasma Phys. Control Fusion 53 (2011)
014006.[12] S.Y. Kalmykov, A. Beck, S.A. Yi, et al., Phys. Plasmas 18 (2011) 056704.[13] S.Y. Kalmykov, A. Beck, X. Davoine, E. Lefebvre, B.A. Shadwick, New J. Phys. 14
(2012) 033025.[14] R. Lehe, et al., Phys. Rev. STAB 16 (2013) 021301.[15] A.F. Lifschitz, X. Davoine, E. Lefebvre, et al., J. Comput. Phys. 228 (2009) 1803.[16] J.-L. Vay, Phys. Rev. Lett. 98 (2007) 130405.[17] P. Mora, T.M. Antonsen Jr., Phys. Plasmas 4 (1997) 217.[18] Smilei web page ⟨http://www.maisondelasimulation.fr/projects/Smilei/html/
index.html⟩.
Fig. 3. Example of a 32!32 patches decomposition between 7 MPI processes. MPIdomains are delimited by different uniform colors. The line shows the Hilbert curveand the dots the center of the patches. The curve goes through all patches. It startsfrom the patch with coordinates ( )0, 0 and end at patch with coordinates ( )31, 0 .
Fig. 4. Evolution of the time spent for 10 SMILEI cycles along a LWFA run on 24OCCIGEN nodes. “MPI” is a pure MPI run, 576 MPI processes. “Hybrid” is a hybridMPIþOpenMP run, 48 MPI processes þ 12 OpenMP threads per process. “Hybridbalanced” is the same hybrid run with the dynamic load balancing active.
A. Beck et al. / Nuclear Instruments and Methods in Physics Research A 829 (2016) 418–421 421
Germaschewski et al., JCP 318 305 (2016); Beck et al., NIMPR-A 829 418 (2016)
Patch assignment 7 PE example
LWFA simulation
Ricardo Fonseca | OSIRIS Workshop 2017
Overview
• Maintaining load balancing is crucial for large scale simulations • Lowest performing node dominates
• Particle load per PE evolves dynamically along the simulation
• Share memory parallelism helps smear out load spikes but does not solve the problem
• Dynamic load balancing is implemented by moving node boundaries
• Works well for many problems
• Difficult to expand to multi-dimensional partitions
• Unable to handle difficult LWFA scenarios
• Patch based dynamic load balancing appears to a viable solution • Requires moving to a tiling scheme to organize the data
• Lots of work, but has other advantages, e.g. improved data locality, similarity with GPU code
• Current implementations rely on MPI_THREAD_MULTIPLE
12
O i ir ss4.0