dynamic load balancing - picksc · • dynamic load balancing in the transverse directions does not...

Ricardo Fonseca | OSIRIS Workshop 2017

Dynamic load balancing in OSIRIS

R. A. Fonseca1,2

1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal


Particles/node, iz = 12

Maintaining parallel load balance is crucial

2

LWFA Simulation Parallel Partition

• 94×24×24 = 55k cores

Load Imbalance (max/avg load)

• 9.04×

Average Performance

• ~12% peak

• Full scale 3D modeling of relevant scenarios requires scalability to large number of cores

• Code performance must be sustained throughout the entire simulation

• The overall performance will be limited by the slowest node • Simulation time is dominated by particle calculations

• Some nodes may have more particles than other

• If the distribution of particles remains approximately constant throughout the simulation we could tackle this at initialization

• Static load balancing

• However this will usually depend on the dynamics of the simulation

• Shared memory parallelism can help mitigate the problem • Use a “particle domain” decomposition inside shared memory region

• Smear out localized computational load peaks

• Dynamic load balancing required for optimal efficiency


Adjusting processor load dynamically

3

process 0 process 1

process 2 process 3

Even particle load across processes 4 cores

Global Communication ∝ log2(# nodes)

Communication with overlapping nodes

Unb

alan

ced

Bala

nced

Node Boundaries

•Use minimal number of messages

•Only the first communication depends on #nodes

Begin

Gather Integral of Load

Unbalance > Threshold

End

Get Best Partition

Significant Improvement

Reshape Simulation

No

No

Yes

Yes

•The code can change node boundaries

dynamically to attempt to maintain a

even load across nodes:

•Determine best possible partition

from current particle distribution

•Reshape the simulation


Calculate cumulative load along load balance direction

4

Main overhead (~90%) Scales well for large number of nodes

∝ log2(# nodes)

Global Integral Load is known by all nodes

node 0 node 1 node 2 node 3

node 0 node 1 node 2 node 3

Get Load along load balancing direction on each node

all nodes

all nodes

Integrate Load along load balancing direction

Add load from all nodes

+ Distribute result to

all nodes

MPI_REDUCEALL(few kB)

• The algorithm for calculating the best partition requires the cumulative computational load

• Essentially a scan operation

• Implemented as a reduce all operation followed by local calculations

• All nodes know the result • Simplifies best partition

calculations


Finding the best partition

• Algorithm for 1D partitions

– Only 1D integral load required (less communication)

• Find point where integral load is half between endpoints

• Repeat with two resulting intervals until all node partitions are calculated

• Same calculation is done by all nodes

– No need for extra communication

• Fast algorithm

• Well balanced distribution

• Handles “impossible” scenarios well

– i.e. when it is impossible to find a load balanced situation it finds the best solution

• Can handle extra constraints:

– minimum number of cells per node

– Cells per node must be an integer multiple of some quantity

5

40299

20105

60467

10035

70491

50445

30149

Binary Search Algorithm


Partition reshaping

6

All nodes know old and new partition

node 0 node 1 node 2

node 0 node 1 node 2

node 1

Send data to node 0

Get data from node 2

Communication pattern for node 1

Old partition

New partition

• All nodes know previous and new partition

• For each node

– Detect overlap between local node on old partition and other nodes on new partition

– Detect overlap between local node on new partition and other nodes on old partition

– Send/receive data to proper nodes/from old nodes

• For particles

– Remove/add to particle buffer

• For grids

– Send/get data to/from other nodes

– Allocate new grid with new shape

– Copy in values from old grid and data from other nodes

– Free old grid


• Cell calculations (field solve + grid filtering) can be considered by including an effective cell weight

• In this example each cell is considered to have the same computational load as 1.5 particles

• Load estimate considering only particles per vertical slice

Including cells in the computational load

7

α = 1.5

Particles only Particles + cells

Node BoundariesNode Boundaries


Hydrodynamic nano plasma explosion

8

Tota

l Sim

ulat

ion

Tim

e [s

]

0

100

200

300

400

500

600

700

800

128 64 32 16 8

No dynamic load balanceParticles onlyParticles and Grids

Iterations Between Load Balance

Num

ber o

f Par

ticle

s

0

2 105

4 105

6 105

8 105

1 106

0 500 1000 1500 2000 2500 3000

Fixed BoundariesDynamic Load Balance

IterationMin. Load Max. Load

2 species

uth = 0.1 c

Radius: 12.8 c/ωp

• 2D Cartesian geometry

• 3000 iterations

• Grid

– 10242 cells

– 102.42 (c/ωp)2

• Particles – electrons + positrons

– 82 particle/species/cell

– Generalized thermal velocity 0.1 c

Simulation Setup

• Algorithm maintains the best possible load balance

• Repartitioning the simulation is a time consuming task

• Don’t do it too often!

Load

• Particles & grids ~50% better than just particles

• nloadbalance ~64 iterations yields good results

• Performance boost ≥ 2

• Other scenarios can be as high as 4

Performance


Multi-dimensional dynamic load balancing

9

•Difficult task: •Single solution exists only for 1D parallel partitions • Improving load balance in one direction might result in worse

results in another direction

t = 0 t’ > 0

node boundary

Best Partition

•Assume separable load function (i.e. load is a product of fx(x)×fy(y)×fz(z) ) •Each partition direction becomes independent

•Accounting for transverse directions •Use max / sum value across perpendicular partition

Multidimensional Partition


Large scale LWFA run: Close but no cigar

10

• Best result: • Dynamic load balance along x2 and x3 • Use max load • 30% improvement in imbalance but... • Lead to a 5% overall slowdown!

No overall speedup

Laser

• The ASCR problems are very difficult to load balance

• Very small problem size per node

• When choosing partitions with less nodes along propagation direction imbalance degrades significantly

• Not enough room to dynamic load balance along propagation direction

• Dynamic load balancing in the transverse directions does not yield big improvements

• Very different distributions from slice to slice

• Dynamic load balance in just 1 direction does not improve imbalance

• Using max node load helps to highlight the hot spots for the algorithm x1-x2 slice at box center

similar partition along x3 > 30% improvement in inbalance


Alternative: Patch based load balancing

11

• Partition the space into (10-100x) more domains (patches) than processing elements (PE)

• Dynamically assign patches to PE • Assign similar load to PEs • Attempt to maintain neighboring patches in the same PE

Implemented in PSC / SMILEI

316 K. Germaschewski et al. / Journal of Computational Physics 318 (2016) 305–326

Fig. 6. Load balancing increased density across the diagonal. We compare traditional domain decomposition (top row) to patch-based load balancing (bottom row). Thick black lines demarcate the subdomains assigned to each MPI process, thin grey lines show the underlying patches. (a), (d) plasma density. (b), (e) computational load for each decomposition unit. (c), (f) aggregate computational load for each MPI process. Computational load in this simple example is defined as number of particles.

While having many small patches on a process means increased synchronization work at patch boundaries (in particular handling ghost points and exchanging particles), most of this work is wholly within the local subdomain and so can be handled directly or via shared memory rather than by more expensive MPI communication. As we will show later, dividing the work into smaller patches can even have a positive impact on computational performance due to enhanced data locality which makes better use of processor caches.

4.3.2. Example: Enhanced density across the diagonalThe next example demonstrates the load balancing algorithm in action. We chose a uniform low background density of

nb = 0.1 and enhance it by up to n = 1.1 along the diagonal of the domain, as shown in Fig. 6. This density distribution is chosen to be one where the psc’s former approach to load balancing is not effective at all: After dividing the domain into 4 × 4 subdomains it is not possible to shift the subdomain boundaries in a way that reduces load imbalance; this simulation will always be unbalanced by a factor of more than 7× as shown in Fig. 6(b), (c), where the load is simply defined as the number of particles per patch or subdomain, respectively. Subdomains near the high-density diagonal have an estimated load of 81 200, while away from it the load is as low as 11 300.

Patch-based load balancing, however, works quite well. The resulting decomposition is shown in Fig. 6 (d)–(f) by the thicker black lines. It can be seen that some subdomains contain only a few patches, including some with a large load (high density, i.e. many particles), while other subdomains contain more patches but mostly at a low load.

Fig. 6 (b) shows the load for each patch, which is calculated for each patch as number of particles plus number of cells in that patch – clearly this mirrors the particle density. Fig. 6 (c) plots the aggregate load per process, calculated as the sum of the individual loads for each patch in that process’s subdomain. It is clear that the load is not perfectly balanced, but it is contained within ±7.5% of the average load of 29 300. This is certainly a vast improvement over an imbalance by a factor of more than 7× in the original code.

4.3.3. Load balancing algorithmThe goal of the actual balancing algorithm is quite straightforward: Divide the 1-d space filling curve that enumerates

all patches into Nproc segments, where Nproc is the number of processes such that all processes have approximately equal load. In order to accommodate inhomogeneous machines, we add a “capability” specification. For example, we may want to run 16 MPI processes on a Cray XK7 node. 15 of those processes run on one core each, while the last one is used to drive the GPU on the node. In this case, we would assign a capability of 1 to the first 15 processes, and a capability of 60 to the last process, since the Titan’s GPU (Nvidia Tesla K20X) performance is roughly 60× faster than a single CPU core for our code and we want it to get correspondingly more work.

316 K. Germaschewski et al. / Journal of Computational Physics 318 (2016) 305–326

Fig. 6. Load balancing increased density across the diagonal. We compare traditional domain decomposition (top row) to patch-based load balancing (bottom row). Thick black lines demarcate the subdomains assigned to each MPI process, thin grey lines show the underlying patches. (a), (d) plasma density. (b), (e) computational load for each decomposition unit. (c), (f) aggregate computational load for each MPI process. Computational load in this simple example is defined as number of particles.

While having many small patches on a process means increased synchronization work at patch boundaries (in particular handling ghost points and exchanging particles), most of this work is wholly within the local subdomain and so can be handled directly or via shared memory rather than by more expensive MPI communication. As we will show later, dividing the work into smaller patches can even have a positive impact on computational performance due to enhanced data locality which makes better use of processor caches.

4.3.2. Example: Enhanced density across the diagonalThe next example demonstrates the load balancing algorithm in action. We chose a uniform low background density of

nb = 0.1 and enhance it by up to n = 1.1 along the diagonal of the domain, as shown in Fig. 6. This density distribution is chosen to be one where the psc’s former approach to load balancing is not effective at all: After dividing the domain into 4 × 4 subdomains it is not possible to shift the subdomain boundaries in a way that reduces load imbalance; this simulation will always be unbalanced by a factor of more than 7× as shown in Fig. 6(b), (c), where the load is simply defined as the number of particles per patch or subdomain, respectively. Subdomains near the high-density diagonal have an estimated load of 81 200, while away from it the load is as low as 11 300.

Patch-based load balancing, however, works quite well. The resulting decomposition is shown in Fig. 6 (d)–(f) by the thicker black lines. It can be seen that some subdomains contain only a few patches, including some with a large load (high density, i.e. many particles), while other subdomains contain more patches but mostly at a low load.

Fig. 6 (b) shows the load for each patch, which is calculated for each patch as number of particles plus number of cells in that patch – clearly this mirrors the particle density. Fig. 6 (c) plots the aggregate load per process, calculated as the sum of the individual loads for each patch in that process’s subdomain. It is clear that the load is not perfectly balanced, but it is contained within ±7.5% of the average load of 29 300. This is certainly a vast improvement over an imbalance by a factor of more than 7× in the original code.

4.3.3. Load balancing algorithmThe goal of the actual balancing algorithm is quite straightforward: Divide the 1-d space filling curve that enumerates

all patches into Nproc segments, where Nproc is the number of processes such that all processes have approximately equal load. In order to accommodate inhomogeneous machines, we add a “capability” specification. For example, we may want to run 16 MPI processes on a Cray XK7 node. 15 of those processes run on one core each, while the last one is used to drive the GPU on the node. In this case, we would assign a capability of 1 to the first 15 processes, and a capability of 60 to the last process, since the Titan’s GPU (Nvidia Tesla K20X) performance is roughly 60× faster than a single CPU core for our code and we want it to get correspondingly more work.

Standard MPI partition

Patch based partitioning

be considered more as a dynamic load limiting algorithm ratherthan balancing. It is absolutely mandatory for modern simulationssince these new kind of developments are emerging in both la-boratory and astrophysical plasmas communities. Details of thisapproach can be found in [6].

4. Summary and outlook

The importance of 3D simulations has been established. Theircost is still prohibitive for most applications since large scale si-mulations are so massively parallel that load imbalance becomes abarrier. It cannot be alleviated by brute force parallelism. That ismerely a waste of computational resources. Germaschewski et al.suggested a new data structure for PIC codes, adapted to bothmodern massively parallel hardware and to the associated algo-rithmic challenges [7]. This structure has been adapted with suc-cess in SMILEI [18] and shows tremendous improvement in thecase of LWFA, even in 2D.

Recent advances in the physics included in PIC codes in bothastrophysical plasmas and laser-plasma communities often leadsto prohibitive numbers of SP. Now, the problem is not just onlycompute bound, but also memory bound. This issue may be re-solved only through a load-limiting algorithm such as particlemerging through k-means clustering [6].

The two methods above could be combined into a coherentdynamic load management strategy without interfering with anyof the physics included in the code. It is even compatible withdifferent geometry models like CC itself which is also subject toload imbalance.

Acknowledgments

PP and CC simulations were performed using the PRACE Re-search Infrastructures resources FERMI based in Italy at CINECAand Curie based in France at TGCC under the Grant 2013091859.SMILEI simulations were performed using DARI allocationc2015067484 on the OCCIGEN system, based in France at CINES.This work was partially supported with a bi-lateral STSM visitGrant under the EU COST Action MP 1208 (title "Simulation ofLaser Plasma Interaction", grant number 130114-035408) “Devel-oping the physics and the scientific community for inertial fusion”and by the LabEx PALM (ANR-10-LABX-0039-PALM) under projectSimPLE.

References

[1] A. Beck, Y. Kalmykov, X. Davoine, et al., Nucl. Instrum. Methods Phys. Res. A(2013), http://dx.doi.org/10.1016/j.nima.2013.11.003.

[2] B. Cros, B.S. Paradkar, X. Davoine, et al., Nucl. Instrum. Methods Phys. Res. A(2013), http://dx.doi.org/10.1016/j.nima.2013.11.003.

[3] A. Pukhov, J. Meyer-ter-Vehn, Appl. Phys. B: Lasers Opt. 74 (2002) 355.[4] B.M. Cowan, S.Y. Kalmykov, A. Beck, et al., J. Plasma Phys. 78 (2012) 469.[5] J.T. Frederiksen, Astrophys. J. Lett. 680 (1) (2008).[6] J.T. Frederiksen, G. Lapenta, M.E. Pessah, Particle control in phase space by

global k-means clustering (2015) arXiv:1504.03849.[7] K. Germaschewski, W. Fox, N. Ahmadi, et al., The plasma simulation code: a

modern particle-in-cell code with load-balancing and GPU support, 2013, arXiv:1310.7866.

[8] T. Haugbølle, J. Trier Frederiksen, Å. Nordlund, et al., Phys. Plasmas 20 (2013)062904.

[9] D. Hilbert, Math. Ann. 38 (1891) 459.[10] S.Y. Kalmykov, S.A. Yi, A. Beck, et al., New J. Phys. 12 (2010) 045019.[11] S.Y. Kalmykov, S.A. Yi, A. Beck, et al., Plasma Phys. Control Fusion 53 (2011)

014006.[12] S.Y. Kalmykov, A. Beck, S.A. Yi, et al., Phys. Plasmas 18 (2011) 056704.[13] S.Y. Kalmykov, A. Beck, X. Davoine, E. Lefebvre, B.A. Shadwick, New J. Phys. 14

(2012) 033025.[14] R. Lehe, et al., Phys. Rev. STAB 16 (2013) 021301.[15] A.F. Lifschitz, X. Davoine, E. Lefebvre, et al., J. Comput. Phys. 228 (2009) 1803.[16] J.-L. Vay, Phys. Rev. Lett. 98 (2007) 130405.[17] P. Mora, T.M. Antonsen Jr., Phys. Plasmas 4 (1997) 217.[18] Smilei web page ⟨http://www.maisondelasimulation.fr/projects/Smilei/html/

index.html⟩.

Fig. 3. Example of a 32!32 patches decomposition between 7 MPI processes. MPIdomains are delimited by different uniform colors. The line shows the Hilbert curveand the dots the center of the patches. The curve goes through all patches. It startsfrom the patch with coordinates ( )0, 0 and end at patch with coordinates ( )31, 0 .

Fig. 4. Evolution of the time spent for 10 SMILEI cycles along a LWFA run on 24OCCIGEN nodes. “MPI” is a pure MPI run, 576 MPI processes. “Hybrid” is a hybridMPIþOpenMP run, 48 MPI processes þ 12 OpenMP threads per process. “Hybridbalanced” is the same hybrid run with the dynamic load balancing active.

A. Beck et al. / Nuclear Instruments and Methods in Physics Research A 829 (2016) 418–421 421

Germaschewski et al., JCP 318 305 (2016); Beck et al., NIMPR-A 829 418 (2016)

Patch assignment 7 PE example

LWFA simulation


Overview

• Maintaining load balancing is crucial for large scale simulations • Lowest performing node dominates

• Particle load per PE evolves dynamically along the simulation

• Share memory parallelism helps smear out load spikes but does not solve the problem

• Dynamic load balancing is implemented by moving node boundaries

• Works well for many problems

• Difficult to expand to multi-dimensional partitions

• Unable to handle difficult LWFA scenarios

• Patch based dynamic load balancing appears to a viable solution • Requires moving to a tiling scheme to organize the data

• Lots of work, but has other advantages, e.g. improved data locality, similarity with GPU code

• Current implementations rely on MPI_THREAD_MULTIPLE

12

O i ir ss4.0

dynamic load balancing - picksc · • dynamic load balancing in the transverse directions does not...

Documents