a trillion particles: studying large-scale structure ......jul 10, 2012  · the requirements for...

25
A Trillion Particles: Studying Large-Scale Structure Formation on the BG/Q David Daniel, Patricia Fasel, Hal Finkel, Nicholas Frontiere, Salman Habib, Katrin Heitmann, Zarija Lukic, Adrian Pope July 10, 2012 The SIAM 2012 Annual Meeting Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 1 / 25

Upload: others

Post on 14-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

A Trillion Particles: Studying Large-Scale StructureFormation on the BG/Q

David Daniel, Patricia Fasel, Hal Finkel, Nicholas Frontiere, SalmanHabib, Katrin Heitmann, Zarija Lukic, Adrian Pope

July 10, 2012The SIAM 2012 Annual Meeting

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 1 / 25

Page 2: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

1 Introduction

2 Equations

3 Requirements and Design

4 RCB Tree and Force Kernel

5 Conclusion

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 2 / 25

Page 3: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Cosmology

(graphic by: NASA/WMAP Science Team)

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 3 / 25

Page 4: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

The Big Picture

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 4 / 25

Page 5: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Structure Evolution

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 5 / 25

Page 6: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Large-Scale Structure Formation

Why study large-scale structure formation?

Cosmology is the study of the universe on the largest measurablescales.

Dark matter and dark energy compose over 95% of the mass of theuniverse.

The statistics of large-scale structure help us constrain models of darkmatter and dark energy.

Predictions from simulations can be compared to observations fromlarge-scale surveys.

Observational error bars are now approaching the sub-percent level,and simulations need to match that level of accuracy.

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 6 / 25

Page 7: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

What We See

Image of galaxies covering a patch of the sky roughly equivalent to the size of the full moon (from the Deep Lens Survey)

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 7 / 25

Page 8: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

What We See (cont.)

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 8 / 25

Page 9: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Vlasov-Poisson Equation

Cosmic structure formation follows the gravitational Vlasov-Poissonequation in an expanding Universe – a 6-D PDE for the Liouville flow (1)of the phase space PDF with the Poisson equation (2) imposingself-consistency:

∂t f (x,p) + x · ∂xf (x,p)−∇φ · ∂pf (x,p) = 0, (1)

∇2φ(x) = 4πGa2(t)Ωmδm(x)ρc . (2)

The expansion of the Universe is encoded in the time-dependence of thescale factor a(t). Parameters:

the Hubble parameter, H = a/a

G is Newton’s constant

ρc is the critical density

Ωm, the average mass density as a fraction of ρc

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 9 / 25

Page 10: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Vlasov-Poisson Equation (cont.)

ρm(x) is the local mass density

δm(x) is the dimensionless density contrast

ρc = 3H2/8πG , δm(x) = (ρm(x)− 〈ρm〉)/〈ρm〉, (3)

p = a2(t)x, ρm(x) = a(t)−3m

∫d3pf (x,p). (4)

The Vlasov-Poisson equation is very difficult to solve directly

Structure develops on fine scales driven by the gravitational Jeansinstability

N-body methods: use tracer particles to sample f (x,p)

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 10 / 25

Page 11: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Requirements

The requirements for our simulation campaigns are driven by the state ofthe art in observational surveys:

Survey depths are of order a few Gpc (1 pc=3.26 light-years)

To follow typical galaxies, halos with a minimum mass of ∼1011 M(M=1 solar mass) must be tracked

To properly resolve these halos, the tracer particle mass should be∼108 M

The force resolution should be small compared to the halo size, i.e.,∼kpc.

This implies that the dynamic range of the simulation is a part in 106

(∼Gpc/kpc) everywhere! With ∼105 particles in the smallest resolvedhalos, we need hundreds of billions to trillions of particles.

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 11 / 25

Page 12: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

HACC

The HACC (Hybrid/Hardware Accelerated Cosmology Code) Frameworkmeets these requirements using a P3M (Particle-Particle Particle-Mesh)algorithm on accelerated systems and a Tree P3M method on CPU-onlysystems (such as the BG/Q).

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 12 / 25

Page 13: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Force Splitting

The gravitational force calculation is split into long-range part and ashort-range part

A grid grid is responsible for largest 4 orders of magnitude of dynamicrange

particle methods handle the critical 2 orders of magnitude at theshortest scales

Complexity:

PM (grid) algorithm: O(Np)+O(Ng log Ng ), where Np is the totalnumber of particles, and Ng the total number of grid points

tree algorithm: O(Npl log Npl), where Npl is the number of particlesin individual spatial domains (Npl Np)

the close-range force computations are O(N2d) where Nd is the

number of particles in a tree leaf node within which all directinteractions are summed

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 13 / 25

Page 14: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Force Splitting (cont.)

Long-Range Algorithm:

The long/medium range algorithm is based on a fast, spectrallyfiltered PM method

The density field is generated from the particles using a Cloud-In-Cell(CIC) scheme

The density field is smoothed with the (isotropizing) spectral filter:

exp (−k2σ2/4) [(2k/∆) sin(k∆/2)]ns , (5)

where the nominal choices are σ = 0.8 and ns = 3. The noise reductionfrom this filter allows matching the short and longer-range forces at aspacing of 3 grid cells.

The Poisson solver uses a sixth-order, periodic, influence function(spectral representation of the inverse Laplacian)

The gradient of the scalar potential is obtained using higher-orderspectral differencing (fourth-order Super-Lanczos)

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 14 / 25

Page 15: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Force Splitting (cont.)

The “Poisson-solve” is the composition of all the kernels above in onesingle Fourier transform

Each component of the potential field gradient then requires anindependent FFT

Distributed FFTs use a pencil decomposition

To obtain the short-range force, the filtered grid force is subtractedfrom the Newtonian force

Mixed precision:

single precision is adequate for the short/close-range particle forceevaluations and particle time-stepping

double precision is used for the spectral component

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 15 / 25

Page 16: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Overloading

The spatial domain decomposition is in regular 3-D blocks, but unlike theguard zones of a typical PM method, full particle replication – termed‘particle overloading’ – is employed across domain boundaries.

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 16 / 25

Page 17: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Overloading (cont.)

Works because particles cluster and large-scale bulk motion is small

Short-range force contribution is not used for particles near the edgeof the overloading region

The typical memory overhead cost for a large run is ∼ 10%

The point of overloading is to allow sufficiently-exactmedium/long-range force calculations with no communication ofparticle information and high-accuracy local force calculations

We use relatively sparse refreshes of the overloading zone! This is key tofreeing the overall code performance from the weaknesses of theunderlying communications infrastructure.

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 17 / 25

Page 18: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Time Stepping

The time-stepping is based on a 2nd-order split-operator symplecticSKS scheme (stream-kick-stream)

Because the characteristic time scale of the long-range force is muchsmaller than that of the short-range force, we sub-cycle theshort-range force operator

The relatively slowly evolving longer range force is effectively frozenduring the shorter-range sub-cycles

Mfull(t) = Mlr (t/2)(Msr (t/nc))nc Mlr (t/2). (6)

The number of sub-cycles is nc = 3− 5, in most cases.

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 18 / 25

Page 19: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

RCB Tree

The short-range force is computed using recursive coordinate bisection(RCB) tree in conjunction with a highly-tuned short-range polynomialforce kernel.

Level 0

Level 1

Level 2

Level 3

1

2

3

4

5

6

7

89

10

11

12

13

14

15

(graphic from Gafton and Rosswog: arXiv:1108.0028)

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 19 / 25

Page 20: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

RCB Tree (cont.)

At each level, the node is split at its center of mass

During each node split, the particles are partitioned into disjointadjacent memory buffers

This partitioning ensures a high degree of cache locality during theremainder of the build and during the force evaluation

To limit the depth of the tree, each leaf node holds more than oneparticle. This makes the build faster, but more importantly, tradestime in a slow procedure (a “pointer-chasing” tree walk) for a fastprocedure (the polynomial force kernel).

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 20 / 25

Page 21: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

RCB Tree (cont.)

Another benefit of using multiple particles per leaf node:

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 21 / 25

Page 22: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Force Kernel

Due to the compactness of the short-range interaction, the kernel can berepresented as

fSR(s) = (s + ε)−3/2 − fgrid(s) (7)

where s = r · r, fgrid(s) = poly[5](s), and ε is a short-distance cutoff.

An interaction list is constructed during the tree walk for each leafnode

Using OpenMP, the particles in the leaf node are assigned to differentthreads: all threads share the interaction list (which automaticallybalances the computation)

The interaction list is processed using a vectorized kernel routine(written using QPX compiler intrinsics)

Filtering for self and out-of-range interactions uses the floating-pointselect instruction: no branching required

We can use the reciprocal (sqrt) estimate instructions: no library calls

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 22 / 25

Page 23: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Running Configuration

We use either 8 threads per rank with 8 ranks per node, or 4 threadsper rank and 16 ranks per node

The code spends 80% of the time in the highly optimized forcekernel, 10% in the tree walk, and 5% in the FFT, all other operations(tree build, CIC deposit) adding up to another 5%.

This code achieves over 50% of the peak FLOP rate on the BG/Q!

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 23 / 25

Page 24: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Conclusion

HACC is designed to minimize inter-node communication (operatorsplitting, overloading, etc.)

The RCB tree increases cache locality and trades a slow walk for afast kernel computation

The code demonstrates highly portable high performance!

By meeting its design requirements, we’ll be able to make significantscientific contributions in the coming years

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 24 / 25

Page 25: A Trillion Particles: Studying Large-Scale Structure ......Jul 10, 2012  · The requirements for our simulation campaigns are driven by the state of the art in observational surveys:

Acknowledgments

DOE and ANL (both HEP and ASCR)

Staff at ALCF, especially Vitali Morozov and Kumar Kumaran (fromthe performance team)

IBM

LLNL

SIAM

(everyone else I’ve forgotten to put here)

Hal Finkel (Argonne National Laboratory) A Trillion Particles July 10, 2012 25 / 25