gpus for scientific applications · 2014-05-07 · gpus for scientific applications eduardo m....

GPUs for Scientific Applications

Eduardo M. Bringa ([email protected]) & Emmanuel Millán CONICET / Instituto de Ciencias Básicas, Universidad Nacional de Cuyo, Mendoza

3er Escuela Argentina de GPGPU para Aplicaciones CientíficasCAB, May 2014

Collaborators:D. Tramontina, C. Ruestes, F. Fioretti, C. Garcia Garino (UN Cuyo), L. Forconesi

(ITU), R. Isoardi (FUESMEN), S. Manzi, E. Perino, M.F. Piccoli, M. Printista (UNSL), D. Schwen (LANL), A. Higginbotham (Oxford)

mailto:[email protected]

Simulations at different scales: HPC allows going from nano to micro

By Greg Odesgard, NASA Langley Research Center

University of Virginia, MSE 492/627: Introduction to Atomistic Simulations, Leonid Zhigilei, http://www.people.virginia.edu/~lz2n/mse627

HPC

• N partículas clásicas. Partícula i con posición ri, tiene velocidad vi y aceleración ai.

• Partículas interactúan a través de un potencial empírico, V(r1,.., ri,.., rN), que generalmente incluye interacciones de muchos cuerpos.

• Partículas obedecen las ecuaciones de movimiento de Newton. Partícula i, masa mi: Fi = -∇iV(r1,.., ri,.., rN)= mi ai = mi (d2ri /dt2)

• Volumen<0.5 µm3~109 átomos)

• Tiempos t<1 ns, ∆t~1 fs)• Varios integradores disponibles

• Pueden incorporarse efectos electrónicos (Koci et al, PRB 2006).

Una herramienta muy útil para estudiar materiales:Dinámica Molecular clásica =Molecular Dynamics=MD

i

j

k

Fji

Fjk

Fij

FkjFki

Fik

Interatomic potential (Phys/Eng) or Force Field (Chem/Bio)

http://en.wikipedia.org/wiki/Force_field_chemistry

Example

Golden rule: “garbage in, garbage out”

With MD you can obtain….

“Real” time evolution of your system.

Thermodynamic properties, including T(r,t) temperature profiles that can be used in rate equations.

Mechanical properties, including elastic and plastic behavior.

Surface/bulk/cluster growth and modification.

X-ray and “IR” spectra

Etcetera …

•Can simulate only small samples (L<1 µm, up to ~109 atoms).•Can simulate only short times (t<1 µs, because ∆t~1 fs).•Computationally expensive (weeks).• Potential’s golden rule: trash in trash out.• Interaction potentials for alloys, molecular solids, and excited species not well know. • Despite its limitations MD is a very powerful tool to study nanosystems.

Limitations of MD

Atomistic simulations are extremely helpful but … still have multiple limitations

How do we simulate a large number of atoms?

• Integrating the two body problem is one thing …. But integrating the motion of N particles, with N=(several million-billions) is a whole different ball game.

• Short-range potentials (not 1/r): use an appropriate cut-off and do spatial decomposition of the domain. This will ensure nearly perfect parallel scaling [O(N)]. Sometimes a VERY long cut-off is used for (1/r) potentials, with varying results.

• Long-range potentials (1/r): old method uses Ewald summation. New methods (PME,PPPM=P3M, etc.) are typically O(NlogN). Even newer methods (variations of multipole expansion) can be O(N), at the price of a large computational overhead. This is the same as the problem of N-body simulations used in astrophysics.

• Have to be careful with boundary conditions (free, periodic, expanding, damping, etc.) and check for system size effects.

Alejandro Strachan, http://nanohub.org/resources/5838#series

Many MD codes can now use GPU accelerationOften used as black-boxes without understanding limitations

AMBER (Assisted Model Building with Energy Refinement): http://ambermd.org/gpus/ Ross Walker. MPI for several GPUs/cores. TIP3P, PME, ~106 atoms max Tesla C2070)

HOOMD-Blue (Highly Optimized Object-oriented Many-particle Dynamics): http://codeblue.umich.edu/hoomd-blue/index.html OMP for several GPUs in single board.

LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator): http://lammps.sandia.gov/ . MPI ofr several GPUs/cores (LJ: 1.2 ~107 atoms max Tesla C2070)GPULAMMPS: http://code.google.com/p/gpulammps/ CUDA + OpenCL

DL_POLY: http://www.cse.scitech.ac.uk/ccg/software/DL_POLY/ F90+MPI, CUDA+OpenMP port.

GROMACS : http://www.gromacs.org/Downloads/Installation_Instructions/Gromacs_on_GPUs Uses OpenMM libs (https://simtk.org/home/openmm). No paralelization. ~106 atoms max.

NAMD (“Not another” MD): http://www.ks.uiuc.edu/Research/namd/GPU/CPU clusters. VMD (Visual MD): http://www.ks.uiuc.edu/Research/vmd/

GTC 2010 Archive: videos and pdf’s: http://www.nvidia.com/object/gtc2010-presentation-archive.html#md

1,000,000+ atom Satellite Tobacco Mosaic Virus

Freddolino et al., Structure, 14:437-449, 2006.Many more!!!!

http://ambermd.org/gpus/

http://codeblue.umich.edu/hoomd-blue/index.html

http://lammps.sandia.gov/

http://code.google.com/p/gpulammps/

http://www.cse.scitech.ac.uk/ccg/software/DL_POLY/

http://www.gromacs.org/Downloads/Installation_Instructions/Gromacs_on_GPUs

https://simtk.org/home/openmm

http://www.ks.uiuc.edu/Research/namd/

http://www.ks.uiuc.edu/Research/vmd/

http://www.nvidia.com/object/gtc2010-presentation-archive.html

http://www.ks.uiuc.edu/Research/STMV/

Highly Optimized Object-oriented Many-particle Dynamics -HOOMD-Bluehttp://codeblue.umich.edu/hoomd-blue/index.html

•General purpose molecular dynamics simulations fully implemented on graphics processing unitsJoshua A. Anderson, Chris D. Lorenz, and Alex Travesset, Journal of Computational Physics 227 (2008) 5342-5359.•Molecular Dynamics on Graphic Processing Units: HOOMD to the Rescue. Joshua A. Anderson and Alex Travesset, Computing in Science & Engineering 10(6) (2008).

http://www.nvidia.com/object/molecular_dynamics.html

HOOMD-Blue 0.11.2

(64K particles)

Host GPU(ECC off)

Polymer TPS

LJ liquid TPS

Xeon E5-2670 (SandyBridge-E)

Tesla K20X

885 1014


Tesla K20m

785 890


Tesla M2090

597 682

Xeon X5670 (Westmere)

Tesla M2070

486 553

Xeon X5670 (Westmere)

Tesla M2050

490 560

NAMD + VMD (MD+Viz for massive CPU-GPU clusters)

NAMD: http://www.ks.uiuc.edu/Research/namd/VMD (Visual MD): http://www.ks.uiuc.edu/Research/vmd/

1,000,000+ atom simulationSatellite Tobacco Mosaic Virus

Freddolino et al., Structure 14, 437 (2006)Zhao et al., Nature 497, 643 (2013)

64,000,000 atom simulation HIV-1 CAPSID

GPU surface rendering http://www.ks.uiuc.edu/Research/gpu/

http://www.ks.uiuc.edu/Research/namd/

http://www.ks.uiuc.edu/Research/vmd/



Summary: there are many opportunities for MD• PetascaleExascale! (USA & EU initiatives). Science 335, 394 (2012).

• New software: novel algorithms and preferably open source (Nature 482, 485 (2012)]. Still need significant advances in visualization (Visual Strategies: A Practical Guide to Graphics for Scientists and Engineers, F. Frankel & A. Depace, Yale University Press, 2012), dataset analysis [Science 334, 1518 (2011)], self-recovery & fault tolerance, etc.

• New hardware : better (faster/greener/cheaper) processors, connectivity, memory and disk access; MD-tailored machines (MD-GRAPE-4, Anton, etc.); GPUs, MICs, hybrid architectures (GPU/CPU); cloud computing, etc.

• Experiments going micro-nano/ns-ps same as MD

• Can go micron-size, but still have to connect to mm-m scale novel approaches needed, including smart sampling, concurrent coupling, dynamic/ adaptive load balancing/refining for heterogeneous systems, asynchronous simulations, etc.

• Need human resources with mix of hardware, soft & science expertise.

LAMMPS (http://lammps.sandia.gov/ )

Some of my personal reasons to use LAMMPS:

1) Free, open source (GNU license).

2) Easy to learn and use:

(a) mailing list in sourceforge.

(b) responsive developers and user community.

(c) extensive docs :http://lammps.sandia.gov/doc/Section_commands.html#3_5

3) It runs efficiently in my laptop (8 cores) and in BlueGeneL (100 K cores), including parallel I/O, with the same input script. Also efficient for GPUs.

4) Very efficient parallel energy minimization, including cg & FIRE.

5) Includes many-body, bond order, & reactive potentials. Can simulate inorganic & bio systems, granular and CG systems.

6) Can do extras like DSMC, TAD, NEB, TTM, semi-classical methods, etc.

7) Extensive set of analysis routines: coordination, centro, cna, etc.

8) Easy to write analysis inside input, using something similar to pseudo-code.

http://lammps.sandia.gov/

http://lammps.sandia.gov/doc/Section_commands.html

LAMMPS: GPU execution GPU Package:

Novel, most up to date GPU code. GNU GPL v2 license. Main developers: Paul Crozier (Sandia), Mike Brown, Arnold Tharrington, Scott

Hampton (Oak Ridge), Axel Kohlmeyer (Temple), Christian Trott, Lars Winterfeld (Ilmenau, Germany), Duncan Poole, Peng Wang (Nvidia), etc.

Multiple MPI processes can execute in the same GPU card. Geryon Library: supports OpenCL and CUDA see

https://github.com/CFDEMproject/LAMMPS/tree/master/lib/gpu/geryon

USER-CUDA package: Main developer: Christian Trott Supports only CUDA. In some type of simulations is faster than GPU package. Only one MPI process can be running in a GPU. The USER-CUDA package uses less memory than the GPU package. Supports Granular simulationsFor more information see: http://lammps.sandia.gov/doc/package.html

http://lammps.sandia.gov/doc/Section_accelerate.html#acc_8

http://www.gnu.org/licenses/old-licenses/gpl-2.0.html

GPU LAMMPS Parallel Performance

Benchmarks: http://users.nccs.gov/~wb8/gpu/bench.htm • Also: http://sites.google.com/site/akohlmey/software/lammps-benchmarks

864K atoms LJ liquid, reduced density=0.8442. NVE, rcut= 2.5σ, 5000 steps. Speedup ~ 3-4.

Rhodopsin protein in solvated lipid bilayer. CHARMM force field, long-range Coulombics via PPPM,SHAKE constraints. Counter-ions and reduced amount of water.32K atoms, 1000 timesteps, LJ rcut= 1 nm, neighbor skin of 1.0 σ, NPT. Speedup ~ 1.7-3.

Gay-Berne ellipsoids125 K atoms, NVE, rcut= 7σ, 1000 steps. Speedup ~ 9-11.

Yona cluster:15 Nodes : 2x6-core AMD Opteron 2435 (2.6GHz) & 2 Tesla C2050 GPUs 3GB GDDR5 memory, 448 cores (1.15GHz) , memory bandwidth: 144GB/s. GPUs are connected on PCIx16 gen 2.0 slots. ECC support enabled. Mellanox MT26428 QDR InfiniBand interconnect.

Benchmarks for load balancing

The performance impact resulting from splitting the force calculation between the host and device will depend on the CPU core to device ratio and the relative rates of force calculation on the host and device.

Processes per node (ppn). Dynamic Load Balancing

(LB). Neighboring performed on

the GPU (GPU-N).

LJ, N=864000

Implementing molecular dynamics on hybrid high performance computers – short range forces

W. Michael Brown, Peng Wang, Steven J. Plimpton and Arnold N. Tharrington. Comp. Phys. Comm. 182 (2011)

898–911

Benchmarks for different precision modes

256000 atoms LJ liquid, reduced density=0.8442. NVE, rcut= 2.5σ, 5000 steps.

Rhodopsin protein in solvated lipid bilayer. CHARMM force field, long-range Coulombics via PPPM,SHAKE constraints. Counter-ions and reduced amount of water to make a 32K atom system, replicated 2x2x2 to create box.256,000 atoms, 1000 timesteps, LJ rcut= 1 nm, neighbor skin of 1.0 σ, NPT.

• Single precision OK for many runs, but use at your peril! Colberg & Höfling, Comp. Phys. Comm. 182 (2011) 1120–1129.

• Mixed precision (single for positions and double for forces) nearly as fast as single precision!• Double precision still cheaper than CPU.

More benchmarks …

Strong scaling benchmark using LJ. cutoff of 2.5 and N=864 K.

LJ. Single node.

Implementing molecular dynamics on hybrid high performance computers – short range forcesW. Michael Brown, Peng Wang, Steven J. Plimpton and Arnold N. Tharrington. Comp. Phys. Comm. 182 (2011) 898–911

LAMMPS: OpenCL vs CUDA

Single node. Code compiled with CUDA and OpenCL. N=256K with neighboring performed on the GPU. Time normalized by the time required to

complete simulation loop with CUDA.


CPU versus CPU-GPU Speedups


LAMMPS Scaling with GPU

GPU: NVIDIA Tesla c2050, 3 GB RAM, 448 processing units CPU: AMD Phenom x6 1055t 2.8GHz, 12 GB of RAM.

4000 32000 256000 864000 13720000

100

200

300

400

500

600

700

800

900

1000

LJ Melt example

1000 steps, AMD Phenom x6 1055T and NVIDIA Tesla c2050

CPU 2 cores

GPU Single

GPU Double

Number of Atoms

Wa

llclo

ck ti

me ~8x

4000000 7812500 9841500 109760000

10

20

30

40

50

60

70

melt - gpulammps - 1 core - tesla c2070 - 100 steps

total

other

force

neigh

output

comm

N

Tim

e

Interstellar dust plays an important role in astrophysical processesGrain size matters for evolution, astro-chemistry, etc.

star

interstellar cloudof gas and dust

absorptionscattering

observed spectraof star

star formation

cooling rate

thermal balance

chemistrybiology

planet formation

• meteorites/asteroids• black holes

• Grain diameters: 0.005 µm – 5 µm

• Size distribution: n(a)∝ a -3.5

Supernova 1987A

Modification of materials:

phase transitions and damage to space

hardware, and due to shrapnel (NIF)

Large-scale MD links nano and microscales in damage induced by nanoprojectiles

Only dislocations + liquid atoms shown, ~300 106 atoms

C. Anders et al., PRL 108, 027601 (2012)

Granular mechanics of

nano-grain collisions

Ringl et al., Ap.J. 752 (2012) 151

New granular friction scheme

implemented for GPUs by E. Millan

Granular mechanics of grain-surface

collisionsRingl et al.,

PRE 86, 061313 (2012) PRE

KALEIDOSCOPE

COMPLEXITY in cluster collisions

Parameters: Velocity (v) Impact parameter (x) Radius (d/2) StructureOrientation of the lattice

0 ps 1.6 ps 3.1 ps

4.6 ps 6.1 ps 16.1 psN. Ohnishi, et al. “Numerical analysis of nanograin collision by classicalmolecular dynamics,” J. Phys. Conf. Series 112 (2008) 042017. Run in 256 cores (LLNL)

COMPLEX CLUSTER STRUCTURE

Stretching separation

Sliding and locking

DropletSticking

Additional parameters:

Shape Porosity Speed of sound in the material Fragmentation velocity

Ringl, Bringa, Bertoldi, and Urbaseek, Astrophysics Journal (2011) Run in 48 cores, 128 GB RAM (Germany).Kalweit and Drikakis PHYSICAL REVIEW B 74, 235415 (2006)

V=5m/s; X= 0,8R

Emmanuel N. Millán, et. al. XVIII CACIC, http://sedici.unlp.edu.ar/handle/10915/23634

Nanograin collisions – GPU vs CPU performance

150 simulations. GPU/CPU test is for 5 procs running independent simulations in parallel using six CPU cores, one CPU process with 2 cores and 4 GPU procs.

Results for different impact parameters

b= impact parameter, and R=radius of spherical grains (R=6.7 σ), SP= single precision.

Largest defect content for central collisions, b=0.

Results at different velocities

Final configuration for V (LJ units): (a) V=0.3, (b) V=1, (c) V=3, and (d) V=6.

Only part of the atoms shown, to allow the visualization of SF atoms (grey)

Grains shown in teal and red → lack of mixing at low velocities.

What is the threshold velocity for plasticity? Continuum models typically neglect nanosize and strain rate effects

Plastic Threshold

2000~ atoms, vthresh=0.55a) 2.000~ atoms, v=0.54b) 2.000~ atoms, v=0.56e) 2.000~ atoms, v=0.54, 18 atoms in SFf) 2.000~ atoms, v=0.56, 191 atoms in SF

20.000~ atoms, vthresh=0.37c) 20.000~ atoms, v=0.35d) 20.000~ atoms, v=0.41g) 20.000~ atoms, v=0.35, 28 atoms in SFh) 20.000~ atoms, v=0.41, 759 atoms in SF

200.000~ atoms, vthresh=0.225i) 200.000~ atoms, v=0.2j) 200.000~ atoms, v=0.25m) 200.000~ atoms, v=0.2, 34 atoms in SFn) 200.000~ atoms, v=0.25, 3670 atoms in SF

2.000.000~ atoms, vthresh=0.11k) 2.000.000~ atoms, v=0.1l) 2.000.000~ atoms, v=0.15o) 2.000.000~ atoms, v=0.1, 248 atoms in SFp) 2.000.000~ atoms, v=0.15, 24913 atoms in SF

Fraction of defective atoms changes with velocity and radius

Plastic threshold inversely proportional to grain radius.

Each point needs 10+ runs over different orientations

Plasticity threshold in grain-grain impacts

Millan, Tramontina, et al., Anales MACI (2013) FCC stacking faults and twins

Dislocation-based model by Lubarda et al.agrees with MD. Millan, Tramontina, et al.,

to be submitted (2014)

GPUs + CPUs to run ~10000 independent

MD simulations

Granular models typically assume lack of

plasticity V. Lubarda, et al., Acta Mater. 52 (2004) 1397

S. Jung et al., Aerosol Sci. 50 (2012) 26

On-going research on grain-grain Atoms in planar defects give a reasonably good diagnostic for

plasticity initiation. Need to quantify dislocation activity and check for dislocation starvation.

Threshold velocity depends on cluster size, but not as in usual continuum models.

On going studies: Statistics of defects (100-10000 orientations). Large cluster sizes → Hall-Petch (HP) for

dislocations/twins. Nano-contacts: beyond JKR/DMT models. Granular model implemented in GPUs and

submitted to LAMMPS

(with C. Ringl & H. Urbassek, KTU, Germany).

GRANULAR simulations with LAMMPS

C. RINGL AND H.M. URBASSEK, A LAMMPS implementation of granular mechanics: Inclusion of adhesive and microscopic friction forces, Computer Physics Communications 183 (2012), pp 986-992.

Time series of a cluster–cluster collision, snapshots taken every 5 μs. Color differentiates the original cluster to which

a grain belongs. Center-of-mass frame. Velocity v = 5 m/s−1 (v = 29vfrag ) and impact parameter b = 0.8 R were selected.

A model for μm-sized grain–grain interaction which exhibits the essential features necessary to describe collision, agglomeration and fragmentation processes. The model has been e ciently implemented in the LAMMPS code. In ffiaddition to existing models, adhesive forces and — gliding, rolling, and torsional — friction processes are implemented.

GRANULAR simulations Benchmarks in GPU(extension of USER-CUDA)

The 7.5e4 curve represents the results obtained in C. Ringl, Comp. Phys. 183, 2012.

CPU: AMD Phenom x6 1055t 2.8GHzGPU: NVIDIA Tesla c2050

AVG speedup GPU vs 1 CPU core = 7xGPU vs 6 CPU core = 2.95x

GPU version developed by Emmanuel N. Millán CPU version developed by Christian Ringl (Comp. Phys. 183, 2012)Code submitted to LAMMPS repository

Millan et al. A GPU implementation for improved granular simulations with LAMMPS. HPCLatAm 2013, pp. 89-100 (full paper). Session: GPU Architecture and Applications. C. Garcia Garino and M. Printista (Eds.) Mendoza, Argentina, July 29-30, 2013. http://hpc2013.hpclatam.org/papers/HPCLatAm2013-paper-10.pdf

http://hpc2013.hpclatam.org/papers/HPCLatAm2013-paper-10.pdf

Benchmarks: Clusters

Granular simulation with the GranularEasy pair style, with 4.48e6 grains and1000 steps, for 1 through 64 processes, in Mendieta and ICB-ITIC clusters. Various GPUs are tested: C2050, C2075 and M2090.

Tesla c2050 GPU = 16∼CPU cores ICB-ITIC cluster.

Mendieta Tesla M2090 GPUs best performance using 4 GPUs in two cluster nodes speedup of 4.2 x ∼against the best CPU result (ICB-ITIC cluster with 16 CPU cores).

Benchmarks: Communication

GPU Granular simulation, 1000 steps with 4.48e6 grains. For six processes, two simulations are shown, one has an elongated, half empty heterogeneous box, and the second has a cubic homogeneous box filled with grains. Each grain has 2-5 neighbors.

GPU Melt simulation, Lennard Jones potential, 1000 steps with 256e3 atoms, each atom has

70 neighbors with 2.5 cutoff and 500 ∼ ∼neighbors with 5.0 cutoff

Simulation of settlement dynamics in arid environments

Problem: Which factors influence the livestock settlements spatial distribution in the NE of Mendoza?Use Monte Carlo simulationE. Millan (ICB), E. Bringa (ICB), C. Garcia Garino (ITIC), L. Forconesi (ITU), S. Goiran & J. Aranibar (ICB, CCT-Mendoza), submitted to Ecological Modeling (2014).

Complexity: large parameter space + neighbor search

Variables:

Distance to road

Distance to river

Settlements distance

Water table depth

Vegetation degradation

(need neighbors!!)

Objetive: Find optimal solution, minimizing error.

One is real, the other simulated: which is which?

Millan et al., submitted to Ecological Modeling (2014)

Vegetation degradation to 5th nearest neighbors

Need ~2.5 million independent simulations:

a) use multicores

b) use GPU: speed-up?

“Numerical” experiments using LAMMPS: parameter sweep for cluster collisions

Need to sweep over relative orientation, velocity, R, etc.

Goal: reduce the makespan of multiples jobs executing parallel processes both in the CPU and GPU.

Ad-hoc strategy: split jobs bewteen CPU&GPU. Could be improved with job scheduling tools.

Different parallel modes considered: Process parametric study on multicore CPU workstation using

OpenMPI. Process parametric study on the GPU. Hybrid studies: RUBY script to assign workload both to CPU

and GPU according to predefined strategy. MPI plus Dynamic or static load balancing.

Only up to 10 simultaneous jobs in single GPU, due to memory limitations.

•Discrete model: regular grid of cells (in N dimensions), each in one of a finite number of states, such as "On" and "Off".

•For each cell, define a neighborhood

• Discrete time. An initial state (time t=0) is selected by assigning a state for each cell.

•A new generation is created (advancing t by 1), according to some fixed “local” rule (generally, a mathematical function) that determines the new state of each cell in terms of the current state of the cell and the states of the cells in its neighborhood.

http://en.wikipedia.org/wiki/Cellular_automata

Cellular automaton (pl. cellular automata, CA)

Von Neumann’s neighborhoods

Moore’s neighborhoods

Rule 30 cellular automatoncurrent pattern 111 110 101 100 011 010 001 000

new state for center cell

0 0 0 1 1 1 1 0

S. Wolfram: A New Kind of Science

http://en.wikipedia.org/wiki/Discrete_mathematics

http://en.wikipedia.org/wiki/State_%28computer_science%29

http://en.wikipedia.org/wiki/Rule_30

http://en.wikipedia.org/wiki/A_New_Kind_of_Science_%28book%29

BiologyPattern formation: seashells, chromatophores in a cephalopod skin. Leaf stoma, neurons, flocking, etc.

ChemistryReaction-diffusion equations: self-assembly

PhysicsGranular systems

Mathematics Numerical Differential Equation solvers

Computer ScienceTuring machine, cryptography, prime number generation, gaming (Minecraft), etc.

MiscellaneaStock market, car traffic flow, fire propagation (Printista et al., UNSL) , crowd behavior, etc.

Some CA applications

PRE 84, 056213 (2011)

HPC for CA in CPU+GPU, using OMP, MPI y CUDA

Flexible CA in parallel arch using: OpenMP for multicore computers MPI for CPU clusters CUDA for GPUs CUDA + MPI for GPU clusters.

Paradigm: Game of Life (GoL)2 states, NxM grid

Moore neighborhood

http://golly.sourceforge.net, for Linux, Windows, Mac

•If an alive cell has fewer than 2 alive neighbors, it dies (loneliness)•If an alive cell has more than 3 alive neighbors, it dies (overcrowding)•If an alive cell has either 2 or 3 alive neighbors, it goes on living (happiness)•If a dead cell has exactly 3 alive neighbors, it comes alive (reproduction). •Otherwise it stays dead.

A puffer-type breeder (red) that leaves glider guns (green) in its wake, which in turn create gliders (blue) Emergency?

http://golly.sourceforge.net/

http://en.wikipedia.org/wiki/Puffer_train_%28cellular_automaton%29

http://en.wikipedia.org/wiki/Gun_%28cellular_automaton%29

Baseline GoL implementations

2 Grids: main and secondary For each cell of main: apply rules + save on secondary Copy secondary grid into main grid

Domain decomposition: each process solves a block. Exchange borders amongst neighbor blocks.

2 kernels K1: System Evolution, K2: Update state for next iteration.

Evolution entirely in GPU

Each MPI process updates block boundaries

Serial

GPU

MPI

GPU+MPI

Hardware: Cluster ICB-ITIC (128 CPU cores -AMD-, 1 Tesla C2050, UN Cuyo). Cluster Mendieta (128 cores -Intel-, 12 Tesla M2090, UN Córdoba).

Results for Mendieta Cluster (1e3 iterations)

Preliminary timing (CACIC2013) Significantly improved by now

Halo: communication time

Results: Speedups

Workstation Phenom, Tesla C2050 Cluster Mendieta, Tesla M2090

Millán et al., CACIC 2013

Can these baseline results be improved?You bet! Wait till Friday for a discussion ….

Results: Code improvements + Scaling

Cluster ICB-ITIC

384e6 cells

MPI: AMD vs Intel issuesGPU+MPI: not very good so far ...

GoL Animation: 512x512, 5000steps

Beyond CA: Agent Based Modeling (ABM)Flexible Agent Based Simulation for Pedestrian Modeling on GPUs

Richmond Paul, Coakley Simon, Romano Daniela (2009), "Cellular Level Agent Based Modelling on the Graphics Processing Unit", Proc. of HiBi09 - High Performance Computational Systems Biology, 14-16 October 2009, Trento, Italy

Several ABMs for CPU clusters, but there are very few multi-GPU ABMs that I know of ….

“Reaction-diffusion equation” to model foams

D. Schwen, A. Caro (LANL), D. Farkas (Va Tech)

Uses Cahn-Hilliard Equation, to generate 3D foam. OpenCL code by Schwen needs

modifications for future research

http://en.wikipedia.org/wiki/Spinodal_decomposition

Plasma exposed W-C surface Takamura et al., Plasma and Fusion Research 1, 51 (2006)

Bringa et al, NanoLetters (2012)

Simulated X-Ray diffraction (use cufftw) A. Higginbotham, M. Suggit, J.S. Wark (U. Oxford).

Twin detection in bcc metals: Suggit et al, Phys. Rev. B (2013)

unshocked phase changed

Experimental geometry: 50 × 50 mm film, placed 30 mm in transmission, 8.05 keV (Cu Kα ) X-rays, perpendicular to the film.

Elastic

phase changed

hcp

hcp

fcc

bcc

Fe phase change: Gunkelmann et al, Phys. Rev. B (2014)

Image processing for medical applicationsF. Fioretti, D. Lemma, E. Millan, E. Bringa (ICB, UNCuyo)& R. Isoardi (FUESMEN)

“Implementation of a Bayesian Statistics Segmentation Algorithm in GPUs”, HPCLatAm (2012) conference proceedings.

Speed-up: 60x in Tesla 2050 respect to to original CPU code. 15x respect to highly optimized CPU-OMP code.

Data reduction in GPU to clasify voxels as:white matter, gray matter or cerebrospinal fluid.

Future perspectives• Multicore +GPU clusters: challenges on

load balancing, communication and synchronization.

• MD often requires more CPU/GPU time for analysis than for time evolution (months versus days) need parallel processing.

• Need smart algorithms which scale well in parallel archs taking advantage of link-cells.

• Need parallel viz tools for samples with +few million atoms (generally not the case in chemistry/biology). Tachyon for GPUs?

• New computers and novel algorithms are allowing comparison of simulation and experimental data.

• GPU processing has bright future!

Cluster ICB-ITIC + nuevos pedidos, para escala nano

Cluster ICB (~$135,000, fondos PICTs; Aranibar, Bringa, Santos, Del Popolo)2 nodos con 64 cores, 128 GB RAM, ~10 TB disco, 1 Tesla 2070

Cluster ITIC (~$60,000, fondos Agencia Garcia Garino)Cluster GPU (~$35000, SeCTyP, PICT; Aranibar, Bringa, Del Popolo, Garcia Garino)

3 nodos con 24 cores, 48 GB RAM, ~3 TB disco, 6 GeForce GT620 Adherido a SNCAD (12/2012)

Fondos para (poder de cómputo)x4, incluyendo 4 PICTs y PME ICB

Solicitud PICTe ($850K, Ambrosini, Bringa, Del Popolo, Garcia Garino, Santos)5 nodos (cada uno con 64 cores, 128 GB RAM, ~4 TB disco, 1 K20x) + Infiniband

Argentina tiene inversión y recursos humanos a escala “nano” en HPCEstamos pensando en O(1e1 GPUs) u O(1e2 CPUs)

cuando existen clusters con O(1e4 GPUs) y O(1e6 CPUs)

Opportunities for interested students!

https://sites.google.com/site/simafweb/Web master: M.J. Erquiaga; design: E. Rim

SiMAF: Simulations in Materials Science, Astrophysics, and Physics

Funding: PICT2008-1325, PICT2009-0092, SeCTyP U.N. Cuyo

That’s all folks!

https://sites.google.com/site/simafweb/

gpus for scientific applications · 2014-05-07 · gpus for scientific applications eduardo m....

Documents