gpus for scientific applications · 2014-05-07 · gpus for scientific applications eduardo m....
TRANSCRIPT
GPUs for Scientific Applications
Eduardo M. Bringa ([email protected]) & Emmanuel Millán CONICET / Instituto de Ciencias Básicas, Universidad Nacional de Cuyo, Mendoza
3er Escuela Argentina de GPGPU para Aplicaciones CientíficasCAB, May 2014
Collaborators:D. Tramontina, C. Ruestes, F. Fioretti, C. Garcia Garino (UN Cuyo), L. Forconesi
(ITU), R. Isoardi (FUESMEN), S. Manzi, E. Perino, M.F. Piccoli, M. Printista (UNSL), D. Schwen (LANL), A. Higginbotham (Oxford)
Simulations at different scales: HPC allows going from nano to micro
By Greg Odesgard, NASA Langley Research Center
University of Virginia, MSE 492/627: Introduction to Atomistic Simulations, Leonid Zhigilei, http://www.people.virginia.edu/~lz2n/mse627
HPC
• N partículas clásicas. Partícula i con posición ri, tiene velocidad vi y aceleración ai.
• Partículas interactúan a través de un potencial empírico, V(r1,.., ri,.., rN), que generalmente incluye interacciones de muchos cuerpos.
• Partículas obedecen las ecuaciones de movimiento de Newton. Partícula i, masa mi: Fi = -∇iV(r1,.., ri,.., rN)= mi ai = mi (d2ri /dt2)
• Volumen<0.5 µm3~109 átomos)
• Tiempos t<1 ns, ∆t~1 fs)• Varios integradores disponibles
• Pueden incorporarse efectos electrónicos (Koci et al, PRB 2006).
Una herramienta muy útil para estudiar materiales:Dinámica Molecular clásica =Molecular Dynamics=MD
i
j
k
Fji
Fjk
Fij
FkjFki
Fik
Interatomic potential (Phys/Eng) or Force Field (Chem/Bio)
http://en.wikipedia.org/wiki/Force_field_chemistry
Example
Golden rule: “garbage in, garbage out”
With MD you can obtain….
“Real” time evolution of your system.
Thermodynamic properties, including T(r,t) temperature profiles that can be used in rate equations.
Mechanical properties, including elastic and plastic behavior.
Surface/bulk/cluster growth and modification.
X-ray and “IR” spectra
Etcetera …
•Can simulate only small samples (L<1 µm, up to ~109 atoms).•Can simulate only short times (t<1 µs, because ∆t~1 fs).•Computationally expensive (weeks).• Potential’s golden rule: trash in trash out.• Interaction potentials for alloys, molecular solids, and excited species not well know. • Despite its limitations MD is a very powerful tool to study nanosystems.
Limitations of MD
Atomistic simulations are extremely helpful but … still have multiple limitations
How do we simulate a large number of atoms?
• Integrating the two body problem is one thing …. But integrating the motion of N particles, with N=(several million-billions) is a whole different ball game.
• Short-range potentials (not 1/r): use an appropriate cut-off and do spatial decomposition of the domain. This will ensure nearly perfect parallel scaling [O(N)]. Sometimes a VERY long cut-off is used for (1/r) potentials, with varying results.
• Long-range potentials (1/r): old method uses Ewald summation. New methods (PME,PPPM=P3M, etc.) are typically O(NlogN). Even newer methods (variations of multipole expansion) can be O(N), at the price of a large computational overhead. This is the same as the problem of N-body simulations used in astrophysics.
• Have to be careful with boundary conditions (free, periodic, expanding, damping, etc.) and check for system size effects.
Alejandro Strachan, http://nanohub.org/resources/5838#series
Many MD codes can now use GPU accelerationOften used as black-boxes without understanding limitations
AMBER (Assisted Model Building with Energy Refinement): http://ambermd.org/gpus/ Ross Walker. MPI for several GPUs/cores. TIP3P, PME, ~106 atoms max Tesla C2070)
HOOMD-Blue (Highly Optimized Object-oriented Many-particle Dynamics): http://codeblue.umich.edu/hoomd-blue/index.html OMP for several GPUs in single board.
LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator): http://lammps.sandia.gov/ . MPI ofr several GPUs/cores (LJ: 1.2 ~107 atoms max Tesla C2070)GPULAMMPS: http://code.google.com/p/gpulammps/ CUDA + OpenCL
DL_POLY: http://www.cse.scitech.ac.uk/ccg/software/DL_POLY/ F90+MPI, CUDA+OpenMP port.
GROMACS : http://www.gromacs.org/Downloads/Installation_Instructions/Gromacs_on_GPUs Uses OpenMM libs (https://simtk.org/home/openmm). No paralelization. ~106 atoms max.
NAMD (“Not another” MD): http://www.ks.uiuc.edu/Research/namd/GPU/CPU clusters. VMD (Visual MD): http://www.ks.uiuc.edu/Research/vmd/
GTC 2010 Archive: videos and pdf’s: http://www.nvidia.com/object/gtc2010-presentation-archive.html#md
1,000,000+ atom Satellite Tobacco Mosaic Virus
Freddolino et al., Structure, 14:437-449, 2006.Many more!!!!
Highly Optimized Object-oriented Many-particle Dynamics -HOOMD-Bluehttp://codeblue.umich.edu/hoomd-blue/index.html
•General purpose molecular dynamics simulations fully implemented on graphics processing unitsJoshua A. Anderson, Chris D. Lorenz, and Alex Travesset, Journal of Computational Physics 227 (2008) 5342-5359.•Molecular Dynamics on Graphic Processing Units: HOOMD to the Rescue. Joshua A. Anderson and Alex Travesset, Computing in Science & Engineering 10(6) (2008).
http://www.nvidia.com/object/molecular_dynamics.html
HOOMD-Blue 0.11.2
(64K particles)
Host GPU(ECC off)
Polymer TPS
LJ liquid TPS
Xeon E5-2670 (SandyBridge-E)
Tesla K20X
885 1014
Xeon E5-2667 (SandyBridge-E)
Tesla K20m
785 890
Xeon E5-2670 (SandyBridge-E)
Tesla M2090
597 682
Xeon X5670 (Westmere)
Tesla M2070
486 553
Xeon X5670 (Westmere)
Tesla M2050
490 560
NAMD + VMD (MD+Viz for massive CPU-GPU clusters)
NAMD: http://www.ks.uiuc.edu/Research/namd/VMD (Visual MD): http://www.ks.uiuc.edu/Research/vmd/
1,000,000+ atom simulationSatellite Tobacco Mosaic Virus
Freddolino et al., Structure 14, 437 (2006)Zhao et al., Nature 497, 643 (2013)
64,000,000 atom simulation HIV-1 CAPSID
GPU surface rendering http://www.ks.uiuc.edu/Research/gpu/
Summary: there are many opportunities for MD• PetascaleExascale! (USA & EU initiatives). Science 335, 394 (2012).
• New software: novel algorithms and preferably open source (Nature 482, 485 (2012)]. Still need significant advances in visualization (Visual Strategies: A Practical Guide to Graphics for Scientists and Engineers, F. Frankel & A. Depace, Yale University Press, 2012), dataset analysis [Science 334, 1518 (2011)], self-recovery & fault tolerance, etc.
• New hardware : better (faster/greener/cheaper) processors, connectivity, memory and disk access; MD-tailored machines (MD-GRAPE-4, Anton, etc.); GPUs, MICs, hybrid architectures (GPU/CPU); cloud computing, etc.
• Experiments going micro-nano/ns-ps same as MD
• Can go micron-size, but still have to connect to mm-m scale novel approaches needed, including smart sampling, concurrent coupling, dynamic/ adaptive load balancing/refining for heterogeneous systems, asynchronous simulations, etc.
• Need human resources with mix of hardware, soft & science expertise.
LAMMPS (http://lammps.sandia.gov/ )
Some of my personal reasons to use LAMMPS:
1) Free, open source (GNU license).
2) Easy to learn and use:
(a) mailing list in sourceforge.
(b) responsive developers and user community.
(c) extensive docs :http://lammps.sandia.gov/doc/Section_commands.html#3_5
3) It runs efficiently in my laptop (8 cores) and in BlueGeneL (100 K cores), including parallel I/O, with the same input script. Also efficient for GPUs.
4) Very efficient parallel energy minimization, including cg & FIRE.
5) Includes many-body, bond order, & reactive potentials. Can simulate inorganic & bio systems, granular and CG systems.
6) Can do extras like DSMC, TAD, NEB, TTM, semi-classical methods, etc.
7) Extensive set of analysis routines: coordination, centro, cna, etc.
8) Easy to write analysis inside input, using something similar to pseudo-code.
LAMMPS: GPU execution GPU Package:
Novel, most up to date GPU code. GNU GPL v2 license. Main developers: Paul Crozier (Sandia), Mike Brown, Arnold Tharrington, Scott
Hampton (Oak Ridge), Axel Kohlmeyer (Temple), Christian Trott, Lars Winterfeld (Ilmenau, Germany), Duncan Poole, Peng Wang (Nvidia), etc.
Multiple MPI processes can execute in the same GPU card. Geryon Library: supports OpenCL and CUDA see
https://github.com/CFDEMproject/LAMMPS/tree/master/lib/gpu/geryon
USER-CUDA package: Main developer: Christian Trott Supports only CUDA. In some type of simulations is faster than GPU package. Only one MPI process can be running in a GPU. The USER-CUDA package uses less memory than the GPU package. Supports Granular simulationsFor more information see: http://lammps.sandia.gov/doc/package.html
http://lammps.sandia.gov/doc/Section_accelerate.html#acc_8
GPU LAMMPS Parallel Performance
Benchmarks: http://users.nccs.gov/~wb8/gpu/bench.htm • Also: http://sites.google.com/site/akohlmey/software/lammps-benchmarks
864K atoms LJ liquid, reduced density=0.8442. NVE, rcut= 2.5σ, 5000 steps. Speedup ~ 3-4.
Rhodopsin protein in solvated lipid bilayer. CHARMM force field, long-range Coulombics via PPPM,SHAKE constraints. Counter-ions and reduced amount of water.32K atoms, 1000 timesteps, LJ rcut= 1 nm, neighbor skin of 1.0 σ, NPT. Speedup ~ 1.7-3.
Gay-Berne ellipsoids125 K atoms, NVE, rcut= 7σ, 1000 steps. Speedup ~ 9-11.
Yona cluster:15 Nodes : 2x6-core AMD Opteron 2435 (2.6GHz) & 2 Tesla C2050 GPUs 3GB GDDR5 memory, 448 cores (1.15GHz) , memory bandwidth: 144GB/s. GPUs are connected on PCIx16 gen 2.0 slots. ECC support enabled. Mellanox MT26428 QDR InfiniBand interconnect.
Benchmarks for load balancing
The performance impact resulting from splitting the force calculation between the host and device will depend on the CPU core to device ratio and the relative rates of force calculation on the host and device.
Processes per node (ppn). Dynamic Load Balancing
(LB). Neighboring performed on
the GPU (GPU-N).
LJ, N=864000
Implementing molecular dynamics on hybrid high performance computers – short range forces
W. Michael Brown, Peng Wang, Steven J. Plimpton and Arnold N. Tharrington. Comp. Phys. Comm. 182 (2011)
898–911
Benchmarks for different precision modes
256000 atoms LJ liquid, reduced density=0.8442. NVE, rcut= 2.5σ, 5000 steps.
Rhodopsin protein in solvated lipid bilayer. CHARMM force field, long-range Coulombics via PPPM,SHAKE constraints. Counter-ions and reduced amount of water to make a 32K atom system, replicated 2x2x2 to create box.256,000 atoms, 1000 timesteps, LJ rcut= 1 nm, neighbor skin of 1.0 σ, NPT.
• Single precision OK for many runs, but use at your peril! Colberg & Höfling, Comp. Phys. Comm. 182 (2011) 1120–1129.
• Mixed precision (single for positions and double for forces) nearly as fast as single precision!• Double precision still cheaper than CPU.
More benchmarks …
Strong scaling benchmark using LJ. cutoff of 2.5 and N=864 K.
LJ. Single node.
Implementing molecular dynamics on hybrid high performance computers – short range forcesW. Michael Brown, Peng Wang, Steven J. Plimpton and Arnold N. Tharrington. Comp. Phys. Comm. 182 (2011) 898–911
LAMMPS: OpenCL vs CUDA
Single node. Code compiled with CUDA and OpenCL. N=256K with neighboring performed on the GPU. Time normalized by the time required to
complete simulation loop with CUDA.
Implementing molecular dynamics on hybrid high performance computers – short range forcesW. Michael Brown, Peng Wang, Steven J. Plimpton and Arnold N. Tharrington. Comp. Phys. Comm. 182 (2011) 898–911
CPU versus CPU-GPU Speedups
Implementing molecular dynamics on hybrid high performance computers – short range forcesW. Michael Brown, Peng Wang, Steven J. Plimpton and Arnold N. Tharrington. Comp. Phys. Comm. 182 (2011) 898–911
LAMMPS Scaling with GPU
GPU: NVIDIA Tesla c2050, 3 GB RAM, 448 processing units CPU: AMD Phenom x6 1055t 2.8GHz, 12 GB of RAM.
4000 32000 256000 864000 13720000
100
200
300
400
500
600
700
800
900
1000
LJ Melt example
1000 steps, AMD Phenom x6 1055T and NVIDIA Tesla c2050
CPU 2 cores
GPU Single
GPU Double
Number of Atoms
Wa
llclo
ck ti
me ~8x
4000000 7812500 9841500 109760000
10
20
30
40
50
60
70
melt - gpulammps - 1 core - tesla c2070 - 100 steps
total
other
force
neigh
output
comm
N
Tim
e
Interstellar dust plays an important role in astrophysical processesGrain size matters for evolution, astro-chemistry, etc.
star
interstellar cloudof gas and dust
absorptionscattering
observed spectraof star
star formation
cooling rate
thermal balance
chemistrybiology
planet formation
• meteorites/asteroids• black holes
• Grain diameters: 0.005 µm – 5 µm
• Size distribution: n(a)∝ a -3.5
Supernova 1987A
Modification of materials:
phase transitions and damage to space
hardware, and due to shrapnel (NIF)
Large-scale MD links nano and microscales in damage induced by nanoprojectiles
Only dislocations + liquid atoms shown, ~300 106 atoms
C. Anders et al., PRL 108, 027601 (2012)
Granular mechanics of
nano-grain collisions
Ringl et al., Ap.J. 752 (2012) 151
New granular friction scheme
implemented for GPUs by E. Millan
Granular mechanics of grain-surface
collisionsRingl et al.,
PRE 86, 061313 (2012) PRE
KALEIDOSCOPE
COMPLEXITY in cluster collisions
Parameters: Velocity (v) Impact parameter (x) Radius (d/2) StructureOrientation of the lattice
0 ps 1.6 ps 3.1 ps
4.6 ps 6.1 ps 16.1 psN. Ohnishi, et al. “Numerical analysis of nanograin collision by classicalmolecular dynamics,” J. Phys. Conf. Series 112 (2008) 042017. Run in 256 cores (LLNL)
COMPLEX CLUSTER STRUCTURE
Stretching separation
Sliding and locking
DropletSticking
Additional parameters:
Shape Porosity Speed of sound in the material Fragmentation velocity
Ringl, Bringa, Bertoldi, and Urbaseek, Astrophysics Journal (2011) Run in 48 cores, 128 GB RAM (Germany).Kalweit and Drikakis PHYSICAL REVIEW B 74, 235415 (2006)
V=5m/s; X= 0,8R
Emmanuel N. Millán, et. al. XVIII CACIC, http://sedici.unlp.edu.ar/handle/10915/23634
Nanograin collisions – GPU vs CPU performance
150 simulations. GPU/CPU test is for 5 procs running independent simulations in parallel using six CPU cores, one CPU process with 2 cores and 4 GPU procs.
Results for different impact parameters
b= impact parameter, and R=radius of spherical grains (R=6.7 σ), SP= single precision.
Largest defect content for central collisions, b=0.
Results at different velocities
Final configuration for V (LJ units): (a) V=0.3, (b) V=1, (c) V=3, and (d) V=6.
Only part of the atoms shown, to allow the visualization of SF atoms (grey)
Grains shown in teal and red → lack of mixing at low velocities.
What is the threshold velocity for plasticity? Continuum models typically neglect nanosize and strain rate effects
Plastic Threshold
2000~ atoms, vthresh=0.55a) 2.000~ atoms, v=0.54b) 2.000~ atoms, v=0.56e) 2.000~ atoms, v=0.54, 18 atoms in SFf) 2.000~ atoms, v=0.56, 191 atoms in SF
20.000~ atoms, vthresh=0.37c) 20.000~ atoms, v=0.35d) 20.000~ atoms, v=0.41g) 20.000~ atoms, v=0.35, 28 atoms in SFh) 20.000~ atoms, v=0.41, 759 atoms in SF
200.000~ atoms, vthresh=0.225i) 200.000~ atoms, v=0.2j) 200.000~ atoms, v=0.25m) 200.000~ atoms, v=0.2, 34 atoms in SFn) 200.000~ atoms, v=0.25, 3670 atoms in SF
2.000.000~ atoms, vthresh=0.11k) 2.000.000~ atoms, v=0.1l) 2.000.000~ atoms, v=0.15o) 2.000.000~ atoms, v=0.1, 248 atoms in SFp) 2.000.000~ atoms, v=0.15, 24913 atoms in SF
Fraction of defective atoms changes with velocity and radius
Plastic threshold inversely proportional to grain radius.
Each point needs 10+ runs over different orientations
Plasticity threshold in grain-grain impacts
Millan, Tramontina, et al., Anales MACI (2013) FCC stacking faults and twins
Dislocation-based model by Lubarda et al.agrees with MD. Millan, Tramontina, et al.,
to be submitted (2014)
GPUs + CPUs to run ~10000 independent
MD simulations
Granular models typically assume lack of
plasticity V. Lubarda, et al., Acta Mater. 52 (2004) 1397
S. Jung et al., Aerosol Sci. 50 (2012) 26
On-going research on grain-grain Atoms in planar defects give a reasonably good diagnostic for
plasticity initiation. Need to quantify dislocation activity and check for dislocation starvation.
Threshold velocity depends on cluster size, but not as in usual continuum models.
On going studies: Statistics of defects (100-10000 orientations). Large cluster sizes → Hall-Petch (HP) for
dislocations/twins. Nano-contacts: beyond JKR/DMT models. Granular model implemented in GPUs and
submitted to LAMMPS
(with C. Ringl & H. Urbassek, KTU, Germany).
GRANULAR simulations with LAMMPS
C. RINGL AND H.M. URBASSEK, A LAMMPS implementation of granular mechanics: Inclusion of adhesive and microscopic friction forces, Computer Physics Communications 183 (2012), pp 986-992.
Time series of a cluster–cluster collision, snapshots taken every 5 μs. Color differentiates the original cluster to which
a grain belongs. Center-of-mass frame. Velocity v = 5 m/s−1 (v = 29vfrag ) and impact parameter b = 0.8 R were selected.
A model for μm-sized grain–grain interaction which exhibits the essential features necessary to describe collision, agglomeration and fragmentation processes. The model has been e ciently implemented in the LAMMPS code. In ffiaddition to existing models, adhesive forces and — gliding, rolling, and torsional — friction processes are implemented.
GRANULAR simulations Benchmarks in GPU(extension of USER-CUDA)
The 7.5e4 curve represents the results obtained in C. Ringl, Comp. Phys. 183, 2012.
CPU: AMD Phenom x6 1055t 2.8GHzGPU: NVIDIA Tesla c2050
AVG speedup GPU vs 1 CPU core = 7xGPU vs 6 CPU core = 2.95x
GPU version developed by Emmanuel N. Millán CPU version developed by Christian Ringl (Comp. Phys. 183, 2012)Code submitted to LAMMPS repository
Millan et al. A GPU implementation for improved granular simulations with LAMMPS. HPCLatAm 2013, pp. 89-100 (full paper). Session: GPU Architecture and Applications. C. Garcia Garino and M. Printista (Eds.) Mendoza, Argentina, July 29-30, 2013. http://hpc2013.hpclatam.org/papers/HPCLatAm2013-paper-10.pdf
Benchmarks: Clusters
Granular simulation with the GranularEasy pair style, with 4.48e6 grains and1000 steps, for 1 through 64 processes, in Mendieta and ICB-ITIC clusters. Various GPUs are tested: C2050, C2075 and M2090.
Tesla c2050 GPU = 16∼CPU cores ICB-ITIC cluster.
Mendieta Tesla M2090 GPUs best performance using 4 GPUs in two cluster nodes speedup of 4.2 x ∼against the best CPU result (ICB-ITIC cluster with 16 CPU cores).
Benchmarks: Communication
GPU Granular simulation, 1000 steps with 4.48e6 grains. For six processes, two simulations are shown, one has an elongated, half empty heterogeneous box, and the second has a cubic homogeneous box filled with grains. Each grain has 2-5 neighbors.
GPU Melt simulation, Lennard Jones potential, 1000 steps with 256e3 atoms, each atom has
70 neighbors with 2.5 cutoff and 500 ∼ ∼neighbors with 5.0 cutoff
Simulation of settlement dynamics in arid environments
Problem: Which factors influence the livestock settlements spatial distribution in the NE of Mendoza?Use Monte Carlo simulationE. Millan (ICB), E. Bringa (ICB), C. Garcia Garino (ITIC), L. Forconesi (ITU), S. Goiran & J. Aranibar (ICB, CCT-Mendoza), submitted to Ecological Modeling (2014).
Complexity: large parameter space + neighbor search
Variables:
Distance to road
Distance to river
Settlements distance
Water table depth
Vegetation degradation
(need neighbors!!)
Objetive: Find optimal solution, minimizing error.
One is real, the other simulated: which is which?
Millan et al., submitted to Ecological Modeling (2014)
Vegetation degradation to 5th nearest neighbors
Need ~2.5 million independent simulations:
a) use multicores
b) use GPU: speed-up?
“Numerical” experiments using LAMMPS: parameter sweep for cluster collisions
Need to sweep over relative orientation, velocity, R, etc.
Goal: reduce the makespan of multiples jobs executing parallel processes both in the CPU and GPU.
Ad-hoc strategy: split jobs bewteen CPU&GPU. Could be improved with job scheduling tools.
Different parallel modes considered: Process parametric study on multicore CPU workstation using
OpenMPI. Process parametric study on the GPU. Hybrid studies: RUBY script to assign workload both to CPU
and GPU according to predefined strategy. MPI plus Dynamic or static load balancing.
Only up to 10 simultaneous jobs in single GPU, due to memory limitations.
•Discrete model: regular grid of cells (in N dimensions), each in one of a finite number of states, such as "On" and "Off".
•For each cell, define a neighborhood
• Discrete time. An initial state (time t=0) is selected by assigning a state for each cell.
•A new generation is created (advancing t by 1), according to some fixed “local” rule (generally, a mathematical function) that determines the new state of each cell in terms of the current state of the cell and the states of the cells in its neighborhood.
http://en.wikipedia.org/wiki/Cellular_automata
Cellular automaton (pl. cellular automata, CA)
Von Neumann’s neighborhoods
Moore’s neighborhoods
Rule 30 cellular automatoncurrent pattern 111 110 101 100 011 010 001 000
new state for center cell
0 0 0 1 1 1 1 0
S. Wolfram: A New Kind of Science
BiologyPattern formation: seashells, chromatophores in a cephalopod skin. Leaf stoma, neurons, flocking, etc.
ChemistryReaction-diffusion equations: self-assembly
PhysicsGranular systems
Mathematics Numerical Differential Equation solvers
Computer ScienceTuring machine, cryptography, prime number generation, gaming (Minecraft), etc.
MiscellaneaStock market, car traffic flow, fire propagation (Printista et al., UNSL) , crowd behavior, etc.
Some CA applications
PRE 84, 056213 (2011)
HPC for CA in CPU+GPU, using OMP, MPI y CUDA
Flexible CA in parallel arch using: OpenMP for multicore computers MPI for CPU clusters CUDA for GPUs CUDA + MPI for GPU clusters.
Paradigm: Game of Life (GoL)2 states, NxM grid
Moore neighborhood
http://golly.sourceforge.net, for Linux, Windows, Mac
•If an alive cell has fewer than 2 alive neighbors, it dies (loneliness)•If an alive cell has more than 3 alive neighbors, it dies (overcrowding)•If an alive cell has either 2 or 3 alive neighbors, it goes on living (happiness)•If a dead cell has exactly 3 alive neighbors, it comes alive (reproduction). •Otherwise it stays dead.
A puffer-type breeder (red) that leaves glider guns (green) in its wake, which in turn create gliders (blue) Emergency?
Baseline GoL implementations
2 Grids: main and secondary For each cell of main: apply rules + save on secondary Copy secondary grid into main grid
Domain decomposition: each process solves a block. Exchange borders amongst neighbor blocks.
2 kernels K1: System Evolution, K2: Update state for next iteration.
Evolution entirely in GPU
Each MPI process updates block boundaries
Serial
GPU
MPI
GPU+MPI
Hardware: Cluster ICB-ITIC (128 CPU cores -AMD-, 1 Tesla C2050, UN Cuyo). Cluster Mendieta (128 cores -Intel-, 12 Tesla M2090, UN Córdoba).
Results for Mendieta Cluster (1e3 iterations)
Preliminary timing (CACIC2013) Significantly improved by now
Halo: communication time
Results: Speedups
Workstation Phenom, Tesla C2050 Cluster Mendieta, Tesla M2090
Millán et al., CACIC 2013
Can these baseline results be improved?You bet! Wait till Friday for a discussion ….
Results: Code improvements + Scaling
Cluster ICB-ITIC
384e6 cells
MPI: AMD vs Intel issuesGPU+MPI: not very good so far ...
GoL Animation: 512x512, 5000steps
Beyond CA: Agent Based Modeling (ABM)Flexible Agent Based Simulation for Pedestrian Modeling on GPUs
Richmond Paul, Coakley Simon, Romano Daniela (2009), "Cellular Level Agent Based Modelling on the Graphics Processing Unit", Proc. of HiBi09 - High Performance Computational Systems Biology, 14-16 October 2009, Trento, Italy
Several ABMs for CPU clusters, but there are very few multi-GPU ABMs that I know of ….
“Reaction-diffusion equation” to model foams
D. Schwen, A. Caro (LANL), D. Farkas (Va Tech)
Uses Cahn-Hilliard Equation, to generate 3D foam. OpenCL code by Schwen needs
modifications for future research
http://en.wikipedia.org/wiki/Spinodal_decomposition
Plasma exposed W-C surface Takamura et al., Plasma and Fusion Research 1, 51 (2006)
Bringa et al, NanoLetters (2012)
Simulated X-Ray diffraction (use cufftw) A. Higginbotham, M. Suggit, J.S. Wark (U. Oxford).
Twin detection in bcc metals: Suggit et al, Phys. Rev. B (2013)
unshocked phase changed
Experimental geometry: 50 × 50 mm film, placed 30 mm in transmission, 8.05 keV (Cu Kα ) X-rays, perpendicular to the film.
Elastic
phase changed
hcp
hcp
fcc
bcc
Fe phase change: Gunkelmann et al, Phys. Rev. B (2014)
Image processing for medical applicationsF. Fioretti, D. Lemma, E. Millan, E. Bringa (ICB, UNCuyo)& R. Isoardi (FUESMEN)
“Implementation of a Bayesian Statistics Segmentation Algorithm in GPUs”, HPCLatAm (2012) conference proceedings.
Speed-up: 60x in Tesla 2050 respect to to original CPU code. 15x respect to highly optimized CPU-OMP code.
Data reduction in GPU to clasify voxels as:white matter, gray matter or cerebrospinal fluid.
Future perspectives• Multicore +GPU clusters: challenges on
load balancing, communication and synchronization.
• MD often requires more CPU/GPU time for analysis than for time evolution (months versus days) need parallel processing.
• Need smart algorithms which scale well in parallel archs taking advantage of link-cells.
• Need parallel viz tools for samples with +few million atoms (generally not the case in chemistry/biology). Tachyon for GPUs?
• New computers and novel algorithms are allowing comparison of simulation and experimental data.
• GPU processing has bright future!
Cluster ICB-ITIC + nuevos pedidos, para escala nano
Cluster ICB (~$135,000, fondos PICTs; Aranibar, Bringa, Santos, Del Popolo)2 nodos con 64 cores, 128 GB RAM, ~10 TB disco, 1 Tesla 2070
Cluster ITIC (~$60,000, fondos Agencia Garcia Garino)Cluster GPU (~$35000, SeCTyP, PICT; Aranibar, Bringa, Del Popolo, Garcia Garino)
3 nodos con 24 cores, 48 GB RAM, ~3 TB disco, 6 GeForce GT620 Adherido a SNCAD (12/2012)
Fondos para (poder de cómputo)x4, incluyendo 4 PICTs y PME ICB
Solicitud PICTe ($850K, Ambrosini, Bringa, Del Popolo, Garcia Garino, Santos)5 nodos (cada uno con 64 cores, 128 GB RAM, ~4 TB disco, 1 K20x) + Infiniband
Argentina tiene inversión y recursos humanos a escala “nano” en HPCEstamos pensando en O(1e1 GPUs) u O(1e2 CPUs)
cuando existen clusters con O(1e4 GPUs) y O(1e6 CPUs)
Opportunities for interested students!
https://sites.google.com/site/simafweb/Web master: M.J. Erquiaga; design: E. Rim
SiMAF: Simulations in Materials Science, Astrophysics, and Physics
Funding: PICT2008-1325, PICT2009-0092, SeCTyP U.N. Cuyo
That’s all folks!