parallel code choices

Parallel Code Choices

Where We Stand?

• ShakeOut-D 1-Hz Vs=250m/s benchmark runs on Kraken-XT5 and Ranger at full machine scale, Hercules successful test run on 16k Kraken-XT4 with Vs=200m/s

• Multiple AWP-Olsen ShakeOut-D 1-Hz runs on NICS Kraken-XT5 using 64k processor cores, Wall Clock Time less than 5 hours, SORD using 16k Ranger cores

• Milestone to pass 100k mark! Recent successful benchmark runs on DOE ANL BG/P using up to 131,072 cores

SCEC capability runs update

1,000 10,000 100,000 1,000,0001E+05

1E+06

1E+07

AWP-Olsen-Day Performance on Some of the World’s Most Powerful Supercomputers

V2-100m (6000x3000x800)

V2-100m on Ranger

V2-100m on Kraken XT4

V2-100m on Kraken XT5

2048^3-100m on ANL BG/P

V2-150m on TJ Watson BG/L

Number of Cores

Nr. o

f mes

h po

ints

upd

ated

/ste

p/se

c/co

re

SCEC capability runs update

2009 2013…2011

Ranger, Kraken, BG/P

GPU/Cell, Blue Water, Hybrid, NUMA, CAF

Future Architectures, FPGA, Chapel, Cloud computing …

Current Parall

Current parallel programming model

Message passing

C, C++, FortranPlus MPI

CommunicationTransition Model

PGAS

UPC, CAF, Titanium

Current compilation technology High Productivity

Models

HPCS

X-10Chapel

Future compilation technology

Tier0:PFlopsclass

Tier1:TG/DOE

Supercomputer Centers

Grid Computing

Tier2:Regional Medium Ranger

Supercomputers

HP

C initiative: short-term

medium

-term long-term

Pick up new codes, EGM 3-10Hz, Contribute to architecture design

Adaption, EGM 2-Hz, Vs=200m/s

SO-1Hz,Vs=200m/s

Data integration

Tier 3:High Performance

Workstations

HPC Initiative

Ph.D programs?

Parallel FD and FE Codes

Split-node

Dynamic

rupture

Wave propagation

Surface topogra

phy

Complex geometr

y

Material

nonlineari

ty

Absorbing

Boundaries

FD-Olsen ✔ ✔ PML

FD-Rob ✔ PML

FD-SORD ✔ ✔ ✔ ✔ PML

FE-Hercules ✔ ✔ Stacey

FE-MaFE* ✔ ✔ ✔ ✔arbitrary ✔ PML

FE-DG* ✔ ✔ ✔ ✔arbitrary -

File system: Original Source

File

File system: Original Media

File

Source Partitioning Media Partitioning

File System:Partitioned

Source Files

File System:Partitioned Media Files

Archival System:

Source and Media Files

Configuration:IN3D

INPUT DATA PREPARATION

Archival System:

Output Files

DATA ARCHIVAL

GridFTPSR

BC

opyGrid

FTP

SRB

Copy

SIMULATION AND VALIDATION

Configuration:IN3D

File System:Partitioned Source and Media Files

GridFTP

SRBLink

File System:Simulation

Output Files

ShakeOut Simulation

GridFTP

SRBCopy

SimulationValidation

YES

Simulation preparation

Source Ready?

Media Ready?

NO

YES

NO

SimulationVisualization

GridFTP

SRBCopy

Proposed Plan of Work:Automatic End-to-End Approach

• Automated rule-based workflow

• Highly configurable and customizable

• Reliable and robust• Easy implementation

• Target much higher TeraFlop/s! basic but most important optimization step due to the accumulated performance gains even in multi-core environments

• Application specific optimization techniques– Program behavior analysis (source level or run-time profiling)

• various traditional optimization techniques such as loop unrolling, code reassignment, register reallocation and so on

– Optimize the behaviors of the code hotspot– Architecture aware optimization

• Optimization based on the underlying architecture: computational unit, interconnect, cache and memory

• Compiler driven optimization techniques, some already done– Optimal compiler and optimization flags– Optimal libraries

Proposed Plan of Work:Single-core Optimization

• Computational pipelining– Asynchronous process communication

• isend and irecv– Well-defined pipelines computational

jobs to reduce the overhead imposed by the MPI synchronization

– Guaranteed correctness of the computation

• Reduction of conflicts on shared resources– A computational node shares

resources: Caches (Shared L2 or L3) and Memory

– Resolves highly biased conflicts on shared resources

• program behavioral solutions through temporal or spatial conflict avoidance

send

recvSyncpoint

stall

sender receiver

isend

irecv

sender receiver

computation

SYNC

ASYNC

Core1 Core2

L1 Cache L1 Cache

frequent&biased Infrequent&even

Proposed Plan of Work:Multi-core Optimization

Shared Memory

Shared L2 cache

Proposed Plan of Work:Fault Tolerance

• Full systems are being designed with 500,000 processors…– Assuming 99.99% each processor to continue functioning for 1 year, the

chance of one million-core machine remaining up for one week is 14%• Checkpointing and restarting could take longer than the time to the

next failure– System checkpoint/restart under way

• Last year, our 80+ hours 6k core run on BG/L successful using IBM system checkpoint (application-assisted infrastructure, application level responsible for identifying point in which there are no outstanding messages.

– New model needed, checkpoints to disk will be impractical at exascale• Collaboration with Dr. Zizheng Chen of CSM

– Scalable algorithm-based checkpoint-free techniques to tolerate a small number of process failures, level fault tolerance solution

• Centralized data-collection more and more difficult, as data size increases exponentially

• Automate administrative tasks huge challenge such as replication, distribution, access controls, metadata extraction. Data virtualization and grid technology to be integrated. With iRODS, for example, can write rules to track administrative functions such as integrity monitoring

- provide logical name space so the data can be moved without the access name changing- provide metadata to support discovery of files and track provenance- provide rules to automate administrative tasks (authenticity, integrity, distribution,replication checks)- provide micro-services for parsing data sets (HDF5 routines).

• Potential to use new iRODS interface to serve large SCEC community- WebDAV (possible to access from such as iPhone)- Windows browser; efficient and fast browser interface

Proposed Plan of Work:Data Management

Proposed Plan of Work:Data Visualization

• Visualization integration as critical interest, Amit has been working with a graduate student to develop GPU based new techniques for earthquake visualization

Candidates of Non-SCEC Applications

• Ader-DG: An FE arbitrary high-order discontinuous Galerkin method

• Shuo Ma’s FE code (MaFE) using simplified structured grid

AWP-Olsen-Day vs ADER-DGFD AWP-Olsen-Day FE ADER-DG

Problem domainAnd settings

600x300x80km, 1-Hz, 250s100x60km, S-wave vel 300-500 m/s (down to 1km)60x30km, S-wave velo 100-300 m/s (down to 400m)

600x300x80km, 1-Hz, 250sVol0: bottom to moho (30km), 600x300x50km, Vs=5000m/s, Vp=8500m/sVol1: 30km to sediment base, 600x300x30km, Vs=3500m/s, Vp=6000m/sVol2: 100x60x1km, Vs=500m/s, Vp=1800m/sVol3: 60x30x0.4km, Vs=200m/s, Vp=1500m/s

3 elements per dominant wavelength, 5th-order accuracy in space and time, ie polynomials of degree 4 within each element , that gives 35 degrees of freedom

Computational cost Vs =200m/s is (500/200)^4=39x more than Vs=500m/s

Vs=200m/s is 2.25x more than using Vs=500m/s

Elements Vs=200m/s: 2.25 x 10^11 Vs=200m/s: 7.69 x 10^7

Time Steps Vs=200m/s:125,000 time steps Vs=200m/s: 485,000 time steps

Total Wall Clock Time (2k / 64k cores)

Vs=200m/s: 2557 hrs / 80 hrs Vs=200m/s: 1191 hrs / 37 hrs

ADER-DG Scaling on Ranger

32 64 128 256 512 1024 20480

2000

4000

6000

8000

10000

12000

tetra_1000_rangertetra_1000_xt4tetra_500_ranger

ADER-DG Validation LOH.3

(Source: Martin Kaeser 2009)

Each tetrahedral element (m) has its own time step

where lmin is the insphere radius of the tetrahedron and amax isthe fastest wave speed.

Therefore, the Taylor series in time depends on the local time level t(m)

ADER-DG Local Time Stepping


ADER-DG Dynamic Rupture Results


ADER-DG Effect of mesh coarsening


DG Application to Landers branching fault system


(J. Wassermann)

• problem adapted mesh generation

• p-adaptive calculations to resolve topography very accurately• load balancing by grouping

subdomains

DG Modeling of Wave Fields in Merapi Volcano


(J. Wassermann)

• analysing strong scattering effect of surface topography

• analysing the limits of standard moment tensor inversion procedures

DG Modeling of Scattered Waves in Merapi Volcano


MaFE Scaling

(Source: Shuo Ma 2009)

parallel code choices

Documents

core machine

important optimization

ms benchmark

nics krakenxt5

parallel fd

d programs

computational unit

bgl parallel code choices1