parallel code choices
DESCRIPTION
Parallel Code Choices. Where We Stand?. ShakeOut-D 1-Hz Vs=250m/s benchmark runs on Kraken-XT5 and Ranger at full machine scale, Hercules successful test run on 16k Kraken-XT4 with Vs=200m/s. - PowerPoint PPT PresentationTRANSCRIPT
Parallel Code Choices
Where We Stand?
• ShakeOut-D 1-Hz Vs=250m/s benchmark runs on Kraken-XT5 and Ranger at full machine scale, Hercules successful test run on 16k Kraken-XT4 with Vs=200m/s
• Multiple AWP-Olsen ShakeOut-D 1-Hz runs on NICS Kraken-XT5 using 64k processor cores, Wall Clock Time less than 5 hours, SORD using 16k Ranger cores
• Milestone to pass 100k mark! Recent successful benchmark runs on DOE ANL BG/P using up to 131,072 cores
SCEC capability runs update
1,000 10,000 100,000 1,000,0001E+05
1E+06
1E+07
AWP-Olsen-Day Performance on Some of the World’s Most Powerful Supercomputers
V2-100m (6000x3000x800)
V2-100m on Ranger
V2-100m on Kraken XT4
V2-100m on Kraken XT5
2048^3-100m on ANL BG/P
V2-150m on TJ Watson BG/L
Number of Cores
Nr. o
f mes
h po
ints
upd
ated
/ste
p/se
c/co
re
SCEC capability runs update
2009 2013…2011
Ranger, Kraken, BG/P
GPU/Cell, Blue Water, Hybrid, NUMA, CAF
Future Architectures, FPGA, Chapel, Cloud computing …
Current Parall
Current parallel programming model
Message passing
C, C++, FortranPlus MPI
CommunicationTransition Model
PGAS
UPC, CAF, Titanium
Current compilation technology High Productivity
Models
HPCS
X-10Chapel
Future compilation technology
Tier0:PFlopsclass
Tier1:TG/DOE
Supercomputer Centers
Grid Computing
Tier2:Regional Medium Ranger
Supercomputers
HP
C initiative: short-term
medium
-term long-term
Pick up new codes, EGM 3-10Hz, Contribute to architecture design
Adaption, EGM 2-Hz, Vs=200m/s
SO-1Hz,Vs=200m/s
Data integration
Tier 3:High Performance
Workstations
HPC Initiative
Ph.D programs?
Parallel FD and FE Codes
Split-node
Dynamic
rupture
Wave propagation
Surface topogra
phy
Complex geometr
y
Material
nonlineari
ty
Absorbing
Boundaries
FD-Olsen ✔ ✔ PML
FD-Rob ✔ PML
FD-SORD ✔ ✔ ✔ ✔ PML
FE-Hercules ✔ ✔ Stacey
FE-MaFE* ✔ ✔ ✔ ✔arbitrary ✔ PML
FE-DG* ✔ ✔ ✔ ✔arbitrary -
File system: Original Source
File
File system: Original Media
File
Source Partitioning Media Partitioning
File System:Partitioned
Source Files
File System:Partitioned Media Files
Archival System:
Source and Media Files
Configuration:IN3D
INPUT DATA PREPARATION
Archival System:
Output Files
DATA ARCHIVAL
GridFTPSR
BC
opyGrid
FTP
SRB
Copy
SIMULATION AND VALIDATION
Configuration:IN3D
File System:Partitioned Source and Media Files
GridFTP
SRBLink
File System:Simulation
Output Files
ShakeOut Simulation
GridFTP
SRBCopy
SimulationValidation
YES
Simulation preparation
Source Ready?
Media Ready?
NO
YES
NO
SimulationVisualization
GridFTP
SRBCopy
Proposed Plan of Work:Automatic End-to-End Approach
• Automated rule-based workflow
• Highly configurable and customizable
• Reliable and robust• Easy implementation
• Target much higher TeraFlop/s! basic but most important optimization step due to the accumulated performance gains even in multi-core environments
• Application specific optimization techniques– Program behavior analysis (source level or run-time profiling)
• various traditional optimization techniques such as loop unrolling, code reassignment, register reallocation and so on
– Optimize the behaviors of the code hotspot– Architecture aware optimization
• Optimization based on the underlying architecture: computational unit, interconnect, cache and memory
• Compiler driven optimization techniques, some already done– Optimal compiler and optimization flags– Optimal libraries
Proposed Plan of Work:Single-core Optimization
• Computational pipelining– Asynchronous process communication
• isend and irecv– Well-defined pipelines computational
jobs to reduce the overhead imposed by the MPI synchronization
– Guaranteed correctness of the computation
• Reduction of conflicts on shared resources– A computational node shares
resources: Caches (Shared L2 or L3) and Memory
– Resolves highly biased conflicts on shared resources
• program behavioral solutions through temporal or spatial conflict avoidance
send
recvSyncpoint
stall
sender receiver
isend
irecv
sender receiver
computation
SYNC
ASYNC
Core1 Core2
L1 Cache L1 Cache
frequent&biased Infrequent&even
Proposed Plan of Work:Multi-core Optimization
Shared Memory
Shared L2 cache
Proposed Plan of Work:Fault Tolerance
• Full systems are being designed with 500,000 processors…– Assuming 99.99% each processor to continue functioning for 1 year, the
chance of one million-core machine remaining up for one week is 14%• Checkpointing and restarting could take longer than the time to the
next failure– System checkpoint/restart under way
• Last year, our 80+ hours 6k core run on BG/L successful using IBM system checkpoint (application-assisted infrastructure, application level responsible for identifying point in which there are no outstanding messages.
– New model needed, checkpoints to disk will be impractical at exascale• Collaboration with Dr. Zizheng Chen of CSM
– Scalable algorithm-based checkpoint-free techniques to tolerate a small number of process failures, level fault tolerance solution
• Centralized data-collection more and more difficult, as data size increases exponentially
• Automate administrative tasks huge challenge such as replication, distribution, access controls, metadata extraction. Data virtualization and grid technology to be integrated. With iRODS, for example, can write rules to track administrative functions such as integrity monitoring
- provide logical name space so the data can be moved without the access name changing- provide metadata to support discovery of files and track provenance- provide rules to automate administrative tasks (authenticity, integrity, distribution,replication checks)- provide micro-services for parsing data sets (HDF5 routines).
• Potential to use new iRODS interface to serve large SCEC community- WebDAV (possible to access from such as iPhone)- Windows browser; efficient and fast browser interface
Proposed Plan of Work:Data Management
Proposed Plan of Work:Data Visualization
• Visualization integration as critical interest, Amit has been working with a graduate student to develop GPU based new techniques for earthquake visualization
Candidates of Non-SCEC Applications
• Ader-DG: An FE arbitrary high-order discontinuous Galerkin method
• Shuo Ma’s FE code (MaFE) using simplified structured grid
AWP-Olsen-Day vs ADER-DGFD AWP-Olsen-Day FE ADER-DG
Problem domainAnd settings
600x300x80km, 1-Hz, 250s100x60km, S-wave vel 300-500 m/s (down to 1km)60x30km, S-wave velo 100-300 m/s (down to 400m)
600x300x80km, 1-Hz, 250sVol0: bottom to moho (30km), 600x300x50km, Vs=5000m/s, Vp=8500m/sVol1: 30km to sediment base, 600x300x30km, Vs=3500m/s, Vp=6000m/sVol2: 100x60x1km, Vs=500m/s, Vp=1800m/sVol3: 60x30x0.4km, Vs=200m/s, Vp=1500m/s
3 elements per dominant wavelength, 5th-order accuracy in space and time, ie polynomials of degree 4 within each element , that gives 35 degrees of freedom
Computational cost Vs =200m/s is (500/200)^4=39x more than Vs=500m/s
Vs=200m/s is 2.25x more than using Vs=500m/s
Elements Vs=200m/s: 2.25 x 10^11 Vs=200m/s: 7.69 x 10^7
Time Steps Vs=200m/s:125,000 time steps Vs=200m/s: 485,000 time steps
Total Wall Clock Time (2k / 64k cores)
Vs=200m/s: 2557 hrs / 80 hrs Vs=200m/s: 1191 hrs / 37 hrs
ADER-DG Scaling on Ranger
32 64 128 256 512 1024 20480
2000
4000
6000
8000
10000
12000
tetra_1000_rangertetra_1000_xt4tetra_500_ranger
ADER-DG Validation LOH.3
(Source: Martin Kaeser 2009)
Each tetrahedral element (m) has its own time step
where lmin is the insphere radius of the tetrahedron and amax isthe fastest wave speed.
Therefore, the Taylor series in time depends on the local time level t(m)
ADER-DG Local Time Stepping
(Source: Martin Kaeser 2009)
ADER-DG Dynamic Rupture Results
(Source: Martin Kaeser 2009)
ADER-DG Effect of mesh coarsening
(Source: Martin Kaeser 2009)
DG Application to Landers branching fault system
(Source: Martin Kaeser 2009)
(J. Wassermann)
• problem adapted mesh generation
• p-adaptive calculations to resolve topography very accurately• load balancing by grouping
subdomains
DG Modeling of Wave Fields in Merapi Volcano
(Source: Martin Kaeser 2009)
(J. Wassermann)
• analysing strong scattering effect of surface topography
• analysing the limits of standard moment tensor inversion procedures
DG Modeling of Scattered Waves in Merapi Volcano
(Source: Martin Kaeser 2009)
MaFE Scaling
(Source: Shuo Ma 2009)