engineering breakthroughs at ncsa: xsede, … breakthroughs at ncsa: xsede, blue waters, industry...

National Center for Supercomputing Applications

Engineering Breakthroughs at NCSA:XSEDE, Blue Waters, Industry

Seid Koric Senior Technical Lead-Private Sector Program at NCSA

Adjunct Professor, Mechanical Science and Engineering Dept.University of Illinois

http://[email protected]

XSEDE ECSS Project: 3D Study of Elastic-Plastic Transition and Fractal Patterns of 1 million Grain

Cube of grade 316-Steel (2010-2012)

(M. Ostoja-Starzewski, Jun Li, S. Koric, A. Saharan, Philosophical Magazine, 2012 )


Largest Nonhomogenous FEA simulations to date

Every of 1 Million Elements (Grains) has a different material property

Fractal dimension can be used to estimate level of plasticity for damage assessment for various structures

We are aiming at (much) larger simulations on Blue Waters !


Blue Waters- sustained peta-scale system

• >300Cray System & Storage cabinets:

• >25,000Compute nodes:

• >1 TB/sUsable Storage Bandwidth:

• >1.5 PetabytesSystem Memory:

• 4 GBMemory per core module:

• 3D TorusGemini Interconnect Topology:

• >25 PetabytesUsable Storage:

• >11.5 PetaflopsPeak performance:

• >49,000Number of AMD processors:

• >380,000Number of AMD x86 core module:

• >5,000Number of NVIDIA GPUs:

iForge-Industrial HPC resource at NCSA


Platform 1 Platform 2x86 Cores 2048 576CPU Type “Sandy Bridge” “Abu Dhabi”

Clock 3.2 Ghz 3.4 GHz

Cores/Node 16 32Memory/Node 128 GB, 1600 MHz 256 GB, 1600MHzGlobal RAMdisk 1.5 TerabytesTotal Memory 21 Terabytes

Storage 700 TerabytesFile system GPFSInterconnect 40 Gigabit QDR InfiniBand

MPI Platform, Intel, MVAPICH2, OpenMP

Operating System Red Hat Enterprise Linux 6.4


Evaluation of Massively Parallel Linear Solvers in Implicit FEA

• Implicit FEA code spends (70-80%) of time solving large systems of linear equations, Ax=b , where A is sparse i.e., most of coefficients are zero

• A wide range of applications: finite element solid mechanics, computational fluid dynamics, reservoir simulation, circuit design, linear programming etc.


FE Model and its Global Stiffness Matrix


Problem Specification (matrices)

• Originate from either in-house industrial and academic codes, or from a commercial FE code solving real world engineering problems

• Mostly SPD with N=1-20 M, NNZ=120-500M

• Condition Numbers 103-1012


Problem Specification (solvers)

•WSMP: direct solver developed by IBM/Watson, based on multifrontal algorithm, hybrid (MPI & p-threads), symmetric and nonsymmetric•Super LU: direct solver developed by LBNL, LU decomposition, MPI, nonsymmetric•MUMPS: direct solver funded by CEC ESPIRT IV, multifrontalalgorithm, MPI, symmetric and nonsymmetric•Hypre: iterative solver, LLNL, Conjugate Gradient with AMG, IC, and SAI (Sparse Approx Inverse) pre-conditioners, MPI, symmetric•PETSc: iterative solver, ANL, Conjugate Gradients (CG), Bi-Conjugate Stabilized (BCGS), Conjugate Residual Gradient (CR) with Bjacobi, ACM (Additive Schwarz) , and AMG (Multi-Grid) pre-conditioners , MPI, symmetric and nonsymmetric•Commercial FEA Codes (NDA)

Solver Work in Progress (iForge now)


0

50

100

150

200

250

CG/Bjacobi,PETSc,

Rconv=1.E‐5

BCGS/Bjacobi,PETSc,

Rconv=1.E‐5

BCGS/ASM,PETSc,

Rconv=1.E‐5

CR/Bjacobi,PETSc,

Rconv=1.E‐5

PCG/ParaSails,Hypre,

Rconv=1.E‐5

MUMPS SPD,Direct

WSMP SPD,Direct

SuperLU,Unsymmetric,

Direct

Solutio

n Time [sec

]

Matrix 1M, SPD, N=1.5M, NNZ=63.6M, COND=6.9E4Lower = Better

16 cores

32 cores

64 cores

128 cores

256 cores

An Order of magnitude larger problem


0

2000

4000

6000

8000

10000

12000

Solutio

n Time [sec]

16 cores

32 cores

64 cores

128 cores

256 cores

512 cores

CR/Bjacobi, PETSc, Rconv=1.0E‐5

WSMP, SPD, Direct

PCG/Parasails, Hypre, Rconv=1.0E‐5

MUMPS, SPD, Direct

Matrix 20M, SPD, N=20.05M, NNZ=827.49M, COND=~1.E7Lower = Better

WSMP Performance on iForgeHigher=Better


0

1

2

3

4

5

6

128 256 512 768 960

Sparse Factroizatio

n Pe

rforman

ce TFlop

/Sec

Number of Threads

Watson Sparse Matrix Package Hybrid (MPI/Pthreads) Symmetric Solver N=2.8M, NNZ=107M

X5690/Westmere

XE5‐2670/Sandy Bridge

ABAQUS model:Number of elements: 2,274,403Number of nodes: 12,190,073Number of DOFs >30M

ABAQUS analysis job:Cluster: iForgeNumber of cores used: 24-196Solver: Direct Sparse 7hours->1hour

ISV Implicit FEA Benchmark on iForge

0

5000

10000

15000

20000

25000

30000

0 50 100 150 200 250

Wal

l Clo

ck T

ime

(sec

)

# of cores

Wall Clock Time vs. Number of Cores

Explicit FEA: LS-Dyna on Blue Waters

NCSA/PSP, Hardware Vendor (CRAY), ISV (LSTC), PSP partner (NDA)-all working together !

Real geometry, Loads, BC-s, highly nonlinear transient dynamic problem with difficult contact conditions

MPP Dyna solver fully ported and optimized to CRAY’s Linux Environment and taking full advantage of Gemini interconnect


LS-Dyna Breakthrough on Blue Waters


0

2

4

6

8

10

12

14

16

512 1024 1536 2048 3072 4096 8192

Wall Clock (h

ours)

CPU Cores

26.5M nodes, 80M DOFs, Time in Hours, Lower = Better

iForge (MPI)

Blue Waters(MPI)

Blue Waters(Hybrid)

Highest known scaling of LS‐DYNA to date !!

Typical MPP-Dyna Profiling


As the number of cores increases, the communication cost increases rapidly !

64 cores

Computing

Communication

512 cores

Dyna Work in progress

• Benchmarking even larger real problems• Memory management becoming a serious

issue for DP (decomposition, distribution, MPMD, etc.)

• Hybrid (MPI/OpenMP) solver uses less memory and less communication

• Load Balance in Contact and Rigid Body Algorithms


Star‐CCM+ Breakthrough on Blue Waters

Source: NCSA Private Sector Partner ”B" (Confidential)Code/Version: Star‐CCM+ 7.6.9Physics: Transient, turbulent, single‐phase compressible flowMesh size: 21.4 million unstructured polyhedral cellsComplexity: Very complicated geometry, high resolution mesh

Complex real‐life production case: A highly complex CFD case both in terms of the mesh and physics involved.

0

200

400

600

800

1000

0 128 256 384 512 640 768 896 102411521280140815361664179219202048

Iteratio

ns / Sim

ulation Hr

CPU Cores

iForge

BlueWaters

Scaling with Infiniband levels off at 256 cores

Highest known scaling of Star‐CCM+ to date…

…and we broke the code!

CD‐adapco Star‐CCM+ Case from “Partner B”Iteration/Simulation hour, Higher = Better

Future of HPC, GPGPU with OpenACC ?


0

10

20

30

40

50

60

70

80

90

100

CPU Only (1 OMP) CPU Only (6 OMP) GPU(OpenACC)

Wall Clock [sec]

Laplace 2DLower is Better Blue Waters XK7

(Interlagos/Kepler)

KIDS (Westmere/Fermi)

14x Speedup !

Inter-Nodal GPU Acceleration on Blue Waters with Abaqus


0

5

10

15

20

25

30

8 16 32 64 96

Parallel Spe

edup

Cores

Abaqus/Standard 6.11, Cluster Compatibility ModeS4B Benchmark (5.23M Dofs), Higher=Better

Cray XE6 (CPU only)

Cray XK7(CPU+GPU)

NDEMC Public-Private Partnership


•US OEMs have gained a competitive edge through the use of high performance computing (HPC) with modeling simulation and analysis (MS&A).

• US Council of competitiveness recognized that small and medium sized enterprises (SMEs) are not able to take advantage of HPC

• Starting in Fall of 2011 a regional pilot program was started in the Midwestern supply base.

Objective:Study fatigue life of a charge air cooler due to thermal stresses for NDEMC project.

Description:Three‐Step Sequentially Coupled Simulation

(1) CFD Analysis of turbulent fluid flow through CAC coupled with advective HT provide thermal BC‐s for FEA.

(2) FEA analysis of the thermo‐mechanical provides transient thermal stresses in solid part during the thermal cycle for Fatigue Analysis.

(3) Fatigue Model uses history of thermal stresses estimates the cycle life at critical points 15M

nodes

NDEMC: Multiphysics Simulation of Charge Air Cooler (CAC)

Special Thanks • Prof. Martin Ostoja-Starzewski (MechSE, UIUC) • Dr. Ahmed Taha (NCSA)• CRAY • 2 PSP Partner Companies (NDA)• NDEMC• LSTC• IBM/Watson (Dr. Anshul Gupta)• Simulia Dassault Systems • Blue Waters Team


engineering breakthroughs at ncsa: xsede, … breakthroughs at ncsa: xsede, blue waters, industry...

Documents