engineering breakthroughs at ncsa: xsede, … breakthroughs at ncsa: xsede, blue waters, industry...

23
National Center for Supercomputing Applications Engineering Breakthroughs at NCSA: XSEDE, Blue Waters, Industry Seid Koric Senior Technical Lead-Private Sector Program at NCSA Adjunct Professor, Mechanical Science and Engineering Dept. University of Illinois http://www.ncsa.illinois.edu [email protected]

Upload: phungque

Post on 06-Apr-2019

217 views

Category:

Documents


0 download

TRANSCRIPT

National Center for Supercomputing Applications

Engineering Breakthroughs at NCSA:XSEDE, Blue Waters, Industry

Seid Koric Senior Technical Lead-Private Sector Program at NCSA

Adjunct Professor, Mechanical Science and Engineering Dept.University of Illinois

http://[email protected]

XSEDE ECSS Project: 3D Study of Elastic-Plastic Transition and Fractal Patterns of 1 million Grain

Cube of grade 316-Steel (2010-2012)

(M. Ostoja-Starzewski, Jun Li, S. Koric, A. Saharan, Philosophical Magazine, 2012 )

National Center for Supercomputing Applications

Largest Nonhomogenous FEA simulations to date

Every of 1 Million Elements (Grains) has a different material property

Fractal dimension can be used to estimate level of plasticity for damage assessment for various structures

We are aiming at (much) larger simulations on Blue Waters !

National Center for Supercomputing Applications

Blue Waters- sustained peta-scale system

• >300Cray System & Storage cabinets:

• >25,000Compute nodes:

• >1 TB/sUsable Storage Bandwidth:

• >1.5 PetabytesSystem Memory:

• 4 GBMemory per core module:

• 3D TorusGemini Interconnect Topology:

• >25 PetabytesUsable Storage:

• >11.5 PetaflopsPeak performance:

• >49,000Number of AMD processors:

• >380,000Number of AMD x86 core module:

• >5,000Number of NVIDIA GPUs:

iForge-Industrial HPC resource at NCSA

National Center for Supercomputing Applications

Platform 1 Platform 2x86 Cores 2048 576CPU Type “Sandy Bridge” “Abu Dhabi”

Clock 3.2 Ghz 3.4 GHz

Cores/Node 16 32Memory/Node 128 GB, 1600 MHz 256 GB, 1600MHzGlobal RAMdisk 1.5 TerabytesTotal Memory 21 Terabytes

Storage 700 TerabytesFile system GPFSInterconnect 40 Gigabit QDR InfiniBand

MPI Platform, Intel, MVAPICH2, OpenMP

Operating System Red Hat Enterprise Linux 6.4

National Center for Supercomputing Applications

Evaluation of Massively Parallel Linear Solvers in Implicit FEA

• Implicit FEA code spends (70-80%) of time solving large systems of linear equations, Ax=b , where A is sparse i.e., most of coefficients are zero

• A wide range of applications: finite element solid mechanics, computational fluid dynamics, reservoir simulation, circuit design, linear programming etc.

National Center for Supercomputing Applications

FE Model and its Global Stiffness Matrix

National Center for Supercomputing Applications

Problem Specification (matrices)

• Originate from either in-house industrial and academic codes, or from a commercial FE code solving real world engineering problems

• Mostly SPD with N=1-20 M, NNZ=120-500M

• Condition Numbers 103-1012

National Center for Supercomputing Applications

Problem Specification (solvers)

•WSMP: direct solver developed by IBM/Watson, based on multifrontal algorithm, hybrid (MPI & p-threads), symmetric and nonsymmetric•Super LU: direct solver developed by LBNL, LU decomposition, MPI, nonsymmetric•MUMPS: direct solver funded by CEC ESPIRT IV, multifrontalalgorithm, MPI, symmetric and nonsymmetric•Hypre: iterative solver, LLNL, Conjugate Gradient with AMG, IC, and SAI (Sparse Approx Inverse) pre-conditioners, MPI, symmetric•PETSc: iterative solver, ANL, Conjugate Gradients (CG), Bi-Conjugate Stabilized (BCGS), Conjugate Residual Gradient (CR) with Bjacobi, ACM (Additive Schwarz) , and AMG (Multi-Grid) pre-conditioners , MPI, symmetric and nonsymmetric•Commercial FEA Codes (NDA)

Solver Work in Progress (iForge now)

National Center for Supercomputing Applications

0

50

100

150

200

250

CG/Bjacobi,PETSc,

Rconv=1.E‐5

BCGS/Bjacobi,PETSc,

Rconv=1.E‐5

BCGS/ASM,PETSc,

Rconv=1.E‐5

CR/Bjacobi,PETSc,

Rconv=1.E‐5

PCG/ParaSails,Hypre,

Rconv=1.E‐5

MUMPS SPD,Direct

WSMP  SPD,Direct

SuperLU,Unsymmetric,

Direct

Solutio

n Time [sec

]

Matrix 1M, SPD, N=1.5M, NNZ=63.6M, COND=6.9E4Lower = Better  

16 cores

32 cores

64 cores

128 cores

256 cores

An Order of magnitude larger problem

National Center for Supercomputing Applications

0

2000

4000

6000

8000

10000

12000

Solutio

n Time [sec]

16 cores

32 cores

64 cores

128 cores

256 cores

512 cores

CR/Bjacobi, PETSc, Rconv=1.0E‐5

WSMP, SPD, Direct

PCG/Parasails, Hypre, Rconv=1.0E‐5

MUMPS, SPD, Direct

Matrix 20M, SPD, N=20.05M, NNZ=827.49M, COND=~1.E7Lower = Better

WSMP Performance on iForgeHigher=Better

National Center for Supercomputing Applications

0

1

2

3

4

5

6

128 256 512 768 960

Sparse  Factroizatio

n Pe

rforman

ce TFlop

/Sec

Number of Threads

Watson Sparse Matrix Package Hybrid (MPI/Pthreads) Symmetric Solver N=2.8M, NNZ=107M

X5690/Westmere

XE5‐2670/Sandy Bridge

ABAQUS model:Number of elements: 2,274,403Number of nodes: 12,190,073Number of DOFs >30M

ABAQUS analysis job:Cluster: iForgeNumber of cores used: 24-196Solver: Direct Sparse 7hours->1hour

ISV Implicit FEA Benchmark on iForge

0

5000

10000

15000

20000

25000

30000

0 50 100 150 200 250

Wal

l Clo

ck T

ime

(sec

)

# of cores

Wall Clock Time vs. Number of Cores

Explicit FEA: LS-Dyna on Blue Waters

NCSA/PSP, Hardware Vendor (CRAY), ISV (LSTC), PSP partner (NDA)-all working together !

Real geometry, Loads, BC-s, highly nonlinear transient dynamic problem with difficult contact conditions

MPP Dyna solver fully ported and optimized to CRAY’s Linux Environment and taking full advantage of Gemini interconnect

National Center for Supercomputing Applications

LS-Dyna Breakthrough on Blue Waters

National Center for Supercomputing Applications

0

2

4

6

8

10

12

14

16

512 1024 1536 2048 3072 4096 8192

Wall Clock (h

ours)

CPU Cores

26.5M nodes, 80M DOFs, Time in Hours, Lower = Better

iForge (MPI)

Blue Waters(MPI)

Blue Waters(Hybrid)

Highest known scaling of LS‐DYNA to date !!

Typical MPP-Dyna Profiling

National Center for Supercomputing Applications

As the number of cores increases, the communication cost increases rapidly !

64 cores 

Computing

Communication

512 cores

Dyna Work in progress

• Benchmarking even larger real problems• Memory management becoming a serious

issue for DP (decomposition, distribution, MPMD, etc.)

• Hybrid (MPI/OpenMP) solver uses less memory and less communication

• Load Balance in Contact and Rigid Body Algorithms

National Center for Supercomputing Applications

Star‐CCM+ Breakthrough on Blue Waters

Source: NCSA Private Sector Partner ”B" (Confidential)Code/Version: Star‐CCM+ 7.6.9Physics: Transient, turbulent, single‐phase compressible  flowMesh size: 21.4 million unstructured polyhedral cellsComplexity: Very complicated geometry, high resolution mesh

Complex real‐life production case: A highly complex CFD case both in terms of the mesh and physics involved.

0

200

400

600

800

1000

0 128 256 384 512 640 768 896 102411521280140815361664179219202048

Iteratio

ns / Sim

ulation Hr

CPU Cores

iForge

BlueWaters

Scaling with Infiniband levels off at 256 cores

Highest known scaling of Star‐CCM+ to date…

…and we broke the code!

CD‐adapco Star‐CCM+ Case from “Partner B”Iteration/Simulation hour, Higher = Better

Future of HPC, GPGPU with OpenACC ?

National Center for Supercomputing Applications

0

10

20

30

40

50

60

70

80

90

100

CPU Only (1 OMP) CPU Only (6 OMP) GPU(OpenACC)

Wall Clock [sec]

Laplace 2DLower is  Better  Blue Waters XK7

(Interlagos/Kepler)

KIDS (Westmere/Fermi)

14x Speedup !

Inter-Nodal GPU Acceleration on Blue Waters with Abaqus

National Center for Supercomputing Applications

0

5

10

15

20

25

30

8 16 32 64 96

Parallel Spe

edup

Cores

Abaqus/Standard 6.11, Cluster Compatibility ModeS4B Benchmark (5.23M Dofs), Higher=Better

Cray XE6 (CPU only)

Cray XK7(CPU+GPU)

NDEMC Public-Private Partnership

National Center for Supercomputing Applications

•US OEMs have gained a competitive edge through the use of high performance computing (HPC) with modeling simulation and analysis (MS&A).

• US Council of competitiveness recognized that small and medium sized enterprises (SMEs) are not able to take advantage of HPC

• Starting in Fall of 2011 a regional pilot program was started in the Midwestern supply base.

Objective:Study fatigue life of a charge air cooler due to thermal stresses for NDEMC project.

Description:Three‐Step Sequentially Coupled  Simulation

(1) CFD Analysis of turbulent fluid flow through CAC coupled with advective HT  provide thermal BC‐s for FEA.

(2) FEA analysis of the thermo‐mechanical provides transient thermal stresses in solid part during the thermal cycle for Fatigue Analysis.

(3) Fatigue Model uses history of thermal stresses  estimates the cycle life at critical points 15M 

nodes 

NDEMC: Multiphysics Simulation of Charge Air Cooler (CAC)

Special Thanks • Prof. Martin Ostoja-Starzewski (MechSE, UIUC) • Dr. Ahmed Taha (NCSA)• CRAY • 2 PSP Partner Companies (NDA)• NDEMC• LSTC• IBM/Watson (Dr. Anshul Gupta)• Simulia Dassault Systems • Blue Waters Team

National Center for Supercomputing Applications