understanding hardware selection to speedup your · pdf fileunderstanding hardware selection...

© 2013 ANSYS, Inc. May 5, 20151

Understanding Hardware Selection to Speedup Your FEA Simulations

Wim Slagter, PhD

ANSYS, Inc.

© 2013 ANSYS, Inc. May 5, 20152

• Why Talking About Hardware

• HPC Terminology

• ANSYS Work-flow

• Hardware Considerations

• Additional resources

Agenda

© 2013 ANSYS, Inc. May 5, 20153


• HPC Terminology

• ANSYS Work-flow



Agenda

© 2013 ANSYS, Inc. May 5, 20154

Most Users Constrained by Hardware

Source: HPC Usage survey with over 1,800 ANSYS respondents

© 2013 ANSYS, Inc. May 5, 20155

Problem Statement

I am not achieving the performance and throughput I was

expecting from my hardware & software

Image courtesy of Intel Corporation

© 2013 ANSYS, Inc. May 5, 20156

Building A Balanced System Is The Key To Improving Your Experience

If Your System Is

Slow So Are Your

Engineers &

Analysts Processors

Memory

Storage

Networks

© 2013 ANSYS, Inc. May 5, 20157

What Hardware Configuration to Select?

The right combination of hardware and software

leads to maximum efficiency

SMP vs. DMP

HDD vs. SSD

Interconnects?

Clusters?

GPUs?CPUs?

© 2013 ANSYS, Inc. May 5, 20158


• HPC Terminology

• ANSYS Work-flow



Agenda

© 2013 ANSYS, Inc. May 5, 20159

HPC Hardware Terminology

Machine 1 (or Node 1)

GPU

Processor 1 (or Socket 1)


Interconnect(GigE or InfiniBand)

Machine N (or Node N)

GPU



© 2013 ANSYS, Inc. May 5, 201510


Shared Memory Parallel

• Single Machine Parallel (SMP) systems share a single global memory image that may be distributed physically across multiple cores, but is globally addressable.

• OpenMP is the industry standard.


© 2013 ANSYS, Inc. May 5, 201511

Distributed Memory Parallel

• Distributed memory parallel processing (DMP) assumes that physical memory for each process is separate from all other processes.

• Parallel processing on such a system requires some form of message passing software to exchange data between the cores.

• MPI (Message Passing Interface) is the industry standard for this.



© 2013 ANSYS, Inc. May 5, 201512


• HPC Terminology

• ANSYS Work-flow



Agenda

© 2013 ANSYS, Inc. May 5, 201513

Typical HPC Growth Path

Cluster UsersDesktop UserWorkstation and/or

Server Users

Cloud Solution

© 2013 ANSYS, Inc. May 5, 201514

• Ideal for

– remote users submitting jobs from a Windows machine to a Linux cluster or local users submitting jobs to a Linux cluster

– users that do not have enough power (memory or graphics) on their local workstation to build large meshes or view graphics.

• ANSYS 15.0 supports the following remote visualization applications

– Nice Desktop Cloud Visualiation (DCV) 2012.2

• Linux server + Linux/Windows client

– OpenText Exceed onDemand 8 SP3


– RealVNC Enterprise Edition 5.0.4 (with VirtualGL)


– (on Windows cluster: Microsoft Remote Desktop)

• Hardware requirements for remote visualization servers require:

– GPU capable video cards

– large amounts of RAM accessible for multiple user availability when running ANSYS applications and pre/post processing

Remote Visualization

© 2013 ANSYS, Inc. May 5, 201515

Desktop Server Cluster (with 3rd party scheduler)

The Remote Solve Manager (RSM) is a GUI-based, job queuing system that distributes simulation tasks to (shared) computing resources

RSM enables tasks to be

• Run in background mode on the local machine

• Sent to a remote compute machine

• Broken into a series of jobs for parallel processing across a variety of computers

RSM as a scheduler RSM as a transport mechanism

• Submits to RSM itself.

• Unit recognition: jobs (e.g. a run of a solver such as CFX, Fluent or Mechanical)

• Submits through RSM to a high-level scheduler such as LSF, PBS Pro,Windows HPC Server 2008 R2 / 2012, and Univa Grid Engine (at R15.0).

• Unit recognition: cores

ANSYS Remote Solve Manager (RSM)

© 2013 ANSYS, Inc. May 5, 201516

Submission from a client to a centralized (shared) compute resource, allowing

• back-ground queuing on a centralized machine

• multiple users to share a common, usually large memory/fast machine (compared to client machine)

Submission from a client to a centralized (shared) compute resource with a job scheduler, allowing

• back-ground queuing on a centralized machine that submits to a job scheduler (e.g. LSF)

• multiple users to run multi-node jobs on shared compute resources

Submission from a client to multiple (shared) compute resources, allowing

• back-ground queuing on a centralized machine that submits to other machines (compute servers)

• multiple users to share user workstations (often at night) using the RSM “Limit Times for Job Submission” feature

RSM Usage Scenarios

© 2013 ANSYS, Inc. May 5, 201517

• Improved robustness and scalability

• Added support for Univa Grid Engine

• Added support for Mechanical/MAPDL restart

• Non-root users on Linux can now use RSM wizard

• Enriched support for RSM customization

• Added component override for design point update

• Improved efficiency of Design Point updates…

Design objectives:• Equal fresh and exhaust gas mass flow

distribution to each cylinder• To minimize the overall pressure dropInput parameters:• Radii of 3 fillets near inlet (8 design points)

~5.0x speed-up over sequential execution

Parametric, Optimization of Intake Manifold

Initial

Optimized

RSM Enhancements at R15.0

© 2013 ANSYS, Inc. May 5, 201518

• Know your hardware lifecycle

• Have a goal in mind for what you want to achieve.

• Using Licensing productively

• Using ANSYS provided processes effectively.

Guidelines :

© 2013 ANSYS, Inc. May 5, 201519


• HPC Terminology

• ANSYS Work-flow



Agenda

© 2013 ANSYS, Inc. May 5, 201520

HDD vs. SSD


SMP vs. DMP Interconnects?Clusters?

GPUs?CPUs?

© 2013 ANSYS, Inc. May 5, 201521

Understanding the effect of clock speed

Generally, ANSYS applications scale with clock frequency

• Cost/performance argues for high clock (but maybe not top bin)

Impact of CPU Clock on Application Performance

Processor: Xeon X5600 Series

Hyper Threading: OFF, TURBO: ON

Active cores: 12/node; Memory speed: 1333 MHz(performance measure is improvement relative to CPU Clock 2.66 GHz)

0.80

0.85

0.90

0.95

1.00

1.05

1.10

1.15

1.20

1.25

1.30

1.35

1.40

Clock Ratio eddy_417K aircraft_2M turbo_500K sedan_4M truck_14M

ANSYS/FLUENT Model

Imp

rov

em

en

t d

ue

to

Clo

ck

2.66 GHz

2.93 GHz

3.47 GHz

Hig

he

r is

be

tte

r

ANSYS DMP benchmarks (8 core)

• Clock effect is highest for sparse solver

Using higher clock speed is always

helpful to realize productivity gains

© 2013 ANSYS, Inc. May 5, 201522

Generation to Generation - ANSYS Mechanical

Current Processors are Up to 1.98X faster

Than processors that are 3 years old Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific

computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you

in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.. For more information go to http://www.intel.com/performance. Results are

estimated by Intel using the SPEC benchmark software cited and are provided for informational purposes only. Copyright © 2013, Intel Corporation.

Configuration: ANSYS Mechanical: Xeon 1280v3(16GB,4xSSD Raid0), Xeon 1660v2(32GB,4xSSD,Raid0), Xeon E5-2687W v2(128GB,4xSSD Raid0).. Intel Internal measurements as of August 2013. Refer to

backup for additional details.

* Other names and brands may be claimed as the property of others.

© 2013 ANSYS, Inc. May 5, 201523

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific

computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you

in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.. For more information go to http://www.intel.com/performance. Results are

estimated by Intel using the SPEC benchmark software cited and are provided for informational purposes only. Copyright © 2013, Intel Corporation.

Configuration: ANSYS Mechanical: Xeon 1280v3(16GB,4xSSD Raid0), Xeon 1660v2(32GB,4xSSD,Raid0), Xeon E5-2687W v2(128GB,4xSSD Raid0).. Intel Internal measurements as of August 2013. Refer to

backup for additional details.

* Other names and brands may be claimed as the property of others.

Intel® Xeon® processor E5-2687W is up to 3.77X faster

than an entry-level workstation with a single processor

Which Intel Processor Might Meet Your Needs - ANSYS Mechanical

© 2013 ANSYS, Inc. May 5, 201524

Understanding the effect of memory bandwidth- Is 24 Cores Equal to 24 Cores?

3 x (2 x 4) = 24 cores

x5570

x5570x5570

2 x (2 x 6) = 24 cores

x5670

x5670

© 2013 ANSYS, Inc. May 5, 201525


3 x (2 x 4) = 24 cores

x5570

x5570x5570

2 x (2 x 6) = 24 cores

x5670

x5670

Consider memory per core!

© 2013 ANSYS, Inc. May 5, 201526


2 x (2 x 4) = 16 cores 2 x (2 x 4) = 16 cores

x5570

x5570 x5670

x5670

Using less cores per node can be


© 2013 ANSYS, Inc. May 5, 201527

Understanding the effect of memory bandwidth- ANSYS Mechanical

Consider memory per core!

http://www.hp.com/go/wsansys

http://www.hp.com/go/wsansys

© 2013 ANSYS, Inc. May 5, 201528

Understanding the effect of memory speed

• We can see here the effect of memory speed.

• This has implications on how you build your hardware.

• Some processors types have slower memory speeds by default.

• On other processors non-optimally filling the memory channels can slow the memory speed.

Using higher memory speed can be


© 2013 ANSYS, Inc. May 5, 201529

Turbo Boost (Intel) / Turbo Core (AMD)- ANSYS Mechanical

• Effect of Turbo Boost on the SMP benchmarks using 1, 2, 4 and 8 out of 8 physical cores of 1 node

• Turbo Boost most efficient for the lower core counts

Imp

act

of

Turb

o B

oo

st -

Spee

d

# of cores

Using Turbo Boost / Core can be


- particularly for lower core counts

© 2013 ANSYS, Inc. May 5, 201530

Hyper-threading

Hyper-threading is NOT

recommended

© 2013 ANSYS, Inc. May 5, 201531

• Faster cores mean faster solution

• Faster memory means faster solution

• Memory bandwidth is an important factor for (linear) scale-ability

• Turbo Boost/Turbo Core modes do give some benefit especially at low core counts per node.

• In general hyperthreading should not be used because of licensing implications.

Recap

Processor Hardware Tips

© 2013 ANSYS, Inc. May 5, 201532

HDD vs. SSD



GPUs?CPUs?

© 2013 ANSYS, Inc. May 5, 201533

• Need fast interconnects to feed fast processors– Two main characteristics for each interconnect: latency and bandwidth

– Distributed ANSYS is highly bandwidth bound

+--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+

Release: 14.5 Build: UP20120802 Platform: LINUX x64

Date Run: 08/09/2012 Time: 23:07

Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz

Total number of cores available : 32

Number of physical cores available : 32

Number of cores requested : 4 (Distributed Memory Parallel)

MPI Type: INTELMPI

Core Machine Name Working Directory

----------------------------------------------------

0 hpclnxsmc00 /data1/ansyswork




Latency time from master to core 1 = 1.171 microseconds



Communication speed from master to core 1 = 7934.49 MB/sec Same machine

Communication speed from master to core 2 = 3011.09 MB/sec QDR Infiniband

Communication speed from master to core 3 = 3235.00 MB/sec QDR Infiniband

Understanding the effect of the interconnect

© 2013 ANSYS, Inc. May 5, 201534

Understanding the effect of the interconnect- ANSYS Mechanical

V13sp-5 Model

Turbine

geometry

2,100 K DOF

SOLID187 FEs

Static, nonlinear

One iteration

Direct sparse

Linux cluster (8

cores per node) 0

10

20

30

40

50

60

8 cores 16 cores 32 cores 64 cores 128 cores

Rat

ing

(ru

ns/

day

)

Interconnect Performance

Gigabit Ethernet

DDR Infiniband

© 2013 ANSYS, Inc. May 5, 201535


3 Millions DOF using direct sparse solverSolid95 elements, worst case for a direct solver

0

1000

2000

3000

4000

5000

6000

16 32 64 128

Wall Time(secs)

Cores

TrueScale versus GigEIn Core Memory

TrueScale

Gig-E

Using faster interconnects can be


- particularly at higher core/node counts

© 2013 ANSYS, Inc. May 5, 201536

GiGE (Gigabit Ethernet)

– 1 Gbits/sec ( 100 MB/sec )

10 GiGE


Myrinet (Myricom, Inc)


– Myri 10G – 10 Gbits/sec (4th generation Myrinet)

Infiniband (many vendors/speeds)

– SDR/DDR/QDR

– 1x, 4x, 12x

– http://en.wikipedia.org/wiki/List_of_device_bandwidths

Not recommended!!

Bare minimum!!


RECOMMENDATION

Over 1000 MB/s, especially when

running on more than 4 nodes

© 2013 ANSYS, Inc. May 5, 201537

10GiGE and InfiniBand are recommended for HPC Clusters.

• Currently InfiniBand only for large clusters is recommended

• QDR should be more than adequate for small to medium clusters. FDR for large clusters.

For more than 1 node you will see performance decrease using GiGE.

• For Mechanical users do not use GiGE at all if their jobs span more than one node.

Recap

Interconnect Tips

© 2013 ANSYS, Inc. May 5, 201538

HDD vs. SSD



GPUs?CPUs?

© 2013 ANSYS, Inc. May 5, 201539

Understanding the effect of I/O- ANSYS Mechanical

89

145

180

301

419

89

146

180

275

384

88

144

180

283

368

88

124 118

95

52

0

50

100

150

200

250

300

350

400

450

1X1 1X2 1X4 1X8 1X16

4XSSD-RAID-0-SATA-3Gb/s


SSD-SATA-6Gb/s

HD(7.2K RPM)-SATA-6Gb/s

29GB 33GB 35.6GB 40.8GB 47.8GB

Rat

ing

(jo

bs/

day

)

#Machine X #CoreMemory

SP-5 (in-core) R14.5 Benchmark Results

© 2013 ANSYS, Inc. May 5, 201540


89

145

180

301

419

89

146

180

275

384

88

144

180

283

368

88

124 118

95

52

0

50

100

150

200

250

300

350

400

450

1X1 1X2 1X4 1X8 1X16



SSD-SATA-6Gb/s


29GB 33GB 35.6GB 40.8GB 47.8GB

Rat

ing

(jo

bs/

day

)



© 2013 ANSYS, Inc. May 5, 201541


89

145

180

301

419

89

146

180

275

384

88

144

180

283

368

88

124 118

95

52

0

50

100

150

200

250

300

350

400

450

1X1 1X2 1X4 1X8 1X16



SSD-SATA-6Gb/s


29GB 33GB 35.6GB 40.8GB 47.8GB

Rat

ing

(jo

bs/

day

)



© 2013 ANSYS, Inc. May 5, 201542

Is Your Hardware Ready for HPC?- ANSYS Mechanical

100

200

400

600

800

1000

1200

I/O [Mb/s]

RAM [Gb]

8 16 32 48 64 96 128

2x S

SD

1x S

SD

2x S

AS

1x S

AS

0.2 Mdof

2 Mdof

4 Mdof

> 6 Mdof

© 2013 ANSYS, Inc. May 5, 201543

HDD vs. SSD



GPUs?CPUs?

© 2013 ANSYS, Inc. May 5, 201544

DMP Outperforming SMP

6 Mio Degrees of FreedomPlasticity, ContactBolt pretension4 load steps

© 2013 ANSYS, Inc. May 5, 201545

DMP: Good Performance at High Core Counts

Number of Cores Number of Cores

• Intel Xeon E5-2690 processors (2.9 GHz, 16 cores total)

• 128 GB of RAM

10.7 Mio Degrees of FreedomStatic, linear, structural1 load step

1 Mio Degrees of FreedomHarmonic, linear, structural4 frequencies

© 2013 ANSYS, Inc. May 5, 201546

0

5

10

15

20

25

0 8 16 24 32 40 48 56 64

Spe

ed

up

Solution Scalability

Minimum time to solution more important than scaling

ANSYS Mechanical 14.5

DMP Enabling Scalability at High Core Counts

V14sp-5 Model

Turbine geometry

2.1 million DOF

Static, nonlinear analysis

1 loadstep, 7 substeps,

25 equilibrium iterations

8-node Linux cluster

(with 8 cores per node)

© 2013 ANSYS, Inc. May 5, 201547

1.3x1.7x

2.7x 2.4x

0

1

2

3

4

5

6

Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)

Spe

ed

up

ove

r R

14

.5

Improved Scaling at 8 cores

by an enhanced domain decomposition method


Faster Performance at Higher Core Counts

8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)

© 2013 ANSYS, Inc. May 5, 201548

1.6x 1.8x

3.8x4.0x

0

1

2

3

4

5

6


Spe

ed

up

ove

r R

14

.5






© 2013 ANSYS, Inc. May 5, 201549

1.8x2.2x

3.9x

5.0x

0

1

2

3

4

5

6


Spe

ed

up

ove

r R

14

.5






© 2013 ANSYS, Inc. May 5, 201550

Continually improving Core Solver Rating to 80 cores

Courtesy of HP



© 2013 ANSYS, Inc. May 5, 201551


HPC & Solver Technology Improvements

• Improved Scalability of Distributed solver at higher core counts

• NEW Subspace eigen solver supports Shared and Distributed Parallel technology

• NEW MSUP Harmonic method for unsymmetric systems e.g vibro-acoustics

Coupled Acoustic, 1.2 M DOF, Full Harmonic Response

2.09 MDOFsfirst 20 modes

© 2013 ANSYS, Inc. May 5, 201552

HDD vs. SSD



GPUs?CPUs?

© 2013 ANSYS, Inc. May 5, 201553

GPUs are accelerators and can significantly speed up your simulations

• GPUs work hand in hand with CPUs

Most ANSYS GPU acceleration is user-transparent

• Only requirement is to inform ANSYS of how many GPUs to use

Schematic of a CPU with an attached GPU accelerator

• CPU begins/ends job, GPU manages heavy computations

Some Basics

ANSYS Software on NVIDIA GPUs

© 2013 ANSYS, Inc. May 5, 201554

GPU Accelerator Capability- ANSYS Mechanical

Supports majority of ANSYS structural mechanics solvers:

• Covers both sparse direct and PCG iterative solvers

• Only a few minor limitations

Ease of use:

• Requires at least one supported GPU card to be installed

• No rebuild, no additional installation steps

Performance:

• Offer significantly faster time to solution

• Should never slow down your simulation

V14sp-5 Model

© 2013 ANSYS, Inc. May 5, 201555

Influence of GPU Accelerator on Speedup

5.9x3.7x 2.4x

ANSYS Mechanical Model – ImpellerImpeller geometry of ~2M DOF, solid FEs

Normal modes analysis using cyclic symmetry

ANSYS Mechanical SMP and Block-Lanczos solver

SpeedupImpeller 2M DOF

Normal modes4 cores + GPU = 2.4x speedup

vs. 4 cores

ANSYS Mechanical Model – SpeakerSpeaker geometry of ~0.7M DOF, solid FEs

Vibroacoustic harmonic analysis for one frequency

ANSYS Mechanical distributed sparse solver

Speaker 0.7M DOFHarmonic analysis

4 cores + GPU = 2.7x speedup

vs. 4 cores

Speedup

© 2013 ANSYS, Inc. May 5, 201556

NVIDIA-GPU Solution Fit for ANSYS Mechanical

GPUs accelerate the solver part of analysis, consequently problems with high solver workloads benefit the most from GPUs

• Characterized by both high DOF and high factorization requirements

• Models with solid elements (such as castings) and have >500K DOF experience good speedups

Better performance when run on DMP mode over SMP mode

GPU and system memories both play important roles in performance

• Sparse solver:

– Bulkier and/or higher-order FE models are good and will be accelerated

– If the model exceeds 5M DOF, then either add another GPU with 5-6 GB of memory (Tesla K20 or K20X) or use a single GPU with 12 GB memory (Tesla K40 or Quadro K6000).

• PCG/JCG solver:

– Memory saving (MSAVE) option should be turned off for enabling GPUs

– Models with lower Level of Difficulty value (Lev_Diff) are better suited for GPUs

© 2013 ANSYS, Inc. May 5, 201557

2 CPU cores 2 CPU cores + Tesla K20

93

324

3.5X

Simulation productivity(with an HPC license)

2 CPU cores + Tesla K40

363

3.9X

K20K40


275

576

2.1X

Simulation productivity (with an HPC Pack)


600

2.2X

K20K40

V14sp-5 Model

Turbine geometry

2.1 million DOF

SOLID187 elements


One iteration

Sparse direct solver

Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU; Tesla K20 GPU and a Tesla K40 GPU with boost clocks.

Higheris

Better

AN

SYS

Me

chan

ical

job

s/d

ay

GPU AchievementsANSYS Mechanical 15.0 Supporting Newest GPUs

© 2013 ANSYS, Inc. May 5, 201558

Simulation productivity(with an HPC license)

Simulation productivity (with an HPC Pack)

V14sp-6 Model

4.9 million DOF


One iteration


AN

SYS

Me

chan

ical

job

s/d

ay


Higheris

Better


Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU; Tesla K40 GPU with boost clocks.


59

165

2.8X Higheris

Better

8 CPU cores

180

270

1.5X

© 2013 ANSYS, Inc. May 5, 201559


Distributed ANSYS Mechanical 15.0 on Windows workstation with 16 Intel Xeon E5-2670 cores @ 2.7 GHz; 128 GB RAM; SSD.

© 2013 ANSYS, Inc. May 5, 201560

GPUs can offer significantly faster time to solution

Lower

core

counts

favor a

single

GPU

Higher

core

counts

favor

multiple

GPUs

Courtesy of HP


© 2013 ANSYS, Inc. May 5, 201561

Intel Xeon Phi coprocessors are now supported

• Use ‘-acc intel’ to activate this capability

• Xeon Phi models 7120, 5110, 3120 are supported

• Multiple cards

Note:

• Supported by sparse solver (symmetric matrices only)

• Linux only (no Windows support yet)

• SMP only supported

GPU AchievementsANSYS Mechanical 15.0 Supporting Xeon Phi

© 2013 ANSYS, Inc. May 5, 201562

Significant speedups can be achieved with Xeon Phi card

• Shared Memory Sparse Solver on Linux

3.3x

4.3x

5.1x

6.0x

6.8x

0

1

2

3

4

5

6

7

8

1 core 2 cores 4 cores 8 cores 16 cores

Spe

ed

up

Xeon Phi Acceleration (SMP)

CPU cores only

CPU cores + Xeon Phi

V14sp-5 Model

Turbine geometry

2.1 million DOF

SOLID187 elements


One iteration


GPU AchievementsANSYS Mechanical 15.0 Supporting Xeon Phi

Linux workstation (16 Intel Xeon E5-2670 cores @ 2.6 GHz, 1 7120A Xeon Phi, 64 GB RAM).

© 2013 ANSYS, Inc. May 5, 201563

GPU AchievementsANSYS 15.0 License Scheme for GPUs – NEW!

6 CPU Cores + 2 GPUs1 x ANSYS HPC Pack

4 CPU Cores + 4 GPUs

Licensing Examples:

Total 8 HPC Tasks (4 GPUs Max)

2 x ANSYS HPC PackTotal 32 HPC Tasks (16 GPUs Max)

Example of Valid Configurations:

24 CPU Cores + 8 GPUs

(Total Use of 2 Compute Nodes)

.

.

.

.

.(Applies to all schemes: HPC, HPC Pack,HPC Workgroup, HPC Enterprise)

© 2013 ANSYS, Inc. May 5, 201564

ANSYS 15.0 License Scheme for GPUs- Implication of New HPC Pack Licensing

• With R14.5, you could run up to 8 CPU cores and 1 GPU.

• With R15.0, you can run up to to 7 CPU cores and 1 GPU, or 6C + 2G,etc.

Results Courtesy of MicroConsult Engineering, GmbH

Leda

BGA

© 2013 ANSYS, Inc. May 5, 201565

HDD vs. SSD

Maximizing Performance – Putting it Together

The right combination of hardware and software

leads to maximum efficiency

SMP vs. DMP

Interconnects?

Clusters?

GPUs?CPUs?

© 2013 ANSYS, Inc. May 5, 201566

#1 Rule Avoid waiting for I/O to complete

• Always check if job is I/O bound or compute bound

– Check output file for CPU and Elapsed times• When Elapsed time >> main thread CPU time I/O bound

– Consider adding more RAM or faster hard drive configuration

• When Elapsed time ≈ main thread CPU time Compute bound

– Considering moving simulation to a machine with newer, faster processors

– Consider using Distributed ANSYS (DMP) instead of SMP

– Consider running on more CPU cores or possibly using GPU(s)

Total CPU time for main thread : 159.8 seconds

. . .

. . .

Elapsed Time (sec) = 398.000 Date = 03/21/2013

Maximizing Performance – ANSYS Mechanical

© 2013 ANSYS, Inc. May 5, 201567


How to improve an I/O bound simulation

– First consider adding more RAM

• Always the best option for optimal performance

• Allows the operating system to cache file data in memory

– Next consider improving the I/O configuration

• Need fast hard drives to feed fast processors– Consider SSDs

– Higher bandwidths and extremely low seek times

– Consider RAID configurations

RAID 0 – for speed

RAID 1,5 – for redundancy

RAID 10 – for speed and redundancy

© 2013 ANSYS, Inc. May 5, 201568

Example of an I/O bound simulation

0.8x

2.9x2.7x

5.9x 5.9x

0

1

2

3

4

5

6

7

2 cores, HDD 8 cores, HDD 8 cores, SSD

Re

lati

ve S

pe

ed

up

Benefits of SSD and RAM

16 GB RAM

128 GB RAM


Adding RAM gives biggest gains & allows good scaling

Lack of RAM and slow HDD ruin scaling

Single SSD helps allow some scaling. Not as helpful as RAM, but cheaper

• 2.1 million DOF• Nonlinear static analysis • Direct sparse solver (DSPARSE)• 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total)• One 10k rpm HDD, one SSD• Windows 7

© 2013 ANSYS, Inc. May 5, 201569


How to improve a compute bound simulation

– First consider using newer, faster processors

• New CPU architecture and faster clock speeds always help

– Next consider using parallel processing

• DMP virtually always recommended over SMP• More computations performed in parallel with DMP

• Significantly faster speedups achieved using DMP

• DMP can take advantage of all resources on a cluster

• Whole new class of problems can be solved!!

– Last consider using GPU acceleration

• Can help accelerate critical, time-consuming computations

© 2013 ANSYS, Inc. May 5, 201570

Example of a compute bound simulation


1.8x

4.0x

11.0x

0

2

4

6

8

10

12

2 cores 8 cores 8 cores, GPU

Re

lati

ve S

pe

ed

up

Benefits of DMP and GPU

Xeon x5675

Xeon E5-2670Maximum performance found by adding GPU

Using newer Xeons gives big gain

Using 8 cores gives faster performance

• 2.1 million DOF• Nonlinear static analysis • Direct sparse solver (DSPARSE)• 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total)• 128 GB RAM • 1 Tesla K20c• Windows 7

© 2013 ANSYS, Inc. May 5, 201571

Balanced System for Overall Optimum Performance


1.0x2.7x 5.2x

12.5x

0

5

10

15

20

25

30

2 cores 8 cores 8 cores +GPU

8 cores +GPU + SSD

Re

lati

ve S

pe

ed

up

Balanced PerformanceIO Bound

• 2.1 million DOF• Nonlinear static analysis • Direct sparse solver (DSPARSE)• 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total)• 16 GB RAM • SSD and SATA disks• 1 Tesla K20c• Windows 7

© 2013 ANSYS, Inc. May 5, 201572

Balanced System for Overall Optimum Performance


• 2.1 million DOF• Nonlinear static analysis • Direct sparse solver (DSPARSE)• 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total)• 128 GB RAM • SSD and SATA disks• 1 Tesla K20c• Windows 7

1.0x2.7x 5.2x

12.5x

5.7x

12.0x

24.8x27.3x

0

5

10

15

20

25

30

2 cores 8 cores 8 cores +GPU

8 cores +GPU + SSD

Re

lati

ve S

pe

ed

up

Balanced Performance

IO Bound

Compute Bound

© 2013 ANSYS, Inc. May 5, 201574

Watch recorded webinars by clicking below:• Understanding Hardware Selection for ANSYS 15.0

• How to Speed Up ANSYS 15.0 with GPUs

• Intel Technologies Enabling Faster, More Effective Simulation

• Why HPC for ANSYS Mechanical and CFD

Click on webinars related to HPC/IT for more and upcoming ones!

Additional Resources

http://ansys.com/Resource+Library/Webinars/Understanding+Hardware+Selection+for+ANSYS+15.0

http://www.ansys.com/Resource+Library/Webinars/How+to+Speed+Up+ANSYS+15.0+with+GPUs

http://www.ansys.com/Resource+Library/Webinars/Intel+Technologies:+Enabling+Faster,+More+Effective+Simulation+-+Webinar

http://www.ansys.com/Resource+Library/Webinars/Why+High-Performance+Computing+for+ANSYS+Mechanical+and+CFD

http://www.ansys.com/Support/Platform+Support/IT+Solutions+for+ANSYS+Webcast+Series

© 2013 ANSYS, Inc. May 5, 201575

Additional Resources- ANSYS IT Webcast Series

On-demand webinars:• Understanding Hardware Selection for ANSYS 15.0

• How to Speed Up ANSYS 15.0 with GPUs

• Cloud Hosting of ANSYS: Gompute On-Demand Solutions

• Simplified HPC Clusters for ANSYS Users

• Intel Technologies Enabling Faster, More Effective Simulation

• Accelerating Time-to-Results with Parallel I/O

• Extreme Scalability for High-Fidelity CFD Simulations

• Methodology and Tools for Compute Performance at Any Scale

• Understanding Hardware Selection for Structural Mechanics

• Optimizing Remote Access to Simulation

• Scalable Storage and Data Management for Engineering Simulation



© 2013 ANSYS, Inc. May 5, 201576


ANSYS Platform Support• http://www.ansys.com/Support/Platform+Support

– Platform Support Policies

– Supported Platforms

– Supported Hardware

– Tested Systems

– ANSYS Benchmarks

http://www.ansys.com/Support/Platform+Support

© 2013 ANSYS, Inc. May 5, 201577

ANSYS Partner Solutions– http://www.ansys.com/About+ANSYS/Partner+Programs/HPC+Partners

• Reference configurations

• Performance data

• White papers

• Sales contact points

Performance Data– http://www.ansys.com/benchmarks


http://www.ansys.com/About+ANSYS/Partner+Programs/HPC+Partners

http://www.ansys.com/benchmarks

© 2013 ANSYS, Inc. May 5, 201578


The Manual• Sections on best practices and parallel

processing for various solvers

• Performance Guide for Mechanical

• Installation walkthroughs for installing the products, parallel processing, licensing and RSM (remote solve manager)

ANSYS Advantage• Online Magazine

© 2013 ANSYS, Inc. May 5, 201579

• Connect with Me

– [email protected]

• Connect with ANSYS, Inc.

– LinkedIn ANSYSInc

– Twitter @ANSYS_Inc

– Facebook ANSYSInc

• Follow our Blog

– ansys-blog.com

Thank You!

mailto:[email protected]

understanding hardware selection to speedup your · pdf fileunderstanding hardware selection...

Documents