understanding hardware selection to speedup your · pdf fileunderstanding hardware selection...
TRANSCRIPT
© 2013 ANSYS, Inc. May 5, 20151
Understanding Hardware Selection to Speedup Your FEA Simulations
Wim Slagter, PhD
ANSYS, Inc.
© 2013 ANSYS, Inc. May 5, 20152
• Why Talking About Hardware
• HPC Terminology
• ANSYS Work-flow
• Hardware Considerations
• Additional resources
Agenda
© 2013 ANSYS, Inc. May 5, 20153
• Why Talking About Hardware
• HPC Terminology
• ANSYS Work-flow
• Hardware Considerations
• Additional resources
Agenda
© 2013 ANSYS, Inc. May 5, 20154
Most Users Constrained by Hardware
Source: HPC Usage survey with over 1,800 ANSYS respondents
© 2013 ANSYS, Inc. May 5, 20155
Problem Statement
I am not achieving the performance and throughput I was
expecting from my hardware & software
Image courtesy of Intel Corporation
© 2013 ANSYS, Inc. May 5, 20156
Building A Balanced System Is The Key To Improving Your Experience
If Your System Is
Slow So Are Your
Engineers &
Analysts Processors
Memory
Storage
Networks
© 2013 ANSYS, Inc. May 5, 20157
What Hardware Configuration to Select?
The right combination of hardware and software
leads to maximum efficiency
SMP vs. DMP
HDD vs. SSD
Interconnects?
Clusters?
GPUs?CPUs?
© 2013 ANSYS, Inc. May 5, 20158
• Why Talking About Hardware
• HPC Terminology
• ANSYS Work-flow
• Hardware Considerations
• Additional resources
Agenda
© 2013 ANSYS, Inc. May 5, 20159
HPC Hardware Terminology
Machine 1 (or Node 1)
GPU
Processor 1 (or Socket 1)
Processor 2 (or Socket 2)
Interconnect(GigE or InfiniBand)
Machine N (or Node N)
GPU
Processor 1 (or Socket 1)
Processor 2 (or Socket 2)
© 2013 ANSYS, Inc. May 5, 201510
Machine 1 (or Node 1)
Shared Memory Parallel
• Single Machine Parallel (SMP) systems share a single global memory image that may be distributed physically across multiple cores, but is globally addressable.
• OpenMP is the industry standard.
Processor 1 (or Socket 1)
© 2013 ANSYS, Inc. May 5, 201511
Distributed Memory Parallel
• Distributed memory parallel processing (DMP) assumes that physical memory for each process is separate from all other processes.
• Parallel processing on such a system requires some form of message passing software to exchange data between the cores.
• MPI (Message Passing Interface) is the industry standard for this.
Machine 1 (or Node 1)
Processor 1 (or Socket 1)
© 2013 ANSYS, Inc. May 5, 201512
• Why Talking About Hardware
• HPC Terminology
• ANSYS Work-flow
• Hardware Considerations
• Additional resources
Agenda
© 2013 ANSYS, Inc. May 5, 201513
Typical HPC Growth Path
Cluster UsersDesktop UserWorkstation and/or
Server Users
Cloud Solution
© 2013 ANSYS, Inc. May 5, 201514
• Ideal for
– remote users submitting jobs from a Windows machine to a Linux cluster or local users submitting jobs to a Linux cluster
– users that do not have enough power (memory or graphics) on their local workstation to build large meshes or view graphics.
• ANSYS 15.0 supports the following remote visualization applications
– Nice Desktop Cloud Visualiation (DCV) 2012.2
• Linux server + Linux/Windows client
– OpenText Exceed onDemand 8 SP3
• Linux server + Linux/Windows client
– RealVNC Enterprise Edition 5.0.4 (with VirtualGL)
• Linux server + Linux/Windows client
– (on Windows cluster: Microsoft Remote Desktop)
• Hardware requirements for remote visualization servers require:
– GPU capable video cards
– large amounts of RAM accessible for multiple user availability when running ANSYS applications and pre/post processing
Remote Visualization
© 2013 ANSYS, Inc. May 5, 201515
Desktop Server Cluster (with 3rd party scheduler)
The Remote Solve Manager (RSM) is a GUI-based, job queuing system that distributes simulation tasks to (shared) computing resources
RSM enables tasks to be
• Run in background mode on the local machine
• Sent to a remote compute machine
• Broken into a series of jobs for parallel processing across a variety of computers
RSM as a scheduler RSM as a transport mechanism
• Submits to RSM itself.
• Unit recognition: jobs (e.g. a run of a solver such as CFX, Fluent or Mechanical)
• Submits through RSM to a high-level scheduler such as LSF, PBS Pro,Windows HPC Server 2008 R2 / 2012, and Univa Grid Engine (at R15.0).
• Unit recognition: cores
ANSYS Remote Solve Manager (RSM)
© 2013 ANSYS, Inc. May 5, 201516
Submission from a client to a centralized (shared) compute resource, allowing
• back-ground queuing on a centralized machine
• multiple users to share a common, usually large memory/fast machine (compared to client machine)
Submission from a client to a centralized (shared) compute resource with a job scheduler, allowing
• back-ground queuing on a centralized machine that submits to a job scheduler (e.g. LSF)
• multiple users to run multi-node jobs on shared compute resources
Submission from a client to multiple (shared) compute resources, allowing
• back-ground queuing on a centralized machine that submits to other machines (compute servers)
• multiple users to share user workstations (often at night) using the RSM “Limit Times for Job Submission” feature
RSM Usage Scenarios
© 2013 ANSYS, Inc. May 5, 201517
• Improved robustness and scalability
• Added support for Univa Grid Engine
• Added support for Mechanical/MAPDL restart
• Non-root users on Linux can now use RSM wizard
• Enriched support for RSM customization
• Added component override for design point update
• Improved efficiency of Design Point updates…
Design objectives:• Equal fresh and exhaust gas mass flow
distribution to each cylinder• To minimize the overall pressure dropInput parameters:• Radii of 3 fillets near inlet (8 design points)
~5.0x speed-up over sequential execution
Parametric, Optimization of Intake Manifold
Initial
Optimized
RSM Enhancements at R15.0
© 2013 ANSYS, Inc. May 5, 201518
• Know your hardware lifecycle
• Have a goal in mind for what you want to achieve.
• Using Licensing productively
• Using ANSYS provided processes effectively.
Guidelines :
© 2013 ANSYS, Inc. May 5, 201519
• Why Talking About Hardware
• HPC Terminology
• ANSYS Work-flow
• Hardware Considerations
• Additional resources
Agenda
© 2013 ANSYS, Inc. May 5, 201520
HDD vs. SSD
What Hardware Configuration to Select?
SMP vs. DMP Interconnects?Clusters?
GPUs?CPUs?
© 2013 ANSYS, Inc. May 5, 201521
Understanding the effect of clock speed
Generally, ANSYS applications scale with clock frequency
• Cost/performance argues for high clock (but maybe not top bin)
Impact of CPU Clock on Application Performance
Processor: Xeon X5600 Series
Hyper Threading: OFF, TURBO: ON
Active cores: 12/node; Memory speed: 1333 MHz(performance measure is improvement relative to CPU Clock 2.66 GHz)
0.80
0.85
0.90
0.95
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
Clock Ratio eddy_417K aircraft_2M turbo_500K sedan_4M truck_14M
ANSYS/FLUENT Model
Imp
rov
em
en
t d
ue
to
Clo
ck
2.66 GHz
2.93 GHz
3.47 GHz
Hig
he
r is
be
tte
r
ANSYS DMP benchmarks (8 core)
• Clock effect is highest for sparse solver
Using higher clock speed is always
helpful to realize productivity gains
© 2013 ANSYS, Inc. May 5, 201522
Generation to Generation - ANSYS Mechanical
Current Processors are Up to 1.98X faster
Than processors that are 3 years old Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.. For more information go to http://www.intel.com/performance. Results are
estimated by Intel using the SPEC benchmark software cited and are provided for informational purposes only. Copyright © 2013, Intel Corporation.
Configuration: ANSYS Mechanical: Xeon 1280v3(16GB,4xSSD Raid0), Xeon 1660v2(32GB,4xSSD,Raid0), Xeon E5-2687W v2(128GB,4xSSD Raid0).. Intel Internal measurements as of August 2013. Refer to
backup for additional details.
* Other names and brands may be claimed as the property of others.
© 2013 ANSYS, Inc. May 5, 201523
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.. For more information go to http://www.intel.com/performance. Results are
estimated by Intel using the SPEC benchmark software cited and are provided for informational purposes only. Copyright © 2013, Intel Corporation.
Configuration: ANSYS Mechanical: Xeon 1280v3(16GB,4xSSD Raid0), Xeon 1660v2(32GB,4xSSD,Raid0), Xeon E5-2687W v2(128GB,4xSSD Raid0).. Intel Internal measurements as of August 2013. Refer to
backup for additional details.
* Other names and brands may be claimed as the property of others.
Intel® Xeon® processor E5-2687W is up to 3.77X faster
than an entry-level workstation with a single processor
Which Intel Processor Might Meet Your Needs - ANSYS Mechanical
© 2013 ANSYS, Inc. May 5, 201524
Understanding the effect of memory bandwidth- Is 24 Cores Equal to 24 Cores?
3 x (2 x 4) = 24 cores
x5570
x5570x5570
2 x (2 x 6) = 24 cores
x5670
x5670
© 2013 ANSYS, Inc. May 5, 201525
Understanding the effect of memory bandwidth- Is 24 Cores Equal to 24 Cores?
3 x (2 x 4) = 24 cores
x5570
x5570x5570
2 x (2 x 6) = 24 cores
x5670
x5670
Consider memory per core!
© 2013 ANSYS, Inc. May 5, 201526
Understanding the effect of memory bandwidth- Is 16 Cores Equal to 16 Cores?
2 x (2 x 4) = 16 cores 2 x (2 x 4) = 16 cores
x5570
x5570 x5670
x5670
Using less cores per node can be
helpful to realize productivity gains
© 2013 ANSYS, Inc. May 5, 201527
Understanding the effect of memory bandwidth- ANSYS Mechanical
Consider memory per core!
© 2013 ANSYS, Inc. May 5, 201528
Understanding the effect of memory speed
• We can see here the effect of memory speed.
• This has implications on how you build your hardware.
• Some processors types have slower memory speeds by default.
• On other processors non-optimally filling the memory channels can slow the memory speed.
Using higher memory speed can be
helpful to realize productivity gains
© 2013 ANSYS, Inc. May 5, 201529
Turbo Boost (Intel) / Turbo Core (AMD)- ANSYS Mechanical
• Effect of Turbo Boost on the SMP benchmarks using 1, 2, 4 and 8 out of 8 physical cores of 1 node
• Turbo Boost most efficient for the lower core counts
Imp
act
of
Turb
o B
oo
st -
Spee
d
# of cores
Using Turbo Boost / Core can be
helpful to realize productivity gains
- particularly for lower core counts
© 2013 ANSYS, Inc. May 5, 201530
Hyper-threading
Hyper-threading is NOT
recommended
© 2013 ANSYS, Inc. May 5, 201531
• Faster cores mean faster solution
• Faster memory means faster solution
• Memory bandwidth is an important factor for (linear) scale-ability
• Turbo Boost/Turbo Core modes do give some benefit especially at low core counts per node.
• In general hyperthreading should not be used because of licensing implications.
Recap
Processor Hardware Tips
© 2013 ANSYS, Inc. May 5, 201532
HDD vs. SSD
What Hardware Configuration to Select?
SMP vs. DMP Interconnects?Clusters?
GPUs?CPUs?
© 2013 ANSYS, Inc. May 5, 201533
• Need fast interconnects to feed fast processors– Two main characteristics for each interconnect: latency and bandwidth
– Distributed ANSYS is highly bandwidth bound
+--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+
Release: 14.5 Build: UP20120802 Platform: LINUX x64
Date Run: 08/09/2012 Time: 23:07
Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
Total number of cores available : 32
Number of physical cores available : 32
Number of cores requested : 4 (Distributed Memory Parallel)
MPI Type: INTELMPI
Core Machine Name Working Directory
----------------------------------------------------
0 hpclnxsmc00 /data1/ansyswork
1 hpclnxsmc00 /data1/ansyswork
2 hpclnxsmc01 /data1/ansyswork
3 hpclnxsmc01 /data1/ansyswork
Latency time from master to core 1 = 1.171 microseconds
Latency time from master to core 2 = 2.251 microseconds
Latency time from master to core 3 = 2.225 microseconds
Communication speed from master to core 1 = 7934.49 MB/sec Same machine
Communication speed from master to core 2 = 3011.09 MB/sec QDR Infiniband
Communication speed from master to core 3 = 3235.00 MB/sec QDR Infiniband
Understanding the effect of the interconnect
© 2013 ANSYS, Inc. May 5, 201534
Understanding the effect of the interconnect- ANSYS Mechanical
V13sp-5 Model
Turbine
geometry
2,100 K DOF
SOLID187 FEs
Static, nonlinear
One iteration
Direct sparse
Linux cluster (8
cores per node) 0
10
20
30
40
50
60
8 cores 16 cores 32 cores 64 cores 128 cores
Rat
ing
(ru
ns/
day
)
Interconnect Performance
Gigabit Ethernet
DDR Infiniband
© 2013 ANSYS, Inc. May 5, 201535
Understanding the effect of the interconnect- ANSYS Mechanical
3 Millions DOF using direct sparse solverSolid95 elements, worst case for a direct solver
0
1000
2000
3000
4000
5000
6000
16 32 64 128
Wall Time(secs)
Cores
TrueScale versus GigEIn Core Memory
TrueScale
Gig-E
Using faster interconnects can be
helpful to realize productivity gains
- particularly at higher core/node counts
© 2013 ANSYS, Inc. May 5, 201536
GiGE (Gigabit Ethernet)
– 1 Gbits/sec ( 100 MB/sec )
10 GiGE
– 10 Gbits/sec ( 1000 MB/sec )
Myrinet (Myricom, Inc)
– 2 Gbits/sec ( 250 MB/sec )
– Myri 10G – 10 Gbits/sec (4th generation Myrinet)
Infiniband (many vendors/speeds)
– SDR/DDR/QDR
– 1x, 4x, 12x
– http://en.wikipedia.org/wiki/List_of_device_bandwidths
Not recommended!!
Bare minimum!!
Understanding the effect of the interconnect- ANSYS Mechanical
RECOMMENDATION
Over 1000 MB/s, especially when
running on more than 4 nodes
© 2013 ANSYS, Inc. May 5, 201537
10GiGE and InfiniBand are recommended for HPC Clusters.
• Currently InfiniBand only for large clusters is recommended
• QDR should be more than adequate for small to medium clusters. FDR for large clusters.
For more than 1 node you will see performance decrease using GiGE.
• For Mechanical users do not use GiGE at all if their jobs span more than one node.
Recap
Interconnect Tips
© 2013 ANSYS, Inc. May 5, 201538
HDD vs. SSD
What Hardware Configuration to Select?
SMP vs. DMP Interconnects?Clusters?
GPUs?CPUs?
© 2013 ANSYS, Inc. May 5, 201539
Understanding the effect of I/O- ANSYS Mechanical
89
145
180
301
419
89
146
180
275
384
88
144
180
283
368
88
124 118
95
52
0
50
100
150
200
250
300
350
400
450
1X1 1X2 1X4 1X8 1X16
4XSSD-RAID-0-SATA-3Gb/s
2XSSD-RAID-0-SATA-3Gb/s
SSD-SATA-6Gb/s
HD(7.2K RPM)-SATA-6Gb/s
29GB 33GB 35.6GB 40.8GB 47.8GB
Rat
ing
(jo
bs/
day
)
#Machine X #CoreMemory
SP-5 (in-core) R14.5 Benchmark Results
© 2013 ANSYS, Inc. May 5, 201540
Understanding the effect of I/O- ANSYS Mechanical
89
145
180
301
419
89
146
180
275
384
88
144
180
283
368
88
124 118
95
52
0
50
100
150
200
250
300
350
400
450
1X1 1X2 1X4 1X8 1X16
4XSSD-RAID-0-SATA-3Gb/s
2XSSD-RAID-0-SATA-3Gb/s
SSD-SATA-6Gb/s
HD(7.2K RPM)-SATA-6Gb/s
29GB 33GB 35.6GB 40.8GB 47.8GB
Rat
ing
(jo
bs/
day
)
#Machine X #CoreMemory
SP-5 (in-core) R14.5 Benchmark Results
© 2013 ANSYS, Inc. May 5, 201541
Understanding the effect of I/O- ANSYS Mechanical
89
145
180
301
419
89
146
180
275
384
88
144
180
283
368
88
124 118
95
52
0
50
100
150
200
250
300
350
400
450
1X1 1X2 1X4 1X8 1X16
4XSSD-RAID-0-SATA-3Gb/s
2XSSD-RAID-0-SATA-3Gb/s
SSD-SATA-6Gb/s
HD(7.2K RPM)-SATA-6Gb/s
29GB 33GB 35.6GB 40.8GB 47.8GB
Rat
ing
(jo
bs/
day
)
#Machine X #CoreMemory
SP-5 (in-core) R14.5 Benchmark Results
© 2013 ANSYS, Inc. May 5, 201542
Is Your Hardware Ready for HPC?- ANSYS Mechanical
100
200
400
600
800
1000
1200
I/O [Mb/s]
RAM [Gb]
8 16 32 48 64 96 128
2x S
SD
1x S
SD
2x S
AS
1x S
AS
0.2 Mdof
2 Mdof
4 Mdof
> 6 Mdof
© 2013 ANSYS, Inc. May 5, 201543
HDD vs. SSD
What Hardware Configuration to Select?
SMP vs. DMP Interconnects?Clusters?
GPUs?CPUs?
© 2013 ANSYS, Inc. May 5, 201544
DMP Outperforming SMP
6 Mio Degrees of FreedomPlasticity, ContactBolt pretension4 load steps
© 2013 ANSYS, Inc. May 5, 201545
DMP: Good Performance at High Core Counts
Number of Cores Number of Cores
• Intel Xeon E5-2690 processors (2.9 GHz, 16 cores total)
• 128 GB of RAM
10.7 Mio Degrees of FreedomStatic, linear, structural1 load step
1 Mio Degrees of FreedomHarmonic, linear, structural4 frequencies
© 2013 ANSYS, Inc. May 5, 201546
0
5
10
15
20
25
0 8 16 24 32 40 48 56 64
Spe
ed
up
Solution Scalability
Minimum time to solution more important than scaling
ANSYS Mechanical 14.5
DMP Enabling Scalability at High Core Counts
V14sp-5 Model
Turbine geometry
2.1 million DOF
Static, nonlinear analysis
1 loadstep, 7 substeps,
25 equilibrium iterations
8-node Linux cluster
(with 8 cores per node)
© 2013 ANSYS, Inc. May 5, 201547
1.3x1.7x
2.7x 2.4x
0
1
2
3
4
5
6
Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)
Spe
ed
up
ove
r R
14
.5
Improved Scaling at 8 cores
by an enhanced domain decomposition method
ANSYS Mechanical 15.0
Faster Performance at Higher Core Counts
8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)
© 2013 ANSYS, Inc. May 5, 201548
1.6x 1.8x
3.8x4.0x
0
1
2
3
4
5
6
Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)
Spe
ed
up
ove
r R
14
.5
Improved Scaling at 16 cores
ANSYS Mechanical 15.0
Faster Performance at Higher Core Counts
by an enhanced domain decomposition method
8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)
© 2013 ANSYS, Inc. May 5, 201549
1.8x2.2x
3.9x
5.0x
0
1
2
3
4
5
6
Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF)
Spe
ed
up
ove
r R
14
.5
Improved Scaling at 32 cores
ANSYS Mechanical 15.0
Faster Performance at Higher Core Counts
by an enhanced domain decomposition method
8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR)
© 2013 ANSYS, Inc. May 5, 201550
Continually improving Core Solver Rating to 80 cores
Courtesy of HP
ANSYS Mechanical 15.0
Faster Performance at Higher Core Counts
© 2013 ANSYS, Inc. May 5, 201551
ANSYS Mechanical 15.0
HPC & Solver Technology Improvements
• Improved Scalability of Distributed solver at higher core counts
• NEW Subspace eigen solver supports Shared and Distributed Parallel technology
• NEW MSUP Harmonic method for unsymmetric systems e.g vibro-acoustics
Coupled Acoustic, 1.2 M DOF, Full Harmonic Response
2.09 MDOFsfirst 20 modes
© 2013 ANSYS, Inc. May 5, 201552
HDD vs. SSD
What Hardware Configuration to Select?
SMP vs. DMP Interconnects?Clusters?
GPUs?CPUs?
© 2013 ANSYS, Inc. May 5, 201553
GPUs are accelerators and can significantly speed up your simulations
• GPUs work hand in hand with CPUs
Most ANSYS GPU acceleration is user-transparent
• Only requirement is to inform ANSYS of how many GPUs to use
Schematic of a CPU with an attached GPU accelerator
• CPU begins/ends job, GPU manages heavy computations
Some Basics
ANSYS Software on NVIDIA GPUs
© 2013 ANSYS, Inc. May 5, 201554
GPU Accelerator Capability- ANSYS Mechanical
Supports majority of ANSYS structural mechanics solvers:
• Covers both sparse direct and PCG iterative solvers
• Only a few minor limitations
Ease of use:
• Requires at least one supported GPU card to be installed
• No rebuild, no additional installation steps
Performance:
• Offer significantly faster time to solution
• Should never slow down your simulation
V14sp-5 Model
© 2013 ANSYS, Inc. May 5, 201555
Influence of GPU Accelerator on Speedup
5.9x3.7x 2.4x
ANSYS Mechanical Model – ImpellerImpeller geometry of ~2M DOF, solid FEs
Normal modes analysis using cyclic symmetry
ANSYS Mechanical SMP and Block-Lanczos solver
SpeedupImpeller 2M DOF
Normal modes4 cores + GPU = 2.4x speedup
vs. 4 cores
ANSYS Mechanical Model – SpeakerSpeaker geometry of ~0.7M DOF, solid FEs
Vibroacoustic harmonic analysis for one frequency
ANSYS Mechanical distributed sparse solver
Speaker 0.7M DOFHarmonic analysis
4 cores + GPU = 2.7x speedup
vs. 4 cores
Speedup
© 2013 ANSYS, Inc. May 5, 201556
NVIDIA-GPU Solution Fit for ANSYS Mechanical
GPUs accelerate the solver part of analysis, consequently problems with high solver workloads benefit the most from GPUs
• Characterized by both high DOF and high factorization requirements
• Models with solid elements (such as castings) and have >500K DOF experience good speedups
Better performance when run on DMP mode over SMP mode
GPU and system memories both play important roles in performance
• Sparse solver:
– Bulkier and/or higher-order FE models are good and will be accelerated
– If the model exceeds 5M DOF, then either add another GPU with 5-6 GB of memory (Tesla K20 or K20X) or use a single GPU with 12 GB memory (Tesla K40 or Quadro K6000).
• PCG/JCG solver:
– Memory saving (MSAVE) option should be turned off for enabling GPUs
– Models with lower Level of Difficulty value (Lev_Diff) are better suited for GPUs
© 2013 ANSYS, Inc. May 5, 201557
2 CPU cores 2 CPU cores + Tesla K20
93
324
3.5X
Simulation productivity(with an HPC license)
2 CPU cores + Tesla K40
363
3.9X
K20K40
8 CPU cores 7 CPU cores + Tesla K20
275
576
2.1X
Simulation productivity (with an HPC Pack)
7 CPU cores + Tesla K40
600
2.2X
K20K40
V14sp-5 Model
Turbine geometry
2.1 million DOF
SOLID187 elements
Static, nonlinear analysis
One iteration
Sparse direct solver
Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU; Tesla K20 GPU and a Tesla K40 GPU with boost clocks.
Higheris
Better
AN
SYS
Me
chan
ical
job
s/d
ay
GPU AchievementsANSYS Mechanical 15.0 Supporting Newest GPUs
© 2013 ANSYS, Inc. May 5, 201558
Simulation productivity(with an HPC license)
Simulation productivity (with an HPC Pack)
V14sp-6 Model
4.9 million DOF
Static, nonlinear analysis
One iteration
Sparse direct solver
AN
SYS
Me
chan
ical
job
s/d
ay
7 CPU cores + Tesla K20
Higheris
Better
GPU AchievementsANSYS Mechanical 15.0 Supporting Newest GPUs
Distributed ANSYS Mechanical 15.0 with Intel Xeon E5-2697 v2 2.7 GHz CPU; Tesla K40 GPU with boost clocks.
2 CPU cores 2 CPU cores + Tesla K20
59
165
2.8X Higheris
Better
8 CPU cores
180
270
1.5X
© 2013 ANSYS, Inc. May 5, 201559
GPU AchievementsANSYS Mechanical 15.0 Supporting Newest GPUs
Distributed ANSYS Mechanical 15.0 on Windows workstation with 16 Intel Xeon E5-2670 cores @ 2.7 GHz; 128 GB RAM; SSD.
© 2013 ANSYS, Inc. May 5, 201560
GPUs can offer significantly faster time to solution
Lower
core
counts
favor a
single
GPU
Higher
core
counts
favor
multiple
GPUs
Courtesy of HP
GPU AchievementsANSYS Mechanical 15.0 Supporting Newest GPUs
© 2013 ANSYS, Inc. May 5, 201561
Intel Xeon Phi coprocessors are now supported
• Use ‘-acc intel’ to activate this capability
• Xeon Phi models 7120, 5110, 3120 are supported
• Multiple cards
Note:
• Supported by sparse solver (symmetric matrices only)
• Linux only (no Windows support yet)
• SMP only supported
GPU AchievementsANSYS Mechanical 15.0 Supporting Xeon Phi
© 2013 ANSYS, Inc. May 5, 201562
Significant speedups can be achieved with Xeon Phi card
• Shared Memory Sparse Solver on Linux
3.3x
4.3x
5.1x
6.0x
6.8x
0
1
2
3
4
5
6
7
8
1 core 2 cores 4 cores 8 cores 16 cores
Spe
ed
up
Xeon Phi Acceleration (SMP)
CPU cores only
CPU cores + Xeon Phi
V14sp-5 Model
Turbine geometry
2.1 million DOF
SOLID187 elements
Static, nonlinear analysis
One iteration
Sparse direct solver
GPU AchievementsANSYS Mechanical 15.0 Supporting Xeon Phi
Linux workstation (16 Intel Xeon E5-2670 cores @ 2.6 GHz, 1 7120A Xeon Phi, 64 GB RAM).
© 2013 ANSYS, Inc. May 5, 201563
GPU AchievementsANSYS 15.0 License Scheme for GPUs – NEW!
6 CPU Cores + 2 GPUs1 x ANSYS HPC Pack
4 CPU Cores + 4 GPUs
Licensing Examples:
Total 8 HPC Tasks (4 GPUs Max)
2 x ANSYS HPC PackTotal 32 HPC Tasks (16 GPUs Max)
Example of Valid Configurations:
24 CPU Cores + 8 GPUs
(Total Use of 2 Compute Nodes)
.
.
.
.
.(Applies to all schemes: HPC, HPC Pack,HPC Workgroup, HPC Enterprise)
© 2013 ANSYS, Inc. May 5, 201564
ANSYS 15.0 License Scheme for GPUs- Implication of New HPC Pack Licensing
• With R14.5, you could run up to 8 CPU cores and 1 GPU.
• With R15.0, you can run up to to 7 CPU cores and 1 GPU, or 6C + 2G,etc.
Results Courtesy of MicroConsult Engineering, GmbH
Leda
BGA
© 2013 ANSYS, Inc. May 5, 201565
HDD vs. SSD
Maximizing Performance – Putting it Together
The right combination of hardware and software
leads to maximum efficiency
SMP vs. DMP
Interconnects?
Clusters?
GPUs?CPUs?
© 2013 ANSYS, Inc. May 5, 201566
#1 Rule Avoid waiting for I/O to complete
• Always check if job is I/O bound or compute bound
– Check output file for CPU and Elapsed times• When Elapsed time >> main thread CPU time I/O bound
– Consider adding more RAM or faster hard drive configuration
• When Elapsed time ≈ main thread CPU time Compute bound
– Considering moving simulation to a machine with newer, faster processors
– Consider using Distributed ANSYS (DMP) instead of SMP
– Consider running on more CPU cores or possibly using GPU(s)
Total CPU time for main thread : 159.8 seconds
. . .
. . .
Elapsed Time (sec) = 398.000 Date = 03/21/2013
Maximizing Performance – ANSYS Mechanical
© 2013 ANSYS, Inc. May 5, 201567
Maximizing Performance – ANSYS Mechanical
How to improve an I/O bound simulation
– First consider adding more RAM
• Always the best option for optimal performance
• Allows the operating system to cache file data in memory
– Next consider improving the I/O configuration
• Need fast hard drives to feed fast processors– Consider SSDs
– Higher bandwidths and extremely low seek times
– Consider RAID configurations
RAID 0 – for speed
RAID 1,5 – for redundancy
RAID 10 – for speed and redundancy
© 2013 ANSYS, Inc. May 5, 201568
Example of an I/O bound simulation
0.8x
2.9x2.7x
5.9x 5.9x
0
1
2
3
4
5
6
7
2 cores, HDD 8 cores, HDD 8 cores, SSD
Re
lati
ve S
pe
ed
up
Benefits of SSD and RAM
16 GB RAM
128 GB RAM
Maximizing Performance – ANSYS Mechanical
Adding RAM gives biggest gains & allows good scaling
Lack of RAM and slow HDD ruin scaling
Single SSD helps allow some scaling. Not as helpful as RAM, but cheaper
• 2.1 million DOF• Nonlinear static analysis • Direct sparse solver (DSPARSE)• 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total)• One 10k rpm HDD, one SSD• Windows 7
© 2013 ANSYS, Inc. May 5, 201569
Maximizing Performance – ANSYS Mechanical
How to improve a compute bound simulation
– First consider using newer, faster processors
• New CPU architecture and faster clock speeds always help
– Next consider using parallel processing
• DMP virtually always recommended over SMP• More computations performed in parallel with DMP
• Significantly faster speedups achieved using DMP
• DMP can take advantage of all resources on a cluster
• Whole new class of problems can be solved!!
– Last consider using GPU acceleration
• Can help accelerate critical, time-consuming computations
© 2013 ANSYS, Inc. May 5, 201570
Example of a compute bound simulation
Maximizing Performance – ANSYS Mechanical
1.8x
4.0x
11.0x
0
2
4
6
8
10
12
2 cores 8 cores 8 cores, GPU
Re
lati
ve S
pe
ed
up
Benefits of DMP and GPU
Xeon x5675
Xeon E5-2670Maximum performance found by adding GPU
Using newer Xeons gives big gain
Using 8 cores gives faster performance
• 2.1 million DOF• Nonlinear static analysis • Direct sparse solver (DSPARSE)• 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total)• 128 GB RAM • 1 Tesla K20c• Windows 7
© 2013 ANSYS, Inc. May 5, 201571
Balanced System for Overall Optimum Performance
Maximizing Performance – ANSYS Mechanical
1.0x2.7x 5.2x
12.5x
0
5
10
15
20
25
30
2 cores 8 cores 8 cores +GPU
8 cores +GPU + SSD
Re
lati
ve S
pe
ed
up
Balanced PerformanceIO Bound
• 2.1 million DOF• Nonlinear static analysis • Direct sparse solver (DSPARSE)• 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total)• 16 GB RAM • SSD and SATA disks• 1 Tesla K20c• Windows 7
© 2013 ANSYS, Inc. May 5, 201572
Balanced System for Overall Optimum Performance
Maximizing Performance – ANSYS Mechanical
• 2.1 million DOF• Nonlinear static analysis • Direct sparse solver (DSPARSE)• 2 Intel Xeon E5-2670 (2.6 GHz, 16 cores total)• 128 GB RAM • SSD and SATA disks• 1 Tesla K20c• Windows 7
1.0x2.7x 5.2x
12.5x
5.7x
12.0x
24.8x27.3x
0
5
10
15
20
25
30
2 cores 8 cores 8 cores +GPU
8 cores +GPU + SSD
Re
lati
ve S
pe
ed
up
Balanced Performance
IO Bound
Compute Bound
© 2013 ANSYS, Inc. May 5, 201573
• Why Talking About Hardware
• HPC Terminology
• ANSYS Work-flow
• Hardware Considerations
• Additional resources
Agenda
© 2013 ANSYS, Inc. May 5, 201574
Watch recorded webinars by clicking below:• Understanding Hardware Selection for ANSYS 15.0
• How to Speed Up ANSYS 15.0 with GPUs
• Intel Technologies Enabling Faster, More Effective Simulation
• Why HPC for ANSYS Mechanical and CFD
Click on webinars related to HPC/IT for more and upcoming ones!
Additional Resources
© 2013 ANSYS, Inc. May 5, 201575
Additional Resources- ANSYS IT Webcast Series
On-demand webinars:• Understanding Hardware Selection for ANSYS 15.0
• How to Speed Up ANSYS 15.0 with GPUs
• Cloud Hosting of ANSYS: Gompute On-Demand Solutions
• Simplified HPC Clusters for ANSYS Users
• Intel Technologies Enabling Faster, More Effective Simulation
• Accelerating Time-to-Results with Parallel I/O
• Extreme Scalability for High-Fidelity CFD Simulations
• Methodology and Tools for Compute Performance at Any Scale
• Understanding Hardware Selection for Structural Mechanics
• Optimizing Remote Access to Simulation
• Scalable Storage and Data Management for Engineering Simulation
http://www.ansys.com/Support/Platform+Support/IT+Solutions+for+ANSYS+Webcast+Series
© 2013 ANSYS, Inc. May 5, 201576
Additional Resources
ANSYS Platform Support• http://www.ansys.com/Support/Platform+Support
– Platform Support Policies
– Supported Platforms
– Supported Hardware
– Tested Systems
– ANSYS Benchmarks
© 2013 ANSYS, Inc. May 5, 201577
ANSYS Partner Solutions– http://www.ansys.com/About+ANSYS/Partner+Programs/HPC+Partners
• Reference configurations
• Performance data
• White papers
• Sales contact points
Performance Data– http://www.ansys.com/benchmarks
Additional Resources
© 2013 ANSYS, Inc. May 5, 201578
Additional Resources
The Manual• Sections on best practices and parallel
processing for various solvers
• Performance Guide for Mechanical
• Installation walkthroughs for installing the products, parallel processing, licensing and RSM (remote solve manager)
ANSYS Advantage• Online Magazine
© 2013 ANSYS, Inc. May 5, 201579
• Connect with Me
• Connect with ANSYS, Inc.
– LinkedIn ANSYSInc
– Twitter @ANSYS_Inc
– Facebook ANSYSInc
• Follow our Blog
– ansys-blog.com
Thank You!