sgi - hpc-29mai2012
DESCRIPTION
TRANSCRIPT
HPC milestonesMichal Klimeš
Experts @ HPC
2
Structural MechanicsImplicit
Structural MechanicsExplicit
Computational FluidDynamics
Electro-Magnetics
Computational ChemistryQuantum Mechanics
Reservoir Simulation Rendering / Ray Tracing Climate / WeatherOcean Simulation
Data Analytics
Computational ChemistryMolecular Dynamics
Computational Biology Seismic Processing
Competency = Real HPC + Big Storage
From TOP500
There are no small things
6
OpenFOAM® Performance with SGI MPI
Performance: SGI MPT <--> OpenMPI
1,02
1,73
2,29
2,72
1,00
1,59
2,01 2,01
1,021,08
1,14
1,35
0,0
0,5
1,0
1,5
2,0
2,5
3,0
64 128 192 256#Cores
Speedup
1,0
1,5
2,0
Ratio MPT / OpenMPI
SGI MPT Speedup OpenMPI Speedup MPT/OpenMPI Ratio
Automotive Interior Climate Model, 19M cells
OpenFOAM with SGI MPI with up to 35% better performance
Linpack 30.5kW*
Fluent22.4kW*73.4%
Linpack kW
GUPS23.3kW*76.4%
Linpack kW
STREAM22.1kW*72.5%
Linpack kW
Idle15.9kW*
52.1% Linpack kW
* Measured on ICE 8200 system with 128x 2.66GHz Quad-Core Intel® Xeon® Processor 5300 series (1 Rack)
SGI Confidential
Average power consumption heavily depends on• application and its data profile• the level of code optimization (+ libraries + MPI optimization)• the ability of Job Scheduler to utilize the system• the bottlenecks in I/O subsystem and in OS
What is the “average” power consumption ?
Where is performance?
Acce
lera
ted
Real Memory Bandwidth RequirementsMeasurements at LRZ on SGI Altix4700
1B/s : 1 Flops
So
urc
e:
Ma
tth
ias
Bre
hm
(L
RZ
) in
in
Sid
e,
vo
l 4
No
2
SGI HPC Servers and Supercomputers
Altix ® ICEBlade Cluster Architecture
Scalability Leader
CloudRack™Tray Cluster
Architecture (for Internet DC)
Rackable™1U, 2U, 3U, 4U & XEBuild-to-Order Leader
Altix® UVShared-Memory
ArchitectureVirtualization & May-Core Leader
Scale-Out Scale-Up
SGI UV24th Generation SMP System
• The most flexible system!
SGI UV Shared Memory Architecture
• Each system has own memory and OS
• Nodes communicate over commodity interconnect
• Inefficient cross-node communication creates bottlenecks
• Coding required for parallel code execution
...
InfiniBand or Gigabit Ethernet
Mem~64GB
system+
OS
Commodity Clusters
mem
system+
OS
mem
system+
OS
mem
system+
OS
mem
system+
OS
• All nodes operate on one large shared memory space
• Eliminates data passing between nodes
• Big data sets fit entirely in memory• Less memory per node required • Simpler to program• High performance, low cost, easy to
deploy
Global shared memory to 16TB
SGI NUMAlink Interconnect
SGI UV Platform
System + OS
The UV2 Advantage
Long 15 year heritage: same principles as Altix 4700, ….but
–Intel Sandy Bridge Xeon Multi-Core Processors
–Large scalable Shared Memory System
• Up to 4096 Cores and 64TB per Partition• Up to 2048 Cores, 4096 Threads and 32TB per Partition• Multi-partition Systems with up to 16384 Sockets, 2PB in multiple Partitions• MPI, UPC Acceleration by Hardware Offload• Cross Partition Communication
In 2012 without competition
–By help of proven SGI ccNuma Architecture
–Reliability, Availability und Serviceability
• Sandy Bridge EP4S with better than EP2S reliability properties–A
ll- Hardware solution • ccNuma transparent for end users
SGI UV2 Interconnect with Global Addressing
NUMAlink® routers connect nodes to Multi-rack UV systems
HUB snoops Socket QPI and accelerates remote accessHUB Offloads Programming models MPI, UPC, (CoArray
not yet)
High Radix Router
NUMAlink
512GB globally addressable Memory
64GB64GB
CPU CPU
HUB
Altix UV Blade
64GB64GB
CPU CPU
HUB
Altix UV Blade
64GB64GB
CPU CPU
HUB
Altix UV Blade
64GB64GB
CPU CPU
HUB
Altix UV Blade
15
UV Foundation: GAM + Communications Offload
NUMAlinkto Other Nodes
PI
MI
NI
[S]IntelCPU
GAM : Globally Addressable Memory 8PB ( 53b )
GSM – cc = GAM
GSM • Partition Memory ( OS ) - Max. 2KC 16TB
GAM• PGAS Memory ( X-Partition )• Communications Offload ( GRU + AMU ) - Accelerate PGAS Codes - Accelerate MPI Codes ( MOE v.v. TOE )
[P] AMU
[V] GRUTLB
16
UV1 vs. UV2
Socket- NHM-EX- WSM-EX
SocketSNB-EX-B & SNB-EP -
IVB-EX-B & IVB-EP -
- QPI 1.0 QPI 1.1 -
Glue- H + H + R
- 3 separate Chips- 90nm
- (D) Directory DIMM- (S) Snoop DRAM
Interconnect - NL5- 6.25 GT/s- 8B/10B encoding- 4 x 12 lanes- Cu only- 7m max
NL6 - Interconnect-
Higher Payload - 16 x 4 lanes -
Cu & Optical -20m max -
GlueH + H +R -
into 1 Chip -40 nm -
No Directory DIMM -No Snoop DRAM -
Better AMOs -
HD
S
SSSS
H D
S
H H
R
R
UV MPI Barrier
17
Additional Performance Acceleration
•Altix UV offers up to 3X improvement in MPI reduction processes.
•Barrier latency is dramatically better (80x) than competing platforms
•HPCC benchmarks show substantial improvement possible with MPI Offload Engine (MOE)
HPCC Benchmarks
UV with MOE
UV, MOE disabled
0
Barrier Latency <1usec (4096 thread)
UV with MOE
UV, MOE disabled
UV with MOE
UV. MOE disabled
Source: SGI Engineering projections
UV2000 16 Socket 8 Blade IRU
Notes
• IRU: 10U high by 19” wide by 27” deep • 8 blades – 8 Harps & 16 sockets – per IRU• 1 or 2 CMCs in rear of IRU• 3 UV1 12V Power Supplies• Nine 12V cooling fans N+1
Front
19
Two signal backplanes
Powerbackplane
16 NL channels cabled in air plenum
to connect the right and left backplane
Signal BP
Power BPSignal
BP
CMC
SGI UV2 Node Architecture and Numalink 6
SandyBridge-
EP or EX
UV2-HUB16 x4 NL6
SandyBridge-EP
or EX
NL6 16 x4 channels 12.5GT/s
4 DDR3 channels2DPC1600MHz
QPI 1.1 8GT/s 32GB/s
IRU external linksNL1-plane
IRU external linksNL0-plane
•Numalink 6
–12.5GT/s – or – 6.7GB/s net bidirectional bandwidth per link
–16 NL6 links aggregate Bandwidth out of blade: 107.2GB/s
–12 NL6 internal links to backplane – aggregate: 80.4GB/s
– 4 NL6 external links to routers – 26.8 GB/s
•Numalink 6 Routers
–16 NL6 ports
•Numalink cable
–Leverage low-cost QSFP 4xQDR Infiniband cables
•Directory memory part of main memory
–Available memory is reduced by 10 percent
NO memory Buffers as in UV1Same per socket performance
as in cluster40 PCIe lanes per socket
PCIe Gen3 x16PCIe Gen3 x16
12 IRU internal links
System Topology
•1 IRU
–Hypercube
–Max 2 hops between blades
UV2 Topology
21
Blade
22
UV 2 Feature Advances
Feature UV1 UV2System scale 2048c/4096t 4096c/4096t
Memory/SSI 16TB 64TB
Interconnect NUMAlink 5 NL 6 (2.5X data rate)
NL fabric Scale 32K sockets 32K+ sockets
Processor Nehalem EX Sandybridge
Sockets/rack 64 (large 24”) 64 (standard 19”)
Reliability Enterprise Class Enterprise Class
MIC Architecture X86 compatible
1.3TF/s Double Precision peak340GB/s bandwidth
SGI ICE X…Fifth Generation ICE System
• The world’s fastest supercomputer just got faster!
• Flexible to fit your workload
©2011 SGI
SGI® ICE: Firsts and Onlies
• First *over 1PF peak* InfiniBand pure compute connected CPU cluster
• World's fastest distributed memory system• World’s fastest and most scalable computational
fluid dynamics system• First and only vendor to support multiple fabric
level topologies + flexibility at the node, switch and fabric level + application benchmarking expertise for same
• First and only vendor capable of live, large-scale compute capacity integration
Dialing Up The Density!SGI ICE 8400 SGI ICE X
SGI ICE 8400 = 64N
(128 Sockets)
30” Width30” Width
M-Rack 72 x 2 = 144N
(288 Sockets)
28” Width28” Width
D-Rack = 72N(144 Sockets)
24” Width24” Width
SeparablePower Shelf
SGI ICE X Enclosure Design Building Block Increments of Two Blade Enclosures - “One Enclosure Pair”
21U “Building
Block”
17.7
16.59(9.5U)
1.75(1U)
Rear View
19” rack mount
Features perEnclosure Pair:
• 36 blade slots
• Four fabric switch slots
• Integrated management
SGI ICE X Compute BladeIP-113 (Dakota) for “D-Rack”
FDR MezzanineCard Options
Main Features:
•Supports single or dual plane FDR InfiniBand
•Supports two future Intel® Xeon® processor E5 family CPUs
•Supports up to eight DDR3 DIMMs per socket @ 1600 MT/s
•Houses up to two 2.5” SATA drives for local swap/scratch usage
•Utilizes traditional heat sinks
©2011 SGI
SGI ICE X Compute BladeIP-115 (Gemini Twin) for “M-Rack”
Main Features:
•Supports single plane FDR InfiniBand
•Supports four future Intel® Xeon® processor E5 family CPUs
•Two dual socket nodes
•Supports four DDR3 DIMMs per socket @1600 MT/s
•Houses up to two 2.5” SATA drives for local swap/scratch usage
• One per node
•Utilizes traditional heat sinks and cold sinks (liquid)
On-Socket Water-Cooling Detail
OutOutOut
Used for IP-115 Gemini “twin” blades; replaces the traditional air-cooled heat sinks on the CPUs to enable highest watt SKU support•Resides between the pair of node boards in each blade slot (“M-Rack” deployment)•Enables highest watt SKU support (e.g., 130W TDPs)•Utilizes a liquid-to-water heat exchanger that provisions the required quantity of flow to the M-Racks for cooling
©2011 SGI
Notable Features of a “Cell”D-Cell and M-Cell
• “Closed-Loop Airflow” Environment
• Supports Warm Water Cooling
• Contains Large, “Unified” Cooling Racks for Efficiency
One Complete CellOne Complete Cell
One Compute Rack One Compute Rack
One Cooling RackOne Cooling Rack
©2011 SGI
Common Topologies
All-to-AllFat Tree (CLOS
Networks)
Hypercube
Enhanced Hypercub
e
Mesh or Torus (2, 3 or more dimension
s)
Will support when in OFED
Supported on SGI ICE 8400 and SGI ICE X
Slide 33
ICE Differentiation: OS Noise Synchronization
Time
Barrier Complete
System Overhead
WastedCycles
WastedCycles
System OverheadCompute Cycles Wasted
CyclesWastedCycles
WastedCycles
WastedCycles
Node 1
Node 2
Node 3
Unsynchronized OS Noise → Wasted Cycles
System Overhead
System OverheadNode 1
System OverheadNode 2
System OverheadNode 3
Synchronized OS Noise → Faster Results
Process on:
Process on:
• OS system noise: CPU cycles stolen from a user application by the OS to do periodic or asynchronous work (monitoring, daemons, garbage collection, etc).
• Management interface will allow users to select what gets synchronized
• Performance boost on larger scales systems
Cool Customers
SGI ICE X
©2011 SGI
SGI ICE X: Initial Customers• NASA: Increasing their current SGI ICE system, called
“Pleiades,” by 35% with multiple racks with future Intel® Xeon® processor E5 family – will have 1.7 petaflops• Facilitate new discoveries for Earth Science research projects
• Modeling and simulation to support flight regimes and new designs for aircraft
• Engineering risk assessment of crew risk probabilities to support development of launch and commercial crew vehicles for space exploration missions
• NTNU: 13 SGI ICE X racks @ >275 teraflops; 4 SGI InfiniteStorage 16000 racks @ 1.2 petabytes• Accelerate numerical weather predictions
• Develop atmospheric and oceanographic models for improved weather forecasting
UN Chief Calls for Urgent Actionon Climate ChangeNASA Advanced Supercomputing DivisionSGI® ICE
UN Chief Calls for Urgent Actionon Climate ChangeNASA Advanced Supercomputing DivisionSGI® ICE
Images taken by the Thematic Mapper sensor aboard Landsat 5Source: USGS Landsat Missions Gallery, U.S. Department of the Interior / U.S. Geological SurveyImages taken by the Thematic Mapper sensor aboard Landsat 5Source: USGS Landsat Missions Gallery, U.S. Department of the Interior / U.S. Geological Survey
Cyclones
Cyclone Service Models
Software (SaaS)
SGI Cyclone
SGI delivers techincal application
expertise.
SGI delivers commercially
available open and 3rd party
software via the Internet.
SGI offers a platform for
developers
SGI delivers the system
infrastructure.
SGI OpenFOAM® Ready for Cyclone
Technical Applications PortalPowered by
Submits Job
User
Customer : iVEC and Curtin University Australia
Problem: Solving large scale CFD problems like simulating wind flows in the capital city of Perth.
Solution: OpenFOAM scaled on SGI Cyclone better (1024 cores) and was 20x faster than on Amazon EC2.
Source: Dr Andrew King, Department of Mechanical Engineering Curtin, University of Technology, Australia
Balanced design & architecture
Do you attach Caravan attached to the F1?
©2011 SGI