sgi - hpc-29mai2012

HPC milestonesMichal Klimeš

Experts @ HPC

2

Structural MechanicsImplicit

Structural MechanicsExplicit

Computational FluidDynamics

Electro-Magnetics

Computational ChemistryQuantum Mechanics

Reservoir Simulation Rendering / Ray Tracing Climate / WeatherOcean Simulation

Data Analytics

Computational ChemistryMolecular Dynamics

Computational Biology Seismic Processing

Competency = Real HPC + Big Storage

From TOP500

There are no small things

6

OpenFOAM® Performance with SGI MPI

Performance: SGI MPT <--> OpenMPI

1,02

1,73

2,29

2,72

1,00

1,59

2,01 2,01

1,021,08

1,14

1,35

0,0

0,5

1,0

1,5

2,0

2,5

3,0

64 128 192 256#Cores

Speedup

1,0

1,5

2,0

Ratio MPT / OpenMPI

SGI MPT Speedup OpenMPI Speedup MPT/OpenMPI Ratio

Automotive Interior Climate Model, 19M cells

OpenFOAM with SGI MPI with up to 35% better performance

Linpack 30.5kW*

Fluent22.4kW*73.4%

Linpack kW

GUPS23.3kW*76.4%

Linpack kW

STREAM22.1kW*72.5%

Linpack kW

Idle15.9kW*

52.1% Linpack kW

* Measured on ICE 8200 system with 128x 2.66GHz Quad-Core Intel® Xeon® Processor 5300 series (1 Rack)

SGI Confidential

Average power consumption heavily depends on• application and its data profile• the level of code optimization (+ libraries + MPI optimization)• the ability of Job Scheduler to utilize the system• the bottlenecks in I/O subsystem and in OS

What is the “average” power consumption ?

Where is performance?

Acce

lera

ted

Real Memory Bandwidth RequirementsMeasurements at LRZ on SGI Altix4700

1B/s : 1 Flops

So

urc

e:

Ma

tth

ias

Bre

hm

(L

RZ

) in

in

Sid

e,

vo

l 4

No

2

SGI HPC Servers and Supercomputers

Altix ® ICEBlade Cluster Architecture

Scalability Leader

CloudRack™Tray Cluster

Architecture (for Internet DC)

Rackable™1U, 2U, 3U, 4U & XEBuild-to-Order Leader

Altix® UVShared-Memory

ArchitectureVirtualization & May-Core Leader

Scale-Out Scale-Up

https://www.tagteam.com/TagTeam/client/Zoom.asp?DataID=118022&LS=undefined

SGI UV24th Generation SMP System

• The most flexible system!

SGI UV Shared Memory Architecture

• Each system has own memory and OS

• Nodes communicate over commodity interconnect

• Inefficient cross-node communication creates bottlenecks

• Coding required for parallel code execution

...

InfiniBand or Gigabit Ethernet

Mem~64GB

system+

OS

Commodity Clusters

mem

system+

OS

mem

system+

OS

mem

system+

OS

mem

system+

OS

• All nodes operate on one large shared memory space

• Eliminates data passing between nodes

• Big data sets fit entirely in memory• Less memory per node required • Simpler to program• High performance, low cost, easy to

deploy

Global shared memory to 16TB

SGI NUMAlink Interconnect

SGI UV Platform

System + OS

The UV2 Advantage

Long 15 year heritage: same principles as Altix 4700, ….but

–Intel Sandy Bridge Xeon Multi-Core Processors

–Large scalable Shared Memory System

• Up to 4096 Cores and 64TB per Partition• Up to 2048 Cores, 4096 Threads and 32TB per Partition• Multi-partition Systems with up to 16384 Sockets, 2PB in multiple Partitions• MPI, UPC Acceleration by Hardware Offload• Cross Partition Communication

In 2012 without competition

–By help of proven SGI ccNuma Architecture

–Reliability, Availability und Serviceability

• Sandy Bridge EP4S with better than EP2S reliability properties–A

ll- Hardware solution • ccNuma transparent for end users

SGI UV2 Interconnect with Global Addressing

NUMAlink® routers connect nodes to Multi-rack UV systems

HUB snoops Socket QPI and accelerates remote accessHUB Offloads Programming models MPI, UPC, (CoArray

not yet)

High Radix Router

NUMAlink

512GB globally addressable Memory

64GB64GB

CPU CPU

HUB

Altix UV Blade

64GB64GB

CPU CPU

HUB

Altix UV Blade

64GB64GB

CPU CPU

HUB

Altix UV Blade

64GB64GB

CPU CPU

HUB

Altix UV Blade

15

UV Foundation: GAM + Communications Offload

NUMAlinkto Other Nodes

PI

MI

NI

[S]IntelCPU

GAM : Globally Addressable Memory 8PB ( 53b )

GSM – cc = GAM

GSM • Partition Memory ( OS ) - Max. 2KC 16TB

GAM• PGAS Memory ( X-Partition )• Communications Offload ( GRU + AMU ) - Accelerate PGAS Codes - Accelerate MPI Codes ( MOE v.v. TOE )

[P] AMU

[V] GRUTLB

16

UV1 vs. UV2

Socket- NHM-EX- WSM-EX

SocketSNB-EX-B & SNB-EP -

IVB-EX-B & IVB-EP -

- QPI 1.0 QPI 1.1 -

Glue- H + H + R

- 3 separate Chips- 90nm

- (D) Directory DIMM- (S) Snoop DRAM

Interconnect - NL5- 6.25 GT/s- 8B/10B encoding- 4 x 12 lanes- Cu only- 7m max

NL6 - Interconnect-

Higher Payload - 16 x 4 lanes -

Cu & Optical -20m max -

GlueH + H +R -

into 1 Chip -40 nm -

No Directory DIMM -No Snoop DRAM -

Better AMOs -

HD

S

SSSS

H D

S

H H

R

R

UV MPI Barrier

17

Additional Performance Acceleration

•Altix UV offers up to 3X improvement in MPI reduction processes.

•Barrier latency is dramatically better (80x) than competing platforms

•HPCC benchmarks show substantial improvement possible with MPI Offload Engine (MOE)

HPCC Benchmarks

UV with MOE

UV, MOE disabled

0

Barrier Latency <1usec (4096 thread)

UV with MOE

UV, MOE disabled

UV with MOE

UV. MOE disabled

Source: SGI Engineering projections

UV2000 16 Socket 8 Blade IRU

Notes

• IRU: 10U high by 19” wide by 27” deep • 8 blades – 8 Harps & 16 sockets – per IRU• 1 or 2 CMCs in rear of IRU• 3 UV1 12V Power Supplies• Nine 12V cooling fans N+1

Front

19

Two signal backplanes

Powerbackplane

16 NL channels cabled in air plenum

to connect the right and left backplane

Signal BP

Power BPSignal

BP

CMC

SGI UV2 Node Architecture and Numalink 6

SandyBridge-

EP or EX

UV2-HUB16 x4 NL6

SandyBridge-EP

or EX

NL6 16 x4 channels 12.5GT/s

4 DDR3 channels2DPC1600MHz

QPI 1.1 8GT/s 32GB/s

IRU external linksNL1-plane

IRU external linksNL0-plane

•Numalink 6

–12.5GT/s – or – 6.7GB/s net bidirectional bandwidth per link

–16 NL6 links aggregate Bandwidth out of blade: 107.2GB/s

–12 NL6 internal links to backplane – aggregate: 80.4GB/s

– 4 NL6 external links to routers – 26.8 GB/s

•Numalink 6 Routers

–16 NL6 ports

•Numalink cable

–Leverage low-cost QSFP 4xQDR Infiniband cables

•Directory memory part of main memory

–Available memory is reduced by 10 percent

NO memory Buffers as in UV1Same per socket performance

as in cluster40 PCIe lanes per socket

PCIe Gen3 x16PCIe Gen3 x16

12 IRU internal links

System Topology

•1 IRU

–Hypercube

–Max 2 hops between blades

UV2 Topology

21

Blade

22

UV 2 Feature Advances

Feature UV1 UV2System scale 2048c/4096t 4096c/4096t

Memory/SSI 16TB 64TB

Interconnect NUMAlink 5 NL 6 (2.5X data rate)

NL fabric Scale 32K sockets 32K+ sockets

Processor Nehalem EX Sandybridge

Sockets/rack 64 (large 24”) 64 (standard 19”)

Reliability Enterprise Class Enterprise Class

MIC Architecture X86 compatible

1.3TF/s Double Precision peak340GB/s bandwidth

SGI ICE X…Fifth Generation ICE System

• The world’s fastest supercomputer just got faster!

• Flexible to fit your workload

©2011 SGI

SGI® ICE: Firsts and Onlies

• First *over 1PF peak* InfiniBand pure compute connected CPU cluster

• World's fastest distributed memory system• World’s fastest and most scalable computational

fluid dynamics system• First and only vendor to support multiple fabric

level topologies + flexibility at the node, switch and fabric level + application benchmarking expertise for same

• First and only vendor capable of live, large-scale compute capacity integration

Dialing Up The Density!SGI ICE 8400 SGI ICE X

SGI ICE 8400 = 64N

(128 Sockets)

30” Width30” Width

M-Rack 72 x 2 = 144N

(288 Sockets)


D-Rack = 72N(144 Sockets)


SeparablePower Shelf

SGI ICE X Enclosure Design Building Block Increments of Two Blade Enclosures - “One Enclosure Pair”

21U “Building

Block”

17.7

16.59(9.5U)

1.75(1U)

Rear View

19” rack mount

Features perEnclosure Pair:

• 36 blade slots

• Four fabric switch slots

• Integrated management

SGI ICE X Compute BladeIP-113 (Dakota) for “D-Rack”

FDR MezzanineCard Options

Main Features:

•Supports single or dual plane FDR InfiniBand

•Supports two future Intel® Xeon® processor E5 family CPUs

•Supports up to eight DDR3 DIMMs per socket @ 1600 MT/s

•Houses up to two 2.5” SATA drives for local swap/scratch usage

•Utilizes traditional heat sinks

©2011 SGI

SGI ICE X Compute BladeIP-115 (Gemini Twin) for “M-Rack”

Main Features:

•Supports single plane FDR InfiniBand

•Supports four future Intel® Xeon® processor E5 family CPUs

•Two dual socket nodes

•Supports four DDR3 DIMMs per socket @1600 MT/s

•Houses up to two 2.5” SATA drives for local swap/scratch usage

• One per node

•Utilizes traditional heat sinks and cold sinks (liquid)

On-Socket Water-Cooling Detail

OutOutOut

Used for IP-115 Gemini “twin” blades; replaces the traditional air-cooled heat sinks on the CPUs to enable highest watt SKU support•Resides between the pair of node boards in each blade slot (“M-Rack” deployment)•Enables highest watt SKU support (e.g., 130W TDPs)•Utilizes a liquid-to-water heat exchanger that provisions the required quantity of flow to the M-Racks for cooling

©2011 SGI

Notable Features of a “Cell”D-Cell and M-Cell

• “Closed-Loop Airflow” Environment

• Supports Warm Water Cooling

• Contains Large, “Unified” Cooling Racks for Efficiency

One Complete CellOne Complete Cell

One Compute Rack One Compute Rack

One Cooling RackOne Cooling Rack

©2011 SGI

Common Topologies

All-to-AllFat Tree (CLOS

Networks)

Hypercube

Enhanced Hypercub

e

Mesh or Torus (2, 3 or more dimension

s)

Will support when in OFED

Supported on SGI ICE 8400 and SGI ICE X

ICE Differentiation: OS Noise Synchronization

Time

Barrier Complete

System Overhead

WastedCycles

WastedCycles

System OverheadCompute Cycles Wasted

CyclesWastedCycles

WastedCycles

WastedCycles

Node 1

Node 2

Node 3

Unsynchronized OS Noise → Wasted Cycles

System Overhead

System OverheadNode 1



Synchronized OS Noise → Faster Results

Process on:

Process on:

• OS system noise: CPU cycles stolen from a user application by the OS to do periodic or asynchronous work (monitoring, daemons, garbage collection, etc).

• Management interface will allow users to select what gets synchronized

• Performance boost on larger scales systems

Cool Customers

SGI ICE X

©2011 SGI

SGI ICE X: Initial Customers• NASA: Increasing their current SGI ICE system, called

“Pleiades,” by 35% with multiple racks with future Intel® Xeon® processor E5 family – will have 1.7 petaflops• Facilitate new discoveries for Earth Science research projects

• Modeling and simulation to support flight regimes and new designs for aircraft

• Engineering risk assessment of crew risk probabilities to support development of launch and commercial crew vehicles for space exploration missions

• NTNU: 13 SGI ICE X racks @ >275 teraflops; 4 SGI InfiniteStorage 16000 racks @ 1.2 petabytes• Accelerate numerical weather predictions

• Develop atmospheric and oceanographic models for improved weather forecasting

UN Chief Calls for Urgent Actionon Climate ChangeNASA Advanced Supercomputing DivisionSGI® ICE

UN Chief Calls for Urgent Actionon Climate ChangeNASA Advanced Supercomputing DivisionSGI® ICE

Images taken by the Thematic Mapper sensor aboard Landsat 5Source: USGS Landsat Missions Gallery, U.S. Department of the Interior / U.S. Geological SurveyImages taken by the Thematic Mapper sensor aboard Landsat 5Source: USGS Landsat Missions Gallery, U.S. Department of the Interior / U.S. Geological Survey

Cyclones

Cyclone Service Models

Software (SaaS)

SGI Cyclone

SGI delivers techincal application

expertise.

SGI delivers commercially

available open and 3rd party

software via the Internet.

SGI offers a platform for

developers

SGI delivers the system

infrastructure.

SGI OpenFOAM® Ready for Cyclone

Technical Applications PortalPowered by

Submits Job

User

Customer : iVEC and Curtin University Australia

Problem: Solving large scale CFD problems like simulating wind flows in the capital city of Perth.

Solution: OpenFOAM scaled on SGI Cyclone better (1024 cores) and was 20x faster than on Amazon EC2.

Source: Dr Andrew King, Department of Mechanical Engineering Curtin, University of Technology, Australia

Balanced design & architecture

Do you attach Caravan attached to the F1?

sgi - hpc-29mai2012

Technology

t e o n o n e

e n t ir e

t e s d

e n t sm e

t e o v e r e lim

c e n o d e s c o

s e t s f

e t h e r n e ts g