numascale product ibm

Copyright 2013 All rights reserved. 1

NumaConnectNumaConnect

Einar Rustad, Co-Founder & CTO

September 2013

NumaConnectNumaConnect

• Cache Coherent Global Shared Memory and Shared IO

• Single Image Standard Operating System• All kinds of APIs

Commodity ServersCommodity Servers

Copyright 2013. All rights reserved. 2

At Cluster Price

Commodity ServersCommodity Servers

Tightly Coupled into Tightly Coupled into

One Monolithic SystemOne Monolithic Systemwith NumaConnectwith NumaConnect

Shared Everything Shared Everything -- One Single Operating System ImageOne Single Operating System Image

NumaConnect NumaConnect -- Share EverythingShare Everything

CachesCaches

CPUsCPUs

MemoryMemory

CachesCaches CachesCaches

CPUsCPUs

MemoryMemory


CPUsCPUs

MemoryMemory


CPUsCPUs

MemoryMemory

CachesCachesNumaChipNumaChip

RemoteRemoteCacheCache

NumaChipNumaChip



NumaChipNumaChip


NumaChipNumaChip


CPUsCPUs

I/OI/O

CPUsCPUs

I/OI/O

CPUsCPUs

I/OI/O

CPUsCPUs

I/OI/O

NumaConnect NumaConnect Fabric (no switch required)Fabric (no switch required)

�� Convex Convex Exemplar (Acquired by HP)Exemplar (Acquired by HP)

–– First implementation of the First implementation of the ccNUMAccNUMA architecture architecture from Dolphin in 1994from Dolphin in 1994

�� Data Data General General AviionAviion (Acquired by EMC)(Acquired by EMC)

–– Designed in 1996 with deliveries from 1997 Designed in 1996 with deliveries from 1997 -- 2002 2002

–– Used DolphinUsed Dolphin’’s chips with 3 generations of s chips with 3 generations of processor/memory busesprocessor/memory buses

�� I/O I/O Attached Products for Attached Products for Clustering OEMsClustering OEMs

Technology BackgroundTechnology Background


�� I/O I/O Attached Products for Attached Products for Clustering OEMsClustering OEMs

–– Sun Microsystems Sun Microsystems ((SunClusterSunCluster))

–– Siemens RM600 Siemens RM600 Server (IO Expansion)Server (IO Expansion)

–– Siemens Siemens Medical (3D CT)Medical (3D CT)

–– Philips Philips Medical (3D Ultra Sound)Medical (3D Ultra Sound)

–– Dassault/Thales Rafale Dassault/Thales Rafale

�� HPCHPC ClustersClusters ((WulfKitWulfKit ww.. ScaliScali))

–– First First Low Latency Cluster InterconnectLow Latency Cluster Interconnect

Dolphin’s Low

Latency Clustering HW

Dolphin’s

Cache Chip

NumaChipNumaChip

IBM Microelectronics ASIC

FCPBGA1023, 33mm x 33mm, 1mm ball pitch, 4-2-4 package


4-2-4 package

IBM cu-11 Technology

~ 2 million gates

Chip Size 9x11mm

NumaConnect NumaConnect CardCard


NumaChip Top Block DiagramNumaChip Top Block Diagram

SDRAM CacheSDRAM Cache

SDRAM TagsSDRAM Tags

HyperTransportHyperTransport

SPISPI

LC C

onfig D

ata

LC C

onfig D

ata

Mic

rocode

Mic

rocode

H2SH2S

ccHTccHTCaveCave

SMSM

SPI Init SPI Init ModuleModule

CSR

CSR


XAXA XBXB YBYBYAYA ZAZA ZBZB

LC C

onfig D

ata

LC C

onfig D

ata

Mic

rocode

Mic

rocode

Crossbar switch, LCs, SERDESCrossbar switch, LCs, SERDES

SCCSCC

CSR

CSR

NumaConnectNumaConnect™ ™ Node Node ConfigurationConfigurationNumaConnectNumaConnect™ ™ Node Node ConfigurationConfiguration

MultiMulti--CoreCore

CPUCPU

I/OI/O

MemoryMemoryMemoryMemoryMemoryMemoryMemoryMemory

MultiMulti--CoreCore

CPUCPU



NumaChipNumaChip MemoryMemoryNumaNuma

Cache+TagsCache+Tags

I/OI/OBridgeBridge

Coherent HyperTransport

6 x4 SERDES links

2-D Torus

NumaConnect™ System ArchitectureNumaConnect™ System Architecture

Multi-CPU Node

Num

aChip

Num

aChip

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Num

aN

um

a

Cache

Cache

Num

aN

um

a

Cache

Cache

Core

Core

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Multi

Multi--

Core

Core

CPU

CPU

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory


3-D Torus

6 external links 6 external links -- flexible flexible system configurations in multisystem configurations in multi--dimensional topologiesdimensional topologies

Num

aChip

Num

aChip

Multi

Multi--

Core

Core

CPU

CPU

I/O

I/O

Bri

dge

Bri

dge

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Cabling ExampleCabling Example


22--D DataflowD Dataflow

CPUsCPUs



CPUsCPUs


NumaChipNumaChipMemoryMemoryMemoryMemoryMemoryMemory


Request

Response

CPUsCPUs


CPUsCPUs

CachesCaches


NumaChipNumaChip

Size does MatterSize does Matter

�640K ought to be enough for anybody– Bill Gates, 1981


�We are looking for systems that can hold 10 – 20 TeraBytes in main Memory

– Trond J. Suul, Statoil, 2010

ScaleScale--up Capacityup Capacity

��Single System Image or Single System Image or Multiple Multiple PartitionsPartitions

��LimitsLimits–– 256 256 TeraBytesTeraBytes Physical Address SpacePhysical Address Space

–– 4096 4096 NodesNodes


–– 196 608 cores196 608 cores

��Largest and Most Cost Effective Largest and Most Cost Effective Coherent Shared MemoryCoherent Shared Memory

��NO Virtualization SW Layer!NO Virtualization SW Layer!


PerformancePerformance

LMbenchLMbench -- ChartChart

147,125

287,01 308,61 304,582

628,583

1087,576

100

1000

10000

ose

con

ds

Latency - LMbench


1,251

6,421

16,732

1

10

0,00049

0,00195

0,00391

0,00781

0,01562

0,03125

0,0625

0,125

0,25

0,5 1 2 4 8

16 32

64 128 256 512 1024 2048 4096

Na

no

Array Size (MBytes)

The Memory HierarchyThe Memory Hierarchy

7

5 0

00 0

00

10 000

100 000

1 000 000

10 000 000

co

nd

s

Access Times in the Memory Hierarchy

7

5 0

00 0

00

10 000

100 000

1 000 000

10 000 000

co

nd

s

Access Times in the Memory Hierarchy


L1 C

ache;

1,3

L2 C

ache;

6,4

L3 C

ache ;

16,7

Local M

em

ory

; 90

Num

aCache;

308

Rem

ote

Mem

ory

; 1 0

87

SSD

; 100 0

00

Rota

ting D

isk;

1

10

100

1 000

Nan

osec

L1 C

ache;

1,3

L2 C

ache;

6,4

L3 C

ache ;

16,7

Local M

em

ory

; 90

Num

aCache;

308

Rem

ote

Mem

ory

; 1 0

87

SSD

; 100 0

00

Rota

ting D

isk;

1

10

100

1 000

Nan

osec

Typical size: 100KB 1MB 8MB 240GB 4GB < 256TB nx1TB nx5TB

Memory BandwidthMemory Bandwidth

--------------------------------------------------------------------------------------------------------------------------

STREAM version $Revision: 5.9 $STREAM version $Revision: 5.9 $

--------------------------------------------------------------------------------------------------------------------------

This system uses 8 bytes per DOUBLE PRECISION word.This system uses 8 bytes per DOUBLE PRECISION word.

--------------------------------------------------------------------------------------------------------------------------

Array size = 180000000000, Offset = 0Array size = 180000000000, Offset = 0

Total memory required = 4119873.0 MB.Total memory required = 4119873.0 MB.

Each test is run 10 times, but onlyEach test is run 10 times, but only

the *best* time for each is used.the *best* time for each is used. University of Oslo:University of Oslo:


the *best* time for each is used.the *best* time for each is used.

--------------------------------------------------------------------------------------------------------------------------

Number of Threads requested = 828Number of Threads requested = 828

--------------------------------------------------------------------------------------------------------------------------

Function Function Rate (MB/s) Rate (MB/s) AvgAvg time time Min time Min time Max timeMax time

Copy: Copy: 1599317.5028 1599317.5028 1.9224 1.9224 1.8008 1.8008 2.13932.1393

ScaleScale: : 1468219.1643 1468219.1643 2.0954 2.0954 1.9616 1.9616 2.22902.2290

AddAdd: : 1664455.1221 1664455.1221 2.8375 2.8375 2.5954 2.5954 3.09473.0947

Triad: Triad: 1492414.0721 1492414.0721 3.0478 3.0478 2.8946 2.8946 3.32673.3267

University of Oslo:University of Oslo:•• 72 nodes 72 nodes -- IBM x3755IBM x3755•• 1 728 cores1 728 cores•• 4.6 4.6 TBytesTBytes Shared MemoryShared Memory

423,4

449,0 432,5

450

250

300

350

400

450

500

250,0

300,0

350,0

400,0

450,0

500,0

o

on

ds

Time to Allocate and Ini alize 15GB Array in Parallel

Memory Allocation and InitializationMemory Allocation and Initialization


211,1

23,40

3,77 2,02 0,96 9

112

222

0

50

100

150

200

250

0,0

50,0

100,0

150,0

200,0

250,0

1 8 16 104

Ra

Se

co

Number of Processes

ScaleMP Numascale Ra o

MPI Latency (Pallas)MPI Latency (Pallas)

4

5

6

7

seco

nd

s

NumaConnect MPI Latency


0

1

2

3

0 1 2 4 8 16 32 64 128 256 512 1k

Mic

ros

Message Size (Bytes)

MPI BarrierMPI Barrier

13,0 13,8

18,3

19,9

15,0

20,0

25,0

150

200

250

me

nt

Ra

o

eco

nd

s

MPI_Barrier on 32 Nodes

NumaConnect-BTL

Standard-SM-BTL

Ra o


1,7

8

4,0

6

7,4

3

8,5

2

10

,72

8,0

9

52

,67 10

2,6

6

15

5,8

4 21

3,3

8

4,5

0,0

5,0

10,0

0

50

100

2 4 8 16 32

Imp

rov

em

Mic

rose

Number of Processes

MPI Barrier (Pallas)MPI Barrier (Pallas)

13,1

23,1

20,0

15,0

20,0

25,0

800,0

1000,0

1200,0

nt

Ra

o

con

ds

MPI_Barrier, 2 - 256 processes

NumaConnect-BTL

Standard-SM-BTL

Ra o


1,3

3,6

6,2

10

,2

14

,3

14

,3

22

,1

48

,5

0,9

2,7

7,2

13

2,5

33

0,7

33

0,7

51

0,9

96

9,3

0,7 0,7 1,2

13,1

0,0

5,0

10,0

0,0

200,0

400,0

600,0

2 4 8 16 32 64 128 256

Imp

rov

em

en

Mic

rose

c

Number of MPI Processes

Scaling Applications with Scaling Applications with


Scaling Applications with Scaling Applications with NumaConnectNumaConnect

Atle Vesterkjær, Numascale

[email protected]

September 2013

150

200

250

300

80000

100000

120000

140000

peedup

me in

Seco

nd

s

RWTH RWTH TrajSearchTrajSearch codecode

�� OpenMP programs can be made OpenMP programs can be made numanuma--awareaware by decomposing memory and making sure that all memory by decomposing memory and making sure that all memory is “local” to the CPU running the process that is using the memory. Affinity settings can be used to is “local” to the CPU running the process that is using the memory. Affinity settings can be used to distribute jobs in a distribute jobs in a numanuma--awareaware way. The graph shows that the application has the way. The graph shows that the application has the most speedup on most speedup on NumaConnectNumaConnect, even if it was originally adapted to , even if it was originally adapted to ScaleMPScaleMP. .


0

50

100

0

20000

40000

60000

8 16 32 64 128 256 512

Sp

Runti

me

Number of Threads

SGI Altix UV: Nehalem EX SCALEMP: Nehalem EX Bull BCS: Nehalem EX SCALEMP: SandyBridge EP

NUMASCALE SGI Altix UV: Nehalem EX-Speedup SCALEMP: Nehalem EX-Speedup Bull BCS: Nehalem EX-Speedup

SCALEMP: SandyBridge EP-Speedup Numascale-Speedup

Stream on 16 NodesStream on 16 Nodes

150 000

200 000

Streams scaling (16 nodes - 32 sockets - 192 cores)

Triad – NumaConnect

Triad – 1041A-T2F

Triad – Linear socket

Applications that are scalable by design achieve great results on Applications that are scalable by design achieve great results on a NumaConnect Shared Memory Systema NumaConnect Shared Memory System. .


0

50 000

100 000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

MB

yte

/se

c

Number of CPU Sockets

Triad – Linear socket

Triad – Linear node

Porting MPI applications to NCPorting MPI applications to NC--OpenMPOpenMP

��OpenMP programs are OpenMP programs are fasterfaster and has less and has less overhead than MPI programs. overhead than MPI programs.

��Both Both numanuma--aware programs and MPI aware programs and MPI programs care about programs care about memory localitymemory locality. It is . It is therefore often possible to take an MPI therefore often possible to take an MPI program and convert it to a program and convert it to a numanuma--aware aware


program and convert it to a program and convert it to a numanuma--aware aware OpenMP program that will have shorter OpenMP program that will have shorter runtime. runtime.

��In order to demonstrate this the In order to demonstrate this the NAS NAS Parallel Benchmark SP Parallel Benchmark SP has been used. has been used.

NPB SP DNPB SP D--class Mop/s totalclass Mop/s total

35600,82

54029,36

62988,6

40000

50000

60000

70000

Mo

p/s

To

tal

NPB-SP NC-OPENMP

256 cores

8 NumaConnect Nodes

CLASS=D

�� The SP benchmark exists The SP benchmark exists in both an OpenMP in both an OpenMP version written for version written for standalone servers and standalone servers and an an MPI version written MPI version written for distributed servers. for distributed servers.

�� As the computational job As the computational job is the same we find the is the same we find the code written for code written for distributed servers (MPI) distributed servers (MPI) easier to convert to a easier to convert to a


20700,87

35600,82

0

10000

20000

30000

16 36 64 121

Mo

p/s

To

tal

Number of threads

easier to convert to a easier to convert to a NumaConnect optimized NumaConnect optimized version since version since we also we also need to consider need to consider memory locality.memory locality.

�� The The graph shows the graph shows the scalability for the SP scalability for the SP benchmark code benchmark code when when converted converted to a to a NumaConnect optimized NumaConnect optimized version. version.

NPB SP DNPB SP D--class runtimeclass runtime

1426,8

1000

1200

1400

1600

Ru

nti

me

[se

c]

NPB-SP NC-OPENMP D-CLASS

8 NumaConnect Nodes: Runtime [sec]��The The

overheadoverheadintroduced introduced by MPI is not by MPI is not needed when needed when we are we are running on a running on a


829,64

546,67

468,91

0

200

400

600

800

16 36 64 121

Ru

nti

me

[se

c]

Number of threads

running on a running on a Shared Shared Memory Memory SystemSystem

NPB SP DNPB SP D--class runtime affinity effectclass runtime affinity effect

1426,8

1200

1400

1600


Runtime (sec) where the

threads are evenly distributed

on all NumaConnect Nodes

5701,24

4000

5000

6000


Runtime (sec) bound to only the

first <n> cores

�� The figure on The figure on the left the left illustrates the illustrates the great scalinggreat scalingwe get from the we get from the NPBNPB--SP on the SP on the NumaConnect NumaConnect systemsystem

�� The figure on The figure on the right the right illustrates that illustrates that using affinity to using affinity to distribute the distribute the


829,64

546,67

468,91

0

200

400

600

800

1000

16 36 64 121

Ru

nti

me

(se

c)

Number of processes

2496,88

1512,98

967,72

0

1000

2000

3000

16 36 64 121

Ru

nti

me

(se

c)

Number of processes

distribute the distribute the job evenly job evenly between the between the different different NumaConnect NumaConnect Nodes (and Nodes (and cores in the cores in the system) leads to system) leads to shorter shorter runtimeruntime

�� These effects will These effects will be analyzed be analyzed more in the next more in the next slidesslides


�� The runtime using 121 The runtime using 121 cores is cut to half cores is cut to half when running on 8 when running on 8 NumaConnect NumaConnect Nodes, compared to 4 Nodes, compared to 4 NumaConnect NodesNumaConnect Nodes

�� The The numbernumber ofof corescorespr. pr. FPUsFPUs is is oneone whenwhenrunning on 8 running on 8 NumaConnect NumaConnect Nodes Nodes and two when running and two when running

998,86

800

1000

1200

sec)


8 NumaConnect Nodes: Runtime [sec]

vs Affinity 121 cores using SP CLASS D


and two when running and two when running on on 4 NumaConnect 4 NumaConnect NodesNodes. .

�� The The picturepicture onon thetheright shows right shows thatthat thetheThe AMD Opteron™ The AMD Opteron™ 6300 series processor 6300 series processor is organized as a is organized as a MultiMulti--Chip Module Chip Module (two (two CPU dies CPU dies in the in the packagepackage). Each die ). Each die has 8 cores, but 4 has 8 cores, but 4 FPUsFPUs

475,28

0

200

400

600

0-127 0-127:2

Ru

nti

me

(se

c)

Affinty


�� The runtime using 64 The runtime using 64 cores is cut to half when cores is cut to half when running on 8 running on 8 NumaConnect NumaConnect Nodes, compared to 4 Nodes, compared to 4 NumaConnect NodesNumaConnect Nodes

�� In this test the number In this test the number of FPUs is constantof FPUs is constant

�� The scaling is good when The scaling is good when using more nodesusing more nodes

800,69

700

800

900

Runtime (sec) decreasing

when moving from 4 to 8

NumaConnect Nodes using

affinity settings and

SP CLASS D


using more nodesusing more nodes

�� The AMD Opteron™ 6300 The AMD Opteron™ 6300 series memory controller series memory controller interface allow external interface allow external connection of up to two connection of up to two memory channels per memory channels per die. This memory die. This memory interface is shared interface is shared among all the CPU core among all the CPU core pairs on each die. pairs on each die.

548,13

0

100

200

300

400

500

600

0-127:2 0-255:4

Ru

nti

me

(se

c)

Affintity settings

NPB SP DNPB SP D--class runtime MPI class runtime MPI vsvs OpenMPOpenMP

1426,8

1200

1400

1600


Runtime (sec) where the threads are

evenly distributed on all NumaConnect

Nodes

OpenMP

MPI

�� The runtime when using The runtime when using NPB SP DNPB SP D--classclass is the same for the MPI is the same for the MPI version as when using NCversion as when using NC--OpenMPOpenMPwhen the number of cores is 121 or when the number of cores is 121 or less. less.

�� This MPI library overhead in the This MPI library overhead in the communication part of the runtime communication part of the runtime is not significant for these small is not significant for these small sizes.sizes.


829,64

546,67

468,91

0

200

400

600

800

1000

16 36 64 121

Ru

nti

me

(se

c)

Number of processes

MPI

NPB SP DNPB SP D--class runtime MPI class runtime MPI vsvs OpenMPOpenMP

701,58

953,39

729,24708,68

800

1000

1200

Ru

nti

me

(se

c)


Runtime (sec) bound to only the first <n>

cores

OpenMP

MPI

�� The runtime when using The runtime when using NPB SP DNPB SP D--classclass is higher for MPI compared to is higher for MPI compared to the NCthe NC--OpenMP version when the OpenMP version when the number of cores is 144 or higher. number of cores is 144 or higher.

�� The MPI library overhead in the The MPI library overhead in the communication part of the runtime communication part of the runtime now plays a larger role.now plays a larger role.


599,04

554,57

478,14

415,93

534,34

0

200

400

600

144 169 196 225 256

Ru

nti

me

(se

c)

Number of processes

NPB SP ENPB SP E--class runtimeclass runtime

�� The SP benchmark The SP benchmark EE--class scales perfectly class scales perfectly from 64 processes (from 64 processes (using using affinity affinity 00--255:4) to 255:4) to 121 121 processes ( using affinity processes ( using affinity 00--241:2).241:2).

�� General statement: General statement: Larger problems scales Larger problems scales better.better.

�� EE--class problems are class problems are

15242,11

12000

14000

16000

18000

Runtime [sec]: NPB-SP NC-OPENMP E-

CLASS

8 NumaConnect Nodes


�� EE--class problems are class problems are more representative for more representative for large shared memory large shared memory systems and clusters. systems and clusters.

7246,13

0

2000

4000

6000

8000

10000

64 121

Ru

nti

me

Number of processes

NPB SP ENPB SP E--class runtimeclass runtime

�� The SP benchmark EThe SP benchmark E--class scales good class scales good from 16 to 121 on a from 16 to 121 on a 32 node system with 32 node system with a total of 384 coresa total of 384 cores

�� NB: The test with 384 NB: The test with 384 cores has been run on cores has been run on an old system and are an old system and are presented to show presented to show scaling and not scaling and not absolute values.absolute values.

53922,05

40000

50000

60000

Runtime: NPB-SP NC-OPENMP E-CLASS

32 NumaConnect Nodes: "Time in seconds"


absolute values.absolute values.32481,25

19740,37

11247,03

0

10000

20000

30000

0-383:24 0-383:10 0-383:6 0-363:3

Big Data with Shared MemoryBig Data with Shared Memory

“Any application requiring a large memory “Any application requiring a large memory footprint can benefit from a shared memory footprint can benefit from a shared memory computing environment.”computing environment.”


William W. Thigpen, Chief, Engineering William W. Thigpen, Chief, Engineering Branch, NASA Advanced Supercomputing (NAS) Branch, NASA Advanced Supercomputing (NAS) DivisionDivision

numascale product ibm

Technology

local memory

total memory

memory allocation

memory bandwidth

main memory trond

memory hierarchy access

numaconnect cardcopyright

typical size