numascale product ibm

35
Copyright 2013 All rights reserved. 1 NumaConnect NumaConnect Einar Rustad, Co-Founder & CTO September 2013

Upload: ibm-danmark

Post on 11-May-2015

1.316 views

Category:

Technology


0 download

DESCRIPTION

Presentation from the HPC event at IBM Denmark - September 2013, Copenhagen

TRANSCRIPT

Page 1: Numascale Product IBM

Copyright 2013 All rights reserved. 1

NumaConnectNumaConnect

Einar Rustad, Co-Founder & CTO

September 2013

Page 2: Numascale Product IBM

NumaConnectNumaConnect

• Cache Coherent Global Shared Memory and Shared IO

• Single Image Standard Operating System• All kinds of APIs

Commodity ServersCommodity Servers

Copyright 2013. All rights reserved. 2

At Cluster Price

Commodity ServersCommodity Servers

Tightly Coupled into Tightly Coupled into

One Monolithic SystemOne Monolithic Systemwith NumaConnectwith NumaConnect

Page 3: Numascale Product IBM

Shared Everything Shared Everything -- One Single Operating System ImageOne Single Operating System Image

NumaConnect NumaConnect -- Share EverythingShare Everything

CachesCaches

CPUsCPUs

MemoryMemory

CachesCaches CachesCaches

CPUsCPUs

MemoryMemory

CachesCaches CachesCaches

CPUsCPUs

MemoryMemory

CachesCaches CachesCaches

CPUsCPUs

MemoryMemory

CachesCachesNumaChipNumaChip

RemoteRemoteCacheCache

NumaChipNumaChip

RemoteRemoteCacheCache

RemoteRemoteCacheCache

NumaChipNumaChip

RemoteRemoteCacheCache

NumaChipNumaChip

Copyright 2013. All rights reserved. 3

CPUsCPUs

I/OI/O

CPUsCPUs

I/OI/O

CPUsCPUs

I/OI/O

CPUsCPUs

I/OI/O

NumaConnect NumaConnect Fabric (no switch required)Fabric (no switch required)

Page 4: Numascale Product IBM

�� Convex Convex Exemplar (Acquired by HP)Exemplar (Acquired by HP)

–– First implementation of the First implementation of the ccNUMAccNUMA architecture architecture from Dolphin in 1994from Dolphin in 1994

�� Data Data General General AviionAviion (Acquired by EMC)(Acquired by EMC)

–– Designed in 1996 with deliveries from 1997 Designed in 1996 with deliveries from 1997 -- 2002 2002

–– Used DolphinUsed Dolphin’’s chips with 3 generations of s chips with 3 generations of processor/memory busesprocessor/memory buses

�� I/O I/O Attached Products for Attached Products for Clustering OEMsClustering OEMs

Technology BackgroundTechnology Background

Copyright 2013. All rights reserved. 4

�� I/O I/O Attached Products for Attached Products for Clustering OEMsClustering OEMs

–– Sun Microsystems Sun Microsystems ((SunClusterSunCluster))

–– Siemens RM600 Siemens RM600 Server (IO Expansion)Server (IO Expansion)

–– Siemens Siemens Medical (3D CT)Medical (3D CT)

–– Philips Philips Medical (3D Ultra Sound)Medical (3D Ultra Sound)

–– Dassault/Thales Rafale Dassault/Thales Rafale

�� HPCHPC ClustersClusters ((WulfKitWulfKit ww.. ScaliScali))

–– First First Low Latency Cluster InterconnectLow Latency Cluster Interconnect

Dolphin’s Low

Latency Clustering HW

Dolphin’s

Cache Chip

Page 5: Numascale Product IBM

NumaChipNumaChip

IBM Microelectronics ASIC

FCPBGA1023, 33mm x 33mm, 1mm ball pitch, 4-2-4 package

Copyright 2013. All rights reserved. 5

4-2-4 package

IBM cu-11 Technology

~ 2 million gates

Chip Size 9x11mm

Page 6: Numascale Product IBM

NumaConnect NumaConnect CardCard

Copyright 2013. All rights reserved. 6

Page 7: Numascale Product IBM

NumaChip Top Block DiagramNumaChip Top Block Diagram

SDRAM CacheSDRAM Cache

SDRAM TagsSDRAM Tags

HyperTransportHyperTransport

SPISPI

LC C

onfig D

ata

LC C

onfig D

ata

Mic

rocode

Mic

rocode

H2SH2S

ccHTccHTCaveCave

SMSM

SPI Init SPI Init ModuleModule

CSR

CSR

Copyright 2013. All rights reserved. 7

XAXA XBXB YBYBYAYA ZAZA ZBZB

LC C

onfig D

ata

LC C

onfig D

ata

Mic

rocode

Mic

rocode

Crossbar switch, LCs, SERDESCrossbar switch, LCs, SERDES

SCCSCC

CSR

CSR

Page 8: Numascale Product IBM

NumaConnectNumaConnect™ ™ Node Node ConfigurationConfigurationNumaConnectNumaConnect™ ™ Node Node ConfigurationConfiguration

MultiMulti--CoreCore

CPUCPU

I/OI/O

MemoryMemoryMemoryMemoryMemoryMemoryMemoryMemory

MultiMulti--CoreCore

CPUCPU

MemoryMemoryMemoryMemoryMemoryMemoryMemoryMemory

Copyright 2013. All rights reserved. 8

NumaChipNumaChip MemoryMemoryNumaNuma

Cache+TagsCache+Tags

I/OI/OBridgeBridge

Coherent HyperTransport

6 x4 SERDES links

Page 9: Numascale Product IBM

2-D Torus

NumaConnect™ System ArchitectureNumaConnect™ System Architecture

Multi-CPU Node

Num

aChip

Num

aChip

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Num

aN

um

a

Cache

Cache

Num

aN

um

a

Cache

Cache

Core

Core

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Multi

Multi--

Core

Core

CPU

CPU

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Copyright 2013. All rights reserved. 9

3-D Torus

6 external links 6 external links -- flexible flexible system configurations in multisystem configurations in multi--dimensional topologiesdimensional topologies

Num

aChip

Num

aChip

Multi

Multi--

Core

Core

CPU

CPU

I/O

I/O

Bri

dge

Bri

dge

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Page 10: Numascale Product IBM

Cabling ExampleCabling Example

Copyright 2013. All rights reserved. 10

Page 11: Numascale Product IBM

22--D DataflowD Dataflow

CPUsCPUs

CachesCachesNumaChipNumaChip

MemoryMemoryMemoryMemoryMemoryMemoryMemoryMemory

CPUsCPUs

MemoryMemoryMemoryMemoryMemoryMemoryMemoryMemory

NumaChipNumaChipMemoryMemoryMemoryMemoryMemoryMemory

Copyright 2013. All rights reserved. 11

Request

Response

CPUsCPUs

CachesCachesNumaChipNumaChip

CPUsCPUs

CachesCaches

MemoryMemoryMemoryMemoryMemoryMemoryMemoryMemory

NumaChipNumaChip

Page 12: Numascale Product IBM

Size does MatterSize does Matter

�640K ought to be enough for anybody– Bill Gates, 1981

Copyright 2013. All rights reserved. 12

�We are looking for systems that can hold 10 – 20 TeraBytes in main Memory

– Trond J. Suul, Statoil, 2010

Page 13: Numascale Product IBM

ScaleScale--up Capacityup Capacity

��Single System Image or Single System Image or Multiple Multiple PartitionsPartitions

��LimitsLimits–– 256 256 TeraBytesTeraBytes Physical Address SpacePhysical Address Space

–– 4096 4096 NodesNodes

Copyright 2013. All rights reserved. 13

–– 196 608 cores196 608 cores

��Largest and Most Cost Effective Largest and Most Cost Effective Coherent Shared MemoryCoherent Shared Memory

��NO Virtualization SW Layer!NO Virtualization SW Layer!

Page 14: Numascale Product IBM

Copyright 2013 All rights reserved. 14

PerformancePerformance

Page 15: Numascale Product IBM

LMbenchLMbench -- ChartChart

147,125

287,01 308,61 304,582

628,583

1087,576

100

1000

10000

ose

con

ds

Latency - LMbench

Copyright 2013. All rights reserved. 15

1,251

6,421

16,732

1

10

0,00049

0,00195

0,00391

0,00781

0,01562

0,03125

0,0625

0,125

0,25

0,5 1 2 4 8

16 32

64 128 256 512 1024 2048 4096

Na

no

Array Size (MBytes)

Page 16: Numascale Product IBM

The Memory HierarchyThe Memory Hierarchy

7

5 0

00 0

00

10 000

100 000

1 000 000

10 000 000

co

nd

s

Access Times in the Memory Hierarchy

7

5 0

00 0

00

10 000

100 000

1 000 000

10 000 000

co

nd

s

Access Times in the Memory Hierarchy

Copyright 2013. All rights reserved. 16

L1 C

ache;

1,3

L2 C

ache;

6,4

L3 C

ache ;

16,7

Local M

em

ory

; 90

Num

aCache;

308

Rem

ote

Mem

ory

; 1 0

87

SSD

; 100 0

00

Rota

ting D

isk;

1

10

100

1 000

Nan

osec

L1 C

ache;

1,3

L2 C

ache;

6,4

L3 C

ache ;

16,7

Local M

em

ory

; 90

Num

aCache;

308

Rem

ote

Mem

ory

; 1 0

87

SSD

; 100 0

00

Rota

ting D

isk;

1

10

100

1 000

Nan

osec

Typical size: 100KB 1MB 8MB 240GB 4GB < 256TB nx1TB nx5TB

Page 17: Numascale Product IBM

Memory BandwidthMemory Bandwidth

--------------------------------------------------------------------------------------------------------------------------

STREAM version $Revision: 5.9 $STREAM version $Revision: 5.9 $

--------------------------------------------------------------------------------------------------------------------------

This system uses 8 bytes per DOUBLE PRECISION word.This system uses 8 bytes per DOUBLE PRECISION word.

--------------------------------------------------------------------------------------------------------------------------

Array size = 180000000000, Offset = 0Array size = 180000000000, Offset = 0

Total memory required = 4119873.0 MB.Total memory required = 4119873.0 MB.

Each test is run 10 times, but onlyEach test is run 10 times, but only

the *best* time for each is used.the *best* time for each is used. University of Oslo:University of Oslo:

Copyright 2013. All rights reserved. 17

the *best* time for each is used.the *best* time for each is used.

--------------------------------------------------------------------------------------------------------------------------

Number of Threads requested = 828Number of Threads requested = 828

--------------------------------------------------------------------------------------------------------------------------

Function Function Rate (MB/s) Rate (MB/s) AvgAvg time time Min time Min time Max timeMax time

Copy: Copy: 1599317.5028 1599317.5028 1.9224 1.9224 1.8008 1.8008 2.13932.1393

ScaleScale: : 1468219.1643 1468219.1643 2.0954 2.0954 1.9616 1.9616 2.22902.2290

AddAdd: : 1664455.1221 1664455.1221 2.8375 2.8375 2.5954 2.5954 3.09473.0947

Triad: Triad: 1492414.0721 1492414.0721 3.0478 3.0478 2.8946 2.8946 3.32673.3267

University of Oslo:University of Oslo:•• 72 nodes 72 nodes -- IBM x3755IBM x3755•• 1 728 cores1 728 cores•• 4.6 4.6 TBytesTBytes Shared MemoryShared Memory

Page 18: Numascale Product IBM

423,4

449,0 432,5

450

250

300

350

400

450

500

250,0

300,0

350,0

400,0

450,0

500,0

o

on

ds

Time to Allocate and Ini alize 15GB Array in Parallel

Memory Allocation and InitializationMemory Allocation and Initialization

Copyright 2013. All rights reserved. 18

211,1

23,40

3,77 2,02 0,96 9

112

222

0

50

100

150

200

250

0,0

50,0

100,0

150,0

200,0

250,0

1 8 16 104

Ra

Se

co

Number of Processes

ScaleMP Numascale Ra o

Page 19: Numascale Product IBM

MPI Latency (Pallas)MPI Latency (Pallas)

4

5

6

7

seco

nd

s

NumaConnect MPI Latency

Copyright 2013. All rights reserved. 19

0

1

2

3

0 1 2 4 8 16 32 64 128 256 512 1k

Mic

ros

Message Size (Bytes)

Page 20: Numascale Product IBM

MPI BarrierMPI Barrier

13,0 13,8

18,3

19,9

15,0

20,0

25,0

150

200

250

me

nt

Ra

o

eco

nd

s

MPI_Barrier on 32 Nodes

NumaConnect-BTL

Standard-SM-BTL

Ra o

Copyright 2013. All rights reserved. 20

1,7

8

4,0

6

7,4

3

8,5

2

10

,72

8,0

9

52

,67 10

2,6

6

15

5,8

4 21

3,3

8

4,5

0,0

5,0

10,0

0

50

100

2 4 8 16 32

Imp

rov

em

Mic

rose

Number of Processes

Page 21: Numascale Product IBM

MPI Barrier (Pallas)MPI Barrier (Pallas)

13,1

23,1

20,0

15,0

20,0

25,0

800,0

1000,0

1200,0

nt

Ra

o

con

ds

MPI_Barrier, 2 - 256 processes

NumaConnect-BTL

Standard-SM-BTL

Ra o

Copyright 2013. All rights reserved. 21

1,3

3,6

6,2

10

,2

14

,3

14

,3

22

,1

48

,5

0,9

2,7

7,2

13

2,5

33

0,7

33

0,7

51

0,9

96

9,3

0,7 0,7 1,2

13,1

0,0

5,0

10,0

0,0

200,0

400,0

600,0

2 4 8 16 32 64 128 256

Imp

rov

em

en

Mic

rose

c

Number of MPI Processes

Page 22: Numascale Product IBM

Scaling Applications with Scaling Applications with

Copyright 2013 All rights reserved. 22

Scaling Applications with Scaling Applications with NumaConnectNumaConnect

Atle Vesterkjær, Numascale

[email protected]

September 2013

Page 23: Numascale Product IBM

150

200

250

300

80000

100000

120000

140000

peedup

me in

Seco

nd

s

RWTH RWTH TrajSearchTrajSearch codecode

�� OpenMP programs can be made OpenMP programs can be made numanuma--awareaware by decomposing memory and making sure that all memory by decomposing memory and making sure that all memory is “local” to the CPU running the process that is using the memory. Affinity settings can be used to is “local” to the CPU running the process that is using the memory. Affinity settings can be used to distribute jobs in a distribute jobs in a numanuma--awareaware way. The graph shows that the application has the way. The graph shows that the application has the most speedup on most speedup on NumaConnectNumaConnect, even if it was originally adapted to , even if it was originally adapted to ScaleMPScaleMP. .

Copyright 2013. All rights reserved. 23

0

50

100

0

20000

40000

60000

8 16 32 64 128 256 512

Sp

Runti

me

Number of Threads

SGI Altix UV: Nehalem EX SCALEMP: Nehalem EX Bull BCS: Nehalem EX SCALEMP: SandyBridge EP

NUMASCALE SGI Altix UV: Nehalem EX-Speedup SCALEMP: Nehalem EX-Speedup Bull BCS: Nehalem EX-Speedup

SCALEMP: SandyBridge EP-Speedup Numascale-Speedup

Page 24: Numascale Product IBM

Stream on 16 NodesStream on 16 Nodes

150 000

200 000

Streams scaling (16 nodes - 32 sockets - 192 cores)

Triad – NumaConnect

Triad – 1041A-T2F

Triad – Linear socket

Applications that are scalable by design achieve great results on Applications that are scalable by design achieve great results on a NumaConnect Shared Memory Systema NumaConnect Shared Memory System. .

Copyright 2013. All rights reserved. 24

0

50 000

100 000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

MB

yte

/se

c

Number of CPU Sockets

Triad – Linear socket

Triad – Linear node

Page 25: Numascale Product IBM

Porting MPI applications to NCPorting MPI applications to NC--OpenMPOpenMP

��OpenMP programs are OpenMP programs are fasterfaster and has less and has less overhead than MPI programs. overhead than MPI programs.

��Both Both numanuma--aware programs and MPI aware programs and MPI programs care about programs care about memory localitymemory locality. It is . It is therefore often possible to take an MPI therefore often possible to take an MPI program and convert it to a program and convert it to a numanuma--aware aware

Copyright 2013. All rights reserved. 25

program and convert it to a program and convert it to a numanuma--aware aware OpenMP program that will have shorter OpenMP program that will have shorter runtime. runtime.

��In order to demonstrate this the In order to demonstrate this the NAS NAS Parallel Benchmark SP Parallel Benchmark SP has been used. has been used.

Page 26: Numascale Product IBM

NPB SP DNPB SP D--class Mop/s totalclass Mop/s total

35600,82

54029,36

62988,6

40000

50000

60000

70000

Mo

p/s

To

tal

NPB-SP NC-OPENMP

256 cores

8 NumaConnect Nodes

CLASS=D

�� The SP benchmark exists The SP benchmark exists in both an OpenMP in both an OpenMP version written for version written for standalone servers and standalone servers and an an MPI version written MPI version written for distributed servers. for distributed servers.

�� As the computational job As the computational job is the same we find the is the same we find the code written for code written for distributed servers (MPI) distributed servers (MPI) easier to convert to a easier to convert to a

Copyright 2013. All rights reserved. 26

20700,87

35600,82

0

10000

20000

30000

16 36 64 121

Mo

p/s

To

tal

Number of threads

easier to convert to a easier to convert to a NumaConnect optimized NumaConnect optimized version since version since we also we also need to consider need to consider memory locality.memory locality.

�� The The graph shows the graph shows the scalability for the SP scalability for the SP benchmark code benchmark code when when converted converted to a to a NumaConnect optimized NumaConnect optimized version. version.

Page 27: Numascale Product IBM

NPB SP DNPB SP D--class runtimeclass runtime

1426,8

1000

1200

1400

1600

Ru

nti

me

[se

c]

NPB-SP NC-OPENMP D-CLASS

8 NumaConnect Nodes: Runtime [sec]��The The

overheadoverheadintroduced introduced by MPI is not by MPI is not needed when needed when we are we are running on a running on a

Copyright 2013. All rights reserved. 27

829,64

546,67

468,91

0

200

400

600

800

16 36 64 121

Ru

nti

me

[se

c]

Number of threads

running on a running on a Shared Shared Memory Memory SystemSystem

Page 28: Numascale Product IBM

NPB SP DNPB SP D--class runtime affinity effectclass runtime affinity effect

1426,8

1200

1400

1600

NPB-SP NC-OPENMP D-CLASS

Runtime (sec) where the

threads are evenly distributed

on all NumaConnect Nodes

5701,24

4000

5000

6000

NPB-SP NC-OPENMP D-CLASS

Runtime (sec) bound to only the

first <n> cores

�� The figure on The figure on the left the left illustrates the illustrates the great scalinggreat scalingwe get from the we get from the NPBNPB--SP on the SP on the NumaConnect NumaConnect systemsystem

�� The figure on The figure on the right the right illustrates that illustrates that using affinity to using affinity to distribute the distribute the

Copyright 2013. All rights reserved. 28

829,64

546,67

468,91

0

200

400

600

800

1000

16 36 64 121

Ru

nti

me

(se

c)

Number of processes

2496,88

1512,98

967,72

0

1000

2000

3000

16 36 64 121

Ru

nti

me

(se

c)

Number of processes

distribute the distribute the job evenly job evenly between the between the different different NumaConnect NumaConnect Nodes (and Nodes (and cores in the cores in the system) leads to system) leads to shorter shorter runtimeruntime

�� These effects will These effects will be analyzed be analyzed more in the next more in the next slidesslides

Page 29: Numascale Product IBM

NPB SP DNPB SP D--class runtimeclass runtime

�� The runtime using 121 The runtime using 121 cores is cut to half cores is cut to half when running on 8 when running on 8 NumaConnect NumaConnect Nodes, compared to 4 Nodes, compared to 4 NumaConnect NodesNumaConnect Nodes

�� The The numbernumber ofof corescorespr. pr. FPUsFPUs is is oneone whenwhenrunning on 8 running on 8 NumaConnect NumaConnect Nodes Nodes and two when running and two when running

998,86

800

1000

1200

sec)

NPB-SP NC-OPENMP D-CLASS

8 NumaConnect Nodes: Runtime [sec]

vs Affinity 121 cores using SP CLASS D

Copyright 2013. All rights reserved. 29

and two when running and two when running on on 4 NumaConnect 4 NumaConnect NodesNodes. .

�� The The picturepicture onon thetheright shows right shows thatthat thetheThe AMD Opteron™ The AMD Opteron™ 6300 series processor 6300 series processor is organized as a is organized as a MultiMulti--Chip Module Chip Module (two (two CPU dies CPU dies in the in the packagepackage). Each die ). Each die has 8 cores, but 4 has 8 cores, but 4 FPUsFPUs

475,28

0

200

400

600

0-127 0-127:2

Ru

nti

me

(se

c)

Affinty

Page 30: Numascale Product IBM

NPB SP DNPB SP D--class runtimeclass runtime

�� The runtime using 64 The runtime using 64 cores is cut to half when cores is cut to half when running on 8 running on 8 NumaConnect NumaConnect Nodes, compared to 4 Nodes, compared to 4 NumaConnect NodesNumaConnect Nodes

�� In this test the number In this test the number of FPUs is constantof FPUs is constant

�� The scaling is good when The scaling is good when using more nodesusing more nodes

800,69

700

800

900

Runtime (sec) decreasing

when moving from 4 to 8

NumaConnect Nodes using

affinity settings and

SP CLASS D

Copyright 2013. All rights reserved. 30

using more nodesusing more nodes

�� The AMD Opteron™ 6300 The AMD Opteron™ 6300 series memory controller series memory controller interface allow external interface allow external connection of up to two connection of up to two memory channels per memory channels per die. This memory die. This memory interface is shared interface is shared among all the CPU core among all the CPU core pairs on each die. pairs on each die.

548,13

0

100

200

300

400

500

600

0-127:2 0-255:4

Ru

nti

me

(se

c)

Affintity settings

Page 31: Numascale Product IBM

NPB SP DNPB SP D--class runtime MPI class runtime MPI vsvs OpenMPOpenMP

1426,8

1200

1400

1600

NPB-SP NC-OPENMP D-CLASS

Runtime (sec) where the threads are

evenly distributed on all NumaConnect

Nodes

OpenMP

MPI

�� The runtime when using The runtime when using NPB SP DNPB SP D--classclass is the same for the MPI is the same for the MPI version as when using NCversion as when using NC--OpenMPOpenMPwhen the number of cores is 121 or when the number of cores is 121 or less. less.

�� This MPI library overhead in the This MPI library overhead in the communication part of the runtime communication part of the runtime is not significant for these small is not significant for these small sizes.sizes.

Copyright 2013. All rights reserved. 31

829,64

546,67

468,91

0

200

400

600

800

1000

16 36 64 121

Ru

nti

me

(se

c)

Number of processes

MPI

Page 32: Numascale Product IBM

NPB SP DNPB SP D--class runtime MPI class runtime MPI vsvs OpenMPOpenMP

701,58

953,39

729,24708,68

800

1000

1200

Ru

nti

me

(se

c)

NPB-SP NC-OPENMP D-CLASS

Runtime (sec) bound to only the first <n>

cores

OpenMP

MPI

�� The runtime when using The runtime when using NPB SP DNPB SP D--classclass is higher for MPI compared to is higher for MPI compared to the NCthe NC--OpenMP version when the OpenMP version when the number of cores is 144 or higher. number of cores is 144 or higher.

�� The MPI library overhead in the The MPI library overhead in the communication part of the runtime communication part of the runtime now plays a larger role.now plays a larger role.

Copyright 2013. All rights reserved. 32

599,04

554,57

478,14

415,93

534,34

0

200

400

600

144 169 196 225 256

Ru

nti

me

(se

c)

Number of processes

Page 33: Numascale Product IBM

NPB SP ENPB SP E--class runtimeclass runtime

�� The SP benchmark The SP benchmark EE--class scales perfectly class scales perfectly from 64 processes (from 64 processes (using using affinity affinity 00--255:4) to 255:4) to 121 121 processes ( using affinity processes ( using affinity 00--241:2).241:2).

�� General statement: General statement: Larger problems scales Larger problems scales better.better.

�� EE--class problems are class problems are

15242,11

12000

14000

16000

18000

Runtime [sec]: NPB-SP NC-OPENMP E-

CLASS

8 NumaConnect Nodes

Copyright 2013. All rights reserved. 33

�� EE--class problems are class problems are more representative for more representative for large shared memory large shared memory systems and clusters. systems and clusters.

7246,13

0

2000

4000

6000

8000

10000

64 121

Ru

nti

me

Number of processes

Page 34: Numascale Product IBM

NPB SP ENPB SP E--class runtimeclass runtime

�� The SP benchmark EThe SP benchmark E--class scales good class scales good from 16 to 121 on a from 16 to 121 on a 32 node system with 32 node system with a total of 384 coresa total of 384 cores

�� NB: The test with 384 NB: The test with 384 cores has been run on cores has been run on an old system and are an old system and are presented to show presented to show scaling and not scaling and not absolute values.absolute values.

53922,05

40000

50000

60000

Runtime: NPB-SP NC-OPENMP E-CLASS

32 NumaConnect Nodes: "Time in seconds"

Copyright 2013. All rights reserved. 34

absolute values.absolute values.32481,25

19740,37

11247,03

0

10000

20000

30000

0-383:24 0-383:10 0-383:6 0-363:3

Page 35: Numascale Product IBM

Big Data with Shared MemoryBig Data with Shared Memory

“Any application requiring a large memory “Any application requiring a large memory footprint can benefit from a shared memory footprint can benefit from a shared memory computing environment.”computing environment.”

Copyright 2013. All rights reserved. 35

William W. Thigpen, Chief, Engineering William W. Thigpen, Chief, Engineering Branch, NASA Advanced Supercomputing (NAS) Branch, NASA Advanced Supercomputing (NAS) DivisionDivision