optimizing network usage on sequoia and sierraweb.cse.ohio-state.edu/~subramoni.1/exacomm16/...20...

LLNL-PRES-695482 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Optimizing Network Usage on Sequoia and Sierra

BronisR.deSupinskiChiefTechnologyOfficer

LivermoreCompu=ngJune 23, 2016

LLNL-PRES-695482 2

Adv

ance

d Te

chno

logy

S

yste

ms

(ATS

)

Fiscal Year ‘13 ‘14 ‘15 ‘16 ‘17 ‘18

Use Retire

‘20 ‘19 ‘21

Com

mod

ity

Tech

nolo

gy

Sys

tem

s (C

TS)

Procure& Deploy

Sequoia(LLNL)

ATS1–Trinity(LANL/SNL)

ATS2–Sierra(LLNL)

Tri-labLinuxCapacityClusterII(TLCCII)

CTS1

CTS2

‘22

System Delivery

ATS3–Crossroads(LANL/SNL)

ATS4–(LLNL)

‘23

ATS5–(LANL/SNL)

MyfocusisNNSAASCATSpla2ormsatLLNL

SequoiaandSierraarethecurrentandnext-genera=onAdvancedTechnologySystemsatLLNL

LLNL-PRES-695482 3

Sequoiaprovidespreviouslyunprecedentedlevelsofcapabilityandconcurrency

§  Sequoiasta=s=cs—  20petaFLOP/speak—  17petaFLOP/sLINPACK— Memory1.5PB,4PB/sbandwidth—  98,304nodes—  1,572,864cores—  3PB/slinkbandwidth—  60TB/sbi-sec=onbandwidth—  0.5−1.0TB/sLustrebandwidth—  50PBdisk

§  9.6MWpower,4,000^2

§  Thirdgenera=onIBMBlueGene

LLNL-PRES-695482 4

TheBG/Qcomputechipintegratesprocessors,memoryandnetworkinglogicintoonechip

§  16user+1OS+1redundantcores—  4-waymul=-threaded,1.6GHz64-bit—  16kB/16kBL1I/Dcaches—  QuadFPUs(4-wideDPSIMD)—  Peak:204.8GFLOPS@55W

§  Shared32MBeDRAML2cache—  Mul=versionedcache

§  Dualmemorycontroller—  16GBDDR3memory(1.33Gb/s)—  2*16byte-wideinterface(+ECC)

§  Chip-to-chipnetworking—  5DTorustopology+externallink—  Eachlink2GB/ssend+2GB/sreceive—  DMA,put/get,collec=veopera=ons

LLNL-PRES-695482 5

TradiFonalBlueGeneoverallsystemintegraFonresultsinsmallfootprint

6. Rack 2 midplanes

1, 2, or 4 I/O drawers 7. System 20 PF/s

4. Node card 32 compute cards,

Optical modules, link chips, torus

3. Compute card One single chip module, 16 GB DDR3 memory

2. Module Single chip

5a. Midplane 16 node cards

5b. I/O drawer 8 I/O cards

8 PCIe Gen2 slots

1. Chip 16 cores

LLNL-PRES-695482 6

§  Communica=onlocalitythroughop=mizedMPIprocessplacementiscri=calon3Dtorusnetworks— Useof5Dtorusreducesnetworkdiameterand

reducestheimportanceofMPIprocessplacement

§  Supportforhardwareop=mizedcollec=vesshouldapplytosubcommunicatorsaswellasglobalopera=ons—  Increasednetworkcommunica=oncontextsallowsmoreapplica=ons

toexploithardwaresupportforcollec=veopera=ons

§  Hardwaresupportfornetworkpar==oningminimizesjiker

§ Mul=plenetworksprovidemanybenefitsbutalsoincreasecosts

SequoiaandBlueGene/QreflectlessonsfrompreviousBlueGenegeneraFons

Theseexamplesarenetwork-centric;othersreflectlessonsthroughoutthesystemhardwareandso^warearchitecture

LLNL-PRES-695482 7

§ Mechanismsthatleadtoarrhythmiaarenotwellunderstood;Contrac=onofheartiscontrolledbyelectricalbehaviorofheartcells

§ Mathema=calmodelsreproducecomponentioniccurrentsoftheac=onpoten=al—  Systemofnon-linearequa=onODEs—  TT06includes6speciesand14gates— Nega=vecurrentsdepolarize(ac=vate)—  Posi=vecurrentsrepolarize(returntorest)

§  Abilitytorun,athighresolu=on,thousandsinsteadoftensofheartbeatsenablesdetailedstudyofdrugeffects

LLNL,withIBM,hasdevelopedCardiod,astate-of-the-artcardiacelectrophysiologysimulaFon

2012GordonBellfinalisttypifiestheeffortrequiredtoexploitsupercomputercapabilityfully

LLNL-PRES-695482 8

60 beats in 67.2 seconds

60 beats in 197.4 seconds

§ Measuredpeakperformance:11.84PFlop/s(58.8%ofpeak)—  0.05mmresolu=onheart(3B=ssuecells)—  Tenmillionitera=ons,dt=4usec—  Performanceoffullsimula=onloop,includingI/O,measuredwithHPM

Cardoidachievesoutstandingperformancethatenablesnearlyreal-FmeheartbeatsimulaFon

Op=mizedCardioidis50xfasterthan“naive”code

§  Extremestrongscalinglimit:—  0.10mm:236=ssuecells/core.—  0.13mm:114=ssuecells/core

LLNL-PRES-695482 9

CardiodrepresentsamajoradvanceinthestateoftheartofhumanheartsimulaFon

0.1 mm heart (370M tissue cells) One minute of wall time

Previous state of the art Cardioid

18.2 seconds of simulation time

LLNL-PRES-695482 10

§  Par==onedcellsoverprocesseswithanupperboundon=me(notonequal=me)

§  Assigneddiffusionworkandreac=onworktodifferentcores

§  Transformedthepotassiumequa=ontoremoveserializa=on

§  Expensive1Dfunc=onsinreac=onmodelexpressedwithra=onalapproximates

§  SingleprecisionweightstoreducediffusionstenciluseofL2bandwidth

§  HandunrolledtoSIMDizeloopsovercells§  SortedbycelltypetoimproveSIMDiza=on§  Sub-sor=ngofcellstoincreasesequen=al/

vectorloadandstoringofdata.§  logfunc=onfromlibmreplacedwithcustom

inlinedfunc=ons§  Ontheflyassemblyofcodetoop=mizedata

movementatrun=me§  Memorylayouttunedtoimprovecache

performance§  Useofvectorintrinsicsandcustomdivides

CardoidachievesoutstandingperformancethroughdetailedtuningtoSequoia’sarchitecture

§  Movedintegeropera=onstofloa=ngpointunitstoexploitSIMDunits

§  Noexplicitnetworkbarrier§  L2onnodethreadbarriers§  UselowlevelSPIforhalodataexchangebetweentasks(DMA)

§  Applica=onmanagedthreads§  SIMDizeddiffusionstencilimplementa=on§  Zerofluxboundarycondi=onsapproximated

bymethodwithnoglobalsolve§  HighperformanceI/OisawareofBG/Qnetworktopology

§  Lowoverheadin-situperformancemonitors§  Assignmentofthreadstodiffusion/reac=on

dependentondomaincharacteris=cs§  Co-scheduledthreadsforimproveddual

issue§  Mul=plediffusionimplementa=onstoobtain

op=malperformanceforvariousdomains§  Remote&localcopiesseparatedtoimprove

bandwidthu=liza=on

LLNL-PRES-695482 11

Atlargestscales,smallsoTwareoverheadscansignificantlyimpactperformance

DirectuseofmessageunitsandL2atomicopera=onsminimizessoverhead

370 Million Cells 1.6 Million Cores 1600 Flops/cell

60 us per iteration

0 50 100 150 200 250 300

L2AtomicBarrier

OMPFork/Join

SPIHaloExchange

MPIHaloExchange

Time (usec)

60 us target

LLNL-PRES-695482 12

TheSierrasystemthatwillreplaceSequoiafeaturesaGPU-acceleratedarchitecture

Mellanox® Interconnect Dual-rail EDR Infiniband®

IBM POWER •  NVLink™

NVIDIA Volta •  HBM •  NVLink

Components

Compute Node POWER® Architecture Processor NVIDIA®Volta™ NVMe-compatible PCIe 800GB SSD > 512 GB DDR4 + HBM Coherent Shared Memory

Compute Rack Standard 19” Warm water cooling

Compute System 2.1 – 2.7 PB Memory

120 -150 PFLOPS 10 MW

GPFS™ File System 120 PB usable storage

1.2/1.0 TB/s R/W bandwidth

LLNL-PRES-695482 13

OutstandingbenchmarkanalysisbyIBMandNVIDIAdemonstratesthesystem’susability

Projec=onsincludedcodechangesthatshowedtractableannota=on-basedapproach(i.e.,OpenMP)willbecompe==ve

9

Figure 5: CORAL benchmark projections show GPU-accelerated system is expected to deliver substantially higher performance at the system level compared to CPU-only configuration.

The demonstration of compelling, scalable performance at the system level across a wide range of applications proved to be one of the key factors in the U.S. DoE’s decision to build Summit and Sierra on the GPU-accelerated OpenPOWER platform.

Conclusion Summit and Sierra are historic milestones in HPC’s efforts to reach exascale computing. With these new pre-exascale systems, the U.S. DoE maintains its leadership position, trailblazing the next generation of supercomputers while allowing the nation to stay ahead in scientific discoveries and economic competitiveness.

The future of large-scale systems will inevitably be accelerated with throughput-oriented processors. Latency-optimized CPU-based systems have long hit a power wall that no longer delivers year-on-year performance increase. So while the question of “accelerator or not” is no longer in debate, other questions remain, such as CPU architecture, accelerator architecture, inter-node interconnect, intra-node interconnect, and heterogeneous versus self-hosted computing models.

With those questions in mind, the technological building blocks of these systems were carefully chosen with the focused goal of eventually deploying exascale supercomputers. The key building blocks that allow Summit and Sierra to meet this goal are:

• The Heterogeneous computing model • NVIDIA NVLink high-speed interconnect • NVIDIA GPU accelerator platform • IBM OpenPOWER platform

With the unveiling of the Summit and Sierra supercomputers, Oak Ridge National Laboratory and Lawrence Livermore National Laboratory have spoken loud and clear about the technologies that they believe will best carry the industry to exascale.

13

CORAL APPLICATION PERFORMANCE PROJECTIONS

0x

2x

4x

6x

8x

10x

12x

14x

QBOX LSMS HACC NEKbone

Rela

tive

Per

form

ance

Scalable Science Benchmarks

CPU-only CPU + GPU

0x

2x

4x

6x

8x

10x

12x

14x

CAM-SE UMT2013 AMG2013 MCB

Rela

tive

Per

form

ance

Throughput Benchmarks

CPU-only CPU + GPU

LLNL-PRES-695482 14

SierraNREwillprovidesignificantbenefittothefinalsystem

§  CenterofExcellence

§  Motherboarddesign

§  Watercooledcomputenodes

§  HWresiliencestudies/investigation(NVIDIA)

§  Switchbasedcollectives

§  Hardwaretagmatching

§  GPUDirectandNVMe

§  Opensourcecompilerinfrastructure

§  Systemdiagnostics

§  Systemscheduling

§  Burstbuffer

§  GPFSperformanceandscalability

§  Clustermanagement

§  Opensourcetools

LLNL-PRES-695482 15

§  Reliable,scalable,generalpurposeprimi=ve,applicabletomul=pleusecases—  In-networkTreebasedaggrega=onmechanism—  Largenumberofgroups— Mul=plesimultaneousoutstandingopera=ons

§  Highperformancecollec=veoffload—  Barrier,Reduce,All-Reduce—  Sum,Min,Max,Min-loc,max-loc,OR,XOR,AND—  Canoverlapcommunica=onandcomputa=on

§  FlexiblemechanismreflectslessonslearnedfromBlueGenesystems

Switch-basedsupportforcollecFvesfurtherimprovescriFcalfuncFonality

SHArP Tree Aggregation Node (Process running on HCA) SHArP Tree Endnode (Process running on HCA)

SHArP Tree

SHArP Tree Root

LLNL-PRES-695482 16

IniFalresultsdemonstratethatSHArPcollecFvesimproveperformancesignificantly

§  OSUAllreduce1PPN,128nodes

LLNL-PRES-695482 17

§  Realis=capplica=ons,par=cularlyinC++o^enusesmallmessages—  Realizedmessagerateo^enthekeyperformanceindicator— MPIprovideslikleabilitytoCOALESCEthesemessages

§ MPImatchingrulesheavilyimpactrealizedmessagerate— Messageenvelopsmustmatch

•  Wildcards(MPI_ANY_SOURCE,MPI_ANY_TAG)increaseenvelopmatchingcomplexityand,thus,cost

—  Postedreceivedmustbematchedin-orderagainstthein-orderpostedsends

AsseenwithCardoid,MPIsoTwareoverheadcriFcallylimitsrealizednetworkperformance

Hardwaremessagematchingsupportcanalleviateso^wareoverhead

Receiver Sender Tag=A, Communicator=B, source=C, Time=X

Tag=A, Communicator=B, Destination=C, Time=Y

Tag=A, Communicator=B, source=C, Time=X+D

Tag=A, Communicator=B, Destination=C, Time=Y+D’

LLNL-PRES-695482 18

MPItagmatchingoperaFonsmustappeartobeperformedatomically

§  Complexity/serializationofmessagematchinglimitstheprocessingthatcanbeperformedonGPUs

LLNL-PRES-695482 19

§  OffloadedtotheConnectX-5HCA—  Enablesmoreofhardwarebandwidthtotranslatetorealizedmessagerate—  FullMPItagmatchingascomputeprogresses—  Rendezvousoffload:largedatadeliveryascomputeprogresses

§  Controlcanbepassedbetweenhardwareandso^ware

§  Verbstagmatchingsupportisbeingup-streamed

MellanoxhardwarewillsupportefficientMPItagmatching

LLNL-PRES-695482 20

§  Sierrausescommodityclusternetworksolu=on—  Separatemanagement(ethernet)anduser(IB)networks—  Singlenetworkforusertraffic(point-to-point,collec=ves&filesystemtraffic)

§  Asinglenetworkforusertrafficsavesmoneybuthasothercosts—  Jikerimpactofotherjobs’filesystemtrafficcanbesevere—  Burstbufferstrategysmoothsfilesystembandwidthdemand

•  FilesystemtrafficofajobnowcompeteswithitsMPItraffic

§  Differenttypesofnetworktrafficarenotequallycri=cal—  Filesystemtraffic“only”needsaguaranteeofeventualcomple=on—  Collec=veso^encri=callylimitoverallperformance— Othertrafficclassesalsoexist

DeployingmulFplenetworksexacerbatesnetworkhardwarecosts,whicharealreadytoohighinlarge-scalesystems

Quality-of-Service(QoS)mechanismsarenecessarytoachieveanetworksolu=onthatreducesnetworkhardwarecostswhile

providingacceptable,consistentperformanceforalltrafficclasses

LLNL-PRES-695482 21

PreliminaryresultsindicatethatIBprioritylevelscompensateforcheckpointtraffic

IBM Confidential Page 24

Figure 21:Application (pF3D nearest neighbor exchange in Z, 16 processes per node) completion time for the various benchmarked configurations, in presence of checkpointing and in isolation.

Figure 22: Checkpointing rate in absence and in presence of application (pF3D nearest neighbor exchange in Z, 16 processes per node) for the various benchmarked configurations.

3.3 All-to-All Simulations on Fat Tree

The all to all communication phases of pF3D, due to the mapping we have chosen, interact with checkpointing traffic in a smaller portion of the network. They do however exhibit similar behavior to that presented in the previous section for the nearest neighbor exchange and the conclusions are the same. In the interest of brevity, we will only include the summary of the results. The figures present the following.

For the all to all exchange in the X direction:

x Figure 23: pF3D completion time in the single process per node case.

x Figure 24: checkpointing rate in the pF3D single process per node case.

IBM Confidential Page 24

Figure 21:Application (pF3D nearest neighbor exchange in Z, 16 processes per node) completion time for the various benchmarked configurations, in presence of checkpointing and in isolation.

Figure 22: Checkpointing rate in absence and in presence of application (pF3D nearest neighbor exchange in Z, 16 processes per node) for the various benchmarked configurations.

3.3 All-to-All Simulations on Fat Tree

The all to all communication phases of pF3D, due to the mapping we have chosen, interact with checkpointing traffic in a smaller portion of the network. They do however exhibit similar behavior to that presented in the previous section for the nearest neighbor exchange and the conclusions are the same. In the interest of brevity, we will only include the summary of the results. The figures present the following.

For the all to all exchange in the X direction:

x Figure 23: pF3D completion time in the single process per node case.

x Figure 24: checkpointing rate in the pF3D single process per node case.

LLNL-PRES-695482 22

§  FlexibilityofSHArPswitch-basedcollec=veswillacceleratesubcommunicatorscollec=vesandwillallowjobstosharenetwork

§  HCAMPItagmatchingwillreduceso^warecostoncri=calpath—  Futuresystemsshouldfurtheracceleratemessagepassingso^ware

§  QoSmechanismsareessen=alwithburstbuffersorsystemsthatsharednetworkresourcesacrossjobs— Mul=plenetworksmights=llprovidebestsolu=oninsomecases— Networkpar==oningcoulds=llbevaluableonfuturesystems

§  GPUDirectandNVMereducewithinnodemessagingimpact

§  Highcapabilitynodesleadtoasmallernetwork—  Reducesimportanceofnetworkpar==oning—  LLNLCTS-1with2-to-1taperedfattrees=llrequirecarefultaskmapping

SierranetworkhardwareaddresseslessonslearnedfrompreviousLLNLsystems

Substan=alresearchques=onss=llremainforsystemsa^erSierra

optimizing network usage on sequoia and sierraweb.cse.ohio-state.edu/~subramoni.1/exacomm16/...20...

Documents