optimizing network usage on sequoia and sierraweb.cse.ohio-state.edu/~subramoni.1/exacomm16/...20...
TRANSCRIPT
LLNL-PRES-695482 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Optimizing Network Usage on Sequoia and Sierra
BronisR.deSupinskiChiefTechnologyOfficer
LivermoreCompu=ngJune 23, 2016
LLNL-PRES-695482 2
Adv
ance
d Te
chno
logy
S
yste
ms
(ATS
)
Fiscal Year ‘13 ‘14 ‘15 ‘16 ‘17 ‘18
Use Retire
‘20 ‘19 ‘21
Com
mod
ity
Tech
nolo
gy
Sys
tem
s (C
TS)
Procure& Deploy
Sequoia(LLNL)
ATS1–Trinity(LANL/SNL)
ATS2–Sierra(LLNL)
Tri-labLinuxCapacityClusterII(TLCCII)
CTS1
CTS2
‘22
System Delivery
ATS3–Crossroads(LANL/SNL)
ATS4–(LLNL)
‘23
ATS5–(LANL/SNL)
MyfocusisNNSAASCATSpla2ormsatLLNL
SequoiaandSierraarethecurrentandnext-genera=onAdvancedTechnologySystemsatLLNL
LLNL-PRES-695482 3
Sequoiaprovidespreviouslyunprecedentedlevelsofcapabilityandconcurrency
§ Sequoiasta=s=cs— 20petaFLOP/speak— 17petaFLOP/sLINPACK— Memory1.5PB,4PB/sbandwidth— 98,304nodes— 1,572,864cores— 3PB/slinkbandwidth— 60TB/sbi-sec=onbandwidth— 0.5−1.0TB/sLustrebandwidth— 50PBdisk
§ 9.6MWpower,4,000^2
§ Thirdgenera=onIBMBlueGene
LLNL-PRES-695482 4
TheBG/Qcomputechipintegratesprocessors,memoryandnetworkinglogicintoonechip
§ 16user+1OS+1redundantcores— 4-waymul=-threaded,1.6GHz64-bit— 16kB/16kBL1I/Dcaches— QuadFPUs(4-wideDPSIMD)— Peak:204.8GFLOPS@55W
§ Shared32MBeDRAML2cache— Mul=versionedcache
§ Dualmemorycontroller— 16GBDDR3memory(1.33Gb/s)— 2*16byte-wideinterface(+ECC)
§ Chip-to-chipnetworking— 5DTorustopology+externallink— Eachlink2GB/ssend+2GB/sreceive— DMA,put/get,collec=veopera=ons
LLNL-PRES-695482 5
TradiFonalBlueGeneoverallsystemintegraFonresultsinsmallfootprint
6. Rack 2 midplanes
1, 2, or 4 I/O drawers 7. System 20 PF/s
4. Node card 32 compute cards,
Optical modules, link chips, torus
3. Compute card One single chip module, 16 GB DDR3 memory
2. Module Single chip
5a. Midplane 16 node cards
5b. I/O drawer 8 I/O cards
8 PCIe Gen2 slots
1. Chip 16 cores
LLNL-PRES-695482 6
§ Communica=onlocalitythroughop=mizedMPIprocessplacementiscri=calon3Dtorusnetworks— Useof5Dtorusreducesnetworkdiameterand
reducestheimportanceofMPIprocessplacement
§ Supportforhardwareop=mizedcollec=vesshouldapplytosubcommunicatorsaswellasglobalopera=ons— Increasednetworkcommunica=oncontextsallowsmoreapplica=ons
toexploithardwaresupportforcollec=veopera=ons
§ Hardwaresupportfornetworkpar==oningminimizesjiker
§ Mul=plenetworksprovidemanybenefitsbutalsoincreasecosts
SequoiaandBlueGene/QreflectlessonsfrompreviousBlueGenegeneraFons
Theseexamplesarenetwork-centric;othersreflectlessonsthroughoutthesystemhardwareandso^warearchitecture
LLNL-PRES-695482 7
§ Mechanismsthatleadtoarrhythmiaarenotwellunderstood;Contrac=onofheartiscontrolledbyelectricalbehaviorofheartcells
§ Mathema=calmodelsreproducecomponentioniccurrentsoftheac=onpoten=al— Systemofnon-linearequa=onODEs— TT06includes6speciesand14gates— Nega=vecurrentsdepolarize(ac=vate)— Posi=vecurrentsrepolarize(returntorest)
§ Abilitytorun,athighresolu=on,thousandsinsteadoftensofheartbeatsenablesdetailedstudyofdrugeffects
LLNL,withIBM,hasdevelopedCardiod,astate-of-the-artcardiacelectrophysiologysimulaFon
2012GordonBellfinalisttypifiestheeffortrequiredtoexploitsupercomputercapabilityfully
LLNL-PRES-695482 8
60 beats in 67.2 seconds
60 beats in 197.4 seconds
§ Measuredpeakperformance:11.84PFlop/s(58.8%ofpeak)— 0.05mmresolu=onheart(3B=ssuecells)— Tenmillionitera=ons,dt=4usec— Performanceoffullsimula=onloop,includingI/O,measuredwithHPM
Cardoidachievesoutstandingperformancethatenablesnearlyreal-FmeheartbeatsimulaFon
Op=mizedCardioidis50xfasterthan“naive”code
§ Extremestrongscalinglimit:— 0.10mm:236=ssuecells/core.— 0.13mm:114=ssuecells/core
LLNL-PRES-695482 9
CardiodrepresentsamajoradvanceinthestateoftheartofhumanheartsimulaFon
0.1 mm heart (370M tissue cells) One minute of wall time
Previous state of the art Cardioid
18.2 seconds of simulation time
LLNL-PRES-695482 10
§ Par==onedcellsoverprocesseswithanupperboundon=me(notonequal=me)
§ Assigneddiffusionworkandreac=onworktodifferentcores
§ Transformedthepotassiumequa=ontoremoveserializa=on
§ Expensive1Dfunc=onsinreac=onmodelexpressedwithra=onalapproximates
§ SingleprecisionweightstoreducediffusionstenciluseofL2bandwidth
§ HandunrolledtoSIMDizeloopsovercells§ SortedbycelltypetoimproveSIMDiza=on§ Sub-sor=ngofcellstoincreasesequen=al/
vectorloadandstoringofdata.§ logfunc=onfromlibmreplacedwithcustom
inlinedfunc=ons§ Ontheflyassemblyofcodetoop=mizedata
movementatrun=me§ Memorylayouttunedtoimprovecache
performance§ Useofvectorintrinsicsandcustomdivides
CardoidachievesoutstandingperformancethroughdetailedtuningtoSequoia’sarchitecture
§ Movedintegeropera=onstofloa=ngpointunitstoexploitSIMDunits
§ Noexplicitnetworkbarrier§ L2onnodethreadbarriers§ UselowlevelSPIforhalodataexchangebetweentasks(DMA)
§ Applica=onmanagedthreads§ SIMDizeddiffusionstencilimplementa=on§ Zerofluxboundarycondi=onsapproximated
bymethodwithnoglobalsolve§ HighperformanceI/OisawareofBG/Qnetworktopology
§ Lowoverheadin-situperformancemonitors§ Assignmentofthreadstodiffusion/reac=on
dependentondomaincharacteris=cs§ Co-scheduledthreadsforimproveddual
issue§ Mul=plediffusionimplementa=onstoobtain
op=malperformanceforvariousdomains§ Remote&localcopiesseparatedtoimprove
bandwidthu=liza=on
LLNL-PRES-695482 11
Atlargestscales,smallsoTwareoverheadscansignificantlyimpactperformance
DirectuseofmessageunitsandL2atomicopera=onsminimizessoverhead
370 Million Cells 1.6 Million Cores 1600 Flops/cell
60 us per iteration
0 50 100 150 200 250 300
L2AtomicBarrier
OMPFork/Join
SPIHaloExchange
MPIHaloExchange
Time (usec)
60 us target
LLNL-PRES-695482 12
TheSierrasystemthatwillreplaceSequoiafeaturesaGPU-acceleratedarchitecture
Mellanox® Interconnect Dual-rail EDR Infiniband®
IBM POWER • NVLink™
NVIDIA Volta • HBM • NVLink
Components
Compute Node POWER® Architecture Processor NVIDIA®Volta™ NVMe-compatible PCIe 800GB SSD > 512 GB DDR4 + HBM Coherent Shared Memory
Compute Rack Standard 19” Warm water cooling
Compute System 2.1 – 2.7 PB Memory
120 -150 PFLOPS 10 MW
GPFS™ File System 120 PB usable storage
1.2/1.0 TB/s R/W bandwidth
LLNL-PRES-695482 13
OutstandingbenchmarkanalysisbyIBMandNVIDIAdemonstratesthesystem’susability
Projec=onsincludedcodechangesthatshowedtractableannota=on-basedapproach(i.e.,OpenMP)willbecompe==ve
9
Figure 5: CORAL benchmark projections show GPU-accelerated system is expected to deliver substantially higher performance at the system level compared to CPU-only configuration.
The demonstration of compelling, scalable performance at the system level across a wide range of applications proved to be one of the key factors in the U.S. DoE’s decision to build Summit and Sierra on the GPU-accelerated OpenPOWER platform.
Conclusion Summit and Sierra are historic milestones in HPC’s efforts to reach exascale computing. With these new pre-exascale systems, the U.S. DoE maintains its leadership position, trailblazing the next generation of supercomputers while allowing the nation to stay ahead in scientific discoveries and economic competitiveness.
The future of large-scale systems will inevitably be accelerated with throughput-oriented processors. Latency-optimized CPU-based systems have long hit a power wall that no longer delivers year-on-year performance increase. So while the question of “accelerator or not” is no longer in debate, other questions remain, such as CPU architecture, accelerator architecture, inter-node interconnect, intra-node interconnect, and heterogeneous versus self-hosted computing models.
With those questions in mind, the technological building blocks of these systems were carefully chosen with the focused goal of eventually deploying exascale supercomputers. The key building blocks that allow Summit and Sierra to meet this goal are:
• The Heterogeneous computing model • NVIDIA NVLink high-speed interconnect • NVIDIA GPU accelerator platform • IBM OpenPOWER platform
With the unveiling of the Summit and Sierra supercomputers, Oak Ridge National Laboratory and Lawrence Livermore National Laboratory have spoken loud and clear about the technologies that they believe will best carry the industry to exascale.
13
CORAL APPLICATION PERFORMANCE PROJECTIONS
0x
2x
4x
6x
8x
10x
12x
14x
QBOX LSMS HACC NEKbone
Rela
tive
Per
form
ance
Scalable Science Benchmarks
CPU-only CPU + GPU
0x
2x
4x
6x
8x
10x
12x
14x
CAM-SE UMT2013 AMG2013 MCB
Rela
tive
Per
form
ance
Throughput Benchmarks
CPU-only CPU + GPU
LLNL-PRES-695482 14
SierraNREwillprovidesignificantbenefittothefinalsystem
§ CenterofExcellence
§ Motherboarddesign
§ Watercooledcomputenodes
§ HWresiliencestudies/investigation(NVIDIA)
§ Switchbasedcollectives
§ Hardwaretagmatching
§ GPUDirectandNVMe
§ Opensourcecompilerinfrastructure
§ Systemdiagnostics
§ Systemscheduling
§ Burstbuffer
§ GPFSperformanceandscalability
§ Clustermanagement
§ Opensourcetools
LLNL-PRES-695482 15
§ Reliable,scalable,generalpurposeprimi=ve,applicabletomul=pleusecases— In-networkTreebasedaggrega=onmechanism— Largenumberofgroups— Mul=plesimultaneousoutstandingopera=ons
§ Highperformancecollec=veoffload— Barrier,Reduce,All-Reduce— Sum,Min,Max,Min-loc,max-loc,OR,XOR,AND— Canoverlapcommunica=onandcomputa=on
§ FlexiblemechanismreflectslessonslearnedfromBlueGenesystems
Switch-basedsupportforcollecFvesfurtherimprovescriFcalfuncFonality
SHArP Tree Aggregation Node (Process running on HCA) SHArP Tree Endnode (Process running on HCA)
SHArP Tree
SHArP Tree Root
LLNL-PRES-695482 16
IniFalresultsdemonstratethatSHArPcollecFvesimproveperformancesignificantly
§ OSUAllreduce1PPN,128nodes
LLNL-PRES-695482 17
§ Realis=capplica=ons,par=cularlyinC++o^enusesmallmessages— Realizedmessagerateo^enthekeyperformanceindicator— MPIprovideslikleabilitytoCOALESCEthesemessages
§ MPImatchingrulesheavilyimpactrealizedmessagerate— Messageenvelopsmustmatch
• Wildcards(MPI_ANY_SOURCE,MPI_ANY_TAG)increaseenvelopmatchingcomplexityand,thus,cost
— Postedreceivedmustbematchedin-orderagainstthein-orderpostedsends
AsseenwithCardoid,MPIsoTwareoverheadcriFcallylimitsrealizednetworkperformance
Hardwaremessagematchingsupportcanalleviateso^wareoverhead
Receiver Sender Tag=A, Communicator=B, source=C, Time=X
Tag=A, Communicator=B, Destination=C, Time=Y
Tag=A, Communicator=B, source=C, Time=X+D
Tag=A, Communicator=B, Destination=C, Time=Y+D’
LLNL-PRES-695482 18
MPItagmatchingoperaFonsmustappeartobeperformedatomically
§ Complexity/serializationofmessagematchinglimitstheprocessingthatcanbeperformedonGPUs
LLNL-PRES-695482 19
§ OffloadedtotheConnectX-5HCA— Enablesmoreofhardwarebandwidthtotranslatetorealizedmessagerate— FullMPItagmatchingascomputeprogresses— Rendezvousoffload:largedatadeliveryascomputeprogresses
§ Controlcanbepassedbetweenhardwareandso^ware
§ Verbstagmatchingsupportisbeingup-streamed
MellanoxhardwarewillsupportefficientMPItagmatching
LLNL-PRES-695482 20
§ Sierrausescommodityclusternetworksolu=on— Separatemanagement(ethernet)anduser(IB)networks— Singlenetworkforusertraffic(point-to-point,collec=ves&filesystemtraffic)
§ Asinglenetworkforusertrafficsavesmoneybuthasothercosts— Jikerimpactofotherjobs’filesystemtrafficcanbesevere— Burstbufferstrategysmoothsfilesystembandwidthdemand
• FilesystemtrafficofajobnowcompeteswithitsMPItraffic
§ Differenttypesofnetworktrafficarenotequallycri=cal— Filesystemtraffic“only”needsaguaranteeofeventualcomple=on— Collec=veso^encri=callylimitoverallperformance— Othertrafficclassesalsoexist
DeployingmulFplenetworksexacerbatesnetworkhardwarecosts,whicharealreadytoohighinlarge-scalesystems
Quality-of-Service(QoS)mechanismsarenecessarytoachieveanetworksolu=onthatreducesnetworkhardwarecostswhile
providingacceptable,consistentperformanceforalltrafficclasses
LLNL-PRES-695482 21
PreliminaryresultsindicatethatIBprioritylevelscompensateforcheckpointtraffic
IBM Confidential Page 24
Figure 21:Application (pF3D nearest neighbor exchange in Z, 16 processes per node) completion time for the various benchmarked configurations, in presence of checkpointing and in isolation.
Figure 22: Checkpointing rate in absence and in presence of application (pF3D nearest neighbor exchange in Z, 16 processes per node) for the various benchmarked configurations.
3.3 All-to-All Simulations on Fat Tree
The all to all communication phases of pF3D, due to the mapping we have chosen, interact with checkpointing traffic in a smaller portion of the network. They do however exhibit similar behavior to that presented in the previous section for the nearest neighbor exchange and the conclusions are the same. In the interest of brevity, we will only include the summary of the results. The figures present the following.
For the all to all exchange in the X direction:
x Figure 23: pF3D completion time in the single process per node case.
x Figure 24: checkpointing rate in the pF3D single process per node case.
IBM Confidential Page 24
Figure 21:Application (pF3D nearest neighbor exchange in Z, 16 processes per node) completion time for the various benchmarked configurations, in presence of checkpointing and in isolation.
Figure 22: Checkpointing rate in absence and in presence of application (pF3D nearest neighbor exchange in Z, 16 processes per node) for the various benchmarked configurations.
3.3 All-to-All Simulations on Fat Tree
The all to all communication phases of pF3D, due to the mapping we have chosen, interact with checkpointing traffic in a smaller portion of the network. They do however exhibit similar behavior to that presented in the previous section for the nearest neighbor exchange and the conclusions are the same. In the interest of brevity, we will only include the summary of the results. The figures present the following.
For the all to all exchange in the X direction:
x Figure 23: pF3D completion time in the single process per node case.
x Figure 24: checkpointing rate in the pF3D single process per node case.
LLNL-PRES-695482 22
§ FlexibilityofSHArPswitch-basedcollec=veswillacceleratesubcommunicatorscollec=vesandwillallowjobstosharenetwork
§ HCAMPItagmatchingwillreduceso^warecostoncri=calpath— Futuresystemsshouldfurtheracceleratemessagepassingso^ware
§ QoSmechanismsareessen=alwithburstbuffersorsystemsthatsharednetworkresourcesacrossjobs— Mul=plenetworksmights=llprovidebestsolu=oninsomecases— Networkpar==oningcoulds=llbevaluableonfuturesystems
§ GPUDirectandNVMereducewithinnodemessagingimpact
§ Highcapabilitynodesleadtoasmallernetwork— Reducesimportanceofnetworkpar==oning— LLNLCTS-1with2-to-1taperedfattrees=llrequirecarefultaskmapping
SierranetworkhardwareaddresseslessonslearnedfrompreviousLLNLsystems
Substan=alresearchques=onss=llremainforsystemsa^erSierra