efficient reliability support for hardware multicast-based ... · efficient reliability support for...
TRANSCRIPT
EfficientReliabilitySupportforHardwareMulticast-basedBroadcastinGPU-enabledStreamingApplications
1Ching-HsiangChu,1KhaledHamidouche,1HariSubramoni,1AkshayVenkatesh,2BracyEltonand1DhabaleswarK.(DK)Panda
1DepartmentofComputerScienceandEngineering,TheOhioStateUniversity2EngilityCorporation
COMHPC@SC16 2NetworkBasedComputingLaboratory
• Introduction
• ProposedDesigns
• PerformanceEvaluation
• ConclusionandFutureWork
Outline
COMHPC@SC16 3NetworkBasedComputingLaboratory
• Multi-coreprocessorsareubiquitous• InfiniBand(IB)isverypopularinHPCclusters• Accelerators/Coprocessorsarebecomingcommoninhigh-endsystems➠ PushingtheenvelopetowardsExascalecomputing
DriversofModernHPCClusterArchitectures
Accelerators/Coprocessorshighcomputedensity,highperformance/watt
>1Tflop/sDPonachip
HighPerformanceInterconnects– InfiniBand<1µslatency,>100GbpsBandwidth
Tianhe– 2 Titan Stampede Tianhe– 1A
Multi-coreProcessors
COMHPC@SC16 4NetworkBasedComputingLaboratory
• GrowthofIBandGPUclustersinthelast3years– IBisthemajorcommoditynetworkadapterused
– NVIDIAGPUsboost18%ofthetop50 ofthe”Top500”systemsasofJune2016
IBandGPUinHPCSystems
8.4 7.8 9 9.8 10.4 13.8 13.2
41 41.4 44.4 44.851.8 47.4
40.8
0102030405060
June2013 Nov2013 June2014 Nov2014 June2015 Nov2015 June2016System
shareinTo
p500(%
)
GPUCluster InfiniBandCluster DatafromTop500list(http://www.top500.org)
COMHPC@SC16 5NetworkBasedComputingLaboratory
• StreamingapplicationsonHPCsystems1. Communication(MPI)• Pipelineofbroadcast-type
operations
2. Computation(CUDA)• MultipleGPU nodesasworkers
– Examples• Deeplearningframeworks
• Protoncomputedtomography(pCT)
MotivationDataSource
DataDistributor
HPCresourcesforreal-timeanalytics
Real-timestreaming
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
Datastreaming-likebroadcastoperations
COMHPC@SC16 6NetworkBasedComputingLaboratory
• High-performanceHeterogeneousBroadcast*
– LeveragesNVIDIAGPUDirect andIBhardwaremulticast(MCAST)features
– Eliminatesunnecessarydatastagingthroughhostmemory
CommunicationforStreamingApplications
*Ching-HsiangChu,KhaledHamidouche,HariSubramoni,Akshay Venkatesh,BracyElton,andD.K.Panda.“DesigningHighPerformanceHeterogeneousBroadcastforStreamingApplicationsonGPUClusters,“SBAC-PAD’16,Oct2016.
NodeNIBHCA
IBHCA
CPUGPU
Source
IBSwitch
GPUCPU
Node1
Multicaststeps
C Data
C
IBSLstep
Data
IBHCAGPUCPU
Data
CIBHCA:InfiniBandHostChannelAdapter
COMHPC@SC16 7NetworkBasedComputingLaboratory
• IBhardwaremulticastsignificantlyimprovestheperformance,however,itisaUnreliableDatagram(UD)-based scheme
Ø Reliabilityneedstobehandledexplicitly
• ExistingNegativeACKnowledgement (NACK)-basedDesign– SendermuststalltocheckreceiptofNACKpackets
Ø Breaksthepipelineofbroadcastoperations– Re-sendMCASTpackets evenifitisnotnecessaryforsomereceivers
ØWastesnetworkresource,degradesthroughput/bandwidth
LimitationsoftheExistingScheme
COMHPC@SC16 8NetworkBasedComputingLaboratory
• HowtoprovidereliabilitysupportwhileleveragingUD-basedIB
hardwaremulticasttoachievehigh-performancebroadcastfor
GPU-enabledstreamingapplications?
• Maintainsthepipelineofbroadcastoperations
• MinimizestheconsumptionofPeripheralComponentInterconnect
Express(PCIe)resources
ProblemStatement
COMHPC@SC16 9NetworkBasedComputingLaboratory
• Introduction
• ProposedDesigns– RemoteMemoryAccess(RMA)-basedDesign
• PerformanceEvaluation
• ConclusionandFutureWork
Outline
COMHPC@SC16 10NetworkBasedComputingLaboratory
• Goalsoftheproposeddesign– AllowsthereceiverstoretrievelostMCASTpacketsthroughtheRMA
operationswithoutinterruptingsender
Ø Maintainspipeliningofbroadcastoperations
Ø MinimizesconsumptionofPCIeresources
• MajorBenefitofMPI-3RemoteMemoryAccess(RMA)*– Supportsone-sidedcommunicationè broadcastsenderwon’tbeinterrupted
• MajorChallenge– HowandwherereceiverscanretrievethecorrectMCASTpackets
throughRMAoperations
Overview:RMA-basedReliabilityDesign
*”MPIForum”,http://mpi-forum.org/
COMHPC@SC16 11NetworkBasedComputingLaboratory
• Maintainsanadditionalwindowofacircularbackupbufferfor
MCASTpackets
• ExposesthiswindowtootherprocessesintheMCASTgroup,e.g.,
performsMPI_Win_create
• UtilizesanadditionalhelperthreadtocopyMCASTpacketstothe
backupbufferè wecanoverlapwithbroadcastcommunication
ImplementingMPI_Bcast:SenderSide
COMHPC@SC16 12NetworkBasedComputingLaboratory
• Whenareceiverexperiencestimeout (lostMCASTpacket)– PerformstheRMAGetoperationtothesender’sbackupbuffertoretrieve
lostMCASTpackets
– Senderisnotinterrupted
ImplementingMPI_Bcast:ReceiverSide
BroadcastreceiverBroadcastsenderIBHCA IBHCA MPI
Timeout
MPITime
COMHPC@SC16 13NetworkBasedComputingLaboratory
• LargeenoughtokeeptheMCASTpacketsavailablewhenitisneeded
• Assmallaspossibletolimitsizeofmemoryfootprint
BackupBufferRequirements
𝑾 >𝑩×(𝑲×𝑹𝑻𝑻)
𝒇
Framesize:SizeofasingleMCASTpacket
Bandwidth Constant Round-TripTimebetweensenderandreceiver
COMHPC@SC16 14NetworkBasedComputingLaboratory
• Pros:– Broadcastsenderisnotinvolvedinretransmission,i.e.,maintainsthe
pipelineofbroadcastoperationsØ Highthroughput,highscalability– NoextraMCASToperation,i.e.,minimizesconsumptionofPCIeresourcesØ Lowoverhead,lowlatency
• Cons:– CongestionmayoccurwhenmultipleRMAGetoperationsfrommultiple
receiversareissuedtoretrievethesamedatainextremeunreliablenetworks(highlyunlikelyforIBclusters)
ProposedRMA-basedReliabilityDesign
COMHPC@SC16 15NetworkBasedComputingLaboratory
• Introduction
• ProposedDesigns
• PerformanceEvaluation– ExperimentalEnvironments
– StreamingBenchmarkLevelEvaluation
• ConclusionandFutureWork
Outline
COMHPC@SC16 16NetworkBasedComputingLaboratory
1. RI2cluster@TheOhioStateUniversity*– Mellanox EDRInfiniBand HCAs
– 2NVIDIAK80GPUspernode
– Usedupto16GPUnodes
2. CSCScluster@SwissNationalSupercomputingCentrehttp://www.cscs.ch/computers/kesch_escha/index.html
– Mellanox FDRInfiniBandHCAs
– CrayCS-Stormsystem,8NVIDIAK80GPUcardspernode
– Usedupto88NVIDIAK80GPUcardsover11nodes
• ModifiedOhioStateUniversity(OSU)Micro-Benchmark(OMB)*– http://mvapich.cse.ohio-state.edu/benchmarks/
– osu_bcast - MPI_Bcast LatencyTest
– Modifiedtosupportheterogeneousbroadcast
• Streamingbenchmark– Mimicsrealstreamingapplications
– ContinuouslybroadcastsdatafromasourcetoGPU-basedcomputenodes
– Includesacomputationphasethatinvolveshost-to-deviceanddevice-to-hostcopies
ExperimentalEnvironments
*ResultsfromRI2andOMBareomittedinthispresentationduetotimeconstraints
COMHPC@SC16 17NetworkBasedComputingLaboratory
OverviewoftheMVAPICH2Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and
RDMA over Converged Enhanced Ethernet (RoCE)– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR), Available since 2014– Support for MIC (MVAPICH2-MIC), Available since 2014– Support for Virtualization (MVAPICH2-Virt), Available since 2015– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015– Used by more than 2,675 organizations in 83 countries– More than 400,000 (> 0.4 million) downloads from the OSU site directly– Empowering many TOP500 clusters (June 2016 ranking)
• 12th ranked 462,462-core cluster (Stampede) at TACC• 15th ranked 185,344-core cluster (Pleiades) at NASA• 31th ranked 74520-core cluster (Tsubame 2.5) at Tokyo Institute of Technology
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu• Empowering Top500 systems for over a decade
– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 Tflop/s)⇒– Stampede at TACC (12th in June 2016, 462,462 cores, 5.168 Pflop/s)
COMHPC@SC16 18NetworkBasedComputingLaboratory
• NegligibleoverheadcomparedtoexistingNACK-baseddesign
• RMA-baseddesignoutperformsNACK-basedschemeforlargemessages• AhelperthreadinthebackgroundperformsbackupsofMCASTpackets
Evaluation:Overhead
051015202530
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Latency(μs)
MessageSize(Bytes)
w/oreliability NACK RMA
0
1000
2000
3000
4000
16K
32K
64K
128K
256K
512K 1M 2M 4M
Latency(μs)
MessageSize(Bytes)
w/oreliability NACK RMA
COMHPC@SC16 19NetworkBasedComputingLaboratory
LatencyreductionofproposedRMA-baseddesigncomparedtotheexistingNACK-basedscheme
Evaluation:LatencyonStreamingBenchmark
0
0.5
1
1.5
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8KNormalize
dLatency
MessageSize(Bytes)
0.01% 0.10% 1%
0
0.5
1
1.5
16K
32K
64K
128K
256K
512K 1M 2M 4MNormalize
dLatency
MessageSize(Bytes)
0.01% 0.10% 1% NormalizedtoSL-basedMCASTwithNACK-basedretransmissionscheme
MessageSize8KB 128KB 2MB
ModeledErrorRate
0.01% 16% 31% 11%0.1% 21% 36% 19%1% 24% 21% 10%
ModeledErrorrates: ModeledErrorrates:
COMHPC@SC16 20NetworkBasedComputingLaboratory
• EqualorbetterthantheleadingNACK-baseddesignfordifferentmessagesizesanderrorrates
• Alwaysyields(upto56%)ahigherbroadcastratethantheexistingNACK-baseddesign
Evaluation:BroadcastRate(Throughput)
1
1.2
1.4
1.6
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K 16K
32K
64K
128K
256K
512K 1M 2M 4MNormalize
dLatency
MessageSize(Bytes)
0.01% 0.10% 1%
NormalizedtoSL-basedMCASTwithNACK-basedretransmissionscheme
ModeledErrorrates:
COMHPC@SC16 21NetworkBasedComputingLaboratory
• Introduction
• ProposedDesigns
• PerformanceEvaluation
• ConclusionandFutureWork
Outline
COMHPC@SC16 22NetworkBasedComputingLaboratory
• ProposeanRMA-basedreliabilitydesignontopofIBhardwaremulticastbasedbroadcastforstreamingapplications– Maintainspipeliningofbroadcastoperations
– MinimizesconsumptionofPCIeresources
– Providesgoodperformancewithstreamingbenchmarks,whichispromisingforrealstreamingapplications
• Futurework– IncludetheproposeddesigninfuturereleasesoftheMVAPICH2-GDR
library
– Evaluateeffectivenesswithrealstreamingapplications
Conclusion
COMHPC@SC16 23NetworkBasedComputingLaboratory
Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/
TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/
This project is supported under the United States Department of Defense (DOD) High Performance ComputingModernization Program (HPCMP) User Productivity Enhancement and Technology Transfer (PETTT) activity(Contract No. GS04T09DBC0017 through Engility Corporation). The opinions expressed herein are those of theauthors and do not necessarily reflect the views of the DOD or the employer of the author.