efficient reliability support for hardware multicast-based ... · efficient reliability support for...

EfficientReliabilitySupportforHardwareMulticast-basedBroadcastinGPU-enabledStreamingApplications

1Ching-HsiangChu,1KhaledHamidouche,1HariSubramoni,1AkshayVenkatesh,2BracyEltonand1DhabaleswarK.(DK)Panda

1DepartmentofComputerScienceandEngineering,TheOhioStateUniversity2EngilityCorporation

COMHPC@SC16 2NetworkBasedComputingLaboratory

• Introduction

• ProposedDesigns

• PerformanceEvaluation

• ConclusionandFutureWork

Outline


• Multi-coreprocessorsareubiquitous• InfiniBand(IB)isverypopularinHPCclusters• Accelerators/Coprocessorsarebecomingcommoninhigh-endsystems➠ PushingtheenvelopetowardsExascalecomputing

DriversofModernHPCClusterArchitectures

Accelerators/Coprocessorshighcomputedensity,highperformance/watt

>1Tflop/sDPonachip

HighPerformanceInterconnects– InfiniBand<1µslatency,>100GbpsBandwidth

Tianhe– 2 Titan Stampede Tianhe– 1A

Multi-coreProcessors


• GrowthofIBandGPUclustersinthelast3years– IBisthemajorcommoditynetworkadapterused

– NVIDIAGPUsboost18%ofthetop50 ofthe”Top500”systemsasofJune2016

IBandGPUinHPCSystems

8.4 7.8 9 9.8 10.4 13.8 13.2

41 41.4 44.4 44.851.8 47.4

40.8

0102030405060

June2013 Nov2013 June2014 Nov2014 June2015 Nov2015 June2016System

shareinTo

p500(%

)

GPUCluster InfiniBandCluster DatafromTop500list(http://www.top500.org)


• StreamingapplicationsonHPCsystems1. Communication(MPI)• Pipelineofbroadcast-type

operations

2. Computation(CUDA)• MultipleGPU nodesasworkers

– Examples• Deeplearningframeworks

• Protoncomputedtomography(pCT)

MotivationDataSource

DataDistributor

HPCresourcesforreal-timeanalytics

Real-timestreaming

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

Datastreaming-likebroadcastoperations


• High-performanceHeterogeneousBroadcast*

– LeveragesNVIDIAGPUDirect andIBhardwaremulticast(MCAST)features

– Eliminatesunnecessarydatastagingthroughhostmemory

CommunicationforStreamingApplications

*Ching-HsiangChu,KhaledHamidouche,HariSubramoni,Akshay Venkatesh,BracyElton,andD.K.Panda.“DesigningHighPerformanceHeterogeneousBroadcastforStreamingApplicationsonGPUClusters,“SBAC-PAD’16,Oct2016.

NodeNIBHCA

IBHCA

CPUGPU

Source

IBSwitch

GPUCPU

Node1

Multicaststeps

C Data

C

IBSLstep

Data

IBHCAGPUCPU

Data

CIBHCA:InfiniBandHostChannelAdapter


• IBhardwaremulticastsignificantlyimprovestheperformance,however,itisaUnreliableDatagram(UD)-based scheme

Ø Reliabilityneedstobehandledexplicitly

• ExistingNegativeACKnowledgement (NACK)-basedDesign– SendermuststalltocheckreceiptofNACKpackets

Ø Breaksthepipelineofbroadcastoperations– Re-sendMCASTpackets evenifitisnotnecessaryforsomereceivers

ØWastesnetworkresource,degradesthroughput/bandwidth

LimitationsoftheExistingScheme


• HowtoprovidereliabilitysupportwhileleveragingUD-basedIB

hardwaremulticasttoachievehigh-performancebroadcastfor

GPU-enabledstreamingapplications?

• Maintainsthepipelineofbroadcastoperations

• MinimizestheconsumptionofPeripheralComponentInterconnect

Express(PCIe)resources

ProblemStatement


• Introduction

• ProposedDesigns– RemoteMemoryAccess(RMA)-basedDesign



Outline


• Goalsoftheproposeddesign– AllowsthereceiverstoretrievelostMCASTpacketsthroughtheRMA

operationswithoutinterruptingsender

Ø Maintainspipeliningofbroadcastoperations

Ø MinimizesconsumptionofPCIeresources

• MajorBenefitofMPI-3RemoteMemoryAccess(RMA)*– Supportsone-sidedcommunicationè broadcastsenderwon’tbeinterrupted

• MajorChallenge– HowandwherereceiverscanretrievethecorrectMCASTpackets

throughRMAoperations

Overview:RMA-basedReliabilityDesign

*”MPIForum”,http://mpi-forum.org/


• Maintainsanadditionalwindowofacircularbackupbufferfor

MCASTpackets

• ExposesthiswindowtootherprocessesintheMCASTgroup,e.g.,

performsMPI_Win_create

• UtilizesanadditionalhelperthreadtocopyMCASTpacketstothe

backupbufferè wecanoverlapwithbroadcastcommunication

ImplementingMPI_Bcast:SenderSide


• Whenareceiverexperiencestimeout (lostMCASTpacket)– PerformstheRMAGetoperationtothesender’sbackupbuffertoretrieve

lostMCASTpackets

– Senderisnotinterrupted

ImplementingMPI_Bcast:ReceiverSide

BroadcastreceiverBroadcastsenderIBHCA IBHCA MPI

Timeout

MPITime


• LargeenoughtokeeptheMCASTpacketsavailablewhenitisneeded

• Assmallaspossibletolimitsizeofmemoryfootprint

BackupBufferRequirements

𝑾 >𝑩×(𝑲×𝑹𝑻𝑻)

𝒇

Framesize:SizeofasingleMCASTpacket

Bandwidth Constant Round-TripTimebetweensenderandreceiver


• Pros:– Broadcastsenderisnotinvolvedinretransmission,i.e.,maintainsthe

pipelineofbroadcastoperationsØ Highthroughput,highscalability– NoextraMCASToperation,i.e.,minimizesconsumptionofPCIeresourcesØ Lowoverhead,lowlatency

• Cons:– CongestionmayoccurwhenmultipleRMAGetoperationsfrommultiple

receiversareissuedtoretrievethesamedatainextremeunreliablenetworks(highlyunlikelyforIBclusters)

ProposedRMA-basedReliabilityDesign


• Introduction

• ProposedDesigns

• PerformanceEvaluation– ExperimentalEnvironments

– StreamingBenchmarkLevelEvaluation


Outline


1. RI2cluster@TheOhioStateUniversity*– Mellanox EDRInfiniBand HCAs

– 2NVIDIAK80GPUspernode

– Usedupto16GPUnodes

2. CSCScluster@SwissNationalSupercomputingCentrehttp://www.cscs.ch/computers/kesch_escha/index.html

– Mellanox FDRInfiniBandHCAs

– CrayCS-Stormsystem,8NVIDIAK80GPUcardspernode

– Usedupto88NVIDIAK80GPUcardsover11nodes

• ModifiedOhioStateUniversity(OSU)Micro-Benchmark(OMB)*– http://mvapich.cse.ohio-state.edu/benchmarks/

– osu_bcast - MPI_Bcast LatencyTest

– Modifiedtosupportheterogeneousbroadcast

• Streamingbenchmark– Mimicsrealstreamingapplications

– ContinuouslybroadcastsdatafromasourcetoGPU-basedcomputenodes

– Includesacomputationphasethatinvolveshost-to-deviceanddevice-to-hostcopies

ExperimentalEnvironments

*ResultsfromRI2andOMBareomittedinthispresentationduetotimeconstraints


OverviewoftheMVAPICH2Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and

RDMA over Converged Enhanced Ethernet (RoCE)– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR), Available since 2014– Support for MIC (MVAPICH2-MIC), Available since 2014– Support for Virtualization (MVAPICH2-Virt), Available since 2015– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015– Used by more than 2,675 organizations in 83 countries– More than 400,000 (> 0.4 million) downloads from the OSU site directly– Empowering many TOP500 clusters (June 2016 ranking)

• 12th ranked 462,462-core cluster (Stampede) at TACC• 15th ranked 185,344-core cluster (Pleiades) at NASA• 31th ranked 74520-core cluster (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu• Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 Tflop/s)⇒– Stampede at TACC (12th in June 2016, 462,462 cores, 5.168 Pflop/s)


• NegligibleoverheadcomparedtoexistingNACK-baseddesign

• RMA-baseddesignoutperformsNACK-basedschemeforlargemessages• AhelperthreadinthebackgroundperformsbackupsofMCASTpackets

Evaluation:Overhead

051015202530

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Latency(μs)

MessageSize(Bytes)

w/oreliability NACK RMA

0

1000

2000

3000

4000

16K

32K

64K

128K

256K

512K 1M 2M 4M

Latency(μs)

MessageSize(Bytes)

w/oreliability NACK RMA


LatencyreductionofproposedRMA-baseddesigncomparedtotheexistingNACK-basedscheme

Evaluation:LatencyonStreamingBenchmark

0

0.5

1

1.5

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8KNormalize

dLatency

MessageSize(Bytes)

0.01% 0.10% 1%

0

0.5

1

1.5

16K

32K

64K

128K

256K

512K 1M 2M 4MNormalize

dLatency

MessageSize(Bytes)

0.01% 0.10% 1% NormalizedtoSL-basedMCASTwithNACK-basedretransmissionscheme

MessageSize8KB 128KB 2MB

ModeledErrorRate

0.01% 16% 31% 11%0.1% 21% 36% 19%1% 24% 21% 10%

ModeledErrorrates: ModeledErrorrates:


• EqualorbetterthantheleadingNACK-baseddesignfordifferentmessagesizesanderrorrates

• Alwaysyields(upto56%)ahigherbroadcastratethantheexistingNACK-baseddesign

Evaluation:BroadcastRate(Throughput)

1

1.2

1.4

1.6

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4MNormalize

dLatency

MessageSize(Bytes)

0.01% 0.10% 1%

NormalizedtoSL-basedMCASTwithNACK-basedretransmissionscheme

ModeledErrorrates:


• Introduction

• ProposedDesigns



Outline


• ProposeanRMA-basedreliabilitydesignontopofIBhardwaremulticastbasedbroadcastforstreamingapplications– Maintainspipeliningofbroadcastoperations

– MinimizesconsumptionofPCIeresources

– Providesgoodperformancewithstreamingbenchmarks,whichispromisingforrealstreamingapplications

• Futurework– IncludetheproposeddesigninfuturereleasesoftheMVAPICH2-GDR

library

– Evaluateeffectivenesswithrealstreamingapplications

Conclusion


[email protected]

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/

This project is supported under the United States Department of Defense (DOD) High Performance ComputingModernization Program (HPCMP) User Productivity Enhancement and Technology Transfer (PETTT) activity(Contract No. GS04T09DBC0017 through Engility Corporation). The opinions expressed herein are those of theauthors and do not necessarily reflect the views of the DOD or the employer of the author.

efficient reliability support for hardware multicast-based ... · efficient reliability support for...

Documents