cuda kernel based collective reduction operations on large-scale gpu …web.cse.ohio-state.edu ›...

CUDAKernelbasedCollectiveReductionOperationsonLarge-scaleGPUClusters

Ching-HsiangChu,KhaledHamidouche,Akshay Venkatesh,AmmarAhmadAwanandDhabaleswar K.(DK)Panda

Speaker: Sourav Chakraborty

Network-based ComputingLaboratory

DepartmentofComputerScienceandEngineeringTheOhioStateUniversity

CCGrid 2016 2NetworkBasedComputingLaboratory

• Introduction

• ProposedDesigns

• PerformanceEvaluation

• Conclusion

Outline


DriversofModernHPCClusterArchitectures

• Multi-coreprocessorsareubiquitous• InfiniBandverypopularinHPCclusters• Accelerators/Coprocessorsbecomingcommoninhigh-endsystems• PushingtheenvelopeforExascale computing

Accelerators/Coprocessorshighcomputedensity,highperformance/watt

>1Tflop/sDPonachip

HighPerformanceInterconnects- InfiniBand<1uslatency,>100Gbps Bandwidth

Tianhe– 2 Titan Stampede Tianhe– 1A

Multi-coreProcessors


• GrowthofAccelerator-enabledclustersinthelast3years– 22%ofTop50clustersareboosted byNVIDIAGPUsinNov’15

– FromTop500list(http://www.top500.org)

AcceleratorsinHPCSystems

8 15 23 28 335231 22 20 18 1514

11 1216 20 30

29

0

20

40

60

80

100

June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015

System

Count

NVIDIAKepler NVIDIAFermi IntelXeonPhi


• Scientific parallel applications spend a considerableamount of time in collectivecommunication operations– E.g. Deep learning frameworks such as Caffe

Motivation– CollectivesinApplications

GPUNode1 GPUNode2

GPUNodeN

MPI_Bcast/MPI_Scatter

MPI_Gather/MPI_Reduce

GPU computations


• Scientific parallel applications spend a considerableamount of time in collectivecommunication operations– Pure communication collectives: Broadcast, Gather, Scatter…

– Compute-oriented collectives: Reduce, Allreduce, Scan

– Communication part is highly optimized

• Compute-oriented collectives operations are not fullyoptimized for GPU clusters– CPU is doing all the works

– GPU resources are not fully utilized

Motivation- CollectiveReductionOperations


• Fastcomputation

– Massiveparallelism

• Efficientcommunication

– NVIDIAGPUDirect RDMA

Motivation– PowerfulGPUResources

• GPUfeaturesarenotbeingutilizedforallcollectives• Can we leverage these features to further optimize the

compute-oriented collectives on GPU clusters?

http://www.nvidia.com/object/gpu-accelerated-computing.html https://developer.nvidia.com/gpudirect


• How to eliminate explicit data movements between Hostand GPU memories?– cudaMemcpy call is expensive!

• How to handle the GPU-to-GPU communication after thecomputations finish?

• When to use GPU for compute-oriented collectiveoperations?

– Launching kernels bring overhead; How to minimize?

ProblemStatement


• Design a framework that exploits the CUDA kernels toefficiently handle compute-oriented collectives

• Propose extensions to the existing collective algorithms tobe GPU-Aware compute-orientedalgorithms– MPI_Reduce, MPI_Allreduce andMPI_Scan

• Detailed analysis and evaluation using different GPUsystems includinga Cray CS-Storm system.

Overview


• Introduction

• ProposedDesigns


• Conclusion

Outline


• Existingdesigns1. ExplicitcopythedatafromGPUtohostmemory

2. Host-to-Host communicationtoremoteprocesses

3. Performcomputation onCPU

4. ExplicitcopythedatafromhosttoGPUmemory

• Proposeddesigns1. GPU-to-GPU communication

• NVIDIAGPUDirect RDMA(GDR)

• Pipeline throughhostforlargemsg

2. Performcomputation onGPU• EfficientCUDAkernels

DesignConsideration

CPU

Host Memory

GPU

PCIe IB Adapter

CPU

Host Memory

GPU

PCIeIB Adapter1

2

3

4

1

2

Node BNode A


• Tree-basedK-nomialalgorithm– Onlythenon-leaf nodesperformreductionoperation

• Pros&Cons– Loadbalance,Efficient/scalablecommunication

– Higheraveragelatency

K-nomial MPI_Reduce

0 1 2 3 4 5 6 7[1]

[2]

[3]

0

124

356

7


• Host-basedBinomial-Reduce(Default)

• GPU-basedBinomial-Reduce(BR-DD)

CostAnalysis

Expensive cudaMemcpy, before/after reduction op.

Relatively slow computation on CPUFast Host-Host Comm.

Fast, highly parallelized computation on GPU

Overhead of launching CUDA kernels (~10us)GDR-based GPU-GPU Comm.

Constant variant of tree initialization

log$𝑛 × 𝜖×𝐶𝑜𝑚𝑚+,-.(𝑀) + 𝐶𝑜𝑚𝑝+,-.(𝑀) + 2×𝐶𝑜𝑝𝑦(𝑀)

log$ 𝑛 × 𝜖×𝐶𝑜𝑚𝑚678 𝑀 + 𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A(𝑀)

Message Size


• Gather-firstalgorithm– Rootgathersallthedataandperformthecomputation

• Sinceonlyrootneeds thefinalresult

• Pros&Cons– Lowcomputationoverhead

– Poorscalability

Gather-firstMPI_Reduce

0 1 2 3 4 5 6 7


• Host-basedGatherandReduce(GR-H-HH)

• Host-basedGather,GPU-basedReduce(GR-HH)

• GPU-basedGatherandReduce(GR-DD)

CostAnalysis

(𝑛 − 1)× 𝐶𝑜𝑚𝑚+,-.(𝑀) + 𝐶𝑜𝑚𝑝+,-.(𝑀) + 2×𝐶𝑜𝑝𝑦(𝑀)

(𝑛 − 1)×𝐶𝑜𝑚𝑚678(𝑀) + 𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A(𝑀)

(𝑛 − 1)×(𝐶𝑜𝑚𝑚+,-. 𝑀 + 𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A 𝑀 + 2×𝐶𝑜𝑝𝑦(𝑀))Could suffer scalable issue è Good for small messages or small scale

Less affect from CUDA kernel launching overhead è Good for small messages


• Recursivedoublingalgorithm– Everyprocessorneedstoperformcomputation

• Pros&Cons– Loadbalance,Efficient/scalablecommunication

– Higheraveragelatency

GPU-basedMPI_Allreduce andMPI_Scan

0 1 2 3 4 5 6 7[1]

[2]

[3]


• GPU-basedRecursiveDoubling(RD-DD)

• GPU-basedBinomial-Reduce-Broadcast(GBRB-DD)

CostAnalysis

Same as BD-DD MPI_Reduce

log$ 𝑛 × 𝜖×𝐶𝑜𝑚𝑚678 𝑀 +𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A(𝑀)

log$𝑛 × 2×𝜖×𝐶𝑜𝑚𝑚678 𝑀 +𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A(𝑀)


Communication Computation Design Algorithm Benefit

Host<->HostCPU

BR-H-HH(Default) Binomial-Reduce Largescale,smallmessagesRD-H-HH(Default) Recursivedoubling

GR-H-HH

Gather-Reduce Smallscale,small messages

GPU

GR-HHHost<->Device (GDR) GR-HD/GR-DH

Device<->Device(GDR)

GR-DD

BR-DD Binomial-Reduce

Largemessagesforanyscale

BRB-DD Binomial-Reduce-Bcast

RD-DDRecursivedoubling

Host<->Device (GDR) RD-HD/RD-DH

AlternativeandExtendedDesigns


• Introduction

• ProposedDesigns


• Conclusion

Outline


OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,10-40Gig/iWARP,andRDMAoverConvergedEnhancedEthernet(RoCE)

– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Availablesince2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR) andMIC(MVAPICH2-MIC),Availablesince2014

– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015

– SupportforEnergy-Awareness (MVAPICH2-EA), Availablesince2015

– Usedbymorethan2,550organizations in79countries

– Morethan360,000(>0.36million)downloadsfromtheOSUsitedirectly

– EmpoweringmanyTOP500clusters(Nov‘15ranking)• 10th ranked519,640-corecluster(Stampede)atTACC

• 13th ranked185,344-corecluster(Pleiades)atNASA

• 25th ranked76,032-corecluster(Tsubame2.5)atTokyoInstituteofTechnologyandmanyothers

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu

• EmpoweringTop500systems foroveradecade– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->

– StampedeatTACC(10th inNov’15,519,640cores,5.168Plops)


1. Wilkescluster@UniversityofCambridge– 2NVIDIAK20cGPUspernode

– Usedforinter-nodeexperiments• Upto32GPUnodes

2. CSCScluster@SwissNationalSupercomputing Centre– CrayCS-Stormsystem

– 8NVIDIAK80GPUspernode(=16NVIDIAK40GPUspernode)

– Usedforintra-nodeexperiments• Upto96GPUsover11nodes

ExperimentalEnvironments


Latency(us)

MessageSize(Bytes)

Default BD-DDGR-DD GR-HDGR-HH GR-H-HH

0

20

40

60

80

100

4 8 16 32 64 128

256

512 1K 2K 4K 8K

Latency(us)

MessageSize(Bytes)


Evaluation- MPI_Reduce @Wilkes(32GPUs)

Gather-first approacheswinforsmallmessages

K-nomialGPU-basedapproachwinsforlargemessages


16K 64K 256K 1M 4M

Latency(us)

MessageSize(Bytes)


0

50

100

150

200

250

4 16 64 256 1K 4K

Latency(us)

MessageSize(Bytes)


Evaluation- MPI_Reduce @ CSCS(96GPUs)

Gather-first approacheswinforsmallmessages

K-nomialGPU-basedapproachwinforlargemessages


0

5

10

15

20

25

Latency(ms)

MessageSize(Bytes)

DefaultRD-DDBRB-DD

0

2

4

6

8

10

2 4 8 16 32Latency(ms)

SystemSize(NumberofNodes)

DefaultRD-DDBRB-DD

Evaluation- MPI_Allreduce

96GPUs@CSCS

GoodScalability32GPUs@Wilkes


0

5

10

15

20

2 4 8 16 32Latency(ms)

SystemSize(Numberofnodes)

DefaultRD-DDRD-HD

0102030405060

64K 128K 256K 512K 1M 2M 4M

Latency(ms)

MessageSize(Bytes)

DefaultRD-DDRD-HD

Evaluation- MPI_Scan

96GPUs@CSCSGoodScalability32GPUs@Wilkes

2MBMessage


Prediction• Use the proposed analytical models to predict the

performance for large scale GPU clusters

0

10000

20000

30000

4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Latecny(us)

MessageSize(Bytes)

Predictionfor1024GPUsDefaultRD-DD/BR-DD

0

2000

4000

6000

4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Latency(us)

MessageSize(Bytes)

32GPUsonWilkesclusterModel-basedEstimationExperimentresult


• CUDAkernelbaseddesignssignificantlyimprovetheperformanceofcompute-orientedcollectiveoperations– MPI_Reduce,MPI_Allreduce andMPI_Scan

– CUDAkernelsbasedreductionoperationsèFastcomputation

– GPUDirect featureèEfficientGPU-to-GPUcommunication

• Futurework– Performingapplication-levelevaluation

• Deeplearning frameworkssuchasCaffe

– WillbeincludedinthefuturereleaseofMVAPICH2-GDRlibrary

Conclusion


ThankYou!

Network-BasedComputing Laboratoryhttp://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/


• CUDAKernels– Example:VectoradditionforMPI_SUMoperation

ReductionOperationsonGPU

template<class T>__global__ void vector_addition(T *dst, T *src, size_t count){

int i = blockIdx.x * blockDim.x + threadIdx.x;for (; i < count; i += blockDim.x * gridDim.x)

dst[i] += src[i];}

MoreinformationaboutoptimizingyourCUDAkernels:http://developer.download.nvidia.com/books/cuda-by-example/cuda-by-example-sample.pdfhttp://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

cuda kernel based collective reduction operations on large-scale gpu …web.cse.ohio-state.edu ›...

Documents