cuda kernel based collective reduction operations on large-scale gpu …web.cse.ohio-state.edu ›...
TRANSCRIPT
CUDAKernelbasedCollectiveReductionOperationsonLarge-scaleGPUClusters
Ching-HsiangChu,KhaledHamidouche,Akshay Venkatesh,AmmarAhmadAwanandDhabaleswar K.(DK)Panda
Speaker: Sourav Chakraborty
Network-based ComputingLaboratory
DepartmentofComputerScienceandEngineeringTheOhioStateUniversity
CCGrid 2016 2NetworkBasedComputingLaboratory
• Introduction
• ProposedDesigns
• PerformanceEvaluation
• Conclusion
Outline
CCGrid 2016 3NetworkBasedComputingLaboratory
DriversofModernHPCClusterArchitectures
• Multi-coreprocessorsareubiquitous• InfiniBandverypopularinHPCclusters• Accelerators/Coprocessorsbecomingcommoninhigh-endsystems• PushingtheenvelopeforExascale computing
Accelerators/Coprocessorshighcomputedensity,highperformance/watt
>1Tflop/sDPonachip
HighPerformanceInterconnects- InfiniBand<1uslatency,>100Gbps Bandwidth
Tianhe– 2 Titan Stampede Tianhe– 1A
Multi-coreProcessors
CCGrid 2016 4NetworkBasedComputingLaboratory
• GrowthofAccelerator-enabledclustersinthelast3years– 22%ofTop50clustersareboosted byNVIDIAGPUsinNov’15
– FromTop500list(http://www.top500.org)
AcceleratorsinHPCSystems
8 15 23 28 335231 22 20 18 1514
11 1216 20 30
29
0
20
40
60
80
100
June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015
System
Count
NVIDIAKepler NVIDIAFermi IntelXeonPhi
CCGrid 2016 5NetworkBasedComputingLaboratory
• Scientific parallel applications spend a considerableamount of time in collectivecommunication operations– E.g. Deep learning frameworks such as Caffe
Motivation– CollectivesinApplications
GPUNode1 GPUNode2
GPUNodeN
MPI_Bcast/MPI_Scatter
MPI_Gather/MPI_Reduce
GPU computations
CCGrid 2016 6NetworkBasedComputingLaboratory
• Scientific parallel applications spend a considerableamount of time in collectivecommunication operations– Pure communication collectives: Broadcast, Gather, Scatter…
– Compute-oriented collectives: Reduce, Allreduce, Scan
– Communication part is highly optimized
• Compute-oriented collectives operations are not fullyoptimized for GPU clusters– CPU is doing all the works
– GPU resources are not fully utilized
Motivation- CollectiveReductionOperations
CCGrid 2016 7NetworkBasedComputingLaboratory
• Fastcomputation
– Massiveparallelism
• Efficientcommunication
– NVIDIAGPUDirect RDMA
Motivation– PowerfulGPUResources
• GPUfeaturesarenotbeingutilizedforallcollectives• Can we leverage these features to further optimize the
compute-oriented collectives on GPU clusters?
http://www.nvidia.com/object/gpu-accelerated-computing.html https://developer.nvidia.com/gpudirect
CCGrid 2016 8NetworkBasedComputingLaboratory
• How to eliminate explicit data movements between Hostand GPU memories?– cudaMemcpy call is expensive!
• How to handle the GPU-to-GPU communication after thecomputations finish?
• When to use GPU for compute-oriented collectiveoperations?
– Launching kernels bring overhead; How to minimize?
ProblemStatement
CCGrid 2016 9NetworkBasedComputingLaboratory
• Design a framework that exploits the CUDA kernels toefficiently handle compute-oriented collectives
• Propose extensions to the existing collective algorithms tobe GPU-Aware compute-orientedalgorithms– MPI_Reduce, MPI_Allreduce andMPI_Scan
• Detailed analysis and evaluation using different GPUsystems includinga Cray CS-Storm system.
Overview
CCGrid 2016 10NetworkBasedComputingLaboratory
• Introduction
• ProposedDesigns
• PerformanceEvaluation
• Conclusion
Outline
CCGrid 2016 11NetworkBasedComputingLaboratory
• Existingdesigns1. ExplicitcopythedatafromGPUtohostmemory
2. Host-to-Host communicationtoremoteprocesses
3. Performcomputation onCPU
4. ExplicitcopythedatafromhosttoGPUmemory
• Proposeddesigns1. GPU-to-GPU communication
• NVIDIAGPUDirect RDMA(GDR)
• Pipeline throughhostforlargemsg
2. Performcomputation onGPU• EfficientCUDAkernels
DesignConsideration
CPU
Host Memory
GPU
PCIe IB Adapter
CPU
Host Memory
GPU
PCIeIB Adapter1
2
3
4
1
2
Node BNode A
CCGrid 2016 12NetworkBasedComputingLaboratory
• Tree-basedK-nomialalgorithm– Onlythenon-leaf nodesperformreductionoperation
• Pros&Cons– Loadbalance,Efficient/scalablecommunication
– Higheraveragelatency
K-nomial MPI_Reduce
0 1 2 3 4 5 6 7[1]
[2]
[3]
0
124
356
7
CCGrid 2016 13NetworkBasedComputingLaboratory
• Host-basedBinomial-Reduce(Default)
• GPU-basedBinomial-Reduce(BR-DD)
CostAnalysis
Expensive cudaMemcpy, before/after reduction op.
Relatively slow computation on CPUFast Host-Host Comm.
Fast, highly parallelized computation on GPU
Overhead of launching CUDA kernels (~10us)GDR-based GPU-GPU Comm.
Constant variant of tree initialization
log$𝑛 × 𝜖×𝐶𝑜𝑚𝑚+,-.(𝑀) + 𝐶𝑜𝑚𝑝+,-.(𝑀) + 2×𝐶𝑜𝑝𝑦(𝑀)
log$ 𝑛 × 𝜖×𝐶𝑜𝑚𝑚678 𝑀 + 𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A(𝑀)
Message Size
CCGrid 2016 14NetworkBasedComputingLaboratory
• Gather-firstalgorithm– Rootgathersallthedataandperformthecomputation
• Sinceonlyrootneeds thefinalresult
• Pros&Cons– Lowcomputationoverhead
– Poorscalability
Gather-firstMPI_Reduce
0 1 2 3 4 5 6 7
CCGrid 2016 15NetworkBasedComputingLaboratory
• Host-basedGatherandReduce(GR-H-HH)
• Host-basedGather,GPU-basedReduce(GR-HH)
• GPU-basedGatherandReduce(GR-DD)
CostAnalysis
(𝑛 − 1)× 𝐶𝑜𝑚𝑚+,-.(𝑀) + 𝐶𝑜𝑚𝑝+,-.(𝑀) + 2×𝐶𝑜𝑝𝑦(𝑀)
(𝑛 − 1)×𝐶𝑜𝑚𝑚678(𝑀) + 𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A(𝑀)
(𝑛 − 1)×(𝐶𝑜𝑚𝑚+,-. 𝑀 + 𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A 𝑀 + 2×𝐶𝑜𝑝𝑦(𝑀))Could suffer scalable issue è Good for small messages or small scale
Less affect from CUDA kernel launching overhead è Good for small messages
CCGrid 2016 16NetworkBasedComputingLaboratory
• Recursivedoublingalgorithm– Everyprocessorneedstoperformcomputation
• Pros&Cons– Loadbalance,Efficient/scalablecommunication
– Higheraveragelatency
GPU-basedMPI_Allreduce andMPI_Scan
0 1 2 3 4 5 6 7[1]
[2]
[3]
CCGrid 2016 17NetworkBasedComputingLaboratory
• GPU-basedRecursiveDoubling(RD-DD)
• GPU-basedBinomial-Reduce-Broadcast(GBRB-DD)
CostAnalysis
Same as BD-DD MPI_Reduce
log$ 𝑛 × 𝜖×𝐶𝑜𝑚𝑚678 𝑀 +𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A(𝑀)
log$𝑛 × 2×𝜖×𝐶𝑜𝑚𝑚678 𝑀 +𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑6@A 𝑀 + 𝐶𝑜𝑚𝑝6@A(𝑀)
CCGrid 2016 18NetworkBasedComputingLaboratory
Communication Computation Design Algorithm Benefit
Host<->HostCPU
BR-H-HH(Default) Binomial-Reduce Largescale,smallmessagesRD-H-HH(Default) Recursivedoubling
GR-H-HH
Gather-Reduce Smallscale,small messages
GPU
GR-HHHost<->Device (GDR) GR-HD/GR-DH
Device<->Device(GDR)
GR-DD
BR-DD Binomial-Reduce
Largemessagesforanyscale
BRB-DD Binomial-Reduce-Bcast
RD-DDRecursivedoubling
Host<->Device (GDR) RD-HD/RD-DH
AlternativeandExtendedDesigns
CCGrid 2016 19NetworkBasedComputingLaboratory
• Introduction
• ProposedDesigns
• PerformanceEvaluation
• Conclusion
Outline
CCGrid 2016 20NetworkBasedComputingLaboratory
OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,10-40Gig/iWARP,andRDMAoverConvergedEnhancedEthernet(RoCE)
– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Availablesince2002
– MVAPICH2-X(MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR) andMIC(MVAPICH2-MIC),Availablesince2014
– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015
– SupportforEnergy-Awareness (MVAPICH2-EA), Availablesince2015
– Usedbymorethan2,550organizations in79countries
– Morethan360,000(>0.36million)downloadsfromtheOSUsitedirectly
– EmpoweringmanyTOP500clusters(Nov‘15ranking)• 10th ranked519,640-corecluster(Stampede)atTACC
• 13th ranked185,344-corecluster(Pleiades)atNASA
• 25th ranked76,032-corecluster(Tsubame2.5)atTokyoInstituteofTechnologyandmanyothers
– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)
– http://mvapich.cse.ohio-state.edu
• EmpoweringTop500systems foroveradecade– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->
– StampedeatTACC(10th inNov’15,519,640cores,5.168Plops)
CCGrid 2016 21NetworkBasedComputingLaboratory
1. Wilkescluster@UniversityofCambridge– 2NVIDIAK20cGPUspernode
– Usedforinter-nodeexperiments• Upto32GPUnodes
2. CSCScluster@SwissNationalSupercomputing Centre– CrayCS-Stormsystem
– 8NVIDIAK80GPUspernode(=16NVIDIAK40GPUspernode)
– Usedforintra-nodeexperiments• Upto96GPUsover11nodes
ExperimentalEnvironments
CCGrid 2016 22NetworkBasedComputingLaboratory
Latency(us)
MessageSize(Bytes)
Default BD-DDGR-DD GR-HDGR-HH GR-H-HH
0
20
40
60
80
100
4 8 16 32 64 128
256
512 1K 2K 4K 8K
Latency(us)
MessageSize(Bytes)
Default BD-DDGR-DD GR-HDGR-HH GR-H-HH
Evaluation- MPI_Reduce @Wilkes(32GPUs)
Gather-first approacheswinforsmallmessages
K-nomialGPU-basedapproachwinsforlargemessages
CCGrid 2016 23NetworkBasedComputingLaboratory
16K 64K 256K 1M 4M
Latency(us)
MessageSize(Bytes)
Default BD-DDGR-DD GR-HDGR-HH GR-H-HH
0
50
100
150
200
250
4 16 64 256 1K 4K
Latency(us)
MessageSize(Bytes)
Default BD-DDGR-DD GR-HDGR-HH GR-H-HH
Evaluation- MPI_Reduce @ CSCS(96GPUs)
Gather-first approacheswinforsmallmessages
K-nomialGPU-basedapproachwinforlargemessages
CCGrid 2016 24NetworkBasedComputingLaboratory
0
5
10
15
20
25
Latency(ms)
MessageSize(Bytes)
DefaultRD-DDBRB-DD
0
2
4
6
8
10
2 4 8 16 32Latency(ms)
SystemSize(NumberofNodes)
DefaultRD-DDBRB-DD
Evaluation- MPI_Allreduce
96GPUs@CSCS
GoodScalability32GPUs@Wilkes
CCGrid 2016 25NetworkBasedComputingLaboratory
0
5
10
15
20
2 4 8 16 32Latency(ms)
SystemSize(Numberofnodes)
DefaultRD-DDRD-HD
0102030405060
64K 128K 256K 512K 1M 2M 4M
Latency(ms)
MessageSize(Bytes)
DefaultRD-DDRD-HD
Evaluation- MPI_Scan
96GPUs@CSCSGoodScalability32GPUs@Wilkes
2MBMessage
CCGrid 2016 26NetworkBasedComputingLaboratory
Prediction• Use the proposed analytical models to predict the
performance for large scale GPU clusters
0
10000
20000
30000
4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Latecny(us)
MessageSize(Bytes)
Predictionfor1024GPUsDefaultRD-DD/BR-DD
0
2000
4000
6000
4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Latency(us)
MessageSize(Bytes)
32GPUsonWilkesclusterModel-basedEstimationExperimentresult
CCGrid 2016 27NetworkBasedComputingLaboratory
• CUDAkernelbaseddesignssignificantlyimprovetheperformanceofcompute-orientedcollectiveoperations– MPI_Reduce,MPI_Allreduce andMPI_Scan
– CUDAkernelsbasedreductionoperationsèFastcomputation
– GPUDirect featureèEfficientGPU-to-GPUcommunication
• Futurework– Performingapplication-levelevaluation
• Deeplearning frameworkssuchasCaffe
– WillbeincludedinthefuturereleaseofMVAPICH2-GDRlibrary
Conclusion
CCGrid 2016 28NetworkBasedComputingLaboratory
ThankYou!
Network-BasedComputing Laboratoryhttp://nowlab.cse.ohio-state.edu/
TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/
CCGrid 2016 29NetworkBasedComputingLaboratory
• CUDAKernels– Example:VectoradditionforMPI_SUMoperation
ReductionOperationsonGPU
template<class T>__global__ void vector_addition(T *dst, T *src, size_t count){
int i = blockIdx.x * blockDim.x + threadIdx.x;for (; i < count; i += blockDim.x * gridDim.x)
dst[i] += src[i];}
MoreinformationaboutoptimizingyourCUDAkernels:http://developer.download.nvidia.com/books/cuda-by-example/cuda-by-example-sample.pdfhttp://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html