openshmem nonblocking data movement opera8ons with...
Post on 01-Aug-2020
1 Views
Preview:
TRANSCRIPT
OpenSHMEMNonBlockingDataMovementOpera8onswithMVAPICH2-X:EarlyExperiences
KhaledHamidouche*,JieZhang*,KarenTomko+,D.KPanda**:TheOhioStateUniversity(OSU),+:OhioSupercompu;ngCenter(OSC)
E-mail:hamidouc@cse.ohio-state.edu
PGASApplica8onsWorkshop(PAW’16)
by
PAW’16 2NetworkBasedCompu8ngLaboratory
DriversofModernHPCClusterArchitectures
Tianhe–2 Titan Stampede Tianhe–1A
• Mul;-core/many-coretechnologies
• RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)
• SolidStateDrives(SSDs),Non-Vola;leRandom-AccessMemory(NVRAM),NVMe-SSD
• Accelerators(NVIDIAGPGPUsandIntelXeonPhi)
Accelerators/Coprocessorshighcomputedensity,high
performance/wa\>1TFlopDPonachip
HighPerformanceInterconnects-InfiniBand
<1useclatency,200GbpsBandwidth>Mul8-coreProcessors SSD,NVMe-SSD,NVRAM
PAW’16 3NetworkBasedCompu8ngLaboratory
ParallelProgrammingModelsOverviewP1 P2 P3
SharedMemory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory MemoryLogicalsharedmemory
SharedMemoryModel
DSMDistributedMemoryModel
MPI(MessagePassingInterface)
Par88onedGlobalAddressSpace(PGAS)
OpenSHMEM,GA,UPC,Chapel,X10,CAF,…
• Programmingmodelsprovideabstractmachinemodels
• Modelscanbemappedondifferenttypesofsystems
– e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.
• Addi;onally,OpenMPcanbeusedtoparallelizecomputa;onwithinthenode
• Eachmodelhasstrengthsanddrawbacks-suitedifferentproblemsorapplica;ons
PAW’16 4NetworkBasedCompu8ngLaboratory
TheOpenSHMEMMemoryModel• Symmetricdataobjects
– GlobalVariables
– Allocatedusingcollec;veshmem_malloc,shmem_memalign,shmem_reallocrou;ne
• Globallyaddressable–objectshavesame– Type
– Size
– SamevirtualaddressoroffsetatallPEs
– Addressofaremoteobjectcanbecalculatedbasedoninfooflocalobject
– OpenSHMEM1.3introducesNon-BlockingDataMovementOpera;ons
SymmetricObjects
b
b
PE0 PE1
a a
VirtualAddressSpace
(global)
(alloce’d)
PAW’16 5NetworkBasedCompu8ngLaboratory
• Blockingopera;on=>Op;mizeforlatency– Bufferreusea_erreturningfromthecall
• Non-Blockingopera;on=>Op;mizeComputa;on/Communica;onoverlap– Returnassoonasweposttherequest
– Comple;onisensuredlateronacomple;on/synchroniza;oncall
– Bufferreusea_ercomple;on
– APIextensionwith_nbi(ex:shmem_putmem_nbi)
– shmem_fence/Shmem_barrier.Completeallpreviousopera;ons
Non-BlockingDataMovementOpera8ons
shmem_putmem
Computa;on
shmem_putmem_nbi
Computa;onshmem_fence
BlockingSeman8cs
Non-BlockingSeman8cs
PAW’16 6NetworkBasedCompu8ngLaboratory
• Introduc;on• Contribu;ons• Alterna;veDesigns• PerformanceEvalua;on• ConclusionsandFutureWork
Outline
PAW’16 7NetworkBasedCompu8ngLaboratory
• Proposehigh-performancedesignsandimplementa;onsofOpenSHMEMNBIopera;onsontopoftheMVAPICH2-Xlibrary.
• ExtendOMBwithnewNBIbenchmarksforevalua;ngOpenSHMEM1.3NBIopera;onsinastandardizedmanner.
• Designcommunica;onkernelsincluding3DstencilandalltoallpafernsusingOpenSHMEM.
• DemonstratethebenefitsandimpactofOpenSHMEMNBIopera;onsonbothlatencyandoverlapmetrics.
Contribu8ons
PAW’16 8NetworkBasedCompu8ngLaboratory
• Introduc;on• Contribu;ons• Alterna;veDesigns• PerformanceEvalua;on• ConclusionsandFutureWork
Outline
PAW’16 9NetworkBasedCompu8ngLaboratory
OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)
– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002
– MVAPICH2-X(MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014
– SupportforVirtualiza;on(MVAPICH2-Virt),Availablesince2015
– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015
– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015
– Usedbymorethan2,675organiza8onsin83countries
– Morethan399,000(>0.39million)downloadsfromtheOSUsitedirectly– EmpoweringmanyTOP500clusters(Jun‘16ranking)
• 12thranked519,640-corecluster(Stampede)atTACC
• 15thranked185,344-corecluster(Pleiades)atNASA
• 31stranked76,032-corecluster(Tsubame2.5)atTokyoIns;tuteofTechnologyandmanyothers
– Availablewithso_warestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)
– hfp://mvapich.cse.ohio-state.edu
• EmpoweringTop500systemsforoveradecade
– System-XfromVirginiaTech(3rdinNov2003,2,200processors,12.25TFlops)->
– StampedeatTACC(12thinJun’16,462,462cores,5.168Plops)
PAW’16 10NetworkBasedCompu8ngLaboratory
MVAPICH2-XforHybridMPI+PGASApplica8ons
• Unifiedcommunica;onrun;meforMPI,UPC,UPC++,OpenSHMEM,CAF– Availablesince2012(star;ngwithMVAPICH2-X1.9)– hfp://mvapich.cse.ohio-state.edu
• FeatureHighlights– SupportsMPI(+OpenMP),OpenSHMEM,UPC,CAF,UPC++,
MPI(+OpenMP)+OpenSHMEM,MPI(+OpenMP)+UPC– MPI-3compliant,OpenSHMEMv1.3standardcompliant,UPC
v1.2standardcompliant(withini;alsupportforUPC1.3),CAF2008standard(OpenUH),UPC++
– ScalableInter-nodeandintra-nodecommunica;on–point-to-pointandcollec;ves
PAW’16 11NetworkBasedCompu8ngLaboratory
OpenSHMEMDesigninMVAPICH2-X
• OpenSHMEMStackbasedonOpenSHMEMReferenceImplementa;on
• OpenSHMEMCommunica;onoverMVAPICH2-XRun;me
– Usesac;vemessages,atomicandone-sidedopera;onsandremoteregistra;oncache
Communica8onAPI SymmetricMemoryManagementAPI
MinimalSetofInternalAPI
OpenSHMEMAPI
InfiniBand,RoCE,iWARP
DataMovement Collec;vesAtomicsMemory
Management
Ac;veMessages
One-sidedOpera;ons
MVAPICH2-XRun8me
RemoteAtomicOps
EnhancedRegistra;onCache
J.Jose,K.Kandalla,M.LuoandD.K.Panda,Suppor8ngHybridMPIandOpenSHMEMoverInfiniBand:DesignandPerformanceEvalua8on,Int'lConferenceonParallelProcessing(ICPP'12),September2012
PAW’16 12NetworkBasedCompu8ngLaboratory
• SharedMemorybased
– SymmetricMemory(heap)insharedmemory
– Directcopyfromsourcetodes;na;on
– Latencyop;mized(nooverlap)
• CMAbased
– Sharedmemoryisnotavailable
– Directcopy(Zero-copy)
– Latencyop;mized(nooverlap)
• IBLoopbackbased
– Offloadthecopyopera;onstoanexternalengine(IB)
– Exchangethel_key/r-keyduringini;aliza;on
– Overlapop;mized(returnassoonasweposttheIBopera;on)
• On-loadenginebased(Workinprogress)– Kernelbasedhelperthreads
– Offloadthecopyopera;onstohelperthreads(SimilartoCMA)
– Op;mizedforbothLatencyandOverlap
Intra-nodeAlterna8veDesigns
PAW’16 13NetworkBasedCompu8ngLaboratory
• List-baseddesign– Onthecall:Createanden-queueaninternalrequest
• AssociateanIBComple;onEventwitharequest
– Intheprogressengine:De-queueanddeleteacompletedrequest
– Duringcomple;oncall(Fence):Pollstheprogressengineun;lthelistisempty
– OverheadofCreate/DeleteoftherequestinCri;calpath
• Counter-baseddesign– Globalintegercounter
– Onthecall:Incrementthecounter
– Intheprogressengine:decreasethecounter
– Duringcomple;oncall:Polltheprogressengineun;lcounter==0
– Minimaloverheadinthecri;calpath
Inter-nodesAlterna8veDesigns
PAW’16 14NetworkBasedCompu8ngLaboratory
• Introduc;on• Contribu;ons• Alterna;veDesigns• PerformanceEvalua;on• ConclusionsandFutureWork
Outline
PAW’16 15NetworkBasedCompu8ngLaboratory
• StampedeSystem@TACC– 16-coresSandyBridgenodes
– MellanoxFDRinterconnec;on
• MVAPICH2-X2.2RC1– UnifiedCommunica;onRun;mesupport(UCR)
– OpenSHMEM1.3gitbranchofthestandardimplementa;on
• ExtensionofOpenSHMEMOMBwith– NBIpt-ptlatencytests(forputandget)
– Messageratebenchmarks
– OverlapBenchmarks
• RedesignedAll-to-Alland3D-StencilbenchmarkswithNBIinterface
ExperimentalSetup
PAW’16 16NetworkBasedCompu8ngLaboratory
Intra-nodeEvalua8on
0
1
2
3
4
1 2 4 8 16 32 64 128256512 1K 2K 4K
Latency(us)
MessageSize(Bytes)
Blocking NBI_SHM NBI_LB
050100150200250
8K 16K 32K 64K 128K 256K 512K 1M
Latency(us)
MessageSize(Bytes)
Blocking NBI_SHM NBI_LB
05000000
100000001500000020000000
MessageRate
MessageSize(Bytes)
Blocking NBI_SHM NBI_LB• LB-baseddesignhasoverheadinlatency
• SHM-baseddesignachievesverygoodmessagerate(3X)improvementscomparedtoLB-design
• Overheadinlatencyforsmallmessage– So_wareoverheadduetosynchroniza;on
opera;on
PAW’16 17NetworkBasedCompu8ngLaboratory
Inter-nodeEvalua8on
0100200300400
Latency(us)
MessageSize(Bytes)
Blocking NBI
0
50
100
150
32K 64K 128K 256K 512K 1M
Overla
p(%
)
MessageSize(Byte)
Blocking NBI
01000000200000030000004000000
2 8 32
128
512 2K
8K
32K
128K
512K
2M
MessageRate
MessageSize(Bytes)
Blocking NBI Blockin_Opt
• NBIdeliverssamelatencyperformanceasBlocking
• NBIdesignachievesverygoodmessagerate(5X)comparedtoblocking
• Maximaloverlappoten;al– Hidethecommunica;onoverheadwithRDMA
PAW’16 18NetworkBasedCompu8ngLaboratory
3D-StencilCommunica8onKernelEvalua8on
0
0.2
0.4
0.6
0.8
1
1.2
8 27 64 125
Execu8
onTim
e
#ofPEs
Blocking NBI
0
0.2
0.4
0.6
0.8
1
8 27 64 125
Execu8
onTim
e
#ofPEs
Blocking NBI
• 50%,30%and15%performanceimprovementon27,64and125cores
• Overlapbenefits
– Smallmessage:bothcomputa;on/Communica;onandCommunica;on/Communica;onoverlap
SmallInputSize(512Byte) LargeInputSize(2KByte)
PAW’16 19NetworkBasedCompu8ngLaboratory
• Introduc;on• Contribu;ons• Alterna;veDesigns• PerformanceEvalua;on• ConclusionsandFutureWork
Outline
PAW’16 20NetworkBasedCompu8ngLaboratory
• Highlightedthealterna;veapproachesfordesigningNBIopera;ons– Bothintraandinternodeopera;ononRDMAnetworks
• DemonstratethebenefitsofOpenSHMEMNBIopera;ons– MessageRate,Overlap
• Evaluatetheimpactofsuchseman;cs/designsatapplica;onlevel
• ThesupportwillbeavailablewiththenextreleaseofMVAPICH2-X
• ThenewBenchmarkswillbeavailablewiththenextreleaseofOMB
Conclusions
PAW’16 21NetworkBasedCompu8ngLaboratoryhamidouc@cse.ohio-state.edu
ThankYou!
Network-BasedCompu;ngLaboratoryhfp://nowlab.cse.ohio-state.edu/
TheMVAPICHProjecthfp://mvapich.cse.ohio-state.edu/
top related