rmalloc() and rpipe() – a ugni-based distributed remote memory … · 2018. 6. 14. · – upto...
TRANSCRIPT
![Page 1: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/1.jpg)
rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory Allocator and Access Library
for One-sided Messaging
UdayangaWickramasinghe
IndianaUniversity
AndrewLumsdaine
PacificNorthwestNa<onalLaboratory
![Page 2: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/2.jpg)
Overview
§ Mo<va<on
§ Design/SystemImplementa<on
§ Evalua<on§ FutureWork
2
![Page 3: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/3.jpg)
RDMA Network Communication
3
NetworkOpKernel+CPUdirect
RDMAKernel+CPUbypassZeroCopy
Designedforone-sidedcommunica<on!!
![Page 4: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/4.jpg)
One-sided Communication
4
§ Great for Random Access + Irregular Data patterns
§ Less Overhead/High Performance
Advantages Disadvantages§ Explicit Synchronization –
separate from data-path!!
![Page 5: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/5.jpg)
RDMA Challenges – Communication
5
Recv
SendPin
PinNIC
exchange
comm NIC
register/match
register/match
§ Buffer Pin/Registration
§ Rendezvous
§ Model imposed overheads
![Page 6: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/6.jpg)
RDMA Challenges – Synchronization
6
register/match
ExposureEpoch
comm
Barrier/Fence
Barrier/Fence
comm
comm
...AccessEpoch
Howtomakereadsandupdatesvisible?“in-use”/”re-use”
![Page 7: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/7.jpg)
RDMA Challenges – Dynamic Memory Management
Clusterwidealloca<onsàcostlyinadynamiccontexti.e.PGAS
![Page 8: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/8.jpg)
RDMA Challenges – Programming
register/match
exchange
RDMAPUT0x1F0000
Load0x1F0000
Inc0x1F0000,1
RDMAPUT0x1F0000
register/match
RDMAPUT0x1F0000
DataRace!!!
Deliverycomple1on
Bufferre-use
![Page 9: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/9.jpg)
§ Enforcing“in-use”/”re-use”seman<cs– FlowControl–Creditbased,Counterbased,polling(CQbased)
§ EnforcingComple<onseman<cs– MPI3.0Ac<ve/Passive–barriers,fence,lock,unlock,flush
– GAS/PGASbased(SHMEM,X10,Titanum)–futures,barriers,locks,ac<ons
– GASNetlike(RDMA)Libraries–userhastoimplement
§ ExplicitandComplextoimplementforapplica<ons!!
9
Challenges – Programming
![Page 10: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/10.jpg)
§ Lowoverhead,high-throughputcommunica<on?– Eliminateunnecessaryoverheads.
§ DynamicOn-demandRDMAMemory?– Allocate/de-Allocatewithheuris<cssupport.– LesscoherenceTrafficandmaybebeceru<liza<on
§ ScalableSynchroniza<on?– Comple<onandBufferin-use/re-use.
§ RDMAProgrammingabstrac<onsforapplica<ons?– Noexplicitsynchroniza<on–Letmiddlewaretransparentlyhandleit.
– Exposelight-weightRDMAreadymemoryandopera<ons.
10
Challenges – Summary
![Page 11: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/11.jpg)
11
How rmalloc()/rpipe() meets these Challenges ?
Problem KeyIdea
LowCommunica<onOverhead
FastPath(MMIOvsDoorbell)NetworkOpera<on(inuGNI)withsynchronizedupdates.
DynamicRDMAMemoryMgmt
PerendpointRDMADynamicHeapàHeuris<cs+AsymmetricAlloca<on
Synchroniza<on No<fica<onFlagswithPolling(NFP)
Programmability AfamiliarTwo-levelAbstrac<onàallocator(rmalloc)+streamlikechannel(rpipe)àNoexplicitsynchroniza<on
![Page 12: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/12.jpg)
§ Mo<va<on
§ Design/SystemImplementa<on
§ Evalua<on§ FutureWork
12
Overview
![Page 13: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/13.jpg)
13
System Overview
![Page 14: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/14.jpg)
14
System Overview
High Performance RDMA Channel § Expose Zero-copy
RDMA ops
§ Interface/s
• rread()
• rrwrite()
Enable Implicit Synchronization § NFP (Notified Flags with
Polling)
![Page 15: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/15.jpg)
15
System Overview
Allocates RDMA memory § Returns Network
Compatible Memory
§ Dynamic Asymmetric Heap for RDMA
§ Interface/s
• rmalloc()
Alloca1onpolicies§ Next-fit,First-fit
![Page 16: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/16.jpg)
16
System Overview
Network Backend § Cray specific – uGNI
§ MPI 3.0 based (portability layer)
Cray uGNI § FMA/BTE Support
§ Memory Registration
§ CQ handling
![Page 17: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/17.jpg)
17
“rmalloc”
Asymmetricheapsacrosscluster-0ormoreforeachendpointpair-dynamicallycreated
![Page 18: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/18.jpg)
18
“rmalloc” Allocation
Next-fitheuris<c– returnnextavailableRDMAheapsegment
rmalloc instance
L - local heapS - shadow heapR - remote heap
- unused- used
Synchroniza<onàaspecialbootstraprpipe
![Page 19: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/19.jpg)
19
“rmalloc” Allocation
best-fitheuris<c– findsmallestpossibleRDMAheapsegment
rmalloc instance
L - local heapS - shadow heapR - remote heap
- unused- used
![Page 20: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/20.jpg)
20
“rmalloc” Allocation
worst-fitheuris<c–findlargestpossibleRDMAheapsegment
rmalloc instance
L - local heapS - shadow heapR - remote heap
- unused- used
![Page 21: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/21.jpg)
21
“rmalloc” Implementation
rmalloc_descriptoràmanageslocalandremotevirtualmemory
![Page 22: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/22.jpg)
22
rfree()/rmalloc() synchronization
§ Whentosynchronize?Buffer“in-use/re-use”– Twoop<ons,usebothfordifferentalloca<onmodes
• Atalloca<on<me–>latency(i.e.rmalloc())
• Atde-alloca<on<me–>throughput(i.e.rfree())
§ Deferredsynchroniza<onbyrfree()ànext-fit– Coalescetagsfromasortedfreelist
– rmallocupdatesstatebyRDMAintocoalescedtaglistintheremote
§ Immediatesynchroniza<onbyrmalloc()àbest-fitORworst-fit– Usingaspecialbootstraprpipetosynchronizeateachallocatedmemory
![Page 23: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/23.jpg)
23
“rpipe”– rwrite()
LocalCQ
1
§ Completion Queue (CQ) (Light weight events by NIC/HCA)
1.Ini<ateRDMAWrite.–Sourcebufferà‘’in-use’’
![Page 24: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/24.jpg)
24
LocalCQ
2
2.ProbeLocalCQforcomple<on.Zero-copysourcedatatotarget.
2
“rpipe”– rwrite()
![Page 25: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/25.jpg)
25
LocalCQ
3
4
3.Writetoflagjustanerdata.
“rpipe”– rwrite()
![Page 26: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/26.jpg)
26
LocalCQ
4
4.ProbeLocalCQsuccess.Sourcebufferà‘’re-use’’
“rpipe”– rwrite()
![Page 27: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/27.jpg)
27
LocalCQ
5
5.Probeflagsuccess.targetbufferisreadytoload/ops.
“rpipe”– rwrite()
![Page 28: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/28.jpg)
28
LocalCQ
6Load0x1F0000
6.remotehostconsumesdata.Sourceyettoknowbufferàrfree()
“rpipe”– rwrite()
![Page 29: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/29.jpg)
29
“rpipe”– rread()
LocalCQ1
1.Storedataintotarget.– Targetbufferà‘’in-use’’.
Store0x1F0000,val
![Page 30: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/30.jpg)
30
“rpipe”– rread()
LocalCQ
2.Writetosourceflag.Dataisnowreadyforrread()!!
Store0x1F0000,val2
rfree()
![Page 31: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/31.jpg)
31
“rpipe”– rread()
LocalCQ
3
3.RDMAZero-Copytosource.
![Page 32: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/32.jpg)
32
LocalCQ
4.Writetoflagjustanerdata.
4
“rpipe”– rread()
![Page 33: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/33.jpg)
33
LocalCQ
5
5.ProbeLocalCQforcomple<on.
“rpipe”– rread()
![Page 34: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/34.jpg)
34
Implementing rpipe(), rwrite() and rread()
§ Arpipeiscreatedbetweentwoendpoints.– AuGNIbasedControlMessage(FMACmsg)networktolazyini<alize
rpipei.e.GNI_CqCreate,GNI_EpCreate,GNI_EpBind
§ Implementsrwrite(),rread()inuGNI– Small/mediummessages–FMA(FastMemoryAccess)
– Largemessages–BTE(ByteTransferEngine)
§ MPIportabilityLayer– rpipewithMPI-3.0windows+passiveRMA
FMA
BTE
![Page 35: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/35.jpg)
§ Mo<va<on
§ Design/SystemImplementa<on
§ Evalua<on§ FutureWork
35
Overview
![Page 36: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/36.jpg)
36
rpipe programming intmain(){#definePIPE_WIDTH8rpipe_trp;rinit(&rank,NULL);//createaHalfDuplexRMApiperpipe(rp,peer,iswriter,PIPE_WIDTH,HD_PIPE);raddr_taddr;int*ptr;if(iswriter){addr=rmalloc(rp,sizeof(int));ptr=rmem(rp,addr);*ptr=SEND_VAL;rwrite(rp,addr);}else{rread(rp,addr,sizeof(int));ptr=rmem(rp,addr);rfree(addr);}}
Remoteallocate
FreeremmemoryReleaseimmediately
a5eruse!!
Rpipeops
![Page 37: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/37.jpg)
37
Experimentation Setup CrayXC30[Aries]/
DragonFly
BigredII+550nodes/Rpeak280Tflops—10GB/sUni-direc<onal15GB/sBi-direc<onalBW
PerfbaselineàMPI/OSUBenchmark
![Page 38: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/38.jpg)
38
Small/Medium Message Latency Comparison
1
4
16
1 4 16 64 256 1024 8192Message Size (bytes)
Late
ncy/
oper
atio
n (u
s)
MPI_RMA_FENCEMPI_RMA_PASSIVE(lock_once)MPI_RMA_PSCWMPI_SENDRMA_PIPE_WRITE(uGNI_FMA_2PUTS)RMA_PIPE_WRITE(uGNI_FMA_PUTW_SYNC)
§ Default Alloc = Next-Fit
§ FMA_PUT_W_SYNC – Upto 6X speedup MPI
RMA
§ rpipePUT_W_sync(s)<rpipe2PUT(s)
![Page 39: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/39.jpg)
39
Large Message Latency Comparison – rwrite()
2
16
128
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4MMessage Size (bytes)
Late
ncy/
oper
atio
n (u
s) MPI_RMA_PASSIVEMPI_RMA_PSCWRMA_PIPE_WRITE
§ rpipe uGNI(s) ≈ rpipeMPI(s)whens>4K
– S ≥ 4K à FMA to BTE switch
small/medium 0.65us
![Page 40: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/40.jpg)
40
Large Message Latency Comparison – rread()
8
64
512
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4MMessage Size (bytes)
Late
ncy/
oper
atio
n (u
s) MPI_RMA_PASSIVEMPI_RMA_PSCWRMA_PIPE_READ
§ rpipe uGNI(s) ≈ rpipeMPI(s)whens>1K– S < 4b à FMA_FETCH Atomic (AMO)
– S < 1K à FMA_FETCH + PSYNC
– S ≥ 1K à FMA to BTE switch (BTE_FETCH + FMA_PSYNC)
small/medium 2.14us
![Page 41: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/41.jpg)
41
Rpipe Scales ...
1
4
16
2 4 8 12 16 20 24 28 32Nodes (N)
Late
ncy/
oper
atio
n (u
s) RPIPE_WRITE(1K)(unbounded)RPIPE_WRITE(64)(unbounded)RPIPE_WRITE(8)(4K)RPIPE_WRITE(8)(64)RPIPE_WRITE(8)(unbounded)RPIPE_WRITE(8K)(unbounded)
§ “unbounded”à allocator has full rpipe available for all Zero-copy operations
§ Scaling upto 32 nodes – randomized rwrite() – 0.65 – 3.8us avg latency
![Page 42: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/42.jpg)
42
Allocation Algorithms
1
4
1 2 4 8 16 32 64 128 256 512Message Size (bytes)
Late
ncy/
oper
atio
n (u
s)
MPI3.0_RMARPIPE_WRITE(1K)RPIPE_WRITE(256K)RPIPE_WRITE(unbounded)
§ Zero-copy write vs Heuristics– Next-fit allocator
has better performance
– 1X – 3.5X slowdown for Best/Worst-fit
1
4
16
1 2 4 8 16 32 64 128 256 512Message Size (bytes)
MPI3.0_RMARPIPE_WRITE(1K)RPIPE_WRITE(256K)RPIPE_WRITE(unbounded)
1
4
16
1 2 4 8 16 32 64 128 256 512Message Size (bytes)
MPI3.0_RMARPIPE_WRITE(1K)RPIPE_WRITE(256K)RPIPE_WRITE(unbounded)
Next-fit
Best-fit
Worst-fit
L = Latency
L[Next-fit] < L[MPI] < L[Worst-fit]
![Page 43: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/43.jpg)
§ Mo<va<on
§ Design/SystemImplementa<on
§ Evalua<on§ FutureWork
43
Overview
![Page 44: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/44.jpg)
§ PlavormSupport/Automatedsynchroniza<on
§ HighperformanceRMAKernels– Ac<vemessages/Neighbor/collec<ve
communica<on
§ Aggregatedrpipes– LeverageZerocopy/Eliminatehiddenbuffers
• i.e.Collec<ves• Possiblethroughput,memoryu<liza<ongains
§ IrregularRMAandmemorydisaggrega<on
Future Work
![Page 45: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/45.jpg)
Questions?
![Page 46: rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory … · 2018. 6. 14. · – Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) < rpipe 2PUT (s) 39 Large Message Latency](https://reader035.vdocuments.us/reader035/viewer/2022071214/6042e0c8221462326e63c38b/html5/thumbnails/46.jpg)
Thank You!