towards an extreme scale multithreaded mpi · 2016. 10. 13. · thread 1 mpi task 1 core...
TRANSCRIPT
![Page 1: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/1.jpg)
Abdelhalim Amer (Halim)Postdoctoral ResearcherArgonne National Laboratory, IL, USAJapan-KoreaHPCWinterSchoolUniversityofTsukuba,Feb17,2016
Towards an Extreme Scale Multithreaded MPI
![Page 2: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/2.jpg)
Programming Models and Runtime Systems Group
2
Myself Affilia&on
– MathemaGcsandComputerScienceDivision,ANL
GroupLead– PavanBalaji(computerscienGst)
CurrentStaffMembers– SangminSeo(AssistantScienGsts)– AbdelhalimAmer(Halim)(postdoc)– YanfeiGuo(postdoc)– RobLatham(developer)– LenaOden(postdoc)– KenRaffeneW(developer)– MinSi(postdoc)
![Page 3: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/3.jpg)
The PMRS Group Research Areas
3
CommunicaGonRunGmes(MPICH)
ThreadingModels
(OS-level,user-level:Argobots)
Data-Movementin
HeterogeneousandDeepMemory
HierarchiesMyfocus:
CommunicaGonopGmizaGoninthreadingenvironments
![Page 4: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/4.jpg)
The MPICH Project
4
MPICHanditderiva&vesintheTop101. Tianhe-2(China):TH-MPI2. Titan(US):CrayMPI3. Sequoia(US):IBMPEMPI4. KComputer(Japan):FujitsuMPI5. Mira(US):IBMPEMPI6. Trinity(US):CrayMPI7. PizDaint(Germany):CrayMPI8. HazelHen(Germany):CrayMPI9. ShaheenII(SaudiArabia):CrayMPI10. Stampede(US):IntelMPIandMVAPICH
MPICH
IntelMPI
IBMMPI
CrayMPI
MicrosoVMPI MVAPICH
TianheMPI
ParaSta&onMPI
FG-MPI
§ MPICHanditsderivaGvesaretheworld’smostwidelyusedMPIimplementaGons
§ FundedbyDOEfor23years(turned23thismonth)
§ HasbeenakeyinfluencerintheadopGonofMPI
§ Awardwinningproject– DOER&D100awardin2005
![Page 5: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/5.jpg)
Threading Models
5
Sangmin Seo Assistant Computer
Scientist
§ Argobots– Lightweightlow-levelthreadingandtaskingframework– Targetsmassiveon-nodeparallelism– Fine-grainedcontrolovertheexecuGon,scheduling,synchronizaGon,anddata-
movements– ExposesarichsetoffeaturestohigherprogrammingsystemsandDSLsfor
efficientimplementaGons
§ BOLT(hfps://press3.mcs.anl.gov/bolt)– BasedontheLLVMOpenMPrunGmeandtheClang
![Page 6: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/6.jpg)
The Argobots Ecosystem
6
ES1 Sched
U
U
E
E
E
E
U
S
S
T
T
T
T
T
Argobots
...
ESn
MPI+Argobots
ULT
ES
ULT
ES
MPI
Argobots runtime
Communication libraries
Charm++
Applications
Charm++
Cilk “Worker”
Argobots ES
RWS ULT
Fused ULT 1
Fused ULT 2
Fused ULT N …
CilkBots
PO
GE
TR
SYTR TR
PO GE GE
TR TR
SYSY GE
PO
TR
SY
PO
SY SY
PaRSEC
OpenMP (BOLT)
Mercury RPC
Origin
Target
RPCproc
RPCproc
OmpSs
GridFTP, Kokkos,RAJA, ROSE, TASCEL, XMP, etc.External Connections
![Page 7: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/7.jpg)
Heterogeneous and Deep Memory Hierarchies
7
§ Prefetchingfordeepmemoryhierarchies
§ CurrentlyfocusingonMICacceleratorenvironments
§ Compilersupport– LLVM+userhintsthroughpragma
direcGves
§ RunGmesupport– PrefetchingbetweenIn-PackageMemory
(IPM)andNVRAM– Sojware-managedcache
NVRAM
IPM
Lena Oden Postdoc
Yanfei Guo Postdoc
![Page 8: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/8.jpg)
The PMRS Group Research Areas
8
CommunicaGonRunGmes(MPICH)
ThreadingModels
(OS-level,user-level:Argobots)
Data-Movementin
Heterogeneous/HierarchicalMemories
Myfocus:CommunicaGonopGmizaGoninthreadingenvironments
Today’s Talk
![Page 9: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/9.jpg)
Outline
§ Introduc>ontohybridMPI+threadsprogramming– HybridMPIprogramming– Mul>threaded-DrivenMPIcommunica>on
§ MPI+OS-ThreadingOp>miza>onSpace– Op>mizingthecoarse-grainedgloballockingmodel– Movingtowardsafine-grainedmodel
§ MPI+User-LevelThreadingOp>miza>onSpace– Op>miza>onOpportuni>esoverOS-threading– Advancedcoarse-grainedlockinginMPI
§ Summary
9
![Page 10: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/10.jpg)
Introduction to Hybrid MPI Programming
![Page 11: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/11.jpg)
Why Going Hybrid MPI + Shared-Memory (X) Programming
Core
Core Core
Core Core
Core Core
Core
Core
Core Core
Core Core
Core Core
CoreGrowth of node resources in the Top500 systems. Peter Kogge: “Reading the Tea-
Leaves: How Architecture Has Evolved at the High End”. IPDPS 2014 Keynote
![Page 12: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/12.jpg)
Main Forms of Hybrid MPI Programming
WWalker data
§ E.g.QuantumMonteCarlosimulaGons§ Twodifferentmodels
– MPI+shared-memory(X=MPI)– MPI+threads(e.g.X=OpenMP)
§ Bothmodelsusedirectload/storeoperaGons
LargeB-splinetable
W W W WThread0
Thread1
MPI Task 1
Core Core
MPI + Theads • Share everything by default • Privatize data when necessary
MPI + Shared-Memory (MPI 3.0~) • Everything private by default • Expose shared data explicitly
MPI Task 1 MPI Task 0
LargeB-splinetableinaShare-MemoryWindow
W Core
W
Core
WW
![Page 13: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/13.jpg)
13
Multithreaded MPI Communication: Current Situation
0.E+005.E+061.E+072.E+072.E+073.E+073.E+074.E+074.E+07
0 10 20 30 40
MessageRate
NumberofCoresperNode
1.E+03
8.E+03
6.E+04
5.E+05
4.E+06
3.E+07
1 16 256 4096 65536 1048576
MessageRate(M
essages/s)
MessageSize(Bytes)
1Process 2Processes 4Processes8Processes 18Processes
Node-to-node message rate on Haswell + Mellanox FDR Fabric: processes vs. threads
Node-to-node message rate (1B) on Haswell + Mellanox FDR Fabric
1 Core
1.E+03
8.E+03
6.E+04
5.E+05
4.E+06
3.E+07
1 16 256 4096 65536 1048576
MessageRate(M
essages/s)
MessageSize(Bytes)
1Thread 2Threads 4Threads9Threads 18Threads
§ Becomingmoredifficultforasingle-coretosaturatethenetwork
§ Networkfabricsaregoingparallel– E.g.BG/Qhas16injecGon/recepGonFIFOs
§ Corefrequenciesarereducing– LocalmessageprocessingGmeincreases
§ DrivingMPIcommunicaGonthroughthreadsbackfires
![Page 14: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/14.jpg)
Applications to Exploit Multithreaded MPI Communication
14
§ SeveralapplicaGonsaremovingtoMPI+Threadsmodels
§ Mostofthemrelyonfunneled communication!
§ CertaintypeofapplicaGonsfitbefermulGthreadedcommunicaGonmodel– E.g.Task-parallelapplicaGons
§ SeveralapplicaGonscanscalewellwithcurrentMPIrunGmesif:– Datamovementsarenottoofine-grained– FrequencyofcallingconcurrentlyMPIisnot
toohigh§ AscalablemulGthreadedsupportis
necessaryforotherapplicaGons
[1] Kim, Jeongnim, et al. "Hybrid algorithms in quantum Monte Carlo." Journal of Physics, 2012.
DistributedB-splinetable
W W W WThread0
Thread1
MPI Task 0
Core Core
DistributedlargeB-splinetable
W W W WThread0
Thread1
MPI Task 1
Core Core
Communicate Walker and B-
spline Data
Distributed B-spline table if it cannot fit in memory fin Quantum Monte Carlo Simulations [1]
1.E+03
8.E+03
6.E+04
5.E+05
4.E+06
3.E+07
1 16 256 4096 65536 1048576
MessageRate(M
essages/s)
MessageSize(Bytes)
1Thread 2Threads 4Threads9Threads 18Threads
![Page 15: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/15.jpg)
OS-Thread-Level Optimization Space
![Page 16: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/16.jpg)
MPI_Call(...){ CS_ENTER; ...
Progress();
... CS_EXIT;}
Shared state
Global Read-only
Local state
while (!req_complete){ poll(); CS_EXIT; CS_YIELD; CS_ENTER;}
Challenges and Aspects Being Considered § Granularity
– Thecurrentcoarse-grainedlock/work/unlockmodelisnotscalable
– Fullylock-freeisnotpracGcal• Rank-wiseprogressconstrains• Orderingconstrains• Overuseofatomicsandmemorybarrierscanbackfire
§ Arbitra>on– TradiGonalPthread mutexlockingisbiasedbythe
hardware(NUCA)andtheOS(slowwakeups)– CauseswastebecauseofthelackofcorrelaGon
betweenresourceacquisiGonandwork§ Hand-offlatency
– Slowhand-offlatencieswastetheadvantagesoffine-granularityandsmartarbitraGon
– Trade-offmustbecarefullyaddressed
Threads Critical Section Length
Arbitration
Hand-off Latency
![Page 17: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/17.jpg)
17
Production Thread-Safe MPI Implementations on Linux § OS-thread-levelsynchronizaGon§ MostimplementaGonsrelyoncoarse-grained
criGcalsecGons– OnlyMPICHonBG/Qisknowntoimplementfine-
grainedlocking
§ MostimplementaGonsrelyonPthreadsynchronizaGonAPI(mutex,spinlock,…)
§ MutexinNaGvePosixThreadLibrary(NPTL)– Shipswithglibc– Fastuser-spacelockingafempt;usuallywitha
CompareAndSwap(CAS)– Futexwait/wakeincontendedcases
Use
r-Sp
ace
Kernel-Space
pthread_mutex_lock
CAS FUTEX_WAIT
FUTEX_WAKE
Sleep
FUTEX_WAIT
FUTEX_WAKE
Sleep
CAS
CAS
LockAcquired
MainMemoryMutex
Core Core
L1 L1LLC LLC
Core Core
L1 L1
Lock acquisition hardware bias (e.g. CAS) OS-bias from slow futex wakeups
1.00E+02
1.00E+03
1.00E+04
1.00E+05
0 10 20 30 40
Latency(cycles)
NumberofThreads
signal wakeup
~10x
![Page 18: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/18.jpg)
18
Analysis of MPICH on a Linux Cluster § Thread-safetyinMPICH
– FullTHREAD_MULTIPLEsupport– OS-thread-levelsynchronizaGon– Singlecoarse-grainedcriGcalsecGon– UsesbydefaultPthreadmutex
§ Fairnessanalysis– Biasfactor(BF):– BF<=1àfairarbitraGon
§ Progressanalysis– Wastedprogresspolls=unsuccessfulprogresspolls
whileotherthreadsarewaiGngatentry
MPI_CALL_ENTER
MPI_CALL_EXIT
CS_ENTER
CS_ENTER
CS_EXITCS_EXIT
YIELD
OPERATIONCOMPLETE?
YES
NO2
1
#pragma omp parallel{ for (i=0; i<nreq; i++) MPI_Irecv(&reqs[i]);
MPI_Waitall(nreq,reqs);}
P(same_ domain_ acquisition)P(random_uniform)
0
2
4
6
8
10
12
14
16
18
20
0 5 10 15 20 25 30 35 40
BiasFactor
NumberofThreadsperRank
Mutex-Core-Level
Mutex-NUMA-Nodel-Level
Ticket-Core-Level
Ticket-NUMA-Node-Level
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35 40
WastedProgressPolls(%
)
NumberofThreadsperRank
Mutex
Ticket
Fairness and progress analysis with a message rate benchmark on Haswell + Mellanox FDR fabric
![Page 19: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/19.jpg)
Smart Progress and Fast Hand-off
Prioritizing issuing ops over waiting for internal progress
§ FIFOlocksovercometheshortcomingsofmutexes
§ Pollingforprogresscanbewasteful(wai>ngdoesnotgeneratework!)
§ PrioriGzingissuingoperaGons• FeedthecommunicaGonpipeline• Reducechancesofwastefulinternal
process(e.g.morerequestsontheflyèhigherchancesofmakingprogress)
§ SamearbitraGoncabbeachievewithdifferentlocks(e.g.Gcket,MCS,CLH,etc.areFIFO)§ Trade-offbetweendesiredarbitraGonandhand-offlatency
• E.g.forFIFO,CLHismorescalablethanGcket§ Goingthroughthekernel(e.g.usingfutex)isexpensiveandshouldnotbedonefrequently§ Amorescalablescalablehierarchicallockisbeingintegrated
Pt2Pt message rate with 36 Haswell Cores and Mellanox FDR
MPI_CALL_ENTER
MPI_CALL_EXIT
CS_ENTER
CS_EXIT
CS_EXITCS_ENTER
YIELD
OPERATIONCOMPLETE?
YES
NO2
1
65536
262144
1 32 1024 32768 M
essa
ge R
ate
(Mes
sage
s/s)
Message Size (Bytes)
Mutex Ticket Priority CLH
Time Penalty
Fairness (FIFO) reduces wasted resource acquisitions
Time Penalty
Pthread Mutex
FIFO Lock
Hand-off latency must be kept low
Adapt arbitration to maximize work
![Page 20: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/20.jpg)
Fairness and Cohort Locking § UnfairnessitselfISNOTEVIL§ UnboundedunfairnessISEVIL§ Unfairnesscanimprovelocalityof
reference§ Exploitlocalityofreferencethrough
cohorGng– Threadssharingsomelevelofthe
memoryhierarchy– Localpassingwithinathreshold
Hierarchical MCS Lock. Chabbi, Milind, Michael Fagan, and John Mellor-Crummey. "High performance locks for multi-level NUMA systems." Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 2015.
65536
131072
262144
524288
1048576
1 4 16 64 256 1024 4096 16384
MessageRate(M
essages/s)
MessageSize(Bytes)
Mutex Ticket Priority CLH HCLH<2>
![Page 21: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/21.jpg)
Granularity Optimization Being Considered MPI_Call(...){
MEM_BARRIER();
CS_ENTER;
CS_EXIT;
}
Shared state
Global Read-only
Local state
Local state
ATOMIC_UPDATE(obj.ref_count)
Combination of several methods to achieve fine granularity § Locks and atomics are not always necessary
• Memory barriers can be enough • E.g. only read barrier for MPI_Comm_rank
§ Brief-global locking: only protect shared objects § Lock-free wherever possible
• Atomic reference count updates • Lock-free queue operations
§ Reduce locking requirements and move to a mostly lock-free queue/dequeue model
§ Reduce unnecessary progress polling for nonblocking operations
§ Enqueue operations all atomic § Dequeue operations
• Atomic for unordered queues (RMA) • Fine-grained locking for ordered queues
§ The result is a mostly lock-free model for nonblocking operations
Moving towards a lightweight queue/dequeue model
MPI_Call(...){ ....
CS_TRY_ENTER;
Wait_Progress();
CS_EXIT;}
Atomic_enqueue
MPI_Icall(...){ .... ....
.... CS_TRY_ENTER;/*Return*/}
Atomic_enqueue
Locked or AtomicDequeue
![Page 22: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/22.jpg)
User-Level Threads Optimization Space
![Page 23: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/23.jpg)
MPI+User-Level Threads Optimization Opportunities
Applica>ons
OpenMP,OmpSs,XMP, MPI
POSIXThreads(glibc)
Applica>ons
OpenMP,OmpSs,XMP, MPI
POSIXThreads(glibc)
User-LevelThreadLibrary
MPI_Isend();Computation();MPI_Wait(...){
}
MPI_Irecv ();Computation();MPI_Wait(...){
}
Stalls
Pthread1 Pthread2
ULT1Lockless yield when
leaving CSMPI_Isend();Computation();MPI_Wait(...){
}
MPI_Irecv ();Computation();MPI_Wait(...){
}
Computation();MPI_Irecv ();MPI_Wait(...){
}
Pthread
ULT2ULT0
Context-switch
Yield-to instead of
blocking
![Page 24: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/24.jpg)
Advanced Argobots Locking Support
Argobots Mutex API Argobots Mutex Data-Strctures
ABT_mutex_lock() ABT_mutex_unlock() ABT_mutex_lock_low() ABT_mutex_unlock_se()
…...
High priority Low priority
Status = LOCKED/UNLOCKED
Acquire(table_lock)
Release(table_lock)
ES 0
ES N Local status flag
conditional push
conditional push
If ULTs on my ES Pop without atomics
Mutexes
Push unconditionally if prior ULT already blocked
![Page 25: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/25.jpg)
Mapping Argobots Advanced Locking to MPI
MPI_CALL_ENTER
MPI_CALL_EXIT
CS_ENTER
CS_EXIT
CS_EXIT
CS_ENTER
YIELD
OPCOMPLETE
?
YES
NO2
MPI Side Argobots API
ABT_mutex_lock() ABT_mutex_unlock_se() ABT_mutex_lock_low() ABT_mutex_unlock()
2048
8192
32768
131072
524288
2097152
1 8 64 512 409632768262144
MessageRate(M
sg/s)
MessageSize(Bytes)
PthreadMutex ArgobotsMutex
Single-Threaded
Pt2Pt message rate with 36 Haswell Cores and Mellanox FDR
1
Average of 4x improvement for fine-grained communication!
![Page 26: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/26.jpg)
Summary
![Page 27: Towards an Extreme Scale Multithreaded MPI · 2016. 10. 13. · Thread 1 MPI Task 1 Core Communicate Walker and B-spline Data Distributed B-spline table if it cannot fit in memory](https://reader036.vdocuments.us/reader036/viewer/2022081621/611f68a771454c5e5a02ff58/html5/thumbnails/27.jpg)
To Take Out § WhywouldyoucareaboutMPI+threadscommunicaGon?
– ApplicaGons,libraries,andlanguagesaremovingtowardsdrivingbothcomputaGonandcommunicaGonbecauseof
• Hardwareconstraints• Programmability
§ ThecurrentsituaGon– MulGthreadedMPIcommunicaGonissGllnotExascalelevel
– WeimproveduponexisGngrunGmesbutsGllmoretogo§ Insightintothefuture
– Hopefullythefine-grainedmodelwillbeenoughforbothOS-levelanduser-levelthreading
– Otherwise,acombinaGonofadvancedthread-synchronizaGonandthreadingrunGmeswillbenecessary
27