towards an extreme scale multithreaded mpi · 2016. 10. 13. · thread 1 mpi task 1 core...

Abdelhalim Amer (Halim)Postdoctoral ResearcherArgonne National Laboratory, IL, USAJapan-KoreaHPCWinterSchoolUniversityofTsukuba,Feb17,2016

Towards an Extreme Scale Multithreaded MPI

Programming Models and Runtime Systems Group

2

Myself Affilia&on

–  MathemaGcsandComputerScienceDivision,ANL

GroupLead–  PavanBalaji(computerscienGst)

CurrentStaffMembers–  SangminSeo(AssistantScienGsts)–  AbdelhalimAmer(Halim)(postdoc)–  YanfeiGuo(postdoc)–  RobLatham(developer)–  LenaOden(postdoc)–  KenRaffeneW(developer)–  MinSi(postdoc)

The PMRS Group Research Areas

3

CommunicaGonRunGmes(MPICH)

ThreadingModels

(OS-level,user-level:Argobots)

Data-Movementin

HeterogeneousandDeepMemory

HierarchiesMyfocus:

CommunicaGonopGmizaGoninthreadingenvironments

The MPICH Project

4

MPICHanditderiva&vesintheTop101.   Tianhe-2(China):TH-MPI2.   Titan(US):CrayMPI3.   Sequoia(US):IBMPEMPI4.  KComputer(Japan):FujitsuMPI5.   Mira(US):IBMPEMPI6.   Trinity(US):CrayMPI7.   PizDaint(Germany):CrayMPI8.   HazelHen(Germany):CrayMPI9.   ShaheenII(SaudiArabia):CrayMPI10.  Stampede(US):IntelMPIandMVAPICH

MPICH

IntelMPI

IBMMPI

CrayMPI

MicrosoVMPI MVAPICH

TianheMPI

ParaSta&onMPI

FG-MPI

§  MPICHanditsderivaGvesaretheworld’smostwidelyusedMPIimplementaGons

§  FundedbyDOEfor23years(turned23thismonth)

§  HasbeenakeyinfluencerintheadopGonofMPI

§  Awardwinningproject–  DOER&D100awardin2005

Threading Models

5

Sangmin Seo Assistant Computer

Scientist

§  Argobots–  Lightweightlow-levelthreadingandtaskingframework–  Targetsmassiveon-nodeparallelism–  Fine-grainedcontrolovertheexecuGon,scheduling,synchronizaGon,anddata-

movements–  ExposesarichsetoffeaturestohigherprogrammingsystemsandDSLsfor

efficientimplementaGons

§  BOLT(hfps://press3.mcs.anl.gov/bolt)–  BasedontheLLVMOpenMPrunGmeandtheClang

The Argobots Ecosystem

6

ES1 Sched

U

U

E

E

E

E

U

S

S

T

T

T

T

T

Argobots

...

ESn

MPI+Argobots

ULT

ES

ULT

ES

MPI

Argobots runtime

Communication libraries

Charm++

Applications

Charm++

Cilk “Worker”

Argobots ES

RWS ULT

Fused ULT 1

Fused ULT 2

Fused ULT N …

CilkBots

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY

PO

SY SY

PaRSEC

OpenMP (BOLT)

Mercury RPC

Origin

Target

RPCproc

RPCproc

OmpSs

GridFTP, Kokkos,RAJA, ROSE, TASCEL, XMP, etc.External Connections

Heterogeneous and Deep Memory Hierarchies

7

§  Prefetchingfordeepmemoryhierarchies

§  CurrentlyfocusingonMICacceleratorenvironments

§  Compilersupport–  LLVM+userhintsthroughpragma

direcGves

§  RunGmesupport–  PrefetchingbetweenIn-PackageMemory

(IPM)andNVRAM–  Sojware-managedcache

NVRAM

IPM

Lena Oden Postdoc

Yanfei Guo Postdoc

The PMRS Group Research Areas

8

CommunicaGonRunGmes(MPICH)

ThreadingModels

(OS-level,user-level:Argobots)

Data-Movementin

Heterogeneous/HierarchicalMemories

Myfocus:CommunicaGonopGmizaGoninthreadingenvironments

Today’s Talk

Outline

§  Introduc>ontohybridMPI+threadsprogramming–  HybridMPIprogramming– Mul>threaded-DrivenMPIcommunica>on

§  MPI+OS-ThreadingOp>miza>onSpace–  Op>mizingthecoarse-grainedgloballockingmodel– Movingtowardsafine-grainedmodel

§  MPI+User-LevelThreadingOp>miza>onSpace–  Op>miza>onOpportuni>esoverOS-threading–  Advancedcoarse-grainedlockinginMPI

§  Summary

9

Introduction to Hybrid MPI Programming

Why Going Hybrid MPI + Shared-Memory (X) Programming

Core

Core Core

Core Core

Core Core

Core

Core

Core Core

Core Core

Core Core

CoreGrowth of node resources in the Top500 systems. Peter Kogge: “Reading the Tea-

Leaves: How Architecture Has Evolved at the High End”. IPDPS 2014 Keynote

Main Forms of Hybrid MPI Programming

WWalker data

§  E.g.QuantumMonteCarlosimulaGons§  Twodifferentmodels

–  MPI+shared-memory(X=MPI)–  MPI+threads(e.g.X=OpenMP)

§  Bothmodelsusedirectload/storeoperaGons

LargeB-splinetable

W W W WThread0

Thread1

MPI Task 1

Core Core

MPI + Theads •  Share everything by default •  Privatize data when necessary

MPI + Shared-Memory (MPI 3.0~) •  Everything private by default •  Expose shared data explicitly

MPI Task 1 MPI Task 0

LargeB-splinetableinaShare-MemoryWindow

W Core

W

Core

WW

13

Multithreaded MPI Communication: Current Situation

0.E+005.E+061.E+072.E+072.E+073.E+073.E+074.E+074.E+07

0 10 20 30 40

MessageRate

NumberofCoresperNode

1.E+03

8.E+03

6.E+04

5.E+05

4.E+06

3.E+07

1 16 256 4096 65536 1048576

MessageRate(M

essages/s)

MessageSize(Bytes)

1Process 2Processes 4Processes8Processes 18Processes

Node-to-node message rate on Haswell + Mellanox FDR Fabric: processes vs. threads

Node-to-node message rate (1B) on Haswell + Mellanox FDR Fabric

1 Core

1.E+03

8.E+03

6.E+04

5.E+05

4.E+06

3.E+07

1 16 256 4096 65536 1048576

MessageRate(M

essages/s)

MessageSize(Bytes)

1Thread 2Threads 4Threads9Threads 18Threads

§  Becomingmoredifficultforasingle-coretosaturatethenetwork

§  Networkfabricsaregoingparallel–  E.g.BG/Qhas16injecGon/recepGonFIFOs

§  Corefrequenciesarereducing–  LocalmessageprocessingGmeincreases

§  DrivingMPIcommunicaGonthroughthreadsbackfires

Applications to Exploit Multithreaded MPI Communication

14

§  SeveralapplicaGonsaremovingtoMPI+Threadsmodels

§  Mostofthemrelyonfunneled communication!

§  CertaintypeofapplicaGonsfitbefermulGthreadedcommunicaGonmodel–  E.g.Task-parallelapplicaGons

§  SeveralapplicaGonscanscalewellwithcurrentMPIrunGmesif:–  Datamovementsarenottoofine-grained–  FrequencyofcallingconcurrentlyMPIisnot

toohigh§  AscalablemulGthreadedsupportis

necessaryforotherapplicaGons

[1] Kim, Jeongnim, et al. "Hybrid algorithms in quantum Monte Carlo." Journal of Physics, 2012.

DistributedB-splinetable

W W W WThread0

Thread1

MPI Task 0

Core Core

DistributedlargeB-splinetable

W W W WThread0

Thread1

MPI Task 1

Core Core

Communicate Walker and B-

spline Data

Distributed B-spline table if it cannot fit in memory fin Quantum Monte Carlo Simulations [1]

1.E+03

8.E+03

6.E+04

5.E+05

4.E+06

3.E+07

1 16 256 4096 65536 1048576

MessageRate(M

essages/s)

MessageSize(Bytes)

1Thread 2Threads 4Threads9Threads 18Threads

OS-Thread-Level Optimization Space

MPI_Call(...){ CS_ENTER; ...

Progress();

... CS_EXIT;}

Shared state

Global Read-only

Local state

while (!req_complete){ poll(); CS_EXIT; CS_YIELD; CS_ENTER;}

Challenges and Aspects Being Considered §  Granularity

–  Thecurrentcoarse-grainedlock/work/unlockmodelisnotscalable

–  Fullylock-freeisnotpracGcal•  Rank-wiseprogressconstrains•  Orderingconstrains•  Overuseofatomicsandmemorybarrierscanbackfire

§  Arbitra>on–  TradiGonalPthread mutexlockingisbiasedbythe

hardware(NUCA)andtheOS(slowwakeups)–  CauseswastebecauseofthelackofcorrelaGon

betweenresourceacquisiGonandwork§  Hand-offlatency

–  Slowhand-offlatencieswastetheadvantagesoffine-granularityandsmartarbitraGon

–  Trade-offmustbecarefullyaddressed

Threads Critical Section Length

Arbitration

Hand-off Latency

17

Production Thread-Safe MPI Implementations on Linux §  OS-thread-levelsynchronizaGon§  MostimplementaGonsrelyoncoarse-grained

criGcalsecGons–  OnlyMPICHonBG/Qisknowntoimplementfine-

grainedlocking

§  MostimplementaGonsrelyonPthreadsynchronizaGonAPI(mutex,spinlock,…)

§  MutexinNaGvePosixThreadLibrary(NPTL)–  Shipswithglibc–  Fastuser-spacelockingafempt;usuallywitha

CompareAndSwap(CAS)–  Futexwait/wakeincontendedcases

Use

r-Sp

ace

Kernel-Space

pthread_mutex_lock

CAS FUTEX_WAIT

FUTEX_WAKE

Sleep

FUTEX_WAIT

FUTEX_WAKE

Sleep

CAS

CAS

LockAcquired

MainMemoryMutex

Core Core

L1 L1LLC LLC

Core Core

L1 L1

Lock acquisition hardware bias (e.g. CAS) OS-bias from slow futex wakeups

1.00E+02

1.00E+03

1.00E+04

1.00E+05

0 10 20 30 40

Latency(cycles)

NumberofThreads

signal wakeup

~10x

18

Analysis of MPICH on a Linux Cluster §  Thread-safetyinMPICH

–  FullTHREAD_MULTIPLEsupport–  OS-thread-levelsynchronizaGon–  Singlecoarse-grainedcriGcalsecGon–  UsesbydefaultPthreadmutex

§  Fairnessanalysis–  Biasfactor(BF):–  BF<=1àfairarbitraGon

§  Progressanalysis–  Wastedprogresspolls=unsuccessfulprogresspolls

whileotherthreadsarewaiGngatentry

MPI_CALL_ENTER

MPI_CALL_EXIT

CS_ENTER

CS_ENTER

CS_EXITCS_EXIT

YIELD

OPERATIONCOMPLETE?

YES

NO2

1

#pragma omp parallel{ for (i=0; i<nreq; i++) MPI_Irecv(&reqs[i]);

MPI_Waitall(nreq,reqs);}

P(same_ domain_ acquisition)P(random_uniform)

0

2

4

6

8

10

12

14

16

18

20

0 5 10 15 20 25 30 35 40

BiasFactor

NumberofThreadsperRank

Mutex-Core-Level

Mutex-NUMA-Nodel-Level

Ticket-Core-Level

Ticket-NUMA-Node-Level

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40

WastedProgressPolls(%

)

NumberofThreadsperRank

Mutex

Ticket

Fairness and progress analysis with a message rate benchmark on Haswell + Mellanox FDR fabric

Smart Progress and Fast Hand-off

Prioritizing issuing ops over waiting for internal progress

§  FIFOlocksovercometheshortcomingsofmutexes

§  Pollingforprogresscanbewasteful(wai>ngdoesnotgeneratework!)

§  PrioriGzingissuingoperaGons•  FeedthecommunicaGonpipeline•  Reducechancesofwastefulinternal

process(e.g.morerequestsontheflyèhigherchancesofmakingprogress)

§  SamearbitraGoncabbeachievewithdifferentlocks(e.g.Gcket,MCS,CLH,etc.areFIFO)§  Trade-offbetweendesiredarbitraGonandhand-offlatency

•  E.g.forFIFO,CLHismorescalablethanGcket§  Goingthroughthekernel(e.g.usingfutex)isexpensiveandshouldnotbedonefrequently§  Amorescalablescalablehierarchicallockisbeingintegrated

Pt2Pt message rate with 36 Haswell Cores and Mellanox FDR

MPI_CALL_ENTER

MPI_CALL_EXIT

CS_ENTER

CS_EXIT

CS_EXITCS_ENTER

YIELD

OPERATIONCOMPLETE?

YES

NO2

1

65536

262144

1 32 1024 32768 M

essa

ge R

ate

(Mes

sage

s/s)

Message Size (Bytes)

Mutex Ticket Priority CLH

Time Penalty

Fairness (FIFO) reduces wasted resource acquisitions

Time Penalty

Pthread Mutex

FIFO Lock

Hand-off latency must be kept low

Adapt arbitration to maximize work

Fairness and Cohort Locking §  UnfairnessitselfISNOTEVIL§  UnboundedunfairnessISEVIL§  Unfairnesscanimprovelocalityof

reference§  Exploitlocalityofreferencethrough

cohorGng–  Threadssharingsomelevelofthe

memoryhierarchy–  Localpassingwithinathreshold

Hierarchical MCS Lock. Chabbi, Milind, Michael Fagan, and John Mellor-Crummey. "High performance locks for multi-level NUMA systems." Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 2015.

65536

131072

262144

524288

1048576

1 4 16 64 256 1024 4096 16384

MessageRate(M

essages/s)

MessageSize(Bytes)

Mutex Ticket Priority CLH HCLH<2>

Granularity Optimization Being Considered MPI_Call(...){

MEM_BARRIER();

CS_ENTER;

CS_EXIT;

}

Shared state

Global Read-only

Local state

Local state

ATOMIC_UPDATE(obj.ref_count)

Combination of several methods to achieve fine granularity §  Locks and atomics are not always necessary

•  Memory barriers can be enough •  E.g. only read barrier for MPI_Comm_rank

§  Brief-global locking: only protect shared objects §  Lock-free wherever possible

•  Atomic reference count updates •  Lock-free queue operations

§  Reduce locking requirements and move to a mostly lock-free queue/dequeue model

§  Reduce unnecessary progress polling for nonblocking operations

§  Enqueue operations all atomic §  Dequeue operations

•  Atomic for unordered queues (RMA) •  Fine-grained locking for ordered queues

§  The result is a mostly lock-free model for nonblocking operations

Moving towards a lightweight queue/dequeue model

MPI_Call(...){ ....

CS_TRY_ENTER;

Wait_Progress();

CS_EXIT;}

Atomic_enqueue

MPI_Icall(...){ .... ....

.... CS_TRY_ENTER;/*Return*/}

Atomic_enqueue

Locked or AtomicDequeue

User-Level Threads Optimization Space

MPI+User-Level Threads Optimization Opportunities

Applica>ons

OpenMP,OmpSs,XMP, MPI

POSIXThreads(glibc)

Applica>ons

OpenMP,OmpSs,XMP, MPI

POSIXThreads(glibc)

User-LevelThreadLibrary

MPI_Isend();Computation();MPI_Wait(...){

}

MPI_Irecv ();Computation();MPI_Wait(...){

}

Stalls

Pthread1 Pthread2

ULT1Lockless yield when

leaving CSMPI_Isend();Computation();MPI_Wait(...){

}

MPI_Irecv ();Computation();MPI_Wait(...){

}

Computation();MPI_Irecv ();MPI_Wait(...){

}

Pthread

ULT2ULT0

Context-switch

Yield-to instead of

blocking

Advanced Argobots Locking Support

Argobots Mutex API Argobots Mutex Data-Strctures

ABT_mutex_lock() ABT_mutex_unlock() ABT_mutex_lock_low() ABT_mutex_unlock_se()

…...

High priority Low priority

Status = LOCKED/UNLOCKED

Acquire(table_lock)

Release(table_lock)

ES 0

ES N Local status flag

conditional push

conditional push

If ULTs on my ES Pop without atomics

Mutexes

Push unconditionally if prior ULT already blocked

Mapping Argobots Advanced Locking to MPI

MPI_CALL_ENTER

MPI_CALL_EXIT

CS_ENTER

CS_EXIT

CS_EXIT

CS_ENTER

YIELD

OPCOMPLETE

?

YES

NO2

MPI Side Argobots API

ABT_mutex_lock() ABT_mutex_unlock_se() ABT_mutex_lock_low() ABT_mutex_unlock()

2048

8192

32768

131072

524288

2097152

1 8 64 512 409632768262144

MessageRate(M

sg/s)

MessageSize(Bytes)

PthreadMutex ArgobotsMutex

Single-Threaded

Pt2Pt message rate with 36 Haswell Cores and Mellanox FDR

1

Average of 4x improvement for fine-grained communication!

Summary

To Take Out §  WhywouldyoucareaboutMPI+threadscommunicaGon?

–  ApplicaGons,libraries,andlanguagesaremovingtowardsdrivingbothcomputaGonandcommunicaGonbecauseof

•  Hardwareconstraints•  Programmability

§  ThecurrentsituaGon– MulGthreadedMPIcommunicaGonissGllnotExascalelevel

– WeimproveduponexisGngrunGmesbutsGllmoretogo§  Insightintothefuture

–  Hopefullythefine-grainedmodelwillbeenoughforbothOS-levelanduser-levelthreading

–  Otherwise,acombinaGonofadvancedthread-synchronizaGonandthreadingrunGmeswillbenecessary

27

towards an extreme scale multithreaded mpi · 2016. 10. 13. · thread 1 mpi task 1 core...

Documents