@spcl eth enabling highly-scalable remote memory access...

87
spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Upload: others

Post on 04-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER

Enabling Highly-Scalable Remote Memory

Access Programming with MPI-3 One Sided

Page 2: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

MPI-3.0 supports RMA (“MPI One Sided”)

Designed to react to hardware trends

Majority of HPC networks support RDMA

2

MPI-3.0 REMOTE MEMORY ACCESS

[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

Page 3: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

MPI-3.0 supports RMA (“MPI One Sided”)

Designed to react to hardware trends

Majority of HPC networks support RDMA

3

MPI-3.0 REMOTE MEMORY ACCESS

[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

Page 4: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

MPI-3.0 supports RMA (“MPI One Sided”)

Designed to react to hardware trends

Majority of HPC networks support RDMA

4

MPI-3.0 REMOTE MEMORY ACCESS

[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

Page 5: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

MPI-3.0 supports RMA (“MPI One Sided”)

Designed to react to hardware trends

Majority of HPC networks support RDMA

Communication is „one sided” (no involvement of

destination)

RMA decouples communication & synchronization

Different from message passing

5

MPI-3.0 REMOTE MEMORY ACCESS

[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

Proc A Proc B

send

recv

Proc A Proc B

put

two sided one sided

CommunicationCommunication

+

Synchronization Synchronizationsync

Page 6: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

6

PRESENTATION OVERVIEW

5. Application evaluation

1. Overview of three

MPI-3 RMA concepts2. MPI window creation

3. Communication

4. Synchronization

Page 7: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

7

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive)

Memory

MPI window

Process B (active)

Process C (active)

Put

GetAtomic

Non-atomic

communication

calls (put, get)

Atomic communication calls

(Acc, Get & Acc, CAS, FAO)

Memory

MPI window

…Process D (active)

Page 8: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

8

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive)

Memory

MPI window

Process B (active)

Process C (active)

Put

GetAtomic

Non-atomic

communication

calls (put, get)

Atomic communication calls

(Acc, Get & Acc, CAS, FAO)

Memory

MPI window

…Process D (active)

Page 9: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

9

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive)

Memory

MPI window

Process B (active)

Process C (active)

Put

GetAtomic

Non-atomic

communication

calls (put, get)

Atomic communication calls

(Acc, Get & Acc, CAS, FAO)

Memory

MPI window

…Process D (active)

Page 10: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

10

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive)

Memory

MPI window

Process B (active)

Process C (active)

Put

GetAtomic

Non-atomic

communication

calls (put, get)

Atomic communication calls

(Acc, Get & Acc, CAS, FAO)

Memory

MPI window

…Process D (active)

Page 11: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

11

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive)

Memory

MPI window

Process B (active)

Process C (active)

Put

GetAtomic

Non-atomic

communication

calls (put, get)

Atomic communication calls

(Acc, Get & Acc, CAS, FAO)

Memory

MPI window

…Process D (active)

Page 12: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

12

MPI-3.0 RMA SYNCHRONIZATION OVERVIEW

Active

process

Passive

process

Synchroni-

zation

Passive Target Mode

Lock

Lock All

Active Target Mode

Fence

Post/Start/

Complete/Wait

Communi-

cation

Page 13: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

13

MPI-3.0 RMA SYNCHRONIZATION OVERVIEW

Active

process

Passive

process

Synchroni-

zation

Passive Target Mode

Lock

Lock All

Active Target Mode

Fence

Post/Start/

Complete/Wait

Communi-

cation

Page 14: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

14

MPI-3.0 RMA SYNCHRONIZATION OVERVIEW

Active

process

Passive

process

Synchroni-

zation

Passive Target Mode

Lock

Lock All

Active Target Mode

Fence

Post/Start/

Complete/Wait

Communi-

cation

Page 15: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

15

MPI-3.0 RMA SYNCHRONIZATION OVERVIEW

Active

process

Passive

process

Synchroni-

zation

Passive Target Mode

Lock

Lock All

Active Target Mode

Fence

Post/Start/

Complete/Wait

Communi-

cation

Page 16: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

16

MPI-3.0 RMA SYNCHRONIZATION OVERVIEW

Active

process

Passive

process

Synchroni-

zation

Passive Target Mode

Lock

Lock All

Active Target Mode

Fence

Post/Start/

Complete/Wait

Communi-

cation

Page 17: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Scalable & generic protocols

Can be used on any RDMA network (e.g., OFED/IB)

17

SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

Page 18: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Scalable & generic protocols

Can be used on any RDMA network (e.g., OFED/IB)

18

SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

Page 19: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

19

SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

Window creation

CommunicationSynchronization

Scalable & generic protocols

Can be used on any RDMA network (e.g., OFED/IB)

Window creation, communication and synchronization

Page 20: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Scalable & generic protocols

Can be used on any RDMA network (e.g., OFED/IB)

Window creation, communication and synchronization

foMPI, a fully functional MPI-3 RMA implementation

DMAPP: lowest-level networking API for Cray Gemini/Aries systems

XPMEM: a portable Linux kernel module

20

SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI

Page 21: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Scalable & generic protocols

Can be used on any RDMA network (e.g., OFED/IB)

Window creation, communication and synchronization

foMPI, a fully functional MPI-3 RMA implementation

DMAPP: lowest-level networking API for Cray Gemini/Aries systems

XPMEM: a portable Linux kernel module

21

SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI

Page 22: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Scalable & generic protocols

Can be used on any RDMA network (e.g., OFED/IB)

Window creation, communication and synchronization

foMPI, a fully functional MPI-3 RMA implementation

DMAPP: lowest-level networking API for Cray Gemini/Aries systems

XPMEM: a portable Linux kernel module

22

SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI

Page 23: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

23

PART 1: SCALABLE WINDOW CREATIONTraditional windows

backwards compatible

(MPI-2)

Time bound: 𝒪 𝑝Memory bound: 𝒪 𝑝

𝑝 = total number

of processes

Process A

Memory

Process B

Memory

Process C

Memory

0x111

0x123

0x120

Page 24: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

24

PART 1: SCALABLE WINDOW CREATIONAllocated windows

𝑝 = total number

of processes

Process A

Memory

Process B

Memory

Process C

Memory

Allows MPI

to allocate memory

Time bound: 𝒪 log 𝑝 (𝑤ℎ𝑝)Memory bound: 𝒪 1

0x123 0x1230x123

Page 25: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

25

PART 1: SCALABLE WINDOW CREATIONDynamic windows

𝑝 = total number

of processes

Process A

Memory

Process B

Memory

Process C

Memory

Local attach/detach

Most flexible

Time bound: 𝒪 𝑝Memory bound: 𝒪 𝑝

0x129 0x129

0x111

0x123

0x120

Page 26: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Put and Get:

Direct DMAPP put and get operations or

local (blocking) memcpy (XPMEM)

Accumulate:

DMAPP atomic operations for 64 bit types

...or fall back to remote locking protocol

MPI datatype handling with MPITypes library [1]

Fast path for contiguous data transfers of common

intrinisic datatypes (e.g., MPI_DOUBLE)

26

PART 2: COMMUNICATION

[1] Ross, Latham, Gropp, Lusk, Thakur. Processing MPI datatypes outside MPI. EuroMPI/PVM’09

Contiguous memory

MPI_Put

dmapp_put_nbi

Remote

process

MPI_Compare

_and_swap

dmapp_

acswap_qw_nbi

Remote

process

Page 27: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

27

PERFORMANCE INTER-NODE: LATENCY

Put Inter-Node Get Inter-Node

20%faster

80%faster

Proc 0 Proc 1

put

sync memory

Half ping-pong

Page 28: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

28

PERFORMANCE INTRA-NODE: LATENCY

Put/Get Intra-Node 3xfaster

Proc 0 Proc 1

put

sync memory

Half ping-pong

Page 29: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

29

PERFORMANCE: OVERLAP

Inter-Node Overlap in %

Useful for, e.g., scientific codes:

3D FFT

MILC

Proc 0 Proc 1

put

Sync memory

comp.

AWM-Olsen

seismic

Page 30: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

30

PERFORMANCE: MESSAGE RATE

Intra-NodeInter-Node

Proc 0 Proc 1

puts

Sync memory

...

Page 31: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

31

PERFORMANCE: ATOMICS

64 bit integers

hardware-

accelerated

protocol:

fall back

protocol:

lower latency

higher

bandwidth

proprietary

Page 32: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

32

PART 3: SYNCHRONIZATION

Active

process

Passive

process

Synchroni-

zation

Passive Target Mode

Lock

Lock All

Active Target Mode

Fence

Post/Start/

Complete/Wait

Communi-

cation

Page 33: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Node 0 Node 1

SCALABLE FENCE IMPLEMENTATION

Collective call

Completes all outstanding memory operations

Proc 2

put

Proc 0 Proc 1 Proc 3

int MPI_Win_fence(…) {asm( mfence );dmapp_gsync_wait();MPI_Barrier(...);return MPI_SUCCESS;

}

put

put

put

put

put

put

33

Page 34: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Node 0 Node 1

SCALABLE FENCE IMPLEMENTATION

Collective call

Completes all outstanding memory operations

Proc 2Proc 0 Proc 1 Proc 3

int MPI_Win_fence(…) {asm( mfence );dmapp_gsync_wait();MPI_Barrier(...);return MPI_SUCCESS;

}

put

put

put

Local completion

(XPMEM)

34

Page 35: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Node 0 Node 1

SCALABLE FENCE IMPLEMENTATION

Collective call

Completes all outstanding memory operations

Proc 2Proc 0 Proc 1 Proc 3

int MPI_Win_fence(…) {asm( mfence );dmapp_gsync_wait();MPI_Barrier(...);return MPI_SUCCESS;

}

Local completion

(DMAPP)

35

Page 36: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Node 0 Node 1

SCALABLE FENCE IMPLEMENTATION

Collective call

Completes all outstanding memory operations

Proc 2Proc 0 Proc 1 Proc 3

int MPI_Win_fence(…) {asm( mfence );dmapp_gsync_wait();MPI_Barrier(...);return MPI_SUCCESS;

}

Global

completion

barrier

36

Page 37: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

37

SCALABLE FENCE PERFORMANCE

Time bound 𝒪 log𝑝

Memory bound 𝒪 1

90%

faster

Page 38: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

38

PSCW SYNCHRONIZATION

post

wait

start

complete

accessepoch

exposureepoch

Proc 0 Proc 1

matchingalgorithm

matchingalgorithm

allows to

access other

processesallows access

from other

processes

Posting

processPuts

Starting

process

Puts

Page 39: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

39

PSCW SYNCHRONIZATION

post

wait

start

complete

Proc 0 Proc 1

start

complete

start

complete

Proc 2 Proc 3

start

complete

post

wait

Proc 4 Proc 5

Page 40: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

In general, there can be n posting and m

starting processes

In this example there is one posting and

4 starting processes

40

PSCW SCALABLE POST/START MATCHING

Posting process

(opens its window)

j4

j1

j3

j2i

Page 41: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

In general, there can be n posting and m

starting processes

In this example there is one posting and

4 starting processes

41

PSCW SCALABLE POST/START MATCHING

j4

j1

j3

j2i

Starting processes

(access remote window)

Page 42: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Each starting process has a local list

42

PSCW SCALABLE POST/START MATCHING

j4

j1

j3

j2i

Local list

Page 43: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

Each starting process j waits

until the rank of the posting

process i is present in its local list

43

PSCW SCALABLE POST/START MATCHING

j4

j1

j3

j2i

Page 44: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

Each starting process j waits

until the rank of the posting

process i is present in its local list

44

PSCW SCALABLE POST/START MATCHING

j4

j1

j3

j2i

Page 45: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

Each starting process j waits

until the rank of the posting

process i is present in its local list

45

PSCW SCALABLE POST/START MATCHING

j4

j1

j3

j2i

i

Page 46: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

Each starting process j waits

until the rank of the posting

process i is present in its local list

46

PSCW SCALABLE POST/START MATCHING

j4

j1

j3

j2i

i

i

Page 47: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

Each starting process j waits

until the rank of the posting

process i is present in its local list

47

PSCW SCALABLE POST/START MATCHING

j4

j1

j3

j2i

i

i

i

Page 48: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

Each starting process j waits

until the rank of the posting

process i is present in its local list

48

PSCW SCALABLE POST/START MATCHING

j4

j1

j3

j2i

i

i

i

i

Page 49: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

Each starting process j waits

until the rank of the posting

process i is present in its local list

49

PSCW SCALABLE POST/START MATCHING

j4

j1

j3

j2i

i

i

i

i

Page 50: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Each starting process increments a counter stored at the

posting process

50

PSCW SCALABLE COMPLETE/WAIT MATCHING

ij4

j1

j3

j2

0

Page 51: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Each starting process increments a counter stored at the

posting process

51

PSCW SCALABLE COMPLETE/WAIT MATCHING

ij4

j1

j3

j2

1

Page 52: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Each starting process increments a counter stored at the

posting process

52

PSCW SCALABLE COMPLETE/WAIT MATCHING

ij4

j1

j3

j2

2

Page 53: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Each starting process increments a counter stored at the

posting process

53

PSCW SCALABLE COMPLETE/WAIT MATCHING

ij4

j1

j3

j2

3

Page 54: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

54

PSCW SCALABLE COMPLETE/WAIT MATCHING

ij4

j1

j3

j2

4

Each starting process increments a counter stored at the

posting process

When the counter is equal to the number of starting

processes, the posting process returns from wait

Page 55: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

55

PSCW PERFORMANCE

Time bound

𝒫𝑠𝑡𝑎𝑟𝑡 = 𝒫𝑤𝑎𝑖𝑡 = 𝒪 1𝒫𝑝𝑜𝑠𝑡 = 𝒫𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 = 𝒪 log 𝑝

Memory bound

𝒪 log 𝑝 (for scalable programs)

Ring Topology

Page 56: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Two-level lock hierarchy:

56

SCALABLE LOCK SYNCHRONIZATION

Active

process

Passive

processLock/Unlock

(shared/exclusive)

Lock All

(always shared)

Process 0

Memory

00000 0local:

Shared Counter Exclusive Bit

Process 1

Memory

00000 0local:

Shared Counter Exclusive Bit

Process P-1

Memory

00000 0local:

Shared Counter Exclusive Bit

…000 000global:

SharedCounter

ExclusiveCounter

Master Process

Page 57: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

PHASE 1: increment the global exclusive counter(Invariant 1: no global shared lock held concurrently)

EXCLUSIVE LOCAL LOCK: TWO PHASES

Proc 2 wants to lock Proc 1 exclusively

Process 0

00000 0

000 000

Process 2

00000 0

Process 1

00000 0

57

Page 58: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

PHASE 1: increment the global exclusive counter(Invariant 1: no global shared lock held concurrently)

EXCLUSIVE LOCAL LOCK: TWO PHASES

Process 0

00000 0

000 001

Process 2

00000 0

Process 1

00000 0

fetch-add

000 000

MPI_Win_lock( EXCL, 1 )

58

Proc 2 wants to lock Proc 1 exclusively

Page 59: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

PHASE 1: increment the global exclusive counter(Invariant 2: no local shared/exclusive lock held concurrently)

EXCLUSIVE LOCAL LOCK: TWO PHASES

Process 0

00000 0

000 001

Process 2

00000 0

Process 1

00000 1

fetch-add

000 000

MPI_Win_lock( EXCL, 1 )

compare &swap

00000 0

59

Proc 2 wants to lock Proc 1 exclusively

Page 60: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Increment local shared counter(Invariant: no local exclusive lock on this process held concurrently)

SHARED LOCAL LOCK: ONE PHASE

Proc 0 wants to lock Proc 1

Process 0

00000 0

000 001

Process 2

00000 0

Process 1

00000 0

60

Page 61: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Increment local shared counter(Invariant: no local exclusive lock on this process held concurrently)

SHARED LOCAL LOCK: ONE PHASE

Proc 0 wants to lock Proc 1

Process 0

00000 0

000 001

Process 2

00000 0

Process 1

00001 0

MPI_Win_lock( SHRD, 1 )

fetch-add

00000 0

61

Page 62: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Increment global shared counter(Invariant: no local exclusive lock is held concurrently)

SHARED GLOBAL LOCK: ONE PHASE

Proc 2 wants to lock the whole window

Process 0

00000 0

000 000

Process 2

00000 0

Process 1

00000 0

62

Page 63: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Increment global shared counter(Invariant: no local exclusive lock is held concurrently)

SHARED GLOBAL LOCK: ONE PHASE

Proc 2 wants to lock the whole window

Process 0

00000 0

001 000

Process 2

00000 0

Process 1

00000 0

63

MPI_Win_lock_all()

fetch-add

000 000

Constant number of operations for 𝑝 processes

Page 64: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Guarantees remote completion

Issues a remote bulk synchronization and an x86 mfence

One of the most performance critical functions, we add only 78 x86

CPU instructions to the critical path

64

FLUSH SYNCHRONIZATION Time bound 𝒪 1

Memory bound 𝒪(1)

Process 0 Process 1

inc(counter)

0counter:

inc(counter)

inc(counter)

Page 65: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Guarantees remote completion

Issues a remote bulk synchronization and an x86 mfence

One of the most performance critical functions, we add only 78 x86

CPU instructions to the critical path

65

FLUSH SYNCHRONIZATION Time bound 𝒪 1

Memory bound 𝒪(1)

Process 0 Process 1

inc(counter)

3counter:

inc(counter)

inc(counter)

flush

Page 66: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

66

Evaluation on Blue Waters System

22,640 computing Cray XE6 nodes

724,480 schedulable cores

All microbenchmarks

4 applications

One nearly full-scale run

PERFORMANCE

Page 67: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

67

PERFORMANCE: MOTIF APPLICATIONS

Key/Value Store:

Random Inserts per SecondDynamic Sparse Data Exchange (DSDE)

with 6 neighbors

Page 68: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

68

PERFORMANCE: APPLICATIONS

NAS 3D FFT [1] Performance MILC [2] Application Execution Time

Annotations represent performance gain of foMPI over Cray MPI-1.

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS’09

[2] Shan et al. Accelerating applications at scale using one-sided communication. PGAS’12

scale

to 512k procsscale

to 65k procs

Page 69: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

69

CONCLUSIONS & SUMMARY

Page 70: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

70

CONCLUSIONS & SUMMARY

1. MPI window creation

routines

Page 71: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

71

CONCLUSIONS & SUMMARY

2. Non-atomic & atomic

communication1. MPI window creation

routines

Page 72: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

72

CONCLUSIONS & SUMMARY

3. Fence / PSCW1. MPI window creation

routines

2. Non-atomic & atomic

communication

Page 73: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

73

CONCLUSIONS & SUMMARY

1. MPI window creation

routines

2. Non-atomic & atomic

communication3. Fence / PSCW

4. Locks

Page 74: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

1. MPI window creation

routines

2. Non-atomic & atomic

communication

74

CONCLUSIONS & SUMMARY

3. Fence / PSCW

4. Locks

5. foMPI reference implementation

Page 75: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

75

CONCLUSIONS & SUMMARY

1. MPI window creation

routines

2. Non-atomic & atomic

communication3. Fence / PSCW

4. Locks

5. foMPI reference implementation

Page 76: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

76

CONCLUSIONS & SUMMARY

1. MPI window creation

routines

2. Non-atomic & atomic

communication3. Fence / PSCW

4. Locks

5. foMPI reference implementation

Page 77: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Thanks to:

Timo Schneider, Greg Bauer, Bill Kramer, Duncan Roweth,

Nick Wright, Paul Hargrove (and the whole UPC team)

and the MPI Forum RMA WG …

… and the institutions:

77

ACKNOWLEDGMENTS

Page 78: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Thank you

for your attention

78

http://spcl.inf.ethz.ch/Research/

Parallel_Programming/foMPI

Page 79: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

Backup slides

79

Page 80: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

DYNAMIC WINDOW CREATION

61

Process A

Memory

Process B

Memory

Process C

Memory

0 0 0

Page 81: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

DYNAMIC WINDOW CREATION

61

Process A

Memory

Process B

Memory

Process C

Memory

1 0 1

0x111

0x120

Page 82: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

DYNAMIC WINDOW CREATION

61

Process A

Memory

Process B

Memory

Process C

Memory

2 1 2

0x111

0x120

0x129 0x129

0x123

Page 83: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

DYNAMIC WINDOW CREATION

61

Process A

Memory

Process B

Memory

Process C

Memory

2 1 2

0x111

0x120

0x129 0x129

0x123

Get(id)

Process A Process B

2Cached:

2Access the

window

Get(id)

Process A Process B

2Cached:

1

Access thewindow

Update(list)

Page 84: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

84

PERFORMANCE INTER-NODE: LATENCY (SHMEM)

Put Inter-Node Get Inter-Node

Proc 0 Proc 1

put

sync memory

Half ping-pong

Page 85: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

85

PERFORMANCE INTRA-NODE: LATENCY (SHMEM)

Put/Get Intra-NodeProc 0 Proc 1

put

sync memory

Half ping-pong

Page 86: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

BACKOFF: decrement the global exclusive counter

...then retry

EXCLUSIVE LOCAL LOCK: TWO PHASES

Proc 2 wants to lock exclusively Proc 1

Process 0

00000 0

000 000

Process 2

00000 0

Process 1

00000 0

Add(-1)

000 001

61

Page 87: @spcl eth Enabling Highly-Scalable Remote Memory Access ...htor.inf.ethz.ch/publications/img/fompi-slides.pdf · spcl.inf.ethz.ch @spcl_eth ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN

spcl.inf.ethz.ch

@spcl_eth

PERFORMANCE MODELING

61

Fence 𝒫𝑓𝑒𝑛𝑐𝑒 = 2.9𝜇𝑠 ⋅ log2(𝑝)

PSCW 𝒫𝑠𝑡𝑎𝑟𝑡 = 0.7𝜇𝑠, 𝒫𝑤𝑎𝑖𝑡 = 1.8𝜇𝑠𝒫𝑝𝑜𝑠𝑡 = 𝒫𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 = 350𝑛𝑠 ⋅ 𝑘

Locks 𝒫𝑙𝑜𝑐𝑘,𝑒𝑥𝑐𝑙 = 5.4𝜇𝑠

𝒫𝑙𝑜𝑐𝑘,𝑠ℎ𝑟𝑑 = 𝒫𝑙𝑜𝑐𝑘_𝑎𝑙𝑙 = 2.7𝜇𝑠

𝒫𝑢𝑛𝑙𝑜𝑐𝑘 = 𝒫𝑢𝑛𝑙𝑜𝑐𝑘_𝑎𝑙𝑙 = 0.4𝜇𝑠𝒫𝑓𝑙𝑢𝑠ℎ = 76𝑛𝑠

𝒫𝑠𝑦𝑛𝑐 = 17𝑛𝑠

Put/get 𝒫𝑝𝑢𝑡 = 0.16𝑛𝑠 ⋅ 𝑠 + 1𝜇𝑠

𝒫𝑔𝑒𝑡 = 0.17𝑛𝑠 ⋅ 𝑠 + 1.9𝜇𝑠

Atomics 𝒫𝑎𝑐𝑐,𝑠𝑢𝑚 = 28𝑛𝑠 ⋅ 𝑠 + 2.4𝜇𝑠

𝒫𝑎𝑐𝑐,𝑚𝑖𝑛 = 0.8𝑛𝑠 ⋅ 𝑠 + 7.3𝜇𝑠

Performance functions for synchronization protocols

Performance functions for communication protocols