checkpointing the un-checkpointable: mana and the split...

49
Checkpointing the Un-checkpointable: MANA and the Split-Process Approach Gene Cooperman * [email protected] Khoury College of Computer Sciences Northeastern University, Boston, USA August 20, 2019 * Partially supported by NSF Grants ACI-1440788 and OAC-1740218, and by grants from Intel Corporation and Mentor Graphics (a division of Siemens). Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 1 / 49

Upload: others

Post on 22-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Checkpointing the Un-checkpointable: MANA and theSplit-Process Approach

Gene Cooperman∗

[email protected]

Khoury College of Computer SciencesNortheastern University, Boston, USA

August 20, 2019

∗Partially supported by NSF Grants ACI-1440788 and OAC-1740218, and by grants from Intel Corporation and Mentor Graphics (a division ofSiemens).

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 1 / 49

Page 2: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Table of Contents

1 DMTCP — A review

2 DMTCP Plugins — A review

3 Checkpointing with Proxies: Beginnings of a New Paradigm

4 Split Processes: MANA for MPI

5 Collective Communication: A Tricky Problem for Split Processes

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 2 / 49

Page 3: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Outline

1 DMTCP — A review

2 DMTCP Plugins — A review

3 Checkpointing with Proxies: Beginnings of a New Paradigm

4 Split Processes: MANA for MPI

5 Collective Communication: A Tricky Problem for Split Processes

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 3 / 49

Page 4: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

DMTCP History

The DMTCP project and antecedents began almost 15 years ago, and iswidely used today:http://dmtcp.sourceforge.net/publications.html

Typical use case for HPC: 12-hour batch time slot, an application isexpected to finish in 18 hours.(NOTE: A resource manager such as SLURM can send a signal one hourbefore the end of the time slot, giving the application time to checkpoint;DMTCP can then transparently checkpoint the state.)

Ease of use (unprivileged and transparent):dmtcp launch a.out arg1 arg2 ...

dmtcp command --checkpoint # from other terminal or window

dmtcp restart ckpt a.out *.dmtcp

( DMTCP also works for programs that are multi-threaded, distributed,MPI-based, ....)

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 4 / 49

Page 5: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

DMTCP Architecture: Coordinated Checkpointing

DMTCP

COORDINATOR

CKPT MSG

CKPT THREAD

USER PROCESS 1

SIG

US

R2

SIG

US

R2

USER THREAD B

USER THREAD A

CKPT MSG

SIG

US

R2

connectionsocket

USER THREAD C

CKPT THREAD

USER PROCESS 2

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 5 / 49

Page 6: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Fundamental Research Question for the DMTCP Team

“What are the limits of checkpointing applicationsin the real-world?”

PROBLEM: An application may store a session id, tty, network peeraddress, TMPDIR for temporary files, process id, thread id, etc.On restart, some or all of these are likely to change.

SOLUTION: Process virtualization: interpose on all calls to such ids; replaceactual id by DMTCP-defined virtual id.PRINCIPLE: “Never let the application see a real id!”

Scalability:Checkpointing HPCG: 32,752 CPU cores), and (NAMD: 16,368 CPU cores)for checkpointing MPI applications natively over InfiniBand at TACC:“System-level Scalable Checkpoint-Restart for Petascale Computing”,Jiajun Cao et al., Int. Conf. on Parallel and Dist. Sys. (ICPADS’16), 2016(To the best of our knowledge, this is 100 times larger than the previouslylargest transparent checkpointing study in the literature.)

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 6 / 49

Page 7: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Outline

1 DMTCP — A review

2 DMTCP Plugins — A review

3 Checkpointing with Proxies: Beginnings of a New Paradigm

4 Split Processes: MANA for MPI

5 Collective Communication: A Tricky Problem for Split Processes

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 7 / 49

Page 8: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

DMTCP Plugins

WHY PLUGINS?Processes must talk with the rest of the world!

Process virtualization: virtualize the connections to the rest of the world

In short, a plugin is responsible for modeling an external subsystem, andthen creating a semantically equivalent construct at the time of restart.

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 8 / 49

Page 9: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

A Simple DMTCP Plugin: Virtualizing the Process Id

PRINCIPLE:The user sees only virtual pids; The kernel sees only real pids

User ProcessPID: 4000

User ProcessPID: 4001

Virt. PID Real PID

4000 26524001 3120

Translation Table

getpid()26524000

kill(4001, 9) KERNEL

4001Sending signal 9to pid 31203120

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 9 / 49

Page 10: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Plugins for EDA: A Real-world Example

EDA is “Electronic Design Automation” (circuit design for chips).Part of a four-year collaboration between DMTCP team and Intel:

“Be Kind, Rewind — Checkpoint & Restore Capability for ImprovingReliability of Large-scale Semiconductor Design”, I. Ljubuncic, R. Giri,A. Rozenfeld, and A. Goldis, IEEE HPEC-14, Sept., 2014.(published solely by Intel co-authors)

Fictional scenario with ball-park numbers (no particular vendor):Software circuit simulation: about 1 million times slowdownHardware emulation at back-end: about 1 thousand times slowdownCost of back-end hardware emulator: about $800,000Use case A (for a new CPU design): Boot Microsoft Windows overnightwith emulator, and then test Microsoft Office.Use case B: Boot Microsoft Windows overnight with emulator, andcheckpoint. In later iterations, restart, and then test Microsoft Office.

The above fictional scenario requires a DMTCP plugin to model the back-endemulator. See publications with emulator vendors for details.

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 10 / 49

Page 11: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Outline

1 DMTCP — A review

2 DMTCP Plugins — A review

3 Checkpointing with Proxies: Beginnings of a New Paradigm

4 Split Processes: MANA for MPI

5 Collective Communication: A Tricky Problem for Split Processes

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 11 / 49

Page 12: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Goal of Proxies

GOAL: To convince you of a general proxy paradigm for handlingthe “hard” checkpointing challenges that remain.

We take as testbeds two such “hard” examples:

1 Checkpoint a CUDA application running on a GPU (Problem: how tosave and restore the state of GPU hardware)(joint with Rohan Garg, Apoorve Mohan, Michael Sullivan)To appear, IEEE Cluster’18; see, also, technical report:https://arxiv.org/abs/1808.00117

2 Checkpoint MPI on NERSC/Cori supercomputer (with GNI network; nocheckpoint-restart service to support it)(a long story: coming next in Part Two of this talk)

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 12 / 49

Page 13: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Proxies: The Secret Sauce

Process

GPU LIBRARY

CUDAAPPLIC.

CUDA

LIB

CUDA

LIBRARY

APPLIC.

INTERPOSE

GPU

BEFORE: AFTER:

Application Application

Process

Proxy

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 13 / 49

Page 14: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Basic Idea of a Proxy for System Services

PROBLEM: Often, system services are provided with the help of anauxiliary system that cannot be checkpointed.

EXAMPLES: GPU device hardware; MPI auxiliary process or MPIcoordinator; sshd (ssh daemon); VNC server; etc.

SOLUTION:A. Split user process into two: an application process with all of the application state; and a

proxy process communicating with hardware or software for system services.B. System service requests are passed from application process to proxy process through

inter-process communication, and pass result back.C. At checkpoint time, the proxy process is temporarily disconnected, and the application

process is checkpointed as an isolated “vanilla” Linux process. (Note: the proxy processmust be in a quiescent state (no unfinished system service tasks) at checkpoint time.)

D. At restart time, a new proxy process re-connects with the system service and with therestarted application. The application process then replays some old system servicerequests, restoring system to a state that equivalent to pre-checkpoint time.

For more on this, see the thesis of Rohan Garg(NOTE: Proxies have been used before. For example, this is the basis of an old trick for checkpointing

VNC sessions. We propose it here as a general paradigm for checkpoint-restart.)Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 14 / 49

Page 15: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Proxies: The Secret Sauce (again)

Process

GPU LIBRARY

CUDAAPPLIC.

CUDA

LIB

CUDA

LIBRARY

APPLIC.

INTERPOSE

GPU

BEFORE: AFTER:

Application Application

Process

Proxy

∗First tell them what you’re going to say; then say it;and then tell them what you said.

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 15 / 49

Page 16: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

CRUM: “Checkpoint-Restart for Unified Memory” forCUDA

Proxy-based shadow paging for CUDA UVM (Unified Virtual Memory)Shadow UVM page synchronization: Catches memory transfers betweenproxy and application through memory permissions and segfaultdetection. The difficulty for transparent checkpointing withCUDA-managed memory: How to make this efficient?See the CRUM paper (Algorithm 1, shadow-page synchronization) fordetails.

Enables fast forked checkpointing model for UVM memory that overlapswriting a checkpoint image to stable storage, while the applicationcontinues. Almost free benefit of our approach! (This was difficult in thepast due to the need to share memory among the GPU device, and theUVM-based host that was split among parent and forked childprocesses.)

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 16 / 49

Page 17: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

CRUM (Checkpoint-Restart for CUDA Unified Memory):Runtime

LUD Hotspot3D Gaussian LavaMD0

20

40

60

80

Runtime (s)

Native With CRUM

a RodiniaBenchmark.

1x8 2x8 4x8Num. of MPI ranks

2

4

6

8

10

12

14

DOF/s

Overh

ead (%

)

Level-1 Level-2 Level-3

b HPGMG-FVBenchmark.

1x8 2x8 4x8Num. of MPI ranks

400

500

600

700

800

900

Runtim

e (s)

Native With CRUM

c HYPREBenchmark.

Figure: Runtime overheads for different benchmarks under CRUM.

Operating environment: Four NVIDIA Tesla P100’s per node; CUDA 8(max config: four nodes with 4 GPUs per node and 8 MPI ranks per node)

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 17 / 49

Page 18: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Outline

1 DMTCP — A review

2 DMTCP Plugins — A review

3 Checkpointing with Proxies: Beginnings of a New Paradigm

4 Split Processes: MANA for MPI

5 Collective Communication: A Tricky Problem for Split Processes

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 18 / 49

Page 19: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

MANA for MPI: MPI-Agnostic Network-AgnosticTransparent Checkpointing

Full paper with details at:“MANA for MPI: MPI-Agnostic Network-Agnostic TransparentCheckpointing”, by R. Garg, G. Price, and G. Cooperman,High Performance Distributed Computing (HPDC’19)

Not just an implementation, but a flexible principle with many applications:IDEA: Load two independent programs into a single process, sharing asingle address space. (For a different approach, see “Process-in-process”,HPDC’18 from RIKEN et al., employing multiple link maps/dlmopen.)LOW OVERHEAD! This completely eliminates the overhead of theproxy approach. (No need for shared memory, cross-memory-attach(cma), XPMEN. One program can directly pass an internal pointer to theother program.)Isolate the MPI/network libraries into their own program;completely separate from the program running the MPI application.

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 19 / 49

Page 20: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

MANA for MPI: MPI-Agnostic Network-AgnosticTransparent Checkpointing (cont.)

FEATURES OF THE SPLIT-PROCESS APPROACH:Can dynamically change configuration of underlying MPI at runtime.(Why not? The MPI libraries run in a separate program, unrelated to theMPI application program.)

Can even change the choice of the underlying MPI and network(e.g., InfiniBand vs. Cray GNI) at runtime!(Why not? The MPI and network libraries run in a separate program,unrelated to the MPI application program.)

Can even migrate to a new cluster at runtime, in which the number ofCPU cores per nodes is different!(Why not? The binding of the MPI libraries to the CPU cores is part of aseparate program, unrelated to the MPI application program.)

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 20 / 49

Page 21: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

PuzzleCan you solve checkpointing on...

Cray MPI over Infiniband

And restart on…

MPICH over TCP/IP

1 2

3 4

5 6

7 8

9 10

11 12

13 14

15 16

4

8

5

10

6

12

7

14

1

3

2

9

11

13

15

16

4 Nodes, 4 Cores/Ranks per Node 8 Nodes, 2 Cores/Ranks per Node

Shared Memory

Shared Memory

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 21 / 49

Page 22: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Cross-Cluster MigrationIt is now possible to checkpoint on

Cray MPI over Infiniband

And restart on…

MPICH over TCP/IP

1 2

3 4

5 6

7 8

9 10

11 12

13 14

15 16

4

8

5

10

6

12

7

14

1

3

2

9

11

13

15

16

4 Nodes, 4 Cores/Ranks per Node 8 Nodes, 2 Cores/Ranks per Node

Shared Memory

Shared Memory

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 22 / 49

Page 23: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Transparency and Agnosticism

Transparency

1. No re-compilation and no re-linking of application2. No re-compilation of MPI3. No special transport stack or drivers

Agnosticism

1. Works with any libc or Linux kernel2. Works with any MPI implementation (MPICH, CRAY MPI, etc)3. Works with any network stack (Ethernet, Infiniband, Omni-Path, etc).

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 23 / 49

Page 24: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Alas, poor transparency, I knew him Horatio...

Transparent checkpointing could die a slow, painful death.

1. Open MPI Checkpoint-Restart service (Network Agnostic; cf. Hursey et al.)○ MPI implementation provides checkpoint service to the application.

2. BLCR○ Utilizes kernel module to checkpoint local MPI ranks

3. DMTCP (MPI Agnostic)○ External program that wraps MPI for checkpointing.

These, and others, have run up against a wall:

MAINTENANCE

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 24 / 49

Page 25: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

The M x N maintenance penalty

MPI:

● MPICH● OPEN MPI● LAM-MPI● CRAY MPI● HP MPI● IBM MPI● SGI MPI● MPI-BIP● POWER-MPI● ….

Interconnect:

● Ethernet● InfiniBand● InfiniBand + Mellanox● Cray GNI● Intel Omni-path● libfabric● System V Shared Memory● 115200 baud serial● Carrier Pigeon● ….

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 25 / 49

Page 26: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

The M x N maintenance penalty

MPI:

● MPICH● OPEN-MPI● LAM-MPI● CRAY MPI● HP MPI● IBM MPI● SGI MPI● MPI-BIP● POWER-MPI● ….

Interconnect:

● Ethernet● InfiniBand● InfiniBand + Mellanox● Cray GNI● Intel Omni-path● libfabric● System V Shared Memory● 115200 baud serial● Carrier Pigeon● ….

Network Agnostic

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 26 / 49

Page 27: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

The M x N maintenance penalty

MPI:

● MPICH● OPEN-MPI● LAM-MPI● CRAY MPI● HP MPI● IBM MPI● SGI MPI● MPI-BIP● POWER-MPI● ….

Interconnect:

● Ethernet● InfiniBand● InfiniBand + Mellanox● Cray GNI● Intel Omni-path● libfabric● System V Shared Memory● 115200 baud serial● Carrier Pigeon● ….

MPI and Network Agnostic

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 27 / 49

Page 28: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

The problem stems from checkpointing both the MPI coordinator and the MPI lib.

MANA: MPI-Agnostic, Network-Agnostic

MPI Coordinator

MPI Rank

MPI Rank

MPI Rank

MPI Rank

Node 1 Node 2

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 28 / 49

Page 29: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

The problem stems from checkpointing MPI - both the coordinator and the library.

Connections

Groups

Communicators

Link State

MANA: MPI-Agnostic, Network-Agnostic

MPI Coordinator

MPI Rank

MPI Rank

MPI Rank

MPI Rank

Node 1 Node 2

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 29 / 49

Page 30: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Step 1: Drain the Network

Achieving Agnosticism

MPI Coordinator

MPI Rank

MPI Rank

MPI Rank

MPI Rank

Node 2Node 1

Chandy-LamportAlgorithm

As demonstrated by Hursey et al., abstracting by “MPI Messages” allows for Network Agnosticism.

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 30 / 49

Page 31: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Checkpointing Collective Operations

Solution: Two-phase collectives

1. Preface all collectives with a trivial barrier2. When the trivial barrier is completed, call the original collective

Rank 1

Rank 2

Rank 3

Inside Barrier

Inside Barrier

Straggler

Trivial Barrier Collective

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 31 / 49

Page 32: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Checkpointing Collective Operations

Solution: Two-phase collectives

1. Preface all collectives with a trivial barrier2. When the trivial barrier is completed, call the original collective

Rank 1

Rank 2

Rank 3

Trivial Barrier CollectiveCollectiveComplete

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 32 / 49

Page 33: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Step 2: Discard the network

Achieving Agnosticism

MPI Coordinator

MPI Rank

MPI Rank

MPI Rank

MPI Rank

Node 2Node 1

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 33 / 49

Page 34: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Problems:

● MPI Implementation Specific● Contains MPI network state

Solution: IsolationCheckpointing the rank is simpler… right?

Checkpointing A Rank

MPI Rank

MPI Application

MPI Library

● Required by MPI and Application● Platform dependant

● Grouping information● Opaque MPI Objects

● Heap AllocationsLIBC

Network Libraries

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 34 / 49

Page 35: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

MPI Application

MPI Library

MPI Proxy LibraryMPI Library

Terminology

Isolation - The “Split-Process” Approach

Upper-Half program Checkpoint and Restore

Lower-Half program Discard and Re-initialize

Single Memory Space

Standard C Calling ConventionsNo RPC involved

LIBC

Network Libraries

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 35 / 49

Page 36: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Upper Half:Persistent Data

Lower HalfEphemeral Data

MPI Agnosticism Achieved

MPI Application

Config and Drain Info

LIBC

Lower half data can be replaced by new and different implementations of MPI and related libraries.

*Special care must be taken when replacing upper half libraries

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 36 / 49

Page 37: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Step 1: Drain the Network

Checkpoint Process

MPI Coordinator

MPI Rank

MPI Rank

MPI Rank

MPI Rank

Node 2Node 1

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 37 / 49

Page 38: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Step 1: Drain the NetworkStep 2: Checkpoint Upper-Half

Checkpoint Process

MPI Application

Config and Drain Info

LIBC

MPI Rank

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 38 / 49

Page 39: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Step 1: Restore Lower-Half

MPI Library

MPI Proxy Library

Restart Process

Lower-half components may be replacedLIBC

Network Libraries

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 39 / 49

Page 40: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Step 1: Restore Lower-HalfStep 2: Re-initialize MPI

Restart Process

● MPI_INIT● Replay Configuration

Naturally Optimized

MPI Library

MPI Proxy Library

Lower-half components may be replacedLIBC

Network Libraries

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 40 / 49

Page 41: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Step 1: Restore Lower-HalfStep 2: Re-initialize MPIStep 3: Restore Upper-Half

MPI Library

MPI Proxy Library

LIBC

Restart Process

MPI Application

Config and Drain Info

LIBC

MPI Rank

● MPI_INIT● Replay Configuration

Naturally Optimized

MPI Rank # assigned by MPI_Init used to select checkpoint file for restoring the upper half.

This avoids the need to virtualize MPI Rank numbers. Lower-half components may be replaced

Network Libraries

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 41 / 49

Page 42: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

PuzzleCan you solve checkpointing on...

Cray MPI over Infiniband

And restart on…

MPICH over TCP/IP

1 2

3 4

5 6

7 8

9 10

11 12

13 14

15 16

4

8

5

10

6

12

7

14

1

3

2

9

11

13

15

16

4 Nodes, 4 Cores/Ranks per Node 8 Nodes, 2 Cores/Ranks per Node

YESGene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 42 / 49

Page 43: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Checkpoint-Restart Overhead

Checkpoint Data Size

● GROMACS - 64 Ranks over 2 Nodes: 5.9GB (and 0.6% runtime overhead)● HPCG - 2048 ranks over 64 nodes: 4TB (and nearly 0% runtime overhead)● Largely dominated by memory used by benchmark program.

Checkpoint Time

● Largely dominated by disk-write time● “Stragglers” - a single rank takes much longer to checkpoint than others.

Restart Time

● MPI state reconstruction represented < 10% of total restart time.

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 43 / 49

Page 44: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

NEW: Cross-Cluster MPI Application Migration

Traditionally, migration across disparate clusters was not feasible.

● Different MPI packages across clusters● Highly optimized configurations tied to local cluster (Caches, Cores/Node)● Overhead of checkpointing entire MPI state is prohibitive

Overhead of migrating under MANA:

● 1.6% runtime overhead after migration.*

* Linux kernel’s upcoming patch https://lwn.net/Articles/769355/ reduces overhead to 0.6%

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 44 / 49

Page 45: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Outline

1 DMTCP — A review

2 DMTCP Plugins — A review

3 Checkpointing with Proxies: Beginnings of a New Paradigm

4 Split Processes: MANA for MPI

5 Collective Communication: A Tricky Problem for Split Processes

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 45 / 49

Page 46: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Collective Communication: Can it really work?

1 If we haven’t truly begun . . .Maybe some processes have not completed their current computationtask. Maybe we’ll wait a long time before they reach the collectivecommunication. Maybe we should abort the collective communication,and checkpoint immediately.(Aborting the collective communication should be safe. We hope that thelower-half MPI didn’t start writing yet into the upper user buffers thatwere passed as arguments.)

2 But maybe we have begun and we’re in the middle of it . . .Maybe all processes have already reached the collective communication,and some of them have begun write persistent changes into the upperhalf user buffers that were passed in. It’s dangerous to abort if we’vewritten to the user’s buffer. So, maybe we should finish the collectivecommunication if we’ve truly begun it. We can checkpoint after that.

What to do??? (I’m so confused. )Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 46 / 49

Page 47: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Checkpointing Collective Operations

Solution: Two-phase collectives

1. Preface all collectives with a trivial barrier2. When the trivial barrier is completed, call the original collective

Rank 1

Rank 2

Rank 3

Trivial Barrier CollectiveCollectiveComplete

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 47 / 49

Page 48: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Collective Communication: How can it work?

1 Maybe some processes are still in the trivial barrier. But then noprocesses are in the actual barrier invoked by the application.

Solution: At restart time, discard any operations in the trivial barrrierfrom the lower half, and restart the trivial barrier in the new (restarted)lower half MPI library.

2 Maybe some processes are still in the actual barrier invoked by theapplication. But then no processes are in the trivial barrier.

Solution: We know that all processes must have reached the collectivecommunication, since they have passed through the trivial barrier. So, justwait until they finish the actual collective communication invoked by theapplication.There will be no delay. All processes have entered the collectivecommunication!

(But see the paper for the formal details:)“MANA for MPI: MPI-Agnostic Network-Agnostic TransparentCheckpointing”, by R. Garg, G. Price, and G. Cooperman,High Performance Distributed Computing (HPDC’19)

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 48 / 49

Page 49: Checkpointing the Un-checkpointable: MANA and the Split …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2019. 8. 23. · Checkpointing the Un-checkpointable: MANA and the

Questions?

THANKS TO THE MANY STUDENTS AND OTHERSWHO HAVE CONTRIBUTED TO DMTCP OVER THEYEARS:Jason Ansel, Kapil Arya, Alex Brick, Jiajun Cao, Tyler Denniston, Xin Dong,William Enright, Rohan Garg, Paul Grosu, Twinkle Jain, Samaneh Kazemi,Jay Kim, Gregory Kerr, Apoorve Mohan, Mark Mossberg, Gregory Price,Manuel Rodrı́guez Pascual, Artem Y. Polyakov, Michael Rieker,Praveen S. Solanki, Ana-Maria Visan

QUESTIONS?

Gene Cooperman Checkpointing: MANA, Split-Process Approach August 20, 2019 49 / 49