cruz:application-transparent distributed checkpoint-restart on standard operating systems

© 2004 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice

Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

G. (John) Janakiraman, Jose Renato Santos, Dinesh Subhraveti§, Yoshio Turner

HP Labs

§: Currently at Meiosys, Inc.

2

Broad Opportunity for Checkpoint-Restart in Server Management

• Fault tolerance (minimize unplanned downtime)−Recover by restarting from checkpoint

• Minimize planned downtime−Migrate application before hardware/OS

maintenance

• Resource management−Manage resource allocation in shared computing

environments by migrating applications

3

Need for General-Purpose Checkpoint-Restart• Existing checkpoint-restart methods are too

limited:−No support for many OS resources that

commercial applications use (e.g., sockets)−Limited to applications using specific libraries−Require application source and recompilation−Require use of specialized operating systems

• Need a practical checkpoint-restart mechanism that is capable of supporting a broad class of applications

4

Cruz: Our Solution for General-Purpose Checkpoint-Restart on Linux• Application-transparent: supports

applications without modifications or recompilation

• Supports a broad class of applications (e.g., databases, parallel MPI apps, desktop apps)−Comprehensive support for user-level state,

kernel-level state, and distributed computation and communication state

• Supported on unmodified Linux base kernel – checkpoint-restart integrated via a kernel module

5

Cruz OverviewBuilds on Columbia Univ.’s Zap process

migration

Our Key Extensions• Support for migrating networked applications,

transparent to communicating peers−Enables role in managing servers running

commercial applications (e.g., databases)

• General method for checkpoint-restart of TCP/IP-based distributed applications−Also enables efficiencies compared to library-

specific approaches

6

Outline• Zap (Background)• Migrating Networked Applications−Network Address Migration−Communication State Checkpoint and Restore

• Checkpoint-Restart of Distributed Applications

• Evaluation• Related Work• Future Work• Summary

7

Zap (Background)

• Process migration mechanism− Kernel module implementation

• Virtualization layer groups processes into Pods with private virtual name space− Intercepts system calls to

expose only virtual identifiers (e.g., vpid)

− Preserves resource names and dependencies across migration

• Mechanism to checkpoint and restart pods− User and kernel-level state− Primarily uses system call

handlers− File system not saved or

restored (assumes a network file system)

Linux System calls

ZapLinux

Pods

Applications

8




9

Migrating Networked Applications• Migration must be transparent to remote peers

to be useful in server management scenarios−Peers, including unmodified clients, must not

perceive any change in the IP address of the application

−Communication state of live connections must be preserved

• No prior solution for these (including original Zap)

• Our Solution:−Provide unique IP address to each pod that persists

across migration−Checkpoint and restore the socket control state and

socket data buffer state of all live sockets

10

Network Address Migration

• Pod attached to virtual interface with own IP & MAC addr.− Implemented by using Linux’s virtual interfaces (VIFs)

• IP address assigned statically or through a DHCP client running inside the pod (using pod’s MAC address)

• Intercept bind() & connect() to ensure pod processes use pod’s IP address

• Migration: delete VIF on source host & create on new host− Migration limited to subnet

eth0[IP-1, MAC-h1]

eth0:1

Pod

DHCPServer

Network

DHCPClient

1. ioctl() 2. MAC-p1

3. dhcprequest(MAC-p1)

4. dhcpack(IP-p1)

11

Communication State Checkpoint and RestoreCommunication state:− Control: Socket data structure, TCP connection state− Data: contents of send and receive socket buffers

Challenges in communication state checkpoint and restore:

• Network stack will continue to execute even after application processes are stopped

• No system call interface to read or write control state

• No system call interface to read send socket buffers

• No system call interface to write receive socket buffers

• Consistency of control state and socket buffer state

12

Communication State Checkpoint• Acquire network stack locks

to freeze TCP processing

• Save receive buffers using socket receive system call in peek mode

• Save send buffers by walking kernel structures

• Copy control state from kernel structures

• Modify two sequence numbers in saved state to reflect empty socket buffers− Indicate current send

buffers not yet written by application

− Indicate current receive buffers all consumed by application

Checkpoint State

Control

Rh

Rt

Recvbuffers

St

Sh

Sendbuffers

ShRt+1

Timers,Options,etc.

Rh St+1 ShRt+1 XX

receive()

directaccess

directaccess

State for one socket

Rh

Rt

...

St

Sh

...

Rt+1Rh

ShSt+1

Timers,Options,etc.

Control

Recv buffersSend buffers

copied_seqrcv_nxt snd_una

write_seq

Live Communication State

Note:Checkpoint does not changelive communication state

13

Control

Live Communication State

copied_seqrcv_nxt snd_una

write_seq

Communication State Restore• Create a new socket• Copy control state in

checkpoint to socket structure

• Restore checkpointed send buffer data using the socket write call

• Deliver checkpointed receive buffer data to application on demand− Copy checkpointed

receive buffer data to a special buffer

− Intercept receive system call to deliver data from special buffer until buffer is emptied

St

Sh

...

Send buffersCheckpoint State

Control

Rh

Rt

Recvbuffers

St

Sh

Sendbuffers

ShRt+1

Timers,Options,etc.

Rt+1 Sh

Rt+1Rt+1

Sh

Timers,Options,etc.

Rh

Rt

Recvdata

directupdate

St+1

write()

To App by interceptedreceive system call

Sh

State for one socket

14




15

Checkpoint-Restart of Distributed Applications

• State of processes and messages in channel must be checkpointed and restored consistently

• Prior approaches specific to particular library – e.g., modify library to capture and restore messages in channel

• Cruz preserves TCP connection state and IP addresses of each pod, implicitly preserving global communication state− Transparently supports TCP/IP-based distributed applications − Enables efficiencies compared to library-based implementations

Node

Processes

Communication Channel

Node

Processes

Node

Processes

TCP/IP

Library Library Library

TCP/IP TCP/IP

Checkpoint

16

Checkpoint-Restart of Distributed Applications in Cruz

• Global communication state saved and restored by saving and restoring TCP communication state for each pod − Messages in flight need not be saved since the TCP state will

trigger retransmission of these messages at restart• Eliminates O(N2) step to flush channel for capturing messages in flight

− Eliminates need to re-establish connections at restart

• Preserving pod’s IP address across restart eliminates need to re-discover process locations in library at restart

Node

Pod(processes)

Communication Channel

Node

Pod(processes)

Node

Pod(processes)

TCP/IP

Library Library Library

TCP/IP TCP/IP

Checkpoint

17

Consistent Checkpoint Algorithm in Cruz (Illustrative)

• Algorithm has O(N) complexity (blocking algorithm shown for simplicity)

• Can be extended to improve robustness and performance, e.g.:− Tolerate Agent & Coordinator failures− Overlap computation and checkpointing using copy-on-write− Allow nodes to continue without blocking for all nodes to complete

checkpoint− Reduce checkpoint size with incremental checkpoints

<checkpoint>

Node

Pod

TCP/IP

LibraryAgent

Node

Coordinator

Node

Pod

TCP/IP

LibraryAgent

•Disable pod comm§

<done>

<continue>

•Enable pod comm

<continue-done>

<checkpoint>

•Disable pod comm•Save pod state

<done>

<continue>

•Enable pod comm•Resume pod

<continue-done>

•Save pod state

•Resume pod§: using netfilterrules in Linux

18




19

Evaluation• Cruz implemented for Linux 2.4.x on x86

• Functionality verified on several applications, e.g., MySQL, K Desktop Environment, and a multi-node MPI benchmark

• Cruz incurs negligible runtime overhead (less than 0.5%)

• Initial study shows performance overhead of coordinating checkpoints is negligible, suggesting the scheme is scalable

20

Performance Result – Negligible Coordination Overhead

• Checkpoint behavior for Semi-Lagrangian atmospheric model benchmark in configurations from 2 to 8 nodes

• Negligible latency in coordinating checkpoints (time spent in non-local operations) suggests scheme is scalable− Coordination latency of 400-500 microseconds is a small fraction

of the overall checkpoint latency of about 1 second

21

Related Work• MetaCluster product from Meiosys−Capabilities similar to Cruz (e.g., checkpoint and

restart of unmodified distributed applications)

• Berkeley Labs Checkpoint Restart (BLCR)−Kernel-module based checkpoint-restart for single

node−No identifier virtualization – restart will fail in the

event of an identifier (e.g., pid) conflict−No support for handling communication state – relies

on application or library changes

• MPVM, CoCheck, LAM-MPI−Library-specific implementations of parallel

application checkpoint-restart with disadvantages described earlier

22

Future WorkMany areas for future work, e.g.,• Improve portability across kernel versions by

minimizing direct access to kernel structures−Recommend additional kernel interfaces when

advantageous (e.g., accessing socket attributes)

• Implement performance optimizations to the coordinated checkpoint-restart algorithm−Evaluate performance on a wide range of

applications and cluster configurations

• Support systems with newer interconnects and newer communication abstractions (e.g., InfiniBand, RDMA)

23

Summary• Cruz, a practical checkpoint-restart system for

Linux−No change to applications or to base OS kernel

needed

• Novel mechanisms to support checkpoint-restart of a broader class of applications−Migrating networked applications transparent to

communicating peers−Consistent checkpoint-restart of general TCP/IP-based

distributed applications

• Cruz’s broad capabilities will drive its use in solutions for fault tolerance, online OS maintenance, and resource management

© 2004 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice

http://www.hpl.hp.com/research/dca

27




28




cruz:application-transparent distributed checkpoint-restart on standard operating systems

Technology

migration checkpoint

socket control state

saved state

socket buffer state

kernellevel state

seq live communication

socket data buffer state

userlevel state