cruz:application-transparent distributed checkpoint-restart on standard operating systems
DESCRIPTION
Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Operating SystemsTRANSCRIPT
© 2004 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice
Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems
G. (John) Janakiraman, Jose Renato Santos, Dinesh Subhraveti§, Yoshio Turner
HP Labs
§: Currently at Meiosys, Inc.
2
Broad Opportunity for Checkpoint-Restart in Server Management
• Fault tolerance (minimize unplanned downtime)−Recover by restarting from checkpoint
• Minimize planned downtime−Migrate application before hardware/OS
maintenance
• Resource management−Manage resource allocation in shared computing
environments by migrating applications
3
Need for General-Purpose Checkpoint-Restart• Existing checkpoint-restart methods are too
limited:−No support for many OS resources that
commercial applications use (e.g., sockets)−Limited to applications using specific libraries−Require application source and recompilation−Require use of specialized operating systems
• Need a practical checkpoint-restart mechanism that is capable of supporting a broad class of applications
4
Cruz: Our Solution for General-Purpose Checkpoint-Restart on Linux• Application-transparent: supports
applications without modifications or recompilation
• Supports a broad class of applications (e.g., databases, parallel MPI apps, desktop apps)−Comprehensive support for user-level state,
kernel-level state, and distributed computation and communication state
• Supported on unmodified Linux base kernel – checkpoint-restart integrated via a kernel module
5
Cruz OverviewBuilds on Columbia Univ.’s Zap process
migration
Our Key Extensions• Support for migrating networked applications,
transparent to communicating peers−Enables role in managing servers running
commercial applications (e.g., databases)
• General method for checkpoint-restart of TCP/IP-based distributed applications−Also enables efficiencies compared to library-
specific approaches
6
Outline• Zap (Background)• Migrating Networked Applications−Network Address Migration−Communication State Checkpoint and Restore
• Checkpoint-Restart of Distributed Applications
• Evaluation• Related Work• Future Work• Summary
7
Zap (Background)
• Process migration mechanism− Kernel module implementation
• Virtualization layer groups processes into Pods with private virtual name space− Intercepts system calls to
expose only virtual identifiers (e.g., vpid)
− Preserves resource names and dependencies across migration
• Mechanism to checkpoint and restart pods− User and kernel-level state− Primarily uses system call
handlers− File system not saved or
restored (assumes a network file system)
Linux System calls
ZapLinux
Pods
Applications
8
Outline• Zap (Background)• Migrating Networked Applications−Network Address Migration−Communication State Checkpoint and Restore
• Checkpoint-Restart of Distributed Applications
• Evaluation• Related Work• Future Work• Summary
9
Migrating Networked Applications• Migration must be transparent to remote peers
to be useful in server management scenarios−Peers, including unmodified clients, must not
perceive any change in the IP address of the application
−Communication state of live connections must be preserved
• No prior solution for these (including original Zap)
• Our Solution:−Provide unique IP address to each pod that persists
across migration−Checkpoint and restore the socket control state and
socket data buffer state of all live sockets
10
Network Address Migration
• Pod attached to virtual interface with own IP & MAC addr.− Implemented by using Linux’s virtual interfaces (VIFs)
• IP address assigned statically or through a DHCP client running inside the pod (using pod’s MAC address)
• Intercept bind() & connect() to ensure pod processes use pod’s IP address
• Migration: delete VIF on source host & create on new host− Migration limited to subnet
eth0[IP-1, MAC-h1]
eth0:1
Pod
DHCPServer
Network
DHCPClient
1. ioctl() 2. MAC-p1
3. dhcprequest(MAC-p1)
4. dhcpack(IP-p1)
11
Communication State Checkpoint and RestoreCommunication state:− Control: Socket data structure, TCP connection state− Data: contents of send and receive socket buffers
Challenges in communication state checkpoint and restore:
• Network stack will continue to execute even after application processes are stopped
• No system call interface to read or write control state
• No system call interface to read send socket buffers
• No system call interface to write receive socket buffers
• Consistency of control state and socket buffer state
12
Communication State Checkpoint• Acquire network stack locks
to freeze TCP processing
• Save receive buffers using socket receive system call in peek mode
• Save send buffers by walking kernel structures
• Copy control state from kernel structures
• Modify two sequence numbers in saved state to reflect empty socket buffers− Indicate current send
buffers not yet written by application
− Indicate current receive buffers all consumed by application
Checkpoint State
Control
Rh
Rt
Recvbuffers
St
Sh
Sendbuffers
ShRt+1
Timers,Options,etc.
Rh St+1 ShRt+1 XX
receive()
directaccess
directaccess
State for one socket
Rh
Rt
...
St
Sh
...
Rt+1Rh
ShSt+1
Timers,Options,etc.
Control
Recv buffersSend buffers
copied_seqrcv_nxt snd_una
write_seq
Live Communication State
Note:Checkpoint does not changelive communication state
13
Control
Live Communication State
copied_seqrcv_nxt snd_una
write_seq
Communication State Restore• Create a new socket• Copy control state in
checkpoint to socket structure
• Restore checkpointed send buffer data using the socket write call
• Deliver checkpointed receive buffer data to application on demand− Copy checkpointed
receive buffer data to a special buffer
− Intercept receive system call to deliver data from special buffer until buffer is emptied
St
Sh
...
Send buffersCheckpoint State
Control
Rh
Rt
Recvbuffers
St
Sh
Sendbuffers
ShRt+1
Timers,Options,etc.
Rt+1 Sh
Rt+1Rt+1
Sh
Timers,Options,etc.
Rh
Rt
Recvdata
directupdate
St+1
write()
To App by interceptedreceive system call
Sh
State for one socket
14
Outline• Zap (Background)• Migrating Networked Applications−Network Address Migration−Communication State Checkpoint and Restore
• Checkpoint-Restart of Distributed Applications
• Evaluation• Related Work• Future Work• Summary
15
Checkpoint-Restart of Distributed Applications
• State of processes and messages in channel must be checkpointed and restored consistently
• Prior approaches specific to particular library – e.g., modify library to capture and restore messages in channel
• Cruz preserves TCP connection state and IP addresses of each pod, implicitly preserving global communication state− Transparently supports TCP/IP-based distributed applications − Enables efficiencies compared to library-based implementations
Node
Processes
Communication Channel
Node
Processes
Node
Processes
TCP/IP
Library Library Library
TCP/IP TCP/IP
Checkpoint
16
Checkpoint-Restart of Distributed Applications in Cruz
• Global communication state saved and restored by saving and restoring TCP communication state for each pod − Messages in flight need not be saved since the TCP state will
trigger retransmission of these messages at restart• Eliminates O(N2) step to flush channel for capturing messages in flight
− Eliminates need to re-establish connections at restart
• Preserving pod’s IP address across restart eliminates need to re-discover process locations in library at restart
Node
Pod(processes)
Communication Channel
Node
Pod(processes)
Node
Pod(processes)
TCP/IP
Library Library Library
TCP/IP TCP/IP
Checkpoint
17
Consistent Checkpoint Algorithm in Cruz (Illustrative)
• Algorithm has O(N) complexity (blocking algorithm shown for simplicity)
• Can be extended to improve robustness and performance, e.g.:− Tolerate Agent & Coordinator failures− Overlap computation and checkpointing using copy-on-write− Allow nodes to continue without blocking for all nodes to complete
checkpoint− Reduce checkpoint size with incremental checkpoints
<checkpoint>
Node
Pod
TCP/IP
LibraryAgent
Node
Coordinator
Node
Pod
TCP/IP
LibraryAgent
•Disable pod comm§
<done>
<continue>
•Enable pod comm
<continue-done>
<checkpoint>
•Disable pod comm•Save pod state
<done>
<continue>
•Enable pod comm•Resume pod
<continue-done>
•Save pod state
•Resume pod§: using netfilterrules in Linux
18
Outline• Zap (Background)• Migrating Networked Applications−Network Address Migration−Communication State Checkpoint and Restore
• Checkpoint-Restart of Distributed Applications
• Evaluation• Related Work• Future Work• Summary
19
Evaluation• Cruz implemented for Linux 2.4.x on x86
• Functionality verified on several applications, e.g., MySQL, K Desktop Environment, and a multi-node MPI benchmark
• Cruz incurs negligible runtime overhead (less than 0.5%)
• Initial study shows performance overhead of coordinating checkpoints is negligible, suggesting the scheme is scalable
20
Performance Result – Negligible Coordination Overhead
• Checkpoint behavior for Semi-Lagrangian atmospheric model benchmark in configurations from 2 to 8 nodes
• Negligible latency in coordinating checkpoints (time spent in non-local operations) suggests scheme is scalable− Coordination latency of 400-500 microseconds is a small fraction
of the overall checkpoint latency of about 1 second
21
Related Work• MetaCluster product from Meiosys−Capabilities similar to Cruz (e.g., checkpoint and
restart of unmodified distributed applications)
• Berkeley Labs Checkpoint Restart (BLCR)−Kernel-module based checkpoint-restart for single
node−No identifier virtualization – restart will fail in the
event of an identifier (e.g., pid) conflict−No support for handling communication state – relies
on application or library changes
• MPVM, CoCheck, LAM-MPI−Library-specific implementations of parallel
application checkpoint-restart with disadvantages described earlier
22
Future WorkMany areas for future work, e.g.,• Improve portability across kernel versions by
minimizing direct access to kernel structures−Recommend additional kernel interfaces when
advantageous (e.g., accessing socket attributes)
• Implement performance optimizations to the coordinated checkpoint-restart algorithm−Evaluate performance on a wide range of
applications and cluster configurations
• Support systems with newer interconnects and newer communication abstractions (e.g., InfiniBand, RDMA)
23
Summary• Cruz, a practical checkpoint-restart system for
Linux−No change to applications or to base OS kernel
needed
• Novel mechanisms to support checkpoint-restart of a broader class of applications−Migrating networked applications transparent to
communicating peers−Consistent checkpoint-restart of general TCP/IP-based
distributed applications
• Cruz’s broad capabilities will drive its use in solutions for fault tolerance, online OS maintenance, and resource management
© 2004 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice
http://www.hpl.hp.com/research/dca
27
Outline• Zap (Background)• Migrating Networked Applications−Network Address Migration−Communication State Checkpoint and Restore
• Checkpoint-Restart of Distributed Applications
• Evaluation• Related Work• Future Work• Summary
28
Outline• Zap (Background)• Migrating Networked Applications−Network Address Migration−Communication State Checkpoint and Restore
• Checkpoint-Restart of Distributed Applications
• Evaluation• Related Work• Future Work• Summary