for real world applications - iti.illinois.edu › sites › default › files › docs › ... ·...

1
Transparent Application Checkpointing and Recovery for Real World Applications Abhishek Garg, Indian Institute of Technology, Delhi ADVISORS: Prof. R. K. Iyer, Prof. Z. T. Kalbarczyk, Nithin Nakka I T I S U M M E R U N D E R G R A D U A T E I N T E R N P R O G R A M, 2 0 0 8 University of Illinois at UrbanaChampaign www.iti.uiuc.edu Information Trust Institute Goals Background Research Plan Fault tolerance techniques enable a system to continue operation in the event of a failure or a fault. Checkpointing consists of storing the snapshot of the current application state and using it for restarting the application in the event of a failure. Rollback Recovery: the process state is reloaded to memory from the last checkpoint and the execution resumes. Reliability Micro Kernel (RMK), a loadable kernel module for providing application‐specific reliability mechanisms, is used to deploy the application checkpointing. Developing a loadable kernel module for LINUX systems: q Checkpoints the application periodically. q Recovers the application on a failure from the last saved checkpoint. Demonstrate the Kernel Module on some real world applications. Learning the Kernel Module Programming and writing small test modules to get familiar with it. Understanding the Reliability Micro Kernel framework (RMK) and the code implementing Checkpointing in the framework. Leveraging the RMK interface and developing the Checkpointing kernel module. Algorithm Research Results What is the essential memory and processor state needed to recover the application from failures? How to detect an application process failure in the kernel mode? How to make the checkpoint to be transparent to the application? Successfully implemented the Checkpointing Module in LINUX. Demonstrated Kernel Module with the real world application, GCC compiler, by: q Running the compilation process in the background. q Issuing a kill signal to terminate the process. q Initiating the recovery. Demonstrated the recovery of the application with the same Process ID. Related Work/Interaction with Other Projects Reliability Micro Kernel: An OS level framework for providing Application‐aware reliability. M‐Checkpointing: Transparent Local and In‐ place checkpointing for propagation‐limited errors. Applications OS Hardware RMK Pin Interface Modules Fundamental Questions/Challenges Checkpointing module searches for the application to be checkpointed in the tasks list. Process’s memory pages including general purpose registers are saved. The process starts execution from the restored state. The saved memory pages are reloaded in the process’s memory data space. Error signal is caught and the recovery process is initiated.

Upload: others

Post on 25-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: for Real World Applications - iti.illinois.edu › sites › default › files › docs › ... · Module in LINUX. Ø Demonstrated Kernel Module with the real world application,

Transparent Application Checkpointing and Recovery for Real World Applications

Abhishek Garg, Indian Institute of Technology, Delhi ADVISORS: Prof. R. K. Iyer, Prof. Z. T. Kalbarczyk, Nithin Nakka

I T I S U M M E R U N D E R G R A D U A T E I N T E R N P R O G R A M, 2 0 0 8

University of Illinois at Urbana­Champaign

www.iti.uiuc.edu Information Trust Institute

Goals

Background Research Plan

Ø Fault tolerance techniques enable a system to continue operation in the event of a failure or a fault.

Ø Checkpointing consists of storing the snapshot of the current application state and using it for restarting the application in the event of a failure.

Ø Rollback Recovery: the process state is reloaded to memory from the last checkpoint and the execution resumes.

Ø Reliability Micro Kernel (RMK), a loadable kernel module for providing application‐specific reliability mechanisms, is used to deploy the application checkpointing.

Ø Developing a loadable kernel module for LINUX systems:

q Checkpoints the application periodically.

q Recovers the application on a failure from the last saved checkpoint.

Ø Demonstrate the Kernel Module on some real world applications.

Ø Learning the Kernel Module Programming and writing small test modules to get familiar with it.

Ø Understanding the Reliability Micro Kernel framework (RMK) and the code implementing Checkpointing in the framework.

Ø Leveraging the RMK interface and developing the Checkpointing kernel module.

Algorithm

Research Results

Ø What is the essential memory and processor state needed to recover the application from failures?

Ø How to detect an application process failure in the kernel mode?

Ø How to make the checkpoint to be transparent to the application?

Ø Successfully implemented the Checkpointing Module in LINUX.

Ø Demonstrated Kernel Module with the real world application, GCC compiler, by:

q Running the compilation process in the background.

q Issuing a kill signal to terminate the process.

q Initiating the recovery.

Ø Demonstrated the recovery of the application with the same Process ID.

Related Work/Interaction with Other Projects

Ø Reliability Micro Kernel: An OS level framework for providing Application‐aware reliability.

Ø M‐Checkpointing: Transparent Local and In‐ place checkpointing for propagation‐limited errors.

Applications

OS

Hardware

RMK

Pin Interface

Modules

Fundamental Questions/Challenges 

Checkpointing module searches for the application to be checkpointed in the 

tasks list. 

Process’s memory pages including general purpose registers are saved. 

The process starts execution from the restored state. 

The saved memory pages are reloaded in the process’s memory data space. 

Error signal is caught and the recovery process is initiated.