linux-cr: transparent application checkpoint-restart in linuxorenl/talks/ksummit-2010.pdf ·...

41
Linux Kernel Summit, November 2010 1 [email protected] Linux Kernel Summit, November 2010 1 [email protected] Linux-CR: Transparent Application Checkpoint-Restart in Linux Linux-CR: Transparent Application Checkpoint-Restart in Linux Oren Laadan Columbia University [email protected]

Upload: others

Post on 30-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 1 [email protected] Kernel Summit, November 2010 1 [email protected]

Linux-CR:

Transparent Application Checkpoint-Restart in Linux

Linux-CR:

Transparent Application Checkpoint-Restart in Linux

Oren LaadanColumbia University

[email protected]

Page 2: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 2 [email protected] Kernel Summit, November 2010 2 [email protected]

Application C/RApplication C/R

◆ Application Checkpoint/Restart

a mechanism to save the state ofrunning application(s) so that they can later resume execution from that point

Page 3: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 3 [email protected] Kernel Summit, November 2010 3 [email protected]

checkpointimage

Application C/RApplication C/R

original restoredhierarchy hierarchy

restartcheckpoint

Page 4: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 4 [email protected] Kernel Summit, November 2010 4 [email protected]

What is it good for ?What is it good for ?

◆ Application roll back to the past◆ Application suspend and resume◆ Application migration

Page 5: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 5 [email protected] Kernel Summit, November 2010 5 [email protected]

Application rollbackApplication rollback

◆ Fault tolerance◆ Effective debugging◆ Fast application start-up◆ Software testing◆ Generic time-machine

Page 6: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 6 [email protected] Kernel Summit, November 2010 6 [email protected]

Application rollbackApplication rollback

◆ Fault tolerance◆ long running applications◆ cloud, HPC, at work, at home

◆ Effective debugging◆ Fast application start-up◆ Software testing◆ Generic time-machine

Page 7: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 7 [email protected] Kernel Summit, November 2010 7 [email protected]

Application rollbackApplication rollback

◆ Fault tolerance◆ Effective debugging

◆ Super-core-dump● more details, multiple tasks

◆ re-run from checkpoint● trace, profile, and instrument

◆ Fast application start-up◆ Software testing◆ Generic time-machine

Page 8: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 8 [email protected] Kernel Summit, November 2010 8 [email protected]

Application rollbackApplication rollback

◆ Fault tolerance◆ Effective debugging◆ Fast application start-up

◆ from default/previous state (ccache...)◆ improve desktop boot time

◆ Software testing◆ Generic time-machine

Page 9: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 9 [email protected] Kernel Summit, November 2010 9 [email protected]

Application rollbackApplication rollback

◆ Fault tolerance◆ Effective debugging◆ Fast application start-up◆ Software testing

◆ repeat from specific point(s)◆ distribute on multiple hosts

◆ Generic time-machine

Page 10: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 10 [email protected] Kernel Summit, November 2010 10 [email protected]

Application rollbackApplication rollback

◆ Fault tolerance◆ Effective debugging◆ Fast application start-up◆ Software testing◆ Generic time-machine

◆ revive old server/desktop state◆ retry a move in a game

Page 11: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 11 [email protected] Kernel Summit, November 2010 11 [email protected]

Application suspend/resumeApplication suspend/resume

◆ Improved OOM handling◆ Better system utilization◆ Suspend/resume a user's session

Page 12: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 12 [email protected] Kernel Summit, November 2010 12 [email protected]

Application suspend/resumeApplication suspend/resume

◆ Improved OOM handling◆ suspend applications, don't kill◆ smart “swap” on embedded

◆ Better system utilization◆ Suspend/resume a user's session

Page 13: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 13 [email protected] Kernel Summit, November 2010 13 [email protected]

Application suspend/resumeApplication suspend/resume

◆ Improved OOM handling◆ Better system utilization

◆ suspend application to reduce load

◆ Suspend/resume a user's session

Page 14: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 14 [email protected] Kernel Summit, November 2010 14 [email protected]

Application suspend/resumeApplication suspend/resume

◆ Improved OOM handling◆ Better system utilization◆ Suspend/resume a user's session

◆ mobile desktop on USB key◆ linux based VPS/VDI systems

Page 15: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 15 [email protected] Kernel Summit, November 2010 15 [email protected]

Application MigrationApplication Migration

◆ Load balancing / resource sharing◆ Zero-downtime maintenance◆ High availability

Page 16: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 16 [email protected] Kernel Summit, November 2010 16 [email protected]

Application MigrationApplication Migration

◆ Load balancing / resource sharing◆ HPC (e.g. BlueWaters project)◆ cloud environments◆ linux-based VPS/VDI

◆ Zero-downtime maintenance◆ High availability

Page 17: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 17 [email protected] Kernel Summit, November 2010 17 [email protected]

Application MigrationApplication Migration

◆ Load balancing / resource sharing◆ Zero-downtime maintenance

◆ live migration of applications

◆ High availability

Page 18: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 18 [email protected] Kernel Summit, November 2010 18 [email protected]

Application MigrationApplication Migration

◆ Load balancing / resource sharing◆ Zero-downtime maintenance◆ High availability

◆ primary/backup in lock-step◆ frequent incremental checkpoints

Page 19: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 19 [email protected] Kernel Summit, November 2010 19 [email protected]

Application vs Virtual-MachineApplication vs Virtual-Machine

Application Virtual C/R Machine

granularity specific operating systemapplications as a whole unit

saved state application entire operatingstate only system state

overhead none visible

flexibility application operating systemawareness is black box

deployment linux only same arch family

Page 20: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 20 [email protected] Kernel Summit, November 2010 20 [email protected]

Some examplesSome examples

◆ HPC environments◆ can extend linux-cr → distributed-cr

◆ Cloud deployments◆ using linux containers

◆ Light-weight clusters of ARMs◆ combine LXC and linux-cr

Page 21: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 21 [email protected] Kernel Summit, November 2010 21 [email protected]

Some concrete examplesSome concrete examples

◆ BlueWaters◆ NCSA's most powerful supercomputer◆ checkpointing based on linux-cr

◆ OpenVZ◆ VPS hosting with migration capabilities

◆ Canonical / Ubuntu◆ add LXC & linux-cr in UEC cluster stack

Page 22: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 22 [email protected] Kernel Summit, November 2010 22 [email protected]

Who and WhoWho and Who

◆ Who is doing ?◆ Matt Helsley, Dan Smith, Serge Hallyn,

Nathan Lynch, Sukadev Bhattiprolu, me

◆ Who is interested ?◆ IBM, Canonical, OpenVZ, HPC industry,

Kerrighed, Google (?), ...

◆ Who else does/did ?◆ OS: AIX, OpenVZ, IRIX, Cray...◆ Systems: Moab, BLCR/Beowolf, Condor...

Page 23: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 23 [email protected] Kernel Summit, November 2010 23 [email protected]

Linux-C/R design goalsLinux-C/R design goals

◆ Transparency◆ Reliability◆ Security/safety◆ Performance◆ Maintainability

Page 24: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 24 [email protected] Kernel Summit, November 2010 24 [email protected]

Linux-C/R design goalsLinux-C/R design goals

◆ Transparency◆ applications oblivious to operation◆ allow notify of checkpoint or restart◆ allow application awareness

◆ Reliability◆ Security/safety◆ Performance◆ Maintainability

Page 25: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 25 [email protected] Kernel Summit, November 2010 25 [email protected]

Linux-C/R design goalsLinux-C/R design goals

◆ Transparency◆ Reliability

◆ checkpoint succeeds → restart succeeds◆ report non-checkpoint-able reasons◆ checkpoint is non-intrusive

◆ Security/safety◆ Performance◆ Maintainability

Page 26: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 26 [email protected] Kernel Summit, November 2010 26 [email protected]

Linux-C/R design goalsLinux-C/R design goals

◆ Transparency◆ Reliability◆ Security/safety

◆ ptrace capabilities to checkpoint◆ reuse kernel code to reconstruct state

◆ Performance◆ Maintainability

Page 27: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 27 [email protected] Kernel Summit, November 2010 27 [email protected]

Linux-C/R design goalsLinux-C/R design goals

◆ Transparency◆ Reliability◆ Security/safety◆ Performance

◆ zero impact on performance◆ reasonable code footprint

◆ Maintainability

Page 28: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 28 [email protected] Kernel Summit, November 2010 28 [email protected]

Linux-C/R design goalsLinux-C/R design goals

◆ Transparency◆ Reliability◆ Security/safety◆ Performance◆ Maintainability

◆ next slide ...

Page 29: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 29 [email protected] Kernel Summit, November 2010 29 [email protected]

MaintainabilityMaintainability

◆ Placement of C/R code◆ Extensive test-suite◆ Positive experience so far◆ Impact on developers

Page 30: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 30 [email protected] Kernel Summit, November 2010 30 [email protected]

MaintainabilityMaintainability

◆ Placement of C/R code◆ generic code in kernel/checkpoint/...◆ most c/r code with or near subsystem

code so subsytem maintainers sees it◆ c/r is well documented

◆ Extensive test-suite◆ Positive experience so far◆ Impact on developers

Page 31: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 31 [email protected] Kernel Summit, November 2010 31 [email protected]

MaintainabilityMaintainability

◆ Placement of C/R code◆ Extensive test-suite

◆ test large list (>120) of scenarios◆ test before/during/after behavior

◆ Positive experience so far◆ Impact on developers

Page 32: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 32 [email protected] Kernel Summit, November 2010 32 [email protected]

MaintainabilityMaintainability

◆ Placement of C/R code◆ Extensive test-suite◆ Positive experience (2.6.27 → today)

◆ can ignore most kernel changes◆ mainly need to add features◆ minor changes to prior c/r code

● e.g. splice/pipe, syscalls #s, mm helpers

◆ Impact on developers

Page 33: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 33 [email protected] Kernel Summit, November 2010 33 [email protected]

MaintainabilityMaintainability

◆ Placement of C/R code◆ Extensive test-suite◆ Positive experience so far◆ Impact on developers

◆ understand what may affect c/r code◆ at least notify c/r people when needed◆ awareness will grow with exposure

Page 34: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 34 [email protected] Kernel Summit, November 2010 34 [email protected]

Design SummaryDesign Summary

◆ Save/restore state in-kernel◆ Checkpoint container/subtree/self◆ Image holds “user-visible” state◆ Userspace image conversion◆ Detailed error reporting

Page 35: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 35 [email protected] Kernel Summit, November 2010 35 [email protected]

CheckpointCheckpoint

(1) Freeze process hierarchy

(2) Save global data

(3) Save process hierarchy

(4) Save state of all tasks

(?) Filesystem snapshot

(5) Thaw/kill process hierarchy

in-kernel

Page 36: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 36 [email protected] Kernel Summit, November 2010 36 [email protected]

RestartRestart

(1) Create container

(?) Restore (stage) filesystem

(3) Create process hierarchy

(4) Restore state of all tasks

(5) Resume execution

in-kernel

Page 37: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 37 [email protected] Kernel Summit, November 2010 37 [email protected]

Current StateCurrent State

◆ Supported subsystems:◆ tasks (threads, signals, credentials, etc)◆ namespaces (all but mounts-ns)◆ sysvipc (shm, msg, sem)◆ files, dirs (regular, fifos/pipes, epoll,

event, simple devices)◆ sockets (unix, ipv4, ipv6)◆ security (smack, selinux labels)

Page 38: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 38 [email protected] Kernel Summit, November 2010 38 [email protected]

Current StateCurrent State

◆ What's missing◆ [reviewed] file locks, leases, owner◆ [reviewed] unlinked files/dirs◆ [wip] fanotify/inotify/dnotify◆ [wip] mounts, mount-ns◆ /proc filesystem◆ ptraced tasks◆ more devices

Page 39: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 39 [email protected] Kernel Summit, November 2010 39 [email protected]

Current StateCurrent State

◆ Supported architectures:◆ x86-32◆ x86-64◆ s390x◆ PowerPC◆ ARM

Page 40: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 40 [email protected] Kernel Summit, November 2010 40 [email protected]

Current StateCurrent State

◆ Code:◆ ~23K lines in total

● ~1200 lines documentation● ~600 lines per arch (x5)● ~8K lines in kernel/checkpoint/* (base)● ~7K lines “near-place” files (*/checkpoint.c)● ~2K lines “in-place” save/restore● ~2K lines for LSM

Page 41: Linux-CR: Transparent Application Checkpoint-Restart in Linuxorenl/talks/ksummit-2010.pdf · 2010-11-02 · Linux Kernel Summit, November 2010Linux Kernel Summit, November 2010 2121

Linux Kernel Summit, November 2010 41 [email protected] Kernel Summit, November 2010 41 [email protected]

Discussion ...Discussion ...

◆ Concrete path to mainline (mm/next?)◆ Exposure to subsystem maintainers ?◆ Image format tied to kernel version

(userspace conversion tools)

Many thanks to those who reviewed, tested, and provided suggestions !

● Web page: http://www.linux-cr.org/● Git tree(s): git://www.linux-cr.org/git/