Exploring Failure Transparency and the Limits of Generic Recovery
Dave LowellCompaq Western Research Labxxx
Subhachandra Chandra andPeter M. Chen, University of Michigan
2
Introduction
Failure transparency: abstraction of failure-free operation
OS recovers app after hardware, OS, and application failures
– No programmer help– No slow down
Will explore theory, performance, and limitations
3
Consistent recovery
Visible output equivalent to failure-free run
– equivalence: allows duplicates– avoids “exactly once” problem
Failure transparency consistent recovery with generic techniques
4
Guaranteeing consistent recovery
Key players: non-deterministic events, visible events, commit events
Save-work invariant (simplified):– There’s a commit after each non-
deterministic event that happens-before a visible event.
– Full theorem handles liveness, distinguishes causality and ordering
6
CAND CAND-LOG
Effort to identify/convert ND events
CPVS
CPV-2PCE
ffort
to c
om
mit
onl
y vi
sib
le e
vent
s
CBNDVS
CBNDV-2PC
CBNDVS-LOG
7
CAND CAND-LOG
Effort to identify/convert ND events
CPVS
CPV-2PCE
ffort
to c
om
mit
onl
y vi
sib
le e
vent
s
CBNDVS
CBNDV-2PC
CBNDVS-LOG
Coord. CheckpointingManethoOptimistic Logging
Targon/32SBL Hypervisor
8
Effort to identify/convert ND events
Effo
rt to
co
mm
it o
nly
visi
ble
eve
nts increasing recovery time
app
lica
tion
failu
re r
eco
very
incre
asing
sim
plicit
y
incre
asing
per
form
ance
9
Performance study
Discount Checking: fast checkpoints to reliable memory (Rio)
– Logging and two-phase commit– Disk version
Mostly interactive applications– Localized and distributed
10
CAND1%
43%
CAND-LOG0%
13%
Effort to identify/convert ND events
CPVS1%44%
Effo
rt to
co
mm
it o
nly
visi
ble
eve
nts
CBNDVS1%42%
CBNDVS-LOG0%12%
Nvi Text Editor
11
CAND199%
11499%
CAND-LOG126%
7700%
Effort to identify/convert ND events
CPVS129%7346%
CPV-2PC12%319%
Effo
rt to
co
mm
it o
nly
visi
ble
eve
nts
CBNDVS101%5743%
CBNDV-2PC12% 252%
CBNDVS-LOG73%4973%
TreadMarks Barnes-Hut
12
Have only considered “stop” failures
Committing everything is okay– Save-work: when we must commit
Some failures affect application state– Can we commit too much?
15
Lose-work invariant
To recover from propagation failure, never commit on a “dangerous path”.
Save-work and Lose-work conflict!– Visible event on dangerous path– Can’t guarantee consistent recovery
from propagation failures
Do we see this conflict in practice?
16
Measuring Lose-work violations
Fault-injection study : OS crashes– injected faults into running kernel– induced 350 OS crashes– recovered nvi and postgres using
Discount Checking
Results– nvi: 15% crashes violate Lose-work– postgres: 3% crashes violate Lose-work
17
Application crashes
Fault-injection study: ND bugs– nvi: 37% violate Lose-work– postgres: 33% violate Lose-work
Published bug distributions: 85-95% of application bugs are deterministic
– intrinsically violate Lose-work
Perhaps > 90% app crashes violate Lose-work!
18
Conclusions
Save-work and Lose-work invariants Save-work protocol space Invariants fundamentally conflict Failure transparency performance:
– 0-12% overhead on reliable memory– 13-40% overhead on disk (interactive apps)
> 90% application failures violate Lose-work