stabilization enabling technology
DESCRIPTION
Stabilization Enabling Technology. Shlomi Dolev. Trustworthy Systems: Why is it So Hard?. Corbató’91: "It almost goes without saying that ambitious systems never quite work as expected“ http://larch-www.lcs.mit.edu:8001/~corbato/turing91/ - PowerPoint PPT PresentationTRANSCRIPT
Stabilization Enabling Technology
Shlomi Dolev
Trustworthy Systems: Why is it So Hard?
• Corbató’91: "It almost goes without saying that ambitious systems never quite work as expected“http://larch-www.lcs.mit.edu:8001/~corbato/turing91/
• "You must pay extreme attention to detail here. One wrong bit will make things fail… "http://my.execpc.com/~geezer/os/pm.htm
• From Pentium’s manual:“… if the ESP or SP register is 1 when the PUSH instruction is executed, the processor shuts down due to a lack of stack space. No exception is generated to indicate this condition"
Mars Rover - Spirit• …The Spirit rover has a radiation-hardened R6000
CPU from Lockheed-Martin Federal Systems…The operating system is Wind River Systems' Vx-Works..
• …attempted to allocate more files than the RAM-based directory structure could accommodate. That caused an exception, which caused the task that had attempted the allocation to be suspended…
• …Spirit fell silent, alone on the emptiness of Mars, trying and trying to reboot
http://www.eetimes.com/sys/news/OEG20040220S0046
Linux and Windows do not Stabilize
Self-Stabilization
• Self-healing, Self-managing, Self-*• Recovery Oriented Computing
[Berkeley, Stanford] • Autonomic Computing [IBM]
• Self-Stabilization• Self-Stabilizing algorithm for mutual
exclusion in a ring topology [Dijkstra’74]
Well Established Theory !
Self-Stabilization• The combination and type of faults
cannot be totally anticipated in on-going systems
• Any on-going system must be self stabilizing (or manually monitored)
E L
First Self-Stabilizing Algorithm: Token Passing [Dij74]
Token Passing
1 P1: do forever
2 if x1=xn then
3 x1:=(x1+1)mod(n+1)
4 Pi(i ≠ 1):do forever
5 if xi≠xi-1 then
6 xi:=xi-1
Token Passing Cont.
• Surely works when we start in x1 = x2 = … = xn = 0.
• One processor may change a state at a time.
{0; 0; 0; 0; 0};
{1; 0; 0; 0; 0};
{1; 1; 0; 0; 0};
{1; 1; 1; 0; 0};
{1; 1; 1; 1; 0};
{1; 1; 1; 1; 1};
{2; 1; 1; 1; 1};
{2; 2; 1; 1; 1};
{2; 2; 2; 1; 1};
{2; 2; 2; 2; 1};
{2; 2; 2; 2; 2}
…
Token Passing: Faults
• Transient fault, soft errors, wrong CRC, unexpected temporal severe conditions, etc.
• Assigns each processor with an arbitrary state (in the range of its state space).
• For example {3; 4; 4; 1; 0}.
• p2; p4; and p5 have tokens!
• Will the system ever recover?
Token Passing: Automatic Recovery
• p1 changes state infinitely often,• Otherwise, let s1 be the fixed state of p1, • p2 eventually copies s1 from p1, then • p3 eventually copies s1 from p2, then • ... • pn eventually copies s1 from pn-1, then • p1 changes state. • p1 changes state in the order 4; 5; 0; 1; 2; 3; 4; 5; 0; ...
Token Passing: Automatic Recovery Cont.
• In any initial state at least one state is missing, {4; 4; 1; 0; 2}, 3 and 5 are missing.
• Once p1 reaches the missing state e.g., 5, all the processors must copy 5, before p1 reads 5 from pn and changes state to 0.
Will It Stabilize With mod (n - 2)?
Mod 3
{0,0,2,1,0} p1 {1,0,2,1,0} p5
{1,0,2,1,1} p4 {1,0,2,2,1} p3
{1,0,0,2,1} p2 {1,1,0,2,1}
+1 mod 3 !
Is Self-Stabilization a Toy?
Stabilization Stack
• Self Stabilizing Microprocessor [DH04,DH06]
• Self Stabilizing Operating System [DY04]• Self-Stabilization Preserving
Compiler[DH05]• Self-Stabilizing Automatic Recoverer For
Eventual Byzantine Software [BDK03]• Recovery Oriented Programming[BD05]
Implementation Bottleneck
• Ask Intel, AMD, IBM to design a self-stabilizing microprocessor…
• Technology for converting off-the shelf processor to be self-stabilizing [DH06]
• Ask Microsoft, IBM, Red Hat, to convert existing code of OS to be self-stabilizing…
• Stabilizing Virtual Machine [DY07]
Enforcing stabilization by resetting
• Processors behave correctly after reset
• Periodic reset ensures correct behavior
• But damages closure…• Need careful solutions
Periodic Reset Monitor• Find a location P in OS code reached at least
every T time• At P:
Save necessary information to RAM Request a reset and loop forever.
• Stabilizing watchdog accepts request and resets processor
• Upon reset: restore information and continue
• Stabilizing watchdog verifies that a reset is performed at least every T + epsilon time
Implementationusing Intel XScale core
• Used in numerous processors Network, I/O, Handheld, Cellular etc.
• RISC architecture (ARMv5 compatible)• Debug interface
Allows interaction between WD and OS External debug break used for notifying
the upcoming reset
Up to now
• Virtual Self-stabilizing processor on top of commercial quality processor
• Towards repeating the concept in OSs and VMMs (enforcing configuration and protecting critical operations)
Toward Self-Stabilizing Operating System (SOS)
Shlomi Dolev and Reuven Yagel,SAACS’04 Workshop, Zaragoza
Basic Directions
• Black-box Take existing OS (Unix, Windows, RTOS) Add stabilization layer
• Carefully tailoring a tiny kernel Processor scheduling Memory management Device driver Hosting Byzantine processes
Assumptions
• Every configuration (processor/memory) is possible
• At least some program code is hardwired (in ROM) and is correct – Harvard Model
• Processor: Instruction manual (e.g. x86\IA-32)
defines a transition function. Self-stabilizing [DH04]
Black Box
Periodic Reset Re-install and Execute Watchdog timer (self-stabilizing) Periodic processor reset During bootstraps OS reinstall from ROM
Weak self-stabilization E = (ci, ai, ci+1, …., RRE, c1, a1, c2, a2, …., ci,
ai, ci+1, …., RRE, c1, a1, c2, a2, …. Is it always acceptable?
Alternative: Periodic re-install code only, add consistency check and enforcement
Tailored Kernel
• Tiny Scheduler Tiny Memory Manager
• Requirements: Self-stabilizing Fair Process stabilization preserving (e.g.
validity of P.C. value)
Tiny SOS Scheduler
; increase task10 mov word ax, [currentProc]11 and ax, PROC_MASK...
; load task state...;restore ip52 mov ax, [bx+4];validate ip53 and ax, IP_MASK54 mov word [ss:STACK TOP], ax;restore general registers55 mov cx, word [bx+12] 56 mov dx, word [bx+14] 57 mov si, word [bx+16] 58 mov di, word [bx+18]
• ~70 lines of a real machine assembly code
• 16bit Real mode & 32bit Protected mode.
• Standard build and emulation tools (Nasm, ld, Bochs)
• Detailed proof of requirement preservation
Sketch of Proof
• In every execution E, the code of the scheduler is started to be executed and is executed from the first instruction to the last instruction infinitely often
• In every execution E of the scheduler each process is executed infinitely often
• The self-stabilizing scheduler preservers stabilization of processes.
Talk Outline
• Self Stabilizing Microprocessor [DH06]• Self Stabilizing Operating System [DY04]• Self-Stabilization Preserving
Compiler[DH05]• Self-Stabilizing Automatic Recoverer For • Eventual Byzantine Software [BDK03]• Recover Oriented Programming[BD05]
Self-Stabilization Preserving Compiler
Shlomi Dolev, Yinnon A. Haviv,Department of Computer Science
Ben-Gurion University, Israel
Mooly Sagiv,Department of Computer Science
Tel Aviv University, Israel
The Gap.
• Need a transformation between: Input program P written in a high
abstraction language, e.g., (D)ASM. Output program Q in a machine language,
say, JVM.
• Existing compilers? P and Q behaves the same when started
in the initial state. What if Q reaches an unexpected state
due to soft-error experienced by microprocessor?
Trivial Example
• A statement of the form: For each i in {0..9} do f(i)
• May be compiled to • Start with cx=12 inside
the loop…
• Moreover: Any runtime mechanism can get stuck / inconsistent.
mov ax, 10 mov cx, 0loop1: push cx call f inc cx cmp cx,ax jne loop
Stabilization Preserving Compiler – a closer look
State space of P
Ensuring that Q eventually behaves as P:
State space of Q
The Transformation
upon <condition_1> do
<statement_1>
Variable declarations
upon <condition_n> do
<statement_n>
Enforce invariants
Scheduler
condition_1
…
condition_n
Statement_1
Statement_n
Self-Stabilization Preserving Compiler: Summary
• Front end of compiler for ASM.• Self Stabilization preserving compiler.
Language with clear semantics from any state.
New demands for a compiler.
Talk Outline
• Self Stabilizing Microprocessor [DH04]• Self Stabilizing Operating System
[DY04]• Self-Stabilization Preserving
Compiler[DH05]• Self-Stabilizing Automatic Recoverer For
Eventual Byzantine Software [BDK03]• Recover Oriented Programming[BD05]
Self-Stabilization and Evolving Systems
• Real world systems cannot be verified exhaustively…
• We enforce safety and live-ness specifications• Contract between the client, project manager
and programmers, that is checked on line!• Make sure that the additional (thin)
monitoring and recovering layer is self-stabilizing
• A change can be made to the • implementation/specification• to support evolving environments
Self-Stabilizing Recoverer for Eventual Byzantine
Software
Olga Brukman, Shlomi DolevDepartment of Computer Science
Ben-Gurion University, Israel
Hillel Kolodner,Haifa Research Labs
IBM, Israel
Software Contains Bugs
• Heisenbugs, corrupt states, leaked resources are common…
• Correct and faultless SW is hard Long-lived running programs, e.g., OS
• Usually software is tested when starting from initial state and considering limited time scenarios.
Fault Model Reflecting Reality
• Software packages can be trusted to work as required after restart.
• Eventual Byzantine software.• System administrators and users use
reboot to deal with faults.
Middleware Architecture
OS
Kern
el
OMR
<Preds,RActs>1
<Preds,RActs>2
…<Preds,RActs>n
<Preds,RActs
>
<Preds,RActs>
<Preds,RActs>
<Pred
s,RActs>
Monitor-Restarter for Process and Subsystem
<Pred,RActs>1
<Pred,RActs>2
…
Restart Actions – Mature Approach
• Subsystem waits for completion of a restart of its components.
• Restart action may vary, depending on component internal state. Reschedule Roll-back Kill & Restart Few restart attempts with more drastic restart actions.
Computational Model: rsf-execution
• An execution E is rsf (restart supporting fair)-execution iff E is a fair execution in which every subsystem subi that is initialised during E respects its specification function ssi.
Requirement: Every rsf-execution E has a suffix in which the system respects its specification function ss.
Tools for Implementation – Black Box Approach
• Software package is a black box.• Package is monitored by recording it’s
IO (e.g., strace in Linux).• Monitors are independent of specific
implementation
Tools for Implementation – Transparent Box Approach
• Software package implementation tool is known.
• Run-Time Reflection tools are used to monitor and restart the package.
• Possible in Java, C++, CORBA, COM.
Practical Experience: Printers Problem
• Corrupted pdf, doc or ps file sent to printing server.
• Printer can’t print the file.• Cause retries by printing server
Printer is “stuck” on one job.
• Predicate for printing server: Restrict number of retries, try format
conversions, send error message to user.
Recovery Oriented Programming
Olga Brukman and Shlomi DolevDepartment of Computer Science
Ben-Gurion University, Israel
Towards Robust Software• Programming
Structural programming, OOD, Design Patterns…• Testing and debugging
Unit testing [JUnit, CppUnit]… Design By Contract (Eiffel) …
• Formal specification languages ASM, IO Automata, NURPL
• Model checking• Online recovery
ROC [PBB02]. Self-Stabilizing Autonomic Recoverer for
Eventual Byzantine Software [BDK03] - black box software packages.
Our Contribution• Program invariants derived from design
specifications. Checked every time invariant variables are
updated.• Automatic code generation for invariant verification
and recovery upon invariant violation.• Invariants are verified during • runtime.
Change of invariant variable is pre-checked in sand-box. Violation is prevented and replaced with a recovery action.
Our Contribution Cont.• Recovery action is chosen depending
on the current state and history.Roll back & resume.Wait.Reschedule.Kill & restart.
External Monitoring
• Monitoring the whole task to avoid transient faults occurrence after
which invariant variables are not changed ( and no invariant checks are done)
liveness problem – monitor over time
Talk Conclusions
• Self-Stabilization as an effective paradigm for creating robust systems.
• Rigorous approach for designing basic system components Microprocessor Virtual machine monitor Operating system Compiler Evolving and Recovery Oriented
Self-Stabilizing Virtual Machine Hypervisor Architecture for Resilient Cloud
Alexander Binun, Shlomi Dolev, Reuven Yagel {binun,dolev,yagel}@cs.bgu.ac.il
Martin Kahil, Mark Bloch, Boaz Menuhin{kahilm,mbloch}@post.bgu.ac.il, [email protected]
Ben-Gurion University of the Negev
Marc Lacoste, Thierry Coupaye, Aurelien Wailly{marc.lacoste,thierry.coupaye,aurelien.wailly}@orange.com
Orange Labs, Paris
INTRODUCTION
Virtualization
Virtual machines (VM) Guest OS runs apps
Hypervisor (HV) Hardware for VMs Assume that HV is in
host OS (e.g. KVM)
Malware (e.g. rootkits) Get into HV from VM
VM
TerminologyTransient failures (TFs): Yield arbitrary changes of the system state SEU (Single Event Update), limitations of
error detection algorithms
We do not want to say exactly, as we risk forgetting a scenarioWe better assume a resulting arbitrary state
Terminology, Cont.
Admissible execution: minimal requirements for a system to be recoverable e.g., less than one third of the processors are
under attack Possible limited faults during the
automatic recovery from the unanticipated events in the past:
CPU failures Message losses
Self-Stabilization
Legal execution: the desired behaviorSafe state: every execution starting from it is legal Self-stabilizing algorithm reaches a safe state in some finite time In every admissible execution From any arbitrary systems’ state Without external intervention
Self-Stabilization, Cont.
The program is stored in a ROM Not a subject for changes can be periodically reloaded
The system state can undergo unpredicted changes… … following which the system should
converge to a safe state
Example: Token Passing Algorithm
66
P1:
do forever X1 = XN =>
X1 := (X1+1) mod (N+1)
Pi, i ≠ 1:do forever
Xi+1≠ Xi =>
Xi+1 := Xi
Atomic step
P1
P3
P2
P4
X1
X2
X3
X4
Code: write-protected
Variables: may be corrupted
• Legal execution: exactly one Pi changes Xi • in infinitely many states
• A safe state: X1 = X2 = … XN
• The only possible execution is: exactly one Pi changes Xi
Token Passing: Self Stabilization
{0; 0; 0; 0; 0};
{1; 0; 0; 0; 0};
…
{1; 1; 1; 1; 1};
{2; 1; 1; 1; 1};
…
{2; 2; 2; 2; 2}
x1 x2 x3 … xN
Failure: start fromArbitrary values
{3; 7; 2; 3; 0};
…
{4; …… };
…
{M; …… };
…
{M; M; M; M; M};
x1 x2 x3 … xN
Safe:
• P1 eventually increments x1
– In N rounds x1 propagates along the chain, reaches PN, then increases
• In a state S x1 gets unique (not encountered) value M• after incrementing x1 several times
• Then M is propagated to other processors, reaching a safe state
Rou
nd N
umbe
r
OUR APPROACH – MAIN PRINCIPLE
Rootkit Activity – Current State of Art
Rootkit Activity – Current State of Art
Rootkit Activity – Current State of Art
Rootkit Activity – Current State of Art
Rootkit Activity – Current State of Art
Our approach: Software watchdog brings the system
into a safe state
PeriodicI’m Alive(frequent)
Reboot
Every software component (host,guests) can be corrupted – And the watchdog as well…
corrupted…
Reload system from ROM upon a hardware
timer signal
Hardware interrupt
Consistency check:Tampered => Reboot
SMM
ROM
• Validator is write-protected by hardware means:– It is the one that guarantees self-stabilization– Runs rarely as it is very time-consuming
• Watchdog quickly detects small problems– Runs frequently and efficiently
User Apps
Guest OS
VM1
User Apps
Guest OS
VM2
T2
(VM2)T1
(VM1)
VM Managercreate / delete VMUser
Existing Infrastructure
CPU4 .Saving CPU state & stop
VM2 VM1
OSScheduler
Pool
T2T1
VMTable
Hypervisor
VMiStatei
1. Schedule VM
3 .Run CPU during some time
2. Activate VM
User
Existing Infrastructure
I/O
driv
ers
Inter-VM traffic
Hardware
CPU
OSSchedulerVMTable
Hypervisor
VMiStatei
I/O
driv
ers
Pool
T2T1
Our Architecture: Watchdog
Hardware
Stabilization Manager
Periodic Interrupt
Watchdog
Check schedulerstate
Check VMstate
Check traffic state
Safe state? I’m Alive
not alive during a while =>reboot
CPU
OSSchedulerVMTable
Hypervisor
VMiStatei
I/O
driv
ers
Pool
T2T1
Hardware
Stabilization Manager
Periodic Interrupt(every second)
Watchdog
Check schedulerstate
Check VMstate
Check traffic state
not alive during a while =>reboot Timer
Integrity checker
Interrupt(every day)
integrity failure=>reboot
Our Architecture: Hardware-facilitated integrity checking
Safe state? I’m Alive
Implementation: Employ external tools for examining
VMsVM
Benign output
I am Alive
Implementation: Employ external tools for examining
VMs (cont)VM
Report malware
Alarm
Kill, suspend etc
Future Work• Test the prototype on real malware
collections– e.g. TechGainer[1]
• Intelligent safety enforcement – if the situation is not severely dangerous :
restart only malfunctioning fragments• reset malfunctioning printer instead of rebooting
the computer– Guarded Commands ([2]) as a basis for the
specification language {(guard,action), … }• Guard safety check, action enforcement
Future Work• Make the architecture support
distributed cloud infrastructures– E.g. OpenStack[3]
• Are there competitors ?– Azure [5] – recovery through replication– Replicas synchronization algorithm may
suffer from transient faults too