fault-tolerant typed assembly language frances perry, lester mackey, george a. reis, jay ligatti,...

1
Fault-tolerant Typed Assembly Language Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker Princeton University TAL FT : An Idealized Fault-tolerant System Goal: Detect all faults that change a program’s observable behavior and prove that well-typed programs always detect a single fault. General Strategy: Two redundant and independent copies of each computation (labeled green and blue) that are compared before stores and control flow transfers. The TAL FT Machine A machine state consists of a code memory C, a value memory M, a register file R, a current instruction ir, and a store queue Q. r 1 ... pc G R ir . . . 0x0393 mov r 1 4 0x0394 mov r 3 4 0x0395 st G r 2 r 1 0x0396 st B r 4 r 3 . . . M C Q A value is observed when it is stored to memory. Q buffers a store from the green computation until the corresponding blue store is performed. Special hardware is used by memory and control flow instructions to detect mismatches between the computations and trigger the special fault state. Performance Results Conclusion For More Information… See our upcoming paper in PLDI ’07 http://www.cs.princeton.edu/~frances/t al_ft-pldi07.pdf Visit the Project ap Homepage http://www.cs.princeton.edu/sip/proj/z ap Transient faults are already a significant concern, and their impact is increasing. TAL FT formalizes idealized hardware for a hybrid hardware- software fault-tolerance scheme. TAL FT ’s type system captures the invariants necessary to reason about the system. We formally proved that all well-typed programs are fault- tolerant relative to the fault model. We simulated the TAL FT hardware using the Velocity Research Compiler. Despite essentially doubling the number of instructions, TAL FT is only 34% slower than the equivalent fault-intolerant code. Formalizing the TAL FT System Operational semantics ! specifies how a program executes. is a machine state (R,C,M,Q,ir). Special operational rules randomly introduce faults by arbitrarily modifying a value in the register file R or the queue Q Judgment ` Z states that machine state is well-typed under zap tag Z. Z is either empty or a color c, representing the color of the computation that has been corrupted. If Z is empty then all standard typing invariants must hold. If Z is a color c, then values colored c do not need to conform to their given types. The Fault-tolerance Theorem A program is fault tolerant when all the faulty executions simulate fault-free executions of the program. The relationship 1 sim c 2 says that 1 and 2 are identical modulo values colored c. (Simplified) Fault-Tolerance Theorem: Either a faulty execution is indistinguishable from the equivalent fault-free execution, or it terminates in the fault state. If ` and sim c 1 and ! then either 1 ! 2 and sim c 2 Transient Faults Transient faults (also known as soft fails) occur when an energetic particle strikes a transistor, causing it to change state. Does not permanently damage hardware. May corrupt computation by altering stored values and signal transfers. 1 + 1 = 18 Sun, HP, and Cypress Semiconductor have admitted that transient faults caused crashes and problems at eBay, AOL, Los Alamos, and other major sites. Faster clock rates, increasing transistor density, decreasing voltages, and smaller feature size all contribute to increasing fault rates of approximately 8% per generation. 2006 2014 (estimated) Transient faults are a well- known problem in the architecture community. There are many existing solutions, but do these solutions actually work? Solutions generally consist of algorithms stated in English, and it is left to the audience to judge correctness. In 2004, a typical laptop with 1GB RAM had 1 soft failure per year. Fault rates increase with altitude. Fault rates on a trans-pacific flight increase 300x to nearly 1 fault per roundtrip. Frances Perry. CRA-W/CDC Programming Languages Summer School. May 9-11, 2007.

Upload: georgina-matthews

Post on 28-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fault-tolerant Typed Assembly Language Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker Princeton University

Fault-tolerant Typed Assembly Language

Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David WalkerPrinceton University

TALFT: An Idealized Fault-tolerant System

Goal: Detect all faults that change a program’s observable behavior and prove that well-typed programs always detect a single fault.

General Strategy: Two redundant and independent copies of each computation (labeled green and blue) that are compared before stores and control flow transfers.

The TALFT Machine

A machine state consists of a code memory C, a value memory M, a register file R, a current instruction ir, and a store queue Q.

r1 ...pcG

Rir

. . .0x0393 mov r1 40x0394 mov r3 40x0395 stG r2 r1

0x0396 stB r4 r3

. . .

MCQ

A value is observed when it is stored to memory.

Q buffers a store from the green computation until the corresponding blue store is performed.

Special hardware is used by memory and control flow instructions to detect mismatches between the computations and trigger the special fault state.

Performance Results

Conclusion

For More Information…

See our upcoming paper in PLDI ’07 http://www.cs.princeton.edu/~frances/tal_ft-pldi07.pdf

Visit the Project ap Homepage http://www.cs.princeton.edu/sip/proj/zap

Transient faults are already a significant concern, and their impact is increasing.

TALFT formalizes idealized hardware for a hybrid hardware-software fault-tolerance scheme.

TALFT’s type system captures the invariants necessary to reason about the system.

We formally proved that all well-typed programs are fault-tolerant relative to the fault model.

We simulated the TALFT hardware using the Velocity Research Compiler.

Despite essentially doubling the number of instructions, TALFT is only 34% slower than the equivalent fault-intolerant code.

Formalizing the TALFT System

Operational semantics ! ’ specifies how a program executes.• is a machine state (R,C,M,Q,ir).

• Special operational rules randomly introduce faults by arbitrarily modifying a value in the register file R or the queue Q

Judgment `Z states that machine state is well-typed under zap tag Z.

Z is either empty or a color c, representing the color of the computation that has been corrupted.• If Z is empty then all standard

typing invariants must hold.

• If Z is a color c, then values colored c do not need to conform to their given types.

The Fault-tolerance Theorem

A program is fault tolerant when all the faulty executions simulate fault-free executions of the program.

The relationship 1 simc 2 says that 1 and 2 are identical modulo values colored c.

(Simplified) Fault-Tolerance Theorem: Either a faulty execution is indistinguishable from the equivalent fault-free execution, or it terminates in the fault state.

If ` and simc 1 and ! ’

then either 1 ! 2 and ’ simc 2

or 1 ! fault.

Transient Faults

Transient faults (also known as soft fails) occur when an energetic particle strikes a transistor, causing it to change state.• Does not permanently damage

hardware.

• May corrupt computation by altering stored values and signal transfers.

1 + 1 = 18

Sun, HP, and Cypress Semiconductor have admitted that transient faults caused crashes and problems at eBay, AOL, Los Alamos, and other major sites.

Faster clock rates, increasing transistor density, decreasing voltages, and smaller feature size all contribute to increasing fault rates of approximately 8% per generation.

2006

2014(estimated)

Transient faults are a well-known problem in the architecture community.

There are many existing solutions, but do these solutions actually work? • Solutions generally consist of algorithms

stated in English, and it is left to the audience to judge correctness.

In 2004, a typical laptop with 1GB RAM had 1 soft failure per year. • Fault rates increase with altitude.

• Fault rates on a trans-pacific flight increase 300x to nearly 1 fault per roundtrip.

Frances Perry. CRA-W/CDC Programming Languages Summer School. May 9-11, 2007.