elec 7770 advanced vlsi design spring 2014 soft errors and fault-tolerant design
DESCRIPTION
ELEC 7770 Advanced VLSI Design Spring 2014 Soft Errors and Fault-Tolerant Design. Vishwani D. Agrawal James J. Danaher Professor ECE Department, Auburn University Auburn, AL 36849 [email protected] http://www.eng.auburn.edu/~vagrawal/COURSE/E7770_Spr14. Soft Errors. - PowerPoint PPT PresentationTRANSCRIPT
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 11
ELEC 7770ELEC 7770Advanced VLSI DesignAdvanced VLSI Design
Spring 2014Spring 2014 Soft Errors and Fault-Tolerant DesignSoft Errors and Fault-Tolerant Design
Vishwani D. AgrawalVishwani D. AgrawalJames J. Danaher ProfessorJames J. Danaher Professor
ECE Department, Auburn UniversityECE Department, Auburn University
Auburn, AL 36849Auburn, AL 36849
http://www.eng.auburn.edu/~vagrawal/COURSE/E7770_Spr14
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 22
Soft ErrorsSoft Errors Soft errors are the errors caused by the Soft errors are the errors caused by the
operating environment.operating environment. They are not due to a permanent hardware fault.They are not due to a permanent hardware fault. Soft errors are intermittent or random, which Soft errors are intermittent or random, which
makes their testing unreliable.makes their testing unreliable. One way to deal with soft errors is to make One way to deal with soft errors is to make
hardware robust:hardware robust: Capable of detecting soft errorsCapable of detecting soft errors Capable of correcting soft errorsCapable of correcting soft errors Both measures are probabilisticBoth measures are probabilistic
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 33
Some Early ReferencesSome Early References J. von Neumann, “Probabilistic Logics and the Synthesis J. von Neumann, “Probabilistic Logics and the Synthesis
of Reliable Organisms from Unreliable Components,” pp. of Reliable Organisms from Unreliable Components,” pp. 329-378, 1959, in A. H. Taub, editor, 329-378, 1959, in A. H. Taub, editor, John von Neumann: John von Neumann: Collected WorksCollected Works, , Volume V: Design of Computers, Theory Volume V: Design of Computers, Theory of Automata and Numerical Analysisof Automata and Numerical Analysis, , Oxford University Press, 1963. Oxford University Press, 1963.
M. A. Breuer, “Testing for Intermittent Faults in Digital M. A. Breuer, “Testing for Intermittent Faults in Digital Circuits,” Circuits,” IEEE Trans. ComputersIEEE Trans. Computers, vol. C-22, no. 3, pp. , vol. C-22, no. 3, pp. 241-246, March 1973.241-246, March 1973.
T. C. May and M. H. Woods, “Alpha-Particle-Induces Soft T. C. May and M. H. Woods, “Alpha-Particle-Induces Soft Errors in Dynamic Memories,” Errors in Dynamic Memories,” IEEE Trans. Electron IEEE Trans. Electron DevicesDevices, vol. ED-26, no. 1, pp. 2-9, 1979., vol. ED-26, no. 1, pp. 2-9, 1979.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 44
Causes of Soft ErrorsCauses of Soft Errors
Interconnect coupling (crosstalk).Interconnect coupling (crosstalk). Power supply noise: IR-drop, power droop, Power supply noise: IR-drop, power droop,
ground bounce.ground bounce. Ignition noise.Ignition noise. Electromagnetic pulse (EMP).Electromagnetic pulse (EMP). Effects generally attributed to alpha-particles:Effects generally attributed to alpha-particles:
Charged particles: electrons, protons, ions.Charged particles: electrons, protons, ions. Radiation (photons): X-rays, gamma-rays, ultra-violet Radiation (photons): X-rays, gamma-rays, ultra-violet
light. light.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 55
Sources of Alpha-ParticlesSources of Alpha-Particles
Radioactive contamination in VLSI packaging Radioactive contamination in VLSI packaging material.material.
Ionosphere, magnetosphere and solar radiation.Ionosphere, magnetosphere and solar radiation. Other electromagnetic radiation.Other electromagnetic radiation.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 66
Alpha-ParticleAlpha-Particle
Helium nucleus: two protons and two Helium nucleus: two protons and two neutrons, mass = 6.65 neutrons, mass = 6.65 ×10×10-27-27kgkg, charge = , charge = +2e (e = 1.6 +2e (e = 1.6 ×10×10-19-19C).C).
Energy = 3.73 GeVEnergy = 3.73 GeV
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 77
Soft Error Rate (SER)Soft Error Rate (SER)
Failures in time (FIT): One FIT is 1 error per Failures in time (FIT): One FIT is 1 error per billion hours of operation.billion hours of operation.
Alternative unit is mean time between failures Alternative unit is mean time between failures (MTBF) or mean time to failure (MTTF).(MTBF) or mean time to failure (MTTF).
1 year MTBF = 109/(365×24) = 114,155 FIT
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 88
Particle StrikeParticle Strike
p - substrate
n - + + ++ - -
Ion orCharged particle
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 99
Induced CurrentInduced Current
time
curr
ent
I(t) = I0(e– t/a – e– t/b), a >> b
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 1010
Voltage Induced at a NodeVoltage Induced at a Node
V = Q/C
Where Q = ∫ I(t) dt
C = node capacitance
Smaller node capacitance will result in larger voltage swing.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 1111
Effect on Digital CircuitEffect on Digital Circuit
IN OUT
CK
CombinationalLogic
ChargedParticles
ChargedParticles
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 1212
An SRAM CellAn SRAM Cell
bit bit
VDD
WL
BL BL
01
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 1313
SRAM Cell Struck by Alpha-ParticleSRAM Cell Struck by Alpha-ParticleSingle-Event Upset (SEU)Single-Event Upset (SEU)
bit bit
VDD
WL
BL BL
0→11→0
ChargedParticles
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 1414
A Resistor Hardened SRAM CellA Resistor Hardened SRAM Cell
bit bit
VDD
WL
BL BL
01
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 1515
D-LatchD-Latch
D
CK = 0
Q1
0
Q
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 1616
SEU in D-LatchSEU in D-Latch
D
CK = 0
Q
1→0
0→1
ChargedParticles
Q
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 1717
Single Event Transients in Single Event Transients in Combinational LogicCombinational Logic
CK
CK
1
1
0
1
0
1
ChargedParticles
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 1818
Effects of TransientsEffects of Transients
Error correcting effectsError correcting effects Transient pulse is filtered by gate inertiaTransient pulse is filtered by gate inertia Transient is blocked by an unsensitized pathTransient is blocked by an unsensitized path Transient is blocked by an inactive clockTransient is blocked by an inactive clock
Error enhancing effectsError enhancing effects Large number of gates can produce multiple Large number of gates can produce multiple
pulsespulses Fanouts can multiply error pulsesFanouts can multiply error pulses
Typical Soft Error DistributionTypical Soft Error Distribution
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 1919
S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43-52, February 2005.
Soft Error SimulationSoft Error Simulation
F. Wang and V. D. Agrawal, “Soft Error Rate F. Wang and V. D. Agrawal, “Soft Error Rate with Inertial and Logical Masking,” with Inertial and Logical Masking,” Proc. 22Proc. 22ndnd International Conference on Quality VLSI International Conference on Quality VLSI DesignDesign, January 2009, pp. 459-464., January 2009, pp. 459-464.
F. Wang and V. D. Agrawal, “Soft Error Rate F. Wang and V. D. Agrawal, “Soft Error Rate Determination for Nanoscale Sequential Logic,” Determination for Nanoscale Sequential Logic,” Proc. 11Proc. 11thth International Symposium on Quality International Symposium on Quality Electronic Design (ISQED), Electronic Design (ISQED), March 2010, pp. March 2010, pp. 225-230.225-230.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 2020
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 2121
SEUs in FPGASEUs in FPGA
Parts that can be affectedParts that can be affected Look-up table (LUT)Look-up table (LUT) Configuration memory cellConfiguration memory cell Flip-flopFlip-flop Block RAMBlock RAM
F. L. Kastensmidt, L. Carro and R. Reis, F. L. Kastensmidt, L. Carro and R. Reis, Fault-Tolerant Techniques for SRAM-Based Fault-Tolerant Techniques for SRAM-Based FPGAsFPGAs, Springer, 2006., Springer, 2006.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 2222
LUTLUT
out
F1 F2 F3 F4
1
0
1
1
0
1
1
0
0
0
0
0
1
1
1
0
Mem
ory
cells
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 2323
SEU in SEU in LUTLUT
out
F1 F2 F3 F4
1
0
1
0
0
1
1
0
0
0
0
0
1
1
1
0
Mem
ory
cells
ChargedParticle
1 changed to 0
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 2424
Four Types of SEU in FPGAFour Types of SEU in FPGA
F1F2F3F4
LUT
FF
M
M
M
M
M M M
Configuration memory cell
Type 1
Type 2
Type 3
BlockRAM
Type 4
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 2525
SEU Detection MethodsSEU Detection Methods
Hardware redundancyHardware redundancy Time redundancyTime redundancy Error detection codes (EDC)Error detection codes (EDC) Self-checker techniquesSelf-checker techniques
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 2626
SEU Mitigation TechniquesSEU Mitigation Techniques
Triple modular redundancy (TMR)Triple modular redundancy (TMR) Multiple redundancy with votingMultiple redundancy with voting Error detection and correction codes (EDAC)Error detection and correction codes (EDAC) Hardened memory cellsHardened memory cells FPGA-specific methodsFPGA-specific methods
ReconfigurationReconfiguration Partial configurationPartial configuration Rerouting designRerouting design
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 2727
Hardware Redundancy for DetectionHardware Redundancy for Detection
CombinationalLogic
CombinationalLogic
(duplicated)
outputinputs
Logic 1 indicates
error
Hardware overhead is high ~ 100%Performance penalty is negligible.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 2828
Time Redundancy for DetectionTime Redundancy for Detection
CombinationalLogic outputinputs
Logic 1 indicates
error
Hardware overhead is low.Performance penalty ( ~ d) = maximum detectable pulse width.
D Q
D Q
CK+ d
CK
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 2929
Repeat on Error DetectionRepeat on Error Detection
CombinationalLogic
output
inputs
Logic 1 indicates
errorD Q
D Q
CK+ d
CK
C
Operation: If error is detected, then output retains its previous value.Repeating the computation can produce correct result.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 3030
Muller C-ElementMuller C-Element
outputC
A
B
A B output
00 00 00
00 11 Old outputOld output
11 00 Old outputOld output
11 11 11
S Q
R
A
B
output
Dynamic CMOS C-ElementDynamic CMOS C-Element
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 3131
outputC
A
B
A B output
00 00 11
00 11 Old outputOld output
11 00 Old outputOld output
11 11 00
A
Boutput
Pseudostatic CMOS C-ElementPseudostatic CMOS C-Element
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 3232
outputC
A
B
A B output
00 00 11
00 11 Old outputOld output
11 00 Old outputOld output
11 11 00
A
Boutput
Weakkeeper
Built-In Soft Error Resilience (BISER)Built-In Soft Error Resilience (BISER)
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 3333
A B output
0 0 1
0 1 Old output
1 0 Old output
1 1 0
A
B
output
Weakkeeper
Flip-flop
DuplicateFlip-flop
Clock
Data fromcombinationallogic
BISERBISER Assumptions:Assumptions:
Most soft errors in combinational logic are eliminated by Most soft errors in combinational logic are eliminated by inertial or logic masking.inertial or logic masking.
Soft error pulse generated in flip-flop is much shorter Soft error pulse generated in flip-flop is much shorter than clock period.than clock period.
Probability of either a master or slave latch being struck Probability of either a master or slave latch being struck by soft error exactly at clock edge is small.by soft error exactly at clock edge is small.
Flip-flop is duplicated and outputs fed to C-element.Flip-flop is duplicated and outputs fed to C-element. Twenty times reduction of soft error observed.Twenty times reduction of soft error observed. Ref.: S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, Ref.: S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim,
“Robust System Design with Built-In Soft-Error Resilience,” “Robust System Design with Built-In Soft-Error Resilience,” ComputerComputer, vol. 38, no. 2, pp. 43-52, February 2005., vol. 38, no. 2, pp. 43-52, February 2005.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 3434
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 3535
Triple Modular Redundancy (TMR)Triple Modular Redundancy (TMR)
CombinationalLogic copy 1
outputinputs MajorityVoter
CombinationalLogic copy 3
CombinationalLogic copy 2
TMR Error ReductionTMR Error Reduction Voter input error probability = E, assumed Voter input error probability = E, assumed
independent for each input.independent for each input. Output error probability,Output error probability,
ee = = Prob(Prob(two errors two errors or or three errorsthree errors))
== ( ) E( ) E2 2 (1 – E) + ( ) E(1 – E) + ( ) E33
== 3 E3 E22 – 3 E – 3 E33 + E + E33 == 3 E3 E22 – 2 E – 2 E33
For very small E, EFor very small E, E3 3 << E<< E22 → e = 3E → e = 3E22
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 3636
3
2
3
3
TMR Error ProbabilityTMR Error Probability
Input error probability, E Output error probability, e
0.0 0.0
0.001 0.000002998
0.01 0.000298
0.1 0.027
0.2 0.104
0.3 0.216
0.4 0.352
0.5 0.5
0.6 0.648
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 3737
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 3838
Majority Voter CircuitMajority Voter Circuit
A
B
AA BB CC outputoutput
00 00 00 00
00 00 11 00
00 11 00 00
00 11 11 11
11 00 00 00
11 00 11 11
11 11 00 11
11 11 11 11
A
B output
outputMajorityVoter
C
C
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 3939
Alternative Implementations of VoterAlternative Implementations of Voter
LUT
00010111
output output
A
B
C
A B C
VDD
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 4040
Triple Modular Redundancy (TMR)Triple Modular Redundancy (TMR)
CombinationalLogic
output
inputs
D Q
D Q
CK
CK + d
MajorityVoter
D Q
D Q
CK + 2d
CK + 3d
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 4141
TMR for Memory CellsTMR for Memory Cells
CombinationalLogic
output
inputs
D Q
D Q
CK
CK
MajorityVoter
D Q
CK
Problems:1. Accumulation of
errors in flip-flops.1. Voter is not protected.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 4242
FF Refresh and TMR for Memory CellsFF Refresh and TMR for Memory Cells
output
D Q
D Q
CK
CK
D Q
CK
MajorityVoter
MajorityVoter
MajorityVoter
MajorityVoter
r1
r2
r3
Reliability AnalysisReliability Analysis
Determine how long a system will work without Determine how long a system will work without failure.failure.
Find:Find: Mean time to failure (MTTF) or mean time between Mean time to failure (MTTF) or mean time between
failures (MTBF) failures (MTBF) Mean time to repair (MTTR)Mean time to repair (MTTR) FIT rateFIT rate
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 4343
Reliability FunctionReliability Function Reliability function of a system,Reliability function of a system,
R(t) = Probability of survival at time tR(t) = Probability of survival at time t
Determined from failure rates of components,Determined from failure rates of components,
λλ(t) = Number of failures per unit time(t) = Number of failures per unit time
Generally varies with time.Generally varies with time.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 4444
Failure Rate, Failure Rate, λλ(t)(t)
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 4545
Time, t
Fai
lure
s pe
r se
cond
, λ(
t)
10-12
10-9
10-6
10-3
100
Infantmortality
Constant failureRate (useful life)
λ(t) = λ
Wearoutor aging
Deriving R(t)Deriving R(t)
R(t) is the probability of no error in interval [0, t].R(t) is the probability of no error in interval [0, t]. Divide interval in a large number (n) of subintervals of Divide interval in a large number (n) of subintervals of
duration t/n. Let x be the probability of error in one duration t/n. Let x be the probability of error in one subinterval.subinterval.
Assume that duration t/n is so small that either no error Assume that duration t/n is so small that either no error occurs or at most one error can occur. Then, average occurs or at most one error can occur. Then, average errors in a subinterval = 0.(1 – x) + 1.x = x = errors in a subinterval = 0.(1 – x) + 1.x = x = λλt/n.t/n.
Probability of no error in interval [0, t] is,Probability of no error in interval [0, t] is,
R(t)R(t) = (1 – x)= (1 – x)nn = (1 – = (1 – λλt/n)t/n)nn
= exp(– = exp(– λλt), from Sterling’s formula as n → t), from Sterling’s formula as n → ∞∞
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 4646
R(t) and MTBFR(t) and MTBFR(t)R(t) == ee – –λλt t
Failure rate, Failure rate, λλ = failures per unit time = failures per unit time
Number of failures in time T = Number of failures in time T = λλTT∞∞
MTBF = T/MTBF = T/λλT = 1/T = 1/λλ = = ∫ ∫ R(t) dtR(t) dt00
R(t) = exp( – t/MTBF)R(t) = exp( – t/MTBF)
For t = MTBF, R(MTBF) = eFor t = MTBF, R(MTBF) = e –1 –1 = 0.368= 0.368
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 4747
Reliability and MTBFReliability and MTBF
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 4848
Time, t
Rel
iabi
lity,
R(t
)
1.0
0.8
0.6
0.4
0.2
0.01 MTBF 2 MTBF 3 MTBF
R(t) = 1/e = 0.368
Example: First Generation ComputerExample: First Generation Computer
10,000 electron tubes.10,000 electron tubes. Average burn out rate: 5 tubes per 100,000 hours.Average burn out rate: 5 tubes per 100,000 hours. MTBF = 100,000/5 = 20,000 hours = 2.3 years, MTBF = 100,000/5 = 20,000 hours = 2.3 years,
i.e., 37% chance of survival beyond 2.3 years.i.e., 37% chance of survival beyond 2.3 years. Time for 95% chance of survival:Time for 95% chance of survival:
R(t) = exp(– t/MTBF) = 0.95, or t = 1.4 months R(t) = exp(– t/MTBF) = 0.95, or t = 1.4 months
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 4949
Reliability of TMRReliability of TMR
R(TMR)R(TMR) = Prob(all three modules correct)= Prob(all three modules correct)
+ Prob(any two modules + Prob(any two modules correct)correct)
= R= R3 3 + 3R + 3R22 (1 – R) (1 – R)
= 3 R= 3 R22 – 2 R – 2 R33
= 3e= 3e-2-2λλtt – 2e – 2e-3-3λλtt
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 5050
MTBF of TMRMTBF of TMR
R(TMR)R(TMR) == 3e 3e-2-2λλtt – 2e – 2e-3-3λλtt
MTBF = ∫ R(TMR) dtMTBF = ∫ R(TMR) dt == 5/(65/(6λλ))
0 0
This is less than the MTBF = 1/This is less than the MTBF = 1/λλ for a single for a single system!system!
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 5151
8
MTBF of TMRMTBF of TMR
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 5252
Time, t
Rel
iabi
lity,
R(t
)
1.0
0.8
0.6
0.4
0.2
0.0
Singlemodule
TMR
Missionduration
Error Detection CodeError Detection Code Errors: Bits can flip due too noise in circuits and Errors: Bits can flip due too noise in circuits and
in communication.in communication. Extra bits used for error detection.Extra bits used for error detection. Example: a parity bit in ASCII codeExample: a parity bit in ASCII code
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal)
Even parity code for A 01000001(even number of 1s)
Odd parity code for A 11000001(odd number of 1s)
7-bit ASCII code
Parity bits
Single-bit error in 7-bit code of “A”, e.g., 1000101, will changesymbol to “E” or 1000000 to “@”. But error will be detected inthe 8-bit code because the error changes the specified parity.
5353
Richard W. HammingRichard W. Hamming Error-correcting codes Error-correcting codes
(ECC).(ECC). Also known forAlso known for
Hamming distance Hamming distance HD = Number of bits two HD = Number of bits two
binary binary vectors vectors differ differ inin
Example:Example:
HD(1101, 1010) = 3HD(1101, 1010) = 3 Hamming Medal, 1988Hamming Medal, 1988
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal)
1915-19985454
The Idea of Hamming CodeThe Idea of Hamming Code
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 5555
Code space contains 2N possible N-bit code words
1010”A”
1110”E”
1011”B”
1000”8”
0010”2”
1-bit error in “A”
HD = 1HD = 1
HD = 1HD = 1
Error not correctable. Reason: No redundancy.Hamming’s idea: Increase HD between valid code words.
Hamming’s Distance ≥ 3 CodeHamming’s Distance ≥ 3 Code
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 5656
1010010
”A”1-bit error in “A”shortest distancedecoding eliminateserror
HD = 2
HD = 1
0010101
”2”
1000111
”8”1011001
”B”
1110100
”E”
HD = 3
HD = 3
HD = 3
HD = 4
0010010
”?”
HD = 3
HD = 4
HD = 4
0011110
”3”
Minimum Distance-3 Hamming CodeMinimum Distance-3 Hamming CodeSymbol
Original code
Odd-parity code
ECC, HD ≥ 3
0 0000 10000 0000000
1 0001 00001 0001011
2 0010 00010 0010101
3 0011 10011 0011110
4 0100 00100 0100110
5 0101 10101 0101101
6 0110 10110 0110011
7 0111 00111 0111000
8 1000 01000 1000111
9 1001 11001 1001100
A 1010 11010 1010010
B 1011 01011 1011001
C 1100 11100 1100001
D 1101 01101 1101010
E 1110 01110 1110100
F 1111 11111 1111111
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 5757
Original code: Symbol “0” with a single-bit error will be Interpreted as“1”, “2”, “4” or “8”.
Reason: Hamming distance betweencodes is 1. A code with any bit error willmap onto another valid code.
Remedy: Design codes with HD ≥ 2.Example: Parity code. Single bit errordetected but not correctable.
Remedy: Design codes with HD ≥ 3.For single bit error correction, decodeas the valid code at HD = 1.
For more error bit detection orcorrection, design code with HD ≥ 4.
A Book on Coding TheoryA Book on Coding Theory
R. W. Hamming, R. W. Hamming, Coding and Information TheoryCoding and Information Theory, , Englewood Cliffs, New Jersey: Prentice-Hall, Englewood Cliffs, New Jersey: Prentice-Hall, 1980.1980.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 5858
Byzantine Empire, 527-565Byzantine Empire, 527-565
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 5959
Emperor Justinian and General Belisarius
Byzantine General’s ProblemByzantine General’s Problem
In a war a general needs to communicate an In a war a general needs to communicate an attack (a) or retreat (r) order to subordinates in attack (a) or retreat (r) order to subordinates in the field.the field.
For success a perfect agreement is necessary.For success a perfect agreement is necessary. Byzantine Fault:Byzantine Fault:
Subordinates can be unreliable or malicious.Subordinates can be unreliable or malicious. Communication (messengers) can be unreliable or Communication (messengers) can be unreliable or
malicious.malicious.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 6060
Example 1: Single FaultExample 1: Single Fault
General: D; Subordinates: A, B and CGeneral: D; Subordinates: A, B and C
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 6161
D
A B C
r→ar
r
Example 1: Majority AgreementExample 1: Majority Agreement
General: D; Subordinates: A, B and CGeneral: D; Subordinates: A, B and C
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 6262
D
A B C
r→ar
r
a
a
r r
r
r
Retreat RetreatRetreat
Example 2: Two FaultsExample 2: Two Faults
General: D; Subordinates: A, B and CGeneral: D; Subordinates: A, B and C
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 6363
D
A B C
aa
a
Example 2: Byzantine FailureExample 2: Byzantine Failure
General: D; Subordinates: A, B and CGeneral: D; Subordinates: A, B and C
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 6464
D
A B C
aa
a
r
r
r r
a
a
RetreatAttackAttack
Byzantine Resilient SystemByzantine Resilient System A system that can correctly function in presence of A system that can correctly function in presence of
Byzantine faults.Byzantine faults. Byzantine protocol for n node system:Byzantine protocol for n node system:
Any node can initiate a message broadcast.Any node can initiate a message broadcast. All nodes rebroadcast the received message to all nodes All nodes rebroadcast the received message to all nodes
it has not heard from.it has not heard from. After communications end, nodes take majority decision.After communications end, nodes take majority decision.
Ref.: L. Lamport, R. Shostak and M. Pease, “The Ref.: L. Lamport, R. Shostak and M. Pease, “The Byzantine General’s Problem,” Byzantine General’s Problem,” ACM Trans. Prog. ACM Trans. Prog. Lang. SystLang. Syst., vol. 4, no. 3, pp. 382-401, July 1982.., vol. 4, no. 3, pp. 382-401, July 1982.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 6565
Byzantine Resilience ConditionsByzantine Resilience Conditions
In order to tolerate t failures:In order to tolerate t failures: The system must have at least 3t + 1 nodes.The system must have at least 3t + 1 nodes. There must be at least 2t +1 disjoint There must be at least 2t +1 disjoint
communication paths between nodes.communication paths between nodes. A node must exchange messages at least t +1 A node must exchange messages at least t +1
times.times.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 6666
Four-Core Processor SystemFour-Core Processor System
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 6767
A
B
C
D
Example 1: C Initiates Message m, Example 1: C Initiates Message m, Sends n to A and m to B and DSends n to A and m to B and D
Processor First roundSecond round
Decoded message
A n m m m
B m m n m
D m m n m
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 6868
Example 2: C Initiates Message m, Example 2: C Initiates Message m, B Sends p to A and DB Sends p to A and D
Processor First roundSecond round
Decoded message
A m m p m
B m m m m
D m m p m
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 6969
Example 2: C Initiates Message m, Example 2: C Initiates Message m, A and B generate faulty message qA and B generate faulty message q
Processor First roundSecond round
Decoded message
A m m q m
B m m q m
D m q q q
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 7070
ReferencesReferences L. Lamport, R. Shostak and M. Pease, “The L. Lamport, R. Shostak and M. Pease, “The
Byzantine General’s Problem,” Byzantine General’s Problem,” ACM Trans. ACM Trans. Prog. Lang. Syst., Prog. Lang. Syst., vol. 4, no. 3, pp. 382-401, vol. 4, no. 3, pp. 382-401, July 1982.July 1982.
D. K. Pradhan, D. K. Pradhan, Fault-Tolerant Computer System Fault-Tolerant Computer System Design,Design, Upper Saddle River, New Jersey: Upper Saddle River, New Jersey: Prentice Hall PTR, 1996.Prentice Hall PTR, 1996.
P. K. Lala, P. K. Lala, Self-Checking and Fault-Tolerant Self-Checking and Fault-Tolerant Digital DesignDigital Design, San Francisco: Morgan-, San Francisco: Morgan-Kaufmann, 2001.Kaufmann, 2001.
Spring 2014, Apr 11 . . .Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design (Agrawal)ELEC 7770: Advanced VLSI Design (Agrawal) 7171