cs184b: computer architecture (abstractions and optimizations)

60
Caltech CS184 Spring2003 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 21: May 30, 2003 Defect and Fault Tolerance

Upload: vidar

Post on 13-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

CS184b: Computer Architecture (Abstractions and Optimizations). Day 21: May 30, 2003 Defect and Fault Tolerance. Today. Defect and Fault Tolerance Problem Defect Tolerance Fault Tolerance EAS Questionnaire??? (10 min). Probabilities. Given: N objects P yield probability - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon1

CS184b:Computer Architecture

(Abstractions and Optimizations)

Day 21: May 30, 2003

Defect and Fault Tolerance

Page 2: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon2

Today

• Defect and Fault Tolerance– Problem– Defect Tolerance– Fault Tolerance

• EAS Questionnaire??? (10 min)

Page 3: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon3

Probabilities

• Given:– N objects– P yield probability

• What’s the probability for yield of composite system of N items?– Asssume iid faults– P(N items good) = PN

Page 4: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon4

Probabilities

• P(N items good) = PN

• N=106, P=0.999999

• P(all good) ~= 0.37

• N=107, P=0.999999

• P(all good) ~= 0.000045

Page 5: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon5

Simple Implications

• As N gets large– must either increase reliability– …or start tolerating failures

• N– memory bits– disk sectors– wires – transmitted data bits– processors

Page 6: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon6

Increase Reliability?

• Psys = PN

• Want: Psys = constant

• c=ln(Psys)=N ln(P)

• ln(P)=ln(Psys)/N

• P=Nth root of Psys

Page 7: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon7

Problems

Page 8: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon8

Two “problems”

• Shorts, breaks– wire/node X shorted to power, ground,

another node

• Noise– node X value flips

• crosstalk• alpha particle• bad timing

Page 9: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon9

Defects

• Shorts example of defect

• Persistent problem– reliably manifests

• Occurs before computation

• Can test for at fabrication / boot time and then avoid

Page 10: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon10

Faults

• Alpha particle bit flips is an example of a fault

• Fault occurs dynamically during execution

• At any point in time, can fail– (produce the wrong result)

Page 11: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon11

First Step to Recover

Admit you have a problem

(observe that there is a failure)

Page 12: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon12

Detection

• Determine if something wrong?– Some things easy

• ….won’t start

– Others tricky• …one and gate computes F*T=>T

• Observability– can see effect of problem– some way of telling if defect/fault present

Page 13: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon13

Detection• Coding

– space of legal values < space of all values– should only see legal– e.g. parity, redundancy, ECC

• Explicit test (defects, recurring faults)– ATPG = Automatic Test Pattern Generation– Signature/BIST=Built-In Self-Test– POST = Power On Self-Test

• Direct/special access– test ports, scan paths

Page 14: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon14

Coping with defects/faults?

• Key idea: redundancy

• Detection:– Use redundancy to detect error

• Mitigating: use redundant hardware– Use spare elements in place of faulty

elements (defects)– Compute multiple times so can discard faulty

result (faults)

Page 15: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon15

Defect Tolerance

Page 16: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon16

Two Models

• Disk Drives

• Memory Chips

Page 17: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon18

VM/Disk Drive Model

• On startup OS checks physical memory– Notes bad pages– Treat all bad pages as permanently

allocated to the “bad page” process

• Never map user (OS) to a “bad page”– User never sees physical addresses– Doesn’t expect contiguous physical

memory

Page 18: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon19

Page Tables and Defective Memory

Real MemoryTask 1Page Table

Disk

Task 2

Task 3

xxx

xxx

Page 19: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon20

Memory Chips

• Provide model in hardware of perfect chip

• Model of perfect memory at capacity X

• Use redundancy in hardware to provide perfect model

• Yielded capacity fixed– discard part if not achieve

Page 20: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon21

Example: Memory

• Correct memory:– N slots– each slot reliably stores last value written

• Millions, billions, etc. of bits…– have to get them all right?

Page 21: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon22

Memory defect tolerance

• Idea:– few bits may fail– provide more raw bits– configure so yield what looks like a perfect

memory of specified size

Page 22: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon23

Memory Techniques

• Row Redundancy

• Column Redundancy

• Block Redundancy

Page 23: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon24

Row Redundancy

• Provide extra rows

• Mask faults by avoiding bad rows

• Trick:– have address decoder substitute spare

rows in for faulty rows– use fuses to program

Page 24: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon25

Spare Row

Page 25: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon26

Row Redundancy

[diagram from Keeth&Baker 2001]

Page 26: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon27

Column Redundancy

• Provide extra columns

• Program decoder/mux to use subset of columns

Page 27: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon28

Spare Memory Column

• Provide extra columns

• Program output mux to avoid

Page 28: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon29

Column Redundancy

[diagram from Keeth&Baker 2001]

Page 29: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon30

Block Redundancy

• Substitute out entire block– e.g. memory subarray

• include 5 blocks– only need 4 to yield perfect

• (N+1 sparing more typical for larger N)

Page 30: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon31

Spare Block

Page 31: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon32

Yield M of N

• P(M of N) = P(yield N) + (N choose N-1) P(exactly N-1)+ (N choose N-2) P(exactly N-2)…+ (N choose N-M) P(exactly N-M)…

[think binomial coefficients]

Page 32: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon33

M of 5 example• 1*P5 + 5*P4(1-P)1+10P3(1-P)2+10P2(1-

P)3+5P1(1-P)4 + 1*(1-P)5

• Consider P=0.9– 1*P5 0.59 M=5 P(sys)=0.59

– 5*P4(1-P)1 0.33 M=4 P(sys)=0.92

– 10P3(1-P)2 0.07 M=3 P(sys)=0.99

– 10P2(1-P)3 0.008

– 5P1(1-P)4 0.00045

– 1*(1-P)5 0.00001

Page 33: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon34

Repairable Area

• Not all area in a RAM is repairable– memory bits spare-able – io, power, ground, control not redundant

Page 34: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon35

Repairable Area

• P(yield) = P(non-repair) * P(repair)

• P(non-repair) = PN

– N<<Ntotal– Maybe P > Prepair

• e.g. use coarser feature size

• P(repair) ~ P(yield M of N)

Page 35: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon36

Consider HSRA

• Contains– wires– luts – switches

Page 36: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon37

HSRA

• Spare wires– most area in wires and switches– most wires interchangeable

• Simple model– just fix wires

Page 37: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon38

HSRA “domain” model

• Like “memory” model

• spare entire domains by remapping

• still looks like perfect device

Page 38: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon39

HSRA direct model

• Like “disk drive” model

• Route design around known faults– designs become device specific

Page 39: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon40

HSRA: LUT Sparing

• All LUTs are equivalent

• In pure-tree HSRA– placement irrelevant– i.e. which endpoint as

long as match logical tree

• skip faulty LUTs

Page 40: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon41

Simple LUT Sparing

• Promise N-1 LUTs in subtree of some size– e.g. 63 in 64-LUT subtree– shift try to avoid fault LUT– tolerate any one fault in each subtree

Page 41: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon42

More general LUT sparing

• “Disk Drive” Model

• Promise M LUTs in N-LUT subtree– do unique placement around faulty LUTs

Page 42: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon43

SCORE Array

• Has memory and HSRA LUT arrays

Page 43: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon44

SCORE Array

• …but already know how to spare– LUTs– interconnect

• in LUT array• among LUT arrays and memory blocks

– memory blocks

• Example how can spare everything in universal computing block

Page 44: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon45

…but

• Important not to expose adjacency– E.g. physically contiguous real memory– E.g. Physical adjacency in mesh

• How spare mesh?

Page 45: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon46

Transit Multipath

• Butterfly (or Fat-Tree) networks with multiple paths– showed last time

Page 46: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon47

Multiple Paths

• Provide bandwidth• Minimize congestion• Provide redundancy

to tolerate faults

Page 47: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon48

Routers May be faulty(links may be faulty)

• Static– always corrupt

message– not route or misroute

message

• Dynamic – occasionally corrupt

or misroute

Page 48: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon49

Metro: Static Faults

• Turn off – faulty ports (make look busy)– ports connected to faulty channels– ports connected to faulty routers

• As long as paths remain between all communication endpoints– still functions

Page 49: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon50

Multibutterfly Yield

Page 50: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon51

Multibutterfly Performancew/ Faults

Page 51: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon52

Fault Tolerance

Page 52: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon53

Metro: dynamic faults

• Detection: Check success– checksums on packets to see data intact– check destination (arrived at right place)– acknowledgement from receiver

• know someone received correctly

• If fail– resend message

• same as blocked route case

Page 53: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon54

Metro: dynamic faults

• Consequence– may have faulty components– as long as

• detection strong• there is a non-faulty path

– will eventually deliver an intact message• may deliver multiple times if fault in ack• hence earlier concern about idempotence

Page 54: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon55

Memory: Dynamic Faults

• Error Correcting Codes (ECC)

• Provide enough redundancy to – detect most any errors– correct typical errors

• Simple scheme:– row and column parity

• 2 sqrt(N) bits – correct any single bit error

• …better schemes in practice– [Caltech has whole course on this…EE127]

Page 55: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon56

Processing Faults?

• Simplest model detection:– parallel checking

• run N copies in parallel• compare results

Page 56: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon57

Processor/Gate Fault Handling

• What do on fault?– Stop (not do anything wrong)

• maybe just restart – adequate if soft error

• maybe “reconfigure” to substitute out faulty processor

– Vote• if have enough redundancy take majority• Von Neumann shows:

– Can contain error rate of system to error rate of single component

Page 57: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon58

Checkpoint and Rollback

• Commit state of computation at key points– to memory (ECC, RAID protected...)– …reduce to previously solved problem…

• On faults– recover state from last checkpoint– like going to last backup….– …(snapshot)

Page 58: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon59

Together

• Examples of handling faults in– processing– storage– interconnect

• All components of our system

Page 59: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon60

Big Ideas

• Left to itself:– reliability of system << reliability of parts

• Can design– system reliability >> reliability of parts [defects]– system reliability ~= reliability of parts [faults]

• For large systems– must engineer reliability of system

Page 60: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2003 -- DeHon61

Big Ideas

• Detect failures– static: directed test– dynamic: use redundancy to guard

• Repair with Redundancy

• Model– establish and provide model of correctness

• perfect model part (memory model)• visible defects in model (disk drive model)