hpca, austin, texas february 13 2006 bulletproof: a defect-tolerant cmp switch architecture 1...
Post on 22-Dec-2015
219 views
TRANSCRIPT
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
1
BulletProof: A Defect-Tolerant CMPBulletProof: A Defect-Tolerant CMPSwitch ArchitectureSwitch Architecture
Kypros Constantinides‡ Stephen Plaza‡ Jason Blome‡ Bin Zhang†
Valeria Bertacco‡ Scott Mahlke‡ Todd Austin‡ Michael Orshansky†
‡Advanced Computer Architecture Lab †Department of Electrical and Computer Engineering
University of Michigan University of Texas at Austin
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
2
IntroductionIntroduction• Reliability is a critical aspect of any computer design• System designers target for very small failure rates• Today reliability targets are met by using fault-avoidance design
techniques– use of conservative design margins
• For future process technologies it wouldbe impossible to avoid system failures by using conservative design margins– need defect-tolerant design techniques T
ran
sist
or R
elia
bil
ity
Transistor Lifetime (years)
Now
Future
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
3
• Need for cost- and performance-efficient techniques that can provide high reliability in the presence of unreliable components – “BulletProof”
Reliable System Design SpaceReliable System Design SpaceMANUFACTURING
DEFECT WEAR-OUT DEFECT TRANSIENT ERROR
NO-DETECTION Untestable DefectsSystem fails in unpredictable way
System glitch manifests in unpredictable way
DETECTION TestingComponent terminates at first error
Component terminates. Hard-reset restore
DETECTION+CORRECTION
Post-manufacturing recovery
Online defect recovery
Transient fault recovery
DETECTION+CORRECTION
+REPAIR
Post-manufacturing reconfiguration
Online repair
DMRDMR
ECC - memory
cache-line swap-outmemory-array spares
TMR
DivaRazorECCTMR
BulletProof
Mainstream SolutionsMainstream Solutions High-end SolutionsHigh-end Solutions Specialized SolutionsSpecialized Solutions Research-stage SolutionsResearch-stage Solutions
TYPE OF DEFECT
DESIGN FEATURE
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
4
CMP Switch ArchitectureCMP Switch Architecture• Goal: A defect tolerant CMP switch design• Baseline switch architecture is provided by Li-Shiuan Peh• Implements the routing and flow-control functions required for
transmitting packets in a 2D Torus network• Wormhole switch pipelined
at the flit level (32-bit flits)• Dimensional order routing• Specified in Verilog and
synthesized to a gate-level netlist~ 9K logic gates and 1700 sequential elements
InputBuffers
VC StateRouting Logic
InputBuffers
VC StateRouting Logic
InputBuffers
VC StateRouting Logic
InputBuffers
VC StateRouting Logic
InputBuffers
VC StateRouting Logic
Cross-Bar Controller
Switch Arbiter
Input Controllers
Cross-Bar
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
5
Soft Errors (SEU) VulnerabilitySoft Errors (SEU) Vulnerability• In earlier work we studied the vulnerability of the switch
architecture to soft-errors– Only 3.2% of faults eventually cause an error
• Age-related wear-out silicon defects is a more challenging reliability threat for future technologies
• In this work we focus on solutions for in-field silicon defects• These solutions also provide soft-error tolerance to the design
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
6
Self-Repairing SystemsSelf-Repairing Systems• Defect-tolerant self-repairing systems need to support:
– Error Detection– System Diagnosis (locate the origin of the error)– System Repair– System Recovery
• Key idea:– error detection must be performance efficient
• continuously check execution for errors– diagnosis, repair and recovery are insensitive on performance
• get invoked only when an error is detected (rare scenario)• trade-off performance for more cost efficient techniques
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
7
Traditional Defect-Tolerant TechniquesTraditional Defect-Tolerant Techniques• Traditional techniques for designing defect-tolerant systems:
– Triple Modular Redundancy (TMR)• Forward recovery• Applicable to both combinational
and sequential logic• Can not tolerate more than one
defective modules• Area and power overhead ~ 3X
– Error Correction Codes (ECC)• Lower overhead solution• Applicable only for state
holding structures and busses
M
M
M
V
R1 R2 D1 R3 D2 D3 D4 R4 D5 D6 D7 D8
ECC bits
Data bits
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
8
• The synthesized netlist of the added components account for ~10% of the total switch area
• Provide error detection for both hard and soft errors
BufferChecker
Routing Logic
Routing LogicARB
Cross-bar Controller
Header
Input Buffers Cross-bar
ARB
CRCChecker
CRC
Error Detection: Low-Cost Domain Specific TechniqueError Detection: Low-Cost Domain Specific Technique
Error
FLITCRC
Checker
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
9
Adding Defect Resiliency With Lower Cost Adding Defect Resiliency With Lower Cost • Automatic Cluster Decomposition• Balanced recursive min-cut heuristic algorithm
Input: a) design’s gate-level netlist b) number of partitionsOutput: a partitioned netlistGoal: – Balance partition sizes:
- smaller partition higher resilience – Minimize cut edges:
- reduce cost overhead- reduce vulnerable logic
• Partitions can have both combinational and sequential logic
A
B
C
D
E
F
G
H
J
I
A
B
C
D
E
F
G
H
J
I
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
10
A
B
FA
B
F
D
E
HD
E
H
C
G
J
I
C
G
J
I
• Partition sparing:– Only one spare is active for
each partition of the switch– Replace voting logic with
spare swapping logic– Lower power overhead– A defect is fatal if it hits the
last spare of a partition or the spare swapping logic
Silicon Protection Factor (SPF) =
– The number of defect in a design are proportional to the design’s area– Enables to compare different defect tolerant designs
0
2
4
6
8
10
12
14
16
18
0 100 200 300 400 500 600 700 800 900
#Partitions
Def
ect R
esili
ency
z
Mean Defects to FailureSPF - Defect Tolerance
0
2
4
6
8
10
12
14
16
18
0 100 200 300 400 500 600 700 800 900
#Partitions
Def
ect R
esili
ency
z
Mean Defects to FailureSPF - Defect ToleranceSPF – Defect Tolerance
7.6X more defectstolerated per unit area
Partition Sparing – Silicon Protection FactorPartition Sparing – Silicon Protection Factor
1 extra spare per partition
Mean Defects to FailureArea Overhead
15.8X more defects tolerated
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
11
System RecoverySystem Recovery• Add a Recovery Pointer to each
input buffer• Recovery pointers advance 4 cycles
after the input controller grantsthe requesting output channel– Guarantees that flit is CRC checked
• On error detection:– All CRC checkers drop
outgoing flits– Switch pipeline is flushed– Head pointers are set to recovery
pointers– Restart execution
CRC Checker
InterconnectSwitch
CRC Checker
CRC Checker
CRC Checker
RecoveryLogic
CRC Checker
RoutedFlit
RoutedFlit
RoutedFlit
RoutedFlit
RoutedFlit
Error Detection Signal
abcde abcdeInputBuffers
Tail Head RecoveryHead
a: Correctly routed flitb, c: In the switch pipelined: Next flit to be routede: Last flit buffered
e d
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
12
System Diagnosis and RepairSystem Diagnosis and Repair• Iterative trial-and-error technique
• Built-In-Self-Test (BIST)– For each partition keep automatically generated test vectors in ROM– Apply test vectors to each partition through scan chains to locate the
defective partition
Recover to the last correct state of the switch
For partition i swap in the spare for the current copy and restart execution
Error detected? i < # partitions?
Continue Execution
Increase i
No No
YesYes
Fatal Defect
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
13
35
34
33
32
31-C_3SH(IC)+1SP_IR30-C_2SH(IC)+1SP_IR
2928
27-C_3SH(IC)_IR26
25
24-C+CL_2SP_BIST
23-C+CL_2SP_IR
22-C_2SP_BIST
21-C_2SP_IR20
19-S+CL_2SP_BIST
18-S+CL_2SP_IR17
16 15-C+CL_1SP_IR
14-C_1SP_BIST
13
12
11
10-S+CL_1SP_BIST
9-S+CL_1SP_IR8-S_1SP_IR
7-G_TMR
6 5
4-C_TMR
3-S+CL_TMR+ECC 2-S+CL_TMR1-S_TMR
37-C_5SH(IC)+2SP_IR
36
38-S_ECC
0
1
2
3
4
5
0 1 2 3 4 5 6 7 8 9 10 11 12
Normalized Defect Resiliency - Silicon Protection Factor (SPF)
Are
a O
verh
ead h
How does these techniques affect the system’s lifetime?
Pareto Optimal DesignsPareto Optimal Designs
Pareto Sub-optimal DesignsPareto Sub-optimal Designs12 partitions (cmps)2/5 spare input controllers1 spare per cmp. (rest)Iterative replayArea = 1.76XSPF = 2.53
12 partitions (cmps)2/5 spare input controllers1 spare per cmp. (rest)Iterative replayArea = 1.76XSPF = 2.53
206 partitions2 spares per partitionIterative replayArea = 3.4XSPF = 11.1
206 partitions2 spares per partitionIterative replayArea = 3.4XSPF = 11.1
206 partitions1 spare per partitionBuilt-In-Self-TestArea = 3.16XSPF = 5.54
206 partitions1 spare per partitionBuilt-In-Self-TestArea = 3.16XSPF = 5.54
206 partitions1 spare per partitionIterative replayArea = 2.3XSPF = 7.6
206 partitions1 spare per partitionIterative replayArea = 2.3XSPF = 7.6
12 partitions (cmps)TMRArea = 3.04XSPF = 1.54
12 partitions (cmps)TMRArea = 3.04XSPF = 1.54
more robust designs
chea
per d
esig
ns cheaper
more robust designs
Exploring Defect-Tolerant CMP Switch DesignsExploring Defect-Tolerant CMP Switch Designs
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
14
““Bathtub Curve”: A model for semiconductor hard failures Bathtub Curve”: A model for semiconductor hard failures • The lifetime failure rate for semiconductor systems follows what is known
as the bathtub curve • Trend for future process technologies:
– Failure rate of grace period gets larger– Breakdown period is earlier in system’s lifetime
Grace PeriodInfant Period Breakdown Period
Time
Failu
re R
ate
(FIT
)
Future process technologies
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
15
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6 7 8 9 10 11 12Time (Years)
Def
ectiv
e P
arts
(%)
g
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6 7 8 9 10 11 12Time (Years)
Def
ectiv
e P
arts
(%)
g
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6 7 8 9 10 11 12Time (Years)
Def
ectiv
e P
arts
(%)
g
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6 7 8 9 10 11 12Time (Years)
Def
ectiv
e P
arts
(%)
g
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6 7 8 9 10 11 12Time (Years)
Def
ectiv
e P
arts
(%)
g
System Lifetime – A Post 65nm Technology Case ScenarioSystem Lifetime – A Post 65nm Technology Case Scenario
Fai
lure
Rat
e (F
IT)
12000
24000
36000
48000
60000
72000
84000
96000
108000
120000
TMRSPF=1.54
TMRSPF=1.54 3/5 spare IC
1 spare restSPF=3.01
3/5 spare IC1 spare restSPF=3.01
1 spareSPF=7.631 spare
SPF=7.63
2 sparesSPF=11.11
2 sparesSPF=11.11
1 defect1 defect every two yearsevery two years
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
16
Conclusions – Future WorkConclusions – Future WorkConclusions• Traditional mechanisms are insufficient for tolerating moderate
numbers of defects• Domain-specific techniques along with resource sparing, iterative
diagnosis and reconfiguration are more effective• Decomposing the design into modest-sized partitions is the most
effective granularity to apply redundancy
Future Work• Use of spare components based on component wear-out profiles• Explore low-cost defect-tolerant techniques for microprocessors
HPCA, Austin, TexasFebruary 13 2006
BulletProof: A Defect-Tolerant CMPSwitch Architecture
17
Questions?Questions?