self-adaptable and error- resilient design - vast...
TRANSCRIPT
DUSD(S&T)
SelfSelf--Adaptable and ErrorAdaptable and Error--Resilient DesignResilient Design --coping with increasing variability coping with increasing variability and reliability concernsand reliability concerns
Tim ChengTim ChengUniv. of California, Santa BarbaraUniv. of California, Santa Barbara
2
Sources of Component FailuresSources of Component Failures
40
50
60
70
80
90
100
110
Tem
pera
ture
(C)
On-Die Temperature variations
SEU - soft errors
Parametric variations
RandomDefects
random defects
parametric variationscatastrophic parametric
deterministic
transient
Design errors
soft errors
design errors
probabilistic
permanent
soft
hard
3
Test ChallengesTest ChallengesCost of manufacturing Cost of manufacturing test not scalingtest not scaling
ATE falling further ATE falling further behind device speedsbehind device speeds
BurnBurn--in running out of in running out of steamsteam
Increasing device Increasing device integration: Digital, integration: Digital, analog/mixed signal, analog/mixed signal, memory, software, highmemory, software, high--speed buses, etc.speed buses, etc.
High-Speed
IO
High-Speed
IOCPUCPU
MemoryMemory
FlashFlashDAC
SwitchFabricSwitchSwitchFabricFabric
ADC
FPGAFPGADSPDSP
Cost of Silicon Mfg and TestCost of Silicon Mfg and Test
1010--771010--661010--551010--441010--331010--221010--11
11
‘‘8282 ‘‘8585 ‘‘8888 ‘‘9191 ‘‘9494 ‘‘9797 ‘‘0000 ‘‘0303 ‘‘0606 ‘‘0909 ‘‘1212
cost: cost: cents/transistorcents/transistor
FabFab capital / transistor (Moorecapital / transistor (Moore’’s law)s law)
Test capital / transistor (MooreTest capital / transistor (Moore’’s law for test)s law for test)
4
ImplicationsImplicationsBecome harder and harder to design reliable components
Shorter term: Demand better silicon debug technologies to effectively find bugs escaped from verification and timing failures resulted from variations
Longer term:One-time-factory testing will be too costly and insufficientBurn-in to catch chip infant-mortality will not be practicalHW need to dynamically self-test, detect errors, reconfigure, and adaptOn-line testing technique will become necessary to trigger correction/reconfiguration
5
From Test to Recovery/Reconfiguration From Test to Recovery/Reconfiguration --ExamplesExamples
MemoryMemory““BIST BIST →→ BISD BISD →→ BISRBISR”” a common practicea common practiceErrorError--Tolerant Cache Architecture [Purdue U.]Tolerant Cache Architecture [Purdue U.]
Dynamic circuitsDynamic circuitsUsing programmable keeper and onUsing programmable keeper and on--chip leakage sensors for chip leakage sensors for tuning performance and robustnesstuning performance and robustness
MicroprocessorMicroprocessorDIVA and RAZOR [U. of Michigan] for onDIVA and RAZOR [U. of Michigan] for on--line checking and line checking and recoveryrecoveryAssumption: low failure rateAssumption: low failure rate
Analog/RF/HighAnalog/RF/High--speed IO componentsspeed IO componentsSelfSelf--calibration: calibration: finefine--tuning performance; more robust to tuning performance; more robust to process, temperature and voltage variationsprocess, temperature and voltage variations……
6
0
50
100
150
200
250
300
350
0 52 105
157
210
262
315
367
419
472
524
577
629
682
734
786
839
890
944
996
1049
Chi
p C
ount
(Nch
ip)
Fault statistics
NFaulty-Cells
Conv. Yield≈ 33.4%
Fault Statistics in 64K Cache @45nmFault Statistics in 64K Cache @45nm
σVt ≈ 30mv, using BPTM 45nm technology
NFaulty-Cells = PFault X NCells (total number of cells in a cache)
Conventional 64K cache results in only 33.4% yield
7
ErrorError--Tolerant Cache ArchitectureTolerant Cache Architecture
An error-tolerant, dynamically reconfigurable architecture:
Results in 94% yield vs 33% in conventional architectureDoes not affect cache access timeTransparent to the processorMinimum performance loss (<4%)
ConfigStorage
Controller Column MUX
CACHE, 4 Blocks in a Row
“00” “01” “10” “11”
“01” “01” “10” “11”
Faul
ty B
lock
Row
Dec
oder
Row Address
Col
umn
Add
ress
ColumnDecoder
“11” “10” “01” “00”
Main ideas:Main ideas:
Assume BIST implemented
Resize cache to avoid faulty blocks during regular operation
Force column MUX to select a non-faulty block in same row if the accessed block is faulty
8
Error Tolerant CapabilityError Tolerant Capability
0
50
100
150
200
250
300
350
0 105 210 315 419 524 629 734 839 944 1049
Chi
p C
ount
(Nch
ip)
Fault statistics
Chips saved by the proposed + redundancy (R=8, r=3)
Chips saved by ECC + redundancy ( R=16)
NFaulty-Cells
More number of saved chipsas compare to ECC
ECC fails to save any chips
Proposed architecture can handle more faulty cells than ECC, as high as 890 faulty cells with marginal performance loss
9
Self-Tuning Using On-Chip Current Monitoring – SRAM
Bypass Switch
Online Leakage Monitor
Calibrate Signal
VDD
SRAMArray
Comparator
Vbody Body-Bias Generator
Vout VREF1VREF2
STD of inter-die Vt variation [V]
Yiel
d [%
]
64KB SRAM array with ZBB
64KB Self Repairing SRAM array with ABB
ZBB=Zero Body BiasABB=Adaptive Body Bias
Self-repairing SRAM using on-chip current monitoring andadaptive body biasing (ABB)
Effective in achieving high yield in nanometer technologies
8%-40%
Mukhopadhaya, et. al. ITC’05
Source: K. Roy et al, Purdue
10
From Test to Recovery/Reconfiguration From Test to Recovery/Reconfiguration --ExamplesExamples
MemoryMemory““BIST BIST →→ BISD BISD →→ BISRBISR”” a common practicea common practiceErrorError--Tolerant Cache Architecture [Purdue U.]Tolerant Cache Architecture [Purdue U.]
Dynamic circuitsDynamic circuitsUsing programmable keeper and onUsing programmable keeper and on--chip leakage sensors for chip leakage sensors for tuning performance and robustnesstuning performance and robustness
MicroprocessorMicroprocessorDIVA and RAZOR [U. of Michigan] for onDIVA and RAZOR [U. of Michigan] for on--line checking and line checking and recoveryrecoveryAssumption: low failure rateAssumption: low failure rate
Analog/RF/HighAnalog/RF/High--speed IO componentsspeed IO componentsSelfSelf--calibration: calibration: finefine--tuning performance; more robust to tuning performance; more robust to process, temperature and voltage variationsprocess, temperature and voltage variations……
11
Dynamic Circuit Using Static KeeperDynamic Circuit Using Static Keeper
clk
. . .RS0 RS7
D0 D7
RS1
D1
LBL0
LBL1
N0
Keeper upsizing degrades average performance
Conventional Static Keeper
12
Pessimistic Design Hurts PerformancePessimistic Design Hurts Performance
worst-case corner
(130nm CMOS Measurements, 110°C)
0
50
100
150
200
Normalized IOFF
Num
ber
of d
ies
0 1 2 3 4 5 6
nominal corner
Substantial variation in leakage across dies4-5X variation between nominal and worst-case leakagePerformance determined at nominal leakageRobustness determined at worst-case leakage
13
Programmable Keeper for Dynamic CktsProgrammable Keeper for Dynamic Ckts
3-bit programmable keeper
clk
. . .RS0 RS7
D0 D7
RS1
D1
LBL0
LBL1
N0
b[2:0]
W 2W 4Ws s s
Opportunistic speedup via keeper downsizing
C. Kim et al. , VLSI Circuits Symp. ‘03
14
OnOn--Die Leakage SensorDie Leakage Sensor
C. Kim et al. , VLSI Circuits Symp. ‘04
83μm
73μ
mcurrent
reference
comparators
currentm
irrors
VBIASgen.
NMOS device
test interface
High leakage sensing gain Compact analog design sharing bias generators
7 levels7 levelsResolutionResolution
1.2V1.2VVVDDDD
83 83 X 73 X 73 μμmm22Dimensions Dimensions
0.66 0.66 mW @80CmW @80CººPower consumptionPower consumption
90nm dual 90nm dual VtVt CMOSCMOSTechnologyTechnology
15Output codes from leakage sensor
001 010 011 100 101 110 111
Leakage Binning ResultsLeakage Binning Results
16
Process detection
Test Process for Self-Calibrating DesignFab
Assembly
Wafer test
Burn inPackage testCustomer
Leakage measurement
On-die leakage sensor
Program using fuses
17
From Test to Recovery/Reconfiguration From Test to Recovery/Reconfiguration --ExamplesExamples
MemoryMemory““BIST BIST →→ BISD BISD →→ BISRBISR”” a common practicea common practiceErrorError--Tolerant Cache Architecture [Purdue U.]Tolerant Cache Architecture [Purdue U.]
Dynamic circuitsDynamic circuitsUsing programmable keeper and onUsing programmable keeper and on--chip leakage sensors for chip leakage sensors for tuning performance and robustnesstuning performance and robustness
MicroprocessorMicroprocessorDIVA and RAZOR [U. of Michigan] for onDIVA and RAZOR [U. of Michigan] for on--line checking and line checking and recoveryrecoveryAssumption: low failure rateAssumption: low failure rate
Analog/RF/HighAnalog/RF/High--speed IO componentsspeed IO componentsSelfSelf--calibration: calibration: finefine--tuning performance; more robust to tuning performance; more robust to process, temperature and voltage variationsprocess, temperature and voltage variations……
18
DIVA: On-Line Checking and Correction for Microprocessor
All core function is validated by checkerSimple checker detects & corrects faulty results, restarts coreValidates: control, computation, communication, and forward progress
Checker relaxes burden of correctness on core processorTolerates core design errors, electrical faults, defects, and failuresCore only targets high accuracy prediction, checker alone is 15x slower
Core does the heavy lifting, removes hazards that could slow the simple checker
Source: Todd Austin, Univ. of Michigan
19
DIVA DIVA -- Case StudyCase StudyPerformance impacts minimalPerformance impacts minimal
Without faults, less than Without faults, less than ½½% slowdown % slowdown for broad array of applicationsfor broad array of applicationsAt 1 fault/microsecond on a 1GHz At 1 fault/microsecond on a 1GHz processor, only 1% slowdownprocessor, only 1% slowdown
Area requirements modestArea requirements modestAlpha ISA checker less than 6% area Alpha ISA checker less than 6% area of Alpha 21264 processorof Alpha 21264 processor
Checker lends itself to formal Checker lends itself to formal verificationverification
Simple extensions provide Simple extensions provide excellent SER coverageexcellent SER coverage
4k datacache
1/2k instcache
pipe-line
BIST
205 mm2
(in 0.25um)
Alpha 21264
REMORAChecker
12 mm2
(in 0.25um)
Source: Todd Austin, Univ. of Michigan
20
Razor: Timing Error Detection & Correction
Double-sampling metastability tolerant latches detect timing errors
Second sample is correct-by-design
Micro-architectural support restores stateTiming errors treated like branch mis-predictions
Source: Austin & Blaauw, Michigan
21
From Test to Recovery/Reconfiguration From Test to Recovery/Reconfiguration --ExamplesExamples
MemoryMemory““BIST BIST →→ BISD BISD →→ BISRBISR”” a common practicea common practiceErrorError--Tolerant Cache Architecture [Purdue U.]Tolerant Cache Architecture [Purdue U.]
Dynamic circuitsDynamic circuitsUsing programmable keeper and onUsing programmable keeper and on--chip leakage sensors for chip leakage sensors for tuning performance and robustnesstuning performance and robustness
MicroprocessorMicroprocessorDIVA and RAZOR [U. of Michigan] for onDIVA and RAZOR [U. of Michigan] for on--line checking and line checking and recoveryrecoveryAssumption: low failure rateAssumption: low failure rate
Analog/RF/HighAnalog/RF/High--speed IO componentsspeed IO componentsSelfSelf--tuning/calibration: tuning/calibration: finefine--tuning performance; more robust to tuning performance; more robust to process, temperature and voltage variationsprocess, temperature and voltage variations……
22
SelfSelf--Test and SelfTest and Self--Tuning of Tuning of HighHigh--Speed IOSpeed IO
Jitter in DLL/PLL leads toJitter in DLL/PLL leads toJitter in Transmitted DataJitter in Transmitted DataUncertainty in Uncertainty in RXRX’’ss sampling sampling edgesedges
Mismatch & variations in DLL/PLL Mismatch & variations in DLL/PLL lead to high BER/yield losslead to high BER/yield loss
TX
Clock Recovery
FF
DLL
Recovered Data
RX
Clock Recovery
FF
Recovered Data
RX
TX
DLL
Ref. CLK
Parallel Data
Parallel Data
External measurement of DLL is infeasibleExternal measurement of DLL is infeasibleMultiple and matched access points are required for Multiple and matched access points are required for each delay stageeach delay stage
23
More ExamplesMore Examples……..
OnOn--chip thermal sensingchip thermal sensing →→ Cooling adjustmentCooling adjustment
OnOn--chip delay sensing chip delay sensing →→ Performance tuningPerformance tuning
OnOn--chip leakage sensing chip leakage sensing →→ Leakage controlLeakage control
…………..
24
SelfSelf--Tuning in the Factory Is HappeningTuning in the Factory Is Happening
Hardware support for syndrome collectionHardware support for syndrome collection
Hardware support for selfHardware support for self--tuning/selftuning/self--reconfigurationreconfiguration
In need of design methodology for exploring design In need of design methodology for exploring design tradetrade--offsoffs
withSelf-test & self-tuning/
-configuration capability
withSelf-test & self-tuning/
-configuration capabilitytuning knobs
test results/syndrome
component
25
Allow shipping of defective parts!Allow shipping of defective parts!
Require onRequire on--line testing/checking capabilityline testing/checking capability
Require Require ““selfself--diagnosisdiagnosis”” capabilitycapability
Diagnosis could be conducted on remote serverDiagnosis could be conducted on remote server
withOn-line-test &
self-tuning/-configuration
capability
withOn-line-test &
self-tuning/-configuration
capability
system
Diagnosis server
w/ database
Diagnosis server
w/ database
tuning knobs
test results/syndrome
network
component
Supporting SelfSupporting Self--Tuning/Tuning/--Configuration in Configuration in the Fieldthe Field
26
Design ChallengesDesign Challenges
“Self-diagnosis” to support reconfiguration
Low-cost on-line checking, self-repair and self-tuning schemes and design methodologies
Exploration of redundancy and reconfiguration tradeoffs (power, area, performance, reliability)
27
Summary
OnOn--line testing is promising for detecting soft line testing is promising for detecting soft errors, latency failures and marginality failureserrors, latency failures and marginality failures
Need automatic diagnosis solutions after Need automatic diagnosis solutions after errors are detected by onerrors are detected by on--line checkerline checker
Need lowNeed low--cost and lowcost and low--power recovery or repower recovery or re--configuration schemesconfiguration schemes
Post silicon tuning/calibration/reconfiguration Post silicon tuning/calibration/reconfiguration is becoming promising, and necessary, for is becoming promising, and necessary, for SiSinanonano systemssystems