process variations & process- adaptive design for the nano ... · process variations &...
TRANSCRIPT
Kaushik RoyElectrical & Computer Engineering
Purdue University
Process Variations & Process-Adaptive Design for the Nano-
meter Regime
Exponential Increase in Leakage
Non-Silicon Technology
Silicon Micro- electronics
Silicon Nano- electronics
1 µm 100 nm 10 nm
1970 1980 2000 2010 2020
5 µm
Junctionleakage
GateSource
n+n+
Bulk
Drain
Subthreshold Leakage
Gate Leakage
A. Grove, IEDM 2002
Technology (μ)
Leak
age
Pow
er (%
of T
otal
)
0%
10%
20%
30%
40%
50%
1.5 0.7 0.35 0.18 0.09 0.05
Must stop at 50%
610ON
OFF
II
=310ON
OFF
II
= 2~6~ 10ON
OFF
II
Technology Trend
Buried Oxide (BOX)Substrate
Fully-depleted body
Gate
VG
VS VD
DrainSource
Vback
Buried Oxide (BOX)Substrate
Fully-depleted body
Gate
VG
VS VD
DrainSource
Vback
Bulk-CMOS
FD/SOI
Nano devices Carbon nanotubeIII-V devices nano-wiresSpintronics
Single gate device
20032009
2020
DGMOS
FinFET Trigate
Multi-gate devices
Buried Oxide (BOX)
Substrate
Source Floating Body Drain
GateVS
VG
VD
Buried Oxide (BOX)
Substrate
Source Floating Body Drain
GateVS
VG
VD
PD/SOI
Design methods to exploit the advantages of technology innovations
Variation in Process Parameters
Inter and Intra-die Variations
Device parameters are no longer deterministic
Device 1 Device 2
Channel length
130nm
30 %
5X0.90 .9
1 .01 .0
1 .11 .1
1 .21 .2
1 .31 .3
1 .41 .4
11 22 33 44 55N orm alized Leakage (N orm alized Leakage ( IsbIsb ))
Nor
mal
ized
Fre
quen
cyN
orm
aliz
ed F
requ
ency
Delay and Leakage Spread
Source: Intel
10
100
1000
10000
1000 500 250 130 65 32Technology Node (nm)
# do
pant
ato
ms Source: Intel
Random dopant fluctuation
ReliabilityTemporal degradation of performance -- NBTI
Tech. generation
Failu
re p
roba
bilit
y
Time
Defects Life time degradation
Pessimistic Design Hurts Performance
worst-case corner
(150nm CMOS Measurements, 110°C)
0
50
100
150
200
Normalized IOFF
Num
ber o
f die
s
0 1 2 3 4 5 7
nominal corner
Substantial variation in leakage across dies4X variation between nominal and worst-case leakagePerformance determined at nominal leakageRobustness determined at worst-case leakage
7
Global and Local Variations
− −= Δ +t t GLOBAL t LOCALV V Vδ δinter-die
GLOBALσ
−Δ t GLOBALV
intra-die
LOCALσ
−t LOCALVδ
Random Dopant Fluctuation
8
Process Tolerance: Memories
S. Mukhopadhaya, Mahmoodi, RoyVLSI Circuit Symposium 2006, JSSC 2006, TCAD
9Parametric Failures: Read
Read failure => Flipping of Cell Data while Reading
BL BR
WL
VL=‘1’
VR=‘0’VREAD
+Δ
-Δ
+Δ
-ΔNR
PR
NL
PLAXRAXL
VTRIPRDVR=VREAD
VL
WL
Volta
ge
Time ->
WL
VR
VLVo
ltage
Time ->( )TRIPRDREADRF VVPP >=
10
Parametric Failures in SRAM
BL BR
WL
NR
PR
NL
PL
AXRAXL ‘1’ ‘0’
High-Vt Low-Vt
Parametric failures– Read Failures– Write Failures– Access Failures– Hold Failures
Parametric failures can degrade SRAM yieldFaulty chips Working chips
Test & Repair using Redundancy
11
Process Variations in On-chip SRAM
Parametric failures →Yield degradation
σVt ≈ 30mv, using BPTM 45nm technology
Yield ≈ 33%
0
50
100
150
200
250
300
350
0 52 105
157
210
262
315
367
419
472
524
577
629
682
734
786
839
890
944
996
1049
Chi
p C
ount
Chi
p C
ount Fault statistics
Number of faulty cells (Number of faulty cells (NNFaultyFaulty--CellsCells))
Simulation of an 64KB Cache A. Agarwal, et. al, JSSC, 05
12
Inter-die Variation & Cell Failures
inter-die Vt shift (ΔVth-GLOBAL)
GLOBALσ
“1” “0”
High–Vt Corners− Access failure ↑− Write failure ↑
“1” “0”
Low–Vt Corners− Read failure ↑− Hold failure ↑
13
Inter-die Variation & Memory Failure
Memory failure probabilities are high when inter-die shift in process is high
LVT HVTNom. Vt
BPTM 70nm Devices
14
Self-Repairing SRAM Array
Reduce the dominant failures at different inter-die corners to increase width of low failure region
Region A Region B Region CRegion C
HVT Corner
Access & Write failures dominate
Reduce AF & WF
Region A LVT Corner
Read & Hold failures dominate
Reduce RF & HF
LVT HVTNom. Vt
15
Post-Silicon Repair: Proposed Approach
inter-dieGLOBALσ
intra-die
L O C A Lσ
intra-die
L O C A Lσ
Apply correction to the global variation to reduce number of failures due to local variations
16
Self-Repairing SRAM Array
Reduce the dominant failures at different inter-die corners to increase width of low failure region
Region A Region B Region C
LVT HVTNom. Vt
ZBBRBB FBB
17
How to identify the inter-die Vt corner under a large intra-die variation ?
Effect of inter-die variation can be masked by intra-die variation
Monitor circuit parameters, e.g. leakage current
BLBL BRBR
WLWL
VDDVDD
GNDGND
18
Leakage of entire SRAM array is a reliable indicator of the inter-die Vt corner
1
1NY X
ii Y X
Y XN
σ σμ μ=
= => =∑
Array Leakage Monitoring
• Adding a large number of random variables reduces the effect of intra-die variation
19
Self-Repair using Leakage Monitoring
SRAM ARRAY
VOUT
Entire array leakage is monitored to detect inter-die corner and proper body-bias
is selected
BypassSwitch
On-chipLeakageMonitor
CalibrateSignal
VDD
SRAMArray
Comparator
Bodybias Body-Bias
selection
VoutV
REF1 VREF2
FBB ZBB RBB
20
Yield Enhancement using Self-Repair
Self-Repairing SRAM using body-bias can significantly improve design yield
21
Test-Chip of Self-Repairing SRAMVCO
Technology : IBM 0.13 μm128KB SRAMDual-Vt Triple-well tech. Number of Trans: ~ 7 million Die size: 16mm2VLSI CKT Symp. 2006, ITC 2005
64 KB LVT
Array
16 KB block
VCOIs
olat
ed c
ell
Sensor + Ref. gen. BB gen
Simulation results for 1MB array designed in IBM 0.13μm
22
Continuous vs Quantized Body Bias
Quantized (3 Level: FBB, ZBB, RBB) body bias scheme is a cost effective solution with good
yield enhancement possibility
Process Tolerance: Register Files
Kim et. al. VLSI Circuit Symposium 2004
Process Compensating Dynamic Circuit Technology
clk
. . .RS0 RS7
D0 D7
RS1
D1
LBL0
LBL1
N0
Keeper upsizing degrades average performance
Conventional Static Keeper
Process Compensating Dynamic Circuit Technology
3-bit programmable keeper
clk
. . .RS0 RS7
D0 D7
RS1
D1
LBL0
LBL1
N0
b[2:0]
W 2W 4Ws s s
Opportunistic speedup via keeper downsizing
C. Kim et al. , VLSI Circuits Symp. ‘03
5X reduction in robustness failing dies
0
50
100
150
200
250
0.7 0.8 0.9 1.0 1.1 1.2Normalized DC robustness
Num
ber o
f die
s
ConventionalThis work
Noise floor
saved dies
Robustness Squeeze
10% opportunistic speedup
0
50
100
150
200
250
300
0.8 0.9 1.0 1.1 1.2Normalized delay
Num
ber o
f die
sConventionalThis work
PCD μ = 0.90Conv. μ = 1.00
μ : avg. delay
Delay Squeeze
On-Die Leakage Sensor For Measuring Process Variation
C. Kim et al. , VLSI Circuits Symp. ‘04
83μm
73μ
m
current reference
comparators
current m
irrors
VBIASgen.
NMOS device
test interface
High leakage sensing gain – 90nm dual-Vt, Vdd=1.2V, 7 level resolution, 0.66 mW @80Cº
Output codes from leakage sensor
001 010 011 100 101 110 111
Leakage Binning Results
Process detection
Self-Contained Process CompensationFab
Assembly
Wafer test
Burn inPackage testCustomer
Leakage measurement
On-die leakage sensor
Program PCD using fuses
Self-Repair: Architecture Level
Agarawal, Roy TVLSI 2005
Fault-Tolerant Cache Architecture
BIST detects the faulty blocksConfig Storage stores the fault information
Idea is to resize the cache to avoid faulty blocks during regular operation
Faulty
Mapping Issue
More than one INDEXare mapped to same
block
Include column address bits into
TAG bits
Resizing is transparent to processor → same memory address
Fault Tolerant Capability
0
50
100
150
200
250
300
350
0 105 210 315 419 524 629 734 839 944 1049
Chi
p C
ount
(Nch
ip)
Fault statistics
Chips saved by the proposed + redundancy (R=8, r=3)
Chips saved by ECC + redundancy ( R=16)
NFaulty-Cells
More number of saved chipsas compare to ECC ECC fails to save
any chips
Proposed architecture can handle more number of faulty cells than ECC, as high as 890 faulty cells
Saves more number of chips than ECC for a givenNFaulty-Cells
CPU Performance Loss
0.0
0.5
1.0
1.5
2.0
2.5
0 105 210 315 419 524 629 734 839
% C
PU P
erfo
rman
ce L
oss For a 64K cache
averaged over SPEC 2000 benchmarks
NFaulty-Cells
Increase in miss rate due to downsizing of cache
Average CPU performance loss over all SPEC 2000benchmarks for a cache with 890 faulty cells is ~ 2%
Logic: Process Tolerance
Sizing, Dual-Vt, Synthesis, etc.
• Mostly worst-case design methdology• Guard-banding with larger transistors/larger
library elements• Selective guard-banding to reduce delay in
longer paths– Transistor sizing– Dual-Vt
• Logic optimization and simultaneous sizing/dual-Vt
38
Logic: A New Paradigm for Low-Voltage, Variation Tolerant Circuit Synthesis Using
Critical Path Isolation (CRISTA)-- Better than worst-case design
Ghosh, Bhunia, Roy -- ICCAD 2006
39Vdd Scaling and Process Tolerance: Conventional Solutions
• Low power:– Reduce the supply voltage
• Error rate increases
– Dual-Vt/dual-VDD assignment• Number of critical paths increases
• Robustness:– Increase supply voltage
• Power dissipation increases
– Upsize the gates• Switching capacitance increases
Low power and robustness: conflicting requirements
Low power and robustness: conflicting requirements
40
• Post-Silicon technique for dynamic supply scaling and timing error detection/correction
• Error correction overhead is 1% for a 10% error rate
RAZOR: Dan Ernst et. al., MICRO 2003.
Razor Approach
D
CLK
DelayE
Q
Shadow Latch
Standard Latch
41
CRISTA: Basic Idea
CLK
VDD=1V
VDD<1VVDD<1V
evaluateevaluate based on prediction
non-critical path activationcritical path activation
• Important points:– Design time technique (as opposed to post-Si)– Scale down the supply while making delay failures
predictable– Avoid the failures by adaptive clock stretching– Ensure that critical paths are activated rarely
Tc 2Tc
42
path delay
Num
ber o
f pat
hs
predictable and restricted to a logic section having low activation probability
S2+S3
S1
Design ATc
CLK
S1S2 S3
Design BVDD=1V
Design BVDD<1V
Design Considerations for CRISTA
• Few predictable critical paths• Low activation probability of critical paths• Slack between critical and non-critical paths under variations
Design A: conventional design
Design B: proposed design
VDD=1V
43
Case Study: AdderP0 G1 P G1 P2 G2 P3 G3
Co,3Co,2Co,1Co,0Ci,0
P0 G1 P1 G1 P2 G2 P3 G3Co,3Co,2Co,1Co,0Ci,0 FAFA FAFA FA FA
Crit. path delay=260pslongest non-crit. path delay=165psP = 13uW (1-cycle)
VDD = 1V, TCLK = 260ps
Crit. path delay=330pslongest non-crit. path delay=260psP = 7.4uW (rare 2-cycles, decoder)
VDD = 0.8V, TCLK = 260ps
• Interesting features:
– Single critical path (activated by P0P1P2P3=1 & Ci,0=1)– Low activation probability of critical path
44% power saving by reducing voltage and, operating critical path at 2-cycle and other paths at 1-cycle
44% power saving by reducing voltage and, operating critical path at 2-cycle and other paths at 1-cycle
Can we apply same technique to any random logic?
44
Wallace Tree Multiplier
Vector Merging Adder
Full AddersHalf AdderVector Merging Adder
Partial Products
Stage 1
Stage 2
Stage 3
Final Product
Critical pathLongest off-critical path
• 29% power saving with ~4% area overhead
45
Simulation Results
ISO Yield = 92%
20
25
30
35
40
45
12 bits 16 bits 32 bits
% P
ower
sav
ings
567891011
% A
rea
over
head ISO Yield = 90%
0
5
10
15
20
25
12 bits 16 bits 32 bits
% P
ower
sav
ings
6
7
8
9
10
%A
rea
over
head
ISO Yield = 96%
26
27
28
29
8 bits 12 bits 16 bits
%Po
wer
sav
ings
2.5
3.5
4.5
5.5
6.5
7.5
%A
rea
over
head
02468
101214
6 bits 8 bits 10 bits 12 bits 20 bits
% T
hrou
ghpu
t pen
alty
3.73.83.944.14.24.34.4
% A
rea
Ove
rhea
d% Throughput penalty% Area Overhead
Ripple Carry Adder Carry Save Multiplier
Wallace Tree Multiplier Performance penalty (WTM)
46
1 1 1
1 2
1
'
2
'
1 1
( ,..., ,..., ) ( ,..., 1,..., ) ( ,..., 0,..., )
( ,..., 1,..., ); ( ,..., 0,..., )
n n ni i i
n ni
i i
i
i i
f x x x f x x x f x x xCF CF
CF f x x x CF f
x xx
x xx
x•
= • = + • == + •
= = = =
CF1(xi=1)
CF2(xi=0)
MU
X
f
inputs
f1
f2xi
Activation probability of cofactors can be reducedHow to choose Control Variable ?
Prob =50%
Prob =50%CF11
CF12
MU
X
xj
f1Prob =25%
Prob =25%
Random Logic: Shannon’s Expansion
47
Original Circuit
CF11
CF21
CF32
CF63
CF53
CF42
MU
X N
etw
ork
Inputs
PO
Inputs
Further Isolation and Slack Creation by Sizing
• Slack creation strategy – Lagrangian Relaxation based sizing (B.C. Paul et. al., DAC 2004) is
used– Non-critical paths are selectively made faster– Critical paths are slightly slowed down
48
0
20
40
60
80
cht sct pcle mux decod cm150a x2 alu2 count
% Im
p. in
pow
er
% imp in power with switching activity = 0.2 100% imp in power with switching activity = 0.5
MCNC benchmarks, 70nm Process
Simulation Results
• Average power saving = ~45%• Average area overhead = 18%• Avg performance penalty=5.9% (with 4 control variables) for signal
prob=0.5
0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
cht sct pcle mux decod cm150a x2 alu2 count
Are
a (x
103 )
um^2
Original design
Proposed design
MCNC benchmarks, 70nm Process
Power Area
49
Two-Stage Pipeline with Test Logic
Outputs
Com
para
tor
Pre-decoder
Pre-decoder
gclk
SFFs
SFFs
SFFs
Car
ry-L
ook-
ahea
d A
dder
freeze
Stalling Logic
●
LFSR
TM1TM2
CLK
● ●
VDDm
●●
4:1
mux
Outputs
Com
para
tor
SFFs
SFFs
SFFs
Car
ry-L
ook-
ahea
d A
dder
●
TM1● ●
VDDo ●●
2:1
Mux
●●
Low Power Robust Pipeline
Conventional Pipeline
fixed vectors
fixed vectors
●
TM1TM2
CLK
Clock generator
Test logic
Proposed pipeline
Test logic
Regular pipeline
GDS Layout
Power measurement of proposed pipeline
20% reduction in conventional pipeline
40% extra reduction using CRISTA
~40% power saving with ~13% performance penalty~40% power saving with ~13% performance penalty
CRISTA in Thermal Design
50
51
Temperature Aware Design using CRISTA*
S3
Secondary critical(it will run with two-cycles at VDDL2)
• Scale down the supply at elevated temperatures while making delay failures predictable
• Maintain rated frequency• Avoid the failures by adaptive clock stretching• Ensure that critical paths are activated rarely• Tradeoff between temperature and performance loss
path delay
# pa
ths
S2 S1
1 cycle bound
Critical (it will run with two-cycles at VDDL1 and VDDL2)
path delay
# pa
ths
S
S Ghosh, S Bhunia and K Roy, Low overhead circuit synthesis for temperature adaptation using dynamic voltage scheduling, DATE 2007.
52
Simulation Results
• Similar temperature reduction as conventional technique (i.e. supply scaling with half frequency)
• Avg performance penalty=5.9% (at VDDL1) and 15.2% (at VDDL2) compared to 50% in conventional technique
• Avg area overhead ~14%
MCNC benchmarks, 70nm devices, Nominal supply = 1V, Tamb = 100oC
99.5
100.5
101.5
102.5
cht sct pcle mux cmb alu2 cm150a decod
Tem
pera
ture
(o C)
Org(VDDH), freq=fOrg(VDDL) , freq=f/2CRISTA(VDDL1), freq=f(adaptive)CRISTA(VDDL2) , freq=f(adaptive)
53
Micro-Architectural Implementation
• Applied to execution stage (adders and multipliers) of in-order pipeline
• Can be an add-on to standard DVFS
Tem
p se
nsor
free
ze
ID EX MEM
●
VDDH
VDDL
0
1
TREF2
Sel
CLK
IF WB
●
●●
Pre-decoder
TREF1
54
Simulation Results
DVFS triggered
DVFS triggering avoided in CRISTA
• Avoids requirement of full-chip DVFS
SPEC CINT benchmarks (gcc)
0 0.5 1 1.5 2 2.5 3 3.565
70
75
80
Time (sec)
Max
. Tem
p (C
)
No DTM DVFS
0 0.5 1 1.5 2 2.5 3 3.565
70
75
80
Time (sec)
Max
. Tem
p (C
)
CRISTA
DVFS: Dynamic Supply and Frequency ScalingCRISTA : Occasional 2-Cycle Operation
VV f (GHz)f (GHz)
DVFSDVFS 0.90.9 1.51.5
CRISTACRISTA 0.850.85 33No No
DTMDTM1.21.2
33
55
Simulation Results
0
20
40
60
80
100
120
bzip2
crafty eo
n
gap
gcc gzip mcfpars
erperl
mktw
olfvo
rtex
vpr
T(de
gree
C)
No DTM DVFS CRISTA
• Average reduction in peak temperature ~6.6% with ~3.4% throughput loss and ~4.2% area overhead
Peak Temperature (TREF1=75oC, TREF2=80oC)
VDD Scaling, Process Variation, and Quality Trade-off: DCT
-- Quality knob for tuning
Banerjee, Karakonstantis, RoyDesign Automation and Test in Europe (DATE) 2007
Basic Idea• All computations are “not equally important” for determining outputs
• Identify important and unimportant computations based on output “sensitivity”
• Compute important computations with “higher priority”
• Delay errors due to variations/ Vdd scaling “affect only” non-important computations
• “Gradual degradation” in output with voltage scaling and process variations
DCT Based Image Compression Process
• DCT is used in current international image/video coding standards
- JPEG, MPEG, H.261, H.263
QuantizerFDCT
Source image X
Compressed
Image Data
Z = T• X • T '
Z EntropyEncoder
RoundT• Z • T '
Q
JPEG Encoder Block Diagram
V
8×8 blocks
512×512 image
1D DCT TransposeMemory
1D DCTX Y ZW
High energy components (important outputs 75% energy)Low energy components (less important outputs)
64
1 2
4
3 5
6 7
9
8
10
11
12
13
14
15
19
20
17
18
2816
27
29
4330
3126 42 44
3225 4541
24
54
4033 5546 53
2321 3934 5247 6156
3522 4838 5751 6260
3736 5049 5958 63
Energy Distribution of a 2D-DCT Output
Can important components be computed with higher priority ?
(c)(d)
1D- intermediate DCT outputsInput Block
Transpose Memory
(a) (b)
1D-DCT
1D-DCT
Faster
Slow
er C
ompu
tatio
n
Fast
er C
ompu
tatio
n
Slower
x0
x2
x1
x3
x4
x5
x6
x7
w0
w2
w1
w3
w4
w5
w6
w7
w8 w16 w24 w32 w40 w48 w56
y0
y1
y2
y3
y4
y5
y6
y7
Final DCT outputs
Slow
er C
ompu
tatio
n
Fast
er
Com
puta
tionz0
z1
z2
z3
z4
Computation
Computation
Design Methodology
T.xt
<< (3 adders delay)
(2 adders delay)
52( + )x x d•
4w1 6( + )x x d•
70( + )x x d•
3 4( + )x x d•0w
─
+
Path Delays for 1D-DCT outputs
(4 adders delay)
<< ─+
<<
(3 adders delay)
─
70( + )x x e•
3 4( + )x x e•
70( + )x x f•
1 6( + )x x e•
1 6( + )x x f•
3 4( + )x x f•
52( + )x x f•
3 4( + )x x f•
1 6( + )x x f•
52( + )x x e•
2w
6w
─+
+
++
─
─
(4 adders delay)
>>
1w
+─<<
>> +─
+─<<
<<+─
70( - )x x a•
3 4( - )x x f•
1 6( - )x x a•
1 6( - )x x f•52( - )x x e•
1 6( - )x x e•
70( - )x x f•
3 4( - )x x a•52( - )x x a•52( - )x x e•
52( - )x x f•
(3 adders delay)
7w
>>
3w+70( - )x x a•
(3 adders delay)
<<>>
3 4( - )x x e•
1 6( - )x x a•3 4( - )x x a•
5w
3 4( - )x x f•
52( - )x x f• <<
<<+
52( - )x x a•
70( - )x x e•
70( - )x x f•3 4( - )x x e•
52( - )x x f• +
+
+─
─
──
─
(4 adders delay)
w0
w1
w2
w3
w4
w5
w6
w7
Longer Delays
Important Computations
Paths Not Computed
Delay=D1
@ Vdd1
w0
w1
w2
w3
w4
w5
w6
w7
D2 >D1
@Vdd2
D1
@Vdd2
Proposed DCT under Vdd scalingProposed Design with high/low delay paths Scaled Vdd: Longer paths under Vdd scaling
Extreme Scaled Vdd: Shorter paths affected
Paths Not Computed
w0
w1
w2
w3
w4
w5
w6
w7
D4 >D1
@Vdd3
D3 > D1
@Vdd3
Vdd3 < Vdd2 < Vdd1(nominal)D1 @Vdd3
Only DC component
00.5
11.5
22.5
33.5
4
Pat
h1(w
0)
Pat
h2(w
1)
Pat
h3(w
2)
Pat
h4(w
3)
Pat
h5(w
4)
Pat
h6(w
5)
Pat
h7(w
6)
Pat
h8(w
7)
Computation Paths
Del
ay(n
s)Conventional DCT Proposed DCT
1D-DCT Path Delay Comparisons
Effect of Vdd Scaling
Conventional WTM DCT
1.0 V
0.9 V
0.8 V
FAILS
FAILSFAILS
FAILS
CSHM DCT (2 alphabet)
Proposed DCT 1.0V CSHM DCT
(2 alphabets)DCT
with WTM Proposed
DCT
Power (mW) 25.1 29.8 26Delay (ns) 3.2 3.64 3.57Area (um2) 80490 108738 90337PSNR (dB) 21.97 33.23 33.22
Proposed DCT Vdd=0.9V
Proposed DCTVdd=0.8V
Power (mW) 17.53(41.2%) 11.09(62.8%)PSNR (dB) 29 23.41
Proposed Architecture at Reduced Voltage
Different Architectures at Nominal Voltage
•Graceful degradation of proposed DCT architecture under Vdd scaling ( Vdd can be scaled to 0.75V)
• Conventional architectures fails
65
1. Discrete Cosine Transform (DCT)
• Graceful degradation in quality under Vdd scaling (Vdd can be scaled to from 1V to 0.8V) with ~60% improvement in power
• Conventional architectures fail
• Scale down supply voltage
• Design/algorithm such that failures can only occur in less critical section --minimal quality degradation
CRISTA for DSP SystemsCRISTA for DSP Systems
Conventional WTM DCT
1.0 V
0.9 V
0.8 V
FAILS
FAILSFAILS
FAILS
CSHM DCT (2 alphabet)
Proposed DCT
66
2. Finite Impulse Response (FIR)
less-critical coefficients
critical coefficients
For Other DSP SystemsFor Other DSP SystemsPower at Nominal/Scaled Voltage
020406080
100120
1 0.9 0.8Voltage(V)
Pow
er(m
W) Conv. LCCSE
33.6%
FAILS
FAILS
3. Color Interpolation
1.0 V
0.9 V
0.8 VFAILS
FAILS
Conv Proposed
Ri, j
Gi, j−1Gi+1, jGi, j+1
Gi−1, j
M1Vdd
G’1G’2
>>2
>>2
>>1 ――
M
Vdd
Ri, j+2Ri+2, jRi -2, j
Ri, j -2
G’3
• Bilinear component is critical and gradient component is less-critical
• Design architecture such that failures can only occur in gradient term
Temporal Degradation: BTI
Kang, Roy, et. al. – TCAD, DAC-07
Bias Temperature Instability
• PMOS specific Aging Effect• Generation of (+) traps• Reaction-Diffusion (RD) model*• Time exponent ~ 1/6
x
P+P+
GATE
OXIDE
DRAIN
VBODY = 0V
x
H
Si Si Si Si
HH
H
VGS < 0V
VS = 0VVD = 0V
n-sub
xx x HH H
H2
H2
H2
H2
H2
*M. A. Alam, IEDM’03
After NBTI degradation
( ) ( )10 6
2F
IT HR
k NN t D tk
=
ITT
OX
q NV
C⋅ Δ
Δ =
IDDQ based NBTI Characterization
• Test Circuit Fabricated• 1000 stage INV chain• DC Stress signal @Vin
• IDDQ measurement @GND
Technology CMOS 130nm
Die Size 20 (mm2)
I/O Pin 209
Tox 1.6 (nm)
VDD 1.2 (V)
Inverter Chain
Microphotograph Layout
Vin
VDD
IDDQ Measurement1000 stages
NBTI in Digital Circuits
VL
WL
BLBLB
VR
1
5
o1
o2
3
24
8
7
6
Logic Circuits• fMAX decreases↓• Timing failure with time
Memory Circuits• Static Noise Margin (SNM)↓• Read & Write Stability• Parametric Yield↓
Temporal VTh increase in PMOS affects critical performance factors of digital VLSI circuits
i1
i3
i2
IDDQ based Characterization Technique
Des
ign
phas
ePo
st-s
ilico
n p
hase
• Circuit-level NBTI Reliability Characterization
• IDDQ test is used• Expensive fMAX testing is avoided
(or minimized)• Accurate circuit level performance
degradation can be predicted• IC specific burn-in to qualify the
target produce• Efficient way of field monitoring:
dynamic local signature of produce usage
• Possible usage in other reliability sources; HCI
Initial Characterization•Compute Rleak, RfMAX•Compute KfMAX = Rleak / RfMAX
Reliability Extrapolation•For each IDDQ measurement sample, Rleak is computed•RfMAX = KfMAX X Rleak•Estimate fMAX degradation
Lifetime Projection•Project IDDQ using KfMAX
NBTI Characterization Report
Temp. Dependency
NBTI: Random Logic Circuits
• ISCAS’85 Benchmark Circuits, PTM 65nm • Gate delay: analytical delay model considering NBTI• Circuit delay: NBTI-aware Static Timing Analysis (STA)• Circuit fMAX time exponent n ~ 1/6
PTM 65nmISCAS’85
Benchmarks
3 years Lifetime
8% fMAX decrease
n~1/6
100
102
104
106
108
100
Time (s)f M
AX
redu
ctio
n (%
)
c2670c5315c1908c499c3540c74181
LogicCell fanin
Delay (ps)Δ (%)
t=0 3 years
INV 1 13.77 16.77 21.8
NAND 2 16.86 19.88 17.9
NAND 3 19.57 22.45 14.8
NOR 2 17.26 21.89 26.8
NOR 3 23.80 30.19 26.9
Delay Degrad. STD cells
NBTI: 6T SRAM Cell
104 106 108100
101
Time (s)
Deg
rada
tion
in S
NM
(%)
PTM 60nm, 125°CCR = 1.33PR = 0.67
1/6 trend line
Static Noise Margin (SNM)
SNM degrades by more than 10% in 3 years% SNM Degradation time exponent n ~ 1/6WM improves with time under NBTI
0.43 0.44 0.45 0.46 0.47 0.48 0.490
0.2
0.4
0.6
0.8
1
Write Margin (V)
CD
F
t=0t=108
t=107
t=106
WM improveswith time
Distribution in Write Margin
Design for Reliability under NBTI
580 600 620 640 660 680 700 720
550
600
650
700
750
800
Delay (ps)
Are
a (u
m)
INITTR-BasedCell-Based
Cell-based
Area SavingFrom TR-based
10ps c1908
Gate Sizing applied to guarantee lifetime functionality of design11.7% overhead for Cell-based sizing6.13% overhead for TR-based sizing
45% improvement in area overheadRuntime complexity for TR-based sizing is identical to that of Cell-based sizing
Simulation SetupSynthesized in PTM 65nm1/6 VTh degradation model125°C Stress temperature50% Signal Probability at PI’s
75
Memory Solution #1: Periodic CellMemory Solution #1: Periodic Cell--Flipping*Flipping*
• Balance the signal probability on both side of the cell SL = SR
• Still incurs close to 10% degradation in SNM
ML
WL
BLBLB
MR VL
VR
SL > SR NBTI↑ SL = SR NBTI↓
*S. Kumar et al., ISQED’06
76
Solution #2: Standby VSolution #2: Standby VDDDD ScalingScaling
• NBTI is a strong function of VDD• Lower VDD during standby mode MIN VDD• Effective solution with low design effort
1.0 0.9 0.8 0.7 0.60
2
4
6
8
10
12
14
16
% S
NM
Deg
rada
tion
VDDL (V)
99%50%10%1%
Activity1.0V Nominal VDDVDDL @stand-by mode
65nm PTM@125°C for 3 years
• Normal Mode VDD• Standby Mode VDDL
• Normal Mode NBTI• Standby Mode Min.
NBTI
77
Solution #3: Sensing & CorrectionSolution #3: Sensing & Correction
• Sense reliability degradation adaptive correction using circuit techniques
• *T. Kim et al., VLSI Circuit Symposium 2007• ** K. Kang et al., Design Automation Conf. 2007• Need to properly consider design overhead
RST
RORG
PhaseC
omparator
VDD Stress NBTI_OUT
ADCN7 N-Body Bias
Selection +−
ADCP7 P-Body Bias
Selection
+−
nbody
pbody
P/N Equalizer
VM
GN
GP
Reliability Sensor* Adaptive Body Bias Generator**
Conclusions
• Process Variation and Process Tolerance is becoming important
• There is a need to optimize designs considering power/performance/yield