Kaushik RoyElectrical & Computer Engineering
Purdue University
Process-Tolerant Low-Power Design for the Nano-meter
Regime
Exponential Increase in Leakage
Non-Silicon Technology
Silicon Micro- electronics
Silicon Nano- electronics
1 µm 100 nm 10 nm
1970 1980 2000 2010 2020
5 µm
Junctionleakage
GateSource
n+n+
Bulk
Drain
Subthreshold Leakage
Gate Leakage
A. Grove, IEDM 2002
Technology ( µµµµ)
Leak
age
Pow
er (
% o
f Tot
al)
0%
10%
20%
30%
40%
50%
1.5 0.7 0.35 0.18 0.09 0.05
Must stop at 50%
610ON
OFF
I
I====
310ON
OFF
I
I==== 2~6~ 10ON
OFF
I
I
Technology Trend
Buried Oxide (BOX)
Substrate
Fully-depleted body
Gate
VG
VS VD
DrainSource
Vback
Buried Oxide (BOX)
Substrate
Fully-depleted body
Gate
VG
VS VD
DrainSource
Vback
Bulk-CMOS
FD/SOI
Nano devices Carbon nanotubeIII-V devices nano-wiresSpintronics
Single gate device
20032009
2020
DGMOS
FinFET Trigate
Multi-gate devices
Buried Oxide (BOX)
Substrate
Source Floating Body Drain
GateVS
VG
VD
Buried Oxide (BOX)
Substrate
Source Floating Body Drain
GateVS
VG
VD
PD/SOI
Design methods to exploit the advantages of technology innovations
Variation in Process Parameters
Inter and Intra-die Variations
Device parameters are no longer deterministic
Device 1 Device 2
Channel length
130nm
30 %
5X
0.90 .9
1 .01 .0
1 .11 .1
1 .21 .2
1 .31 .3
1 .41 .4
11 22 33 44 55N orm alized Leakage (N o rm a lized Leakage ( IsbIsb ))
Nor
mal
ized
Fre
quen
cyN
orm
aliz
ed F
requ
ency
Delay and Leakage Spread
Source: Intel
10
100
1000
10000
1000 500 250 130 65 32Technology Node (nm)
# do
pant
ato
ms Source: Intel
Random dopant fluctuation
ReliabilityTemporal degradation of performance -- NBTI
Tech. generation
Fai
lure
pro
babi
lity
Time
Defects Life time degradation
Pessimistic Design Hurts Performance
worst-case corner
(150nm CMOS Measurements, 110 °°°°C)
0
50
100
150
200
Normalized I OFF
Num
ber
of d
ies
0 1 2 3 4 5 7
nominal corner
� Substantial variation in leakage across dies� 4X variation between nominal and worst-case leakage� Performance determined at nominal leakage� Robustness determined at worst-case leakage
7
Global and Local Variations
− −− −− −− −= ∆ += ∆ += ∆ += ∆ +t t GLOBAL t LOCALV V Vδ δδ δδ δδ δinter-die
GLOBALσσσσ
−−−−∆∆∆∆ t GLOBALV
intra-die
LOCALσσσσ
−−−−t LOCALVδδδδ
Random Dopant Fluctuation
8
Process Tolerance: Memories
S. Mukhopadhaya, Mahmoodi, RoyVLSI Circuit Symposium 2006, JSSC 2006, TCAD
9
Parametric Failures: Read Failure
Read failure => Flipping of Cell Data while Reading
BL BR
WL
VL=‘1’
VR=‘0’VREAD
+∆∆∆∆
-∆∆∆∆
+∆∆∆∆
-∆∆∆∆NR
PR
NL
PLAXRAXL
VTRIPRDVR=VREAD
VL
WL
Vol
tage
Time ->
WL
VR
VL
Vol
tage
Time ->( )TRIPRDREADRF VVPP >=
10
Parametric Failures in SRAM
BL BR
WL
NR
PR
NL
PL
AXRAXL‘1’ ‘0’
High-Vt Low-Vt
Parametric failures– Read Failures– Write Failures– Access Failures– Hold Failures
Parametric failures can degrade SRAM yieldFaulty chips Working chips
Test & Repair using Redundancy
11
Process Variations in On-chip SRAM
Parametric failures →→→→Yield degradation
σVt ≈ 30mv, using BPTM 45nm technology
Yield ≈ 33%
0000
50505050
100100100100
150150150150
200200200200
250250250250
300300300300
3503503503500 52 105
157
210
262
315
367
419
472
524
577
629
682
734
786
839
890
944
996
1049
Chi
p C
ount
Chi
p C
ount
Fault statisticsFault statisticsFault statisticsFault statistics
Number of faulty cells (Number of faulty cells ( NNFaultyFaulty --CellsCells ))
Simulation of an 64KB Cache
A. Agarwal, et. al, JSSC, 05
12
Inter-die Variation & Cell Failures
inter-die Vt shift ( ∆∆∆∆Vth-GLOBAL )
GLOBALσσσσ
“1” “0”
High–Vt Corners− Access failure ↑↑↑↑− Write failure ↑↑↑↑
“1” “0”
Low–Vt Corners− Read failure ↑↑↑↑− Hold failure ↑↑↑↑
13
Inter-die Variation & Memory Failure
Memory failure probabilities are high when inter-die shift in process is high
LVT HVTNom. Vt
BPTM 70nm Devices
14
Self-Repairing SRAM Array
Reduce the dominant failures at different inter-die corners to increase width of low failure region
Region A Region B Region CRegion C
HVT Corner
Access & Write failures dominate
Reduce AF & WF
Region A LVT Corner
Read & Hold failures dominate
Reduce RF & HF
LVT HVTNom. Vt
15
Post-Silicon Repair: Proposed Approach
inter-die
GLOBALσσσσ
intra-die
L O C A Lσσσσ
intra-die
L O C A Lσσσσ
Apply correction to the global variation to reduce number of failures due to local variations
16
Self-Repairing SRAM Array
Reduce the dominant failures at different inter-die corners to increase width of low failure region
Region A Region B Region C
LVT HVTNom. Vt
ZBB
RBB FBB
17
How to identify the inter-die Vt corner under a large intra-die variation ?
Effect of inter-die variation can be masked by intra-die variation
Monitor circuit parameters, e.g. leakage current
BLBL BRBR
WLWL
VDDVDD
GNDGND
18
Leakage of entire SRAM array is a reliable indicator of the inter-die Vt corner
1
1NY X
ii Y X
Y XN
σ σσ σσ σσ σµ µµ µµ µµ µ====
= => == => == => == => =∑∑∑∑
Array Leakage Monitoring
• Adding a large number of random variables reduces the effect of intra-die variation
19
Self-Repair using Leakage Monitoring
SRAM ARRAY
VOUT
Entire array leakage is monitored to detect inter-die corner and proper body-bias
is selected
BypassSwitch
On-chipLeakageMonitor
CalibrateSignal
VDD
SRAMArray
Comparator
Bodybias Body-Bias
selection
VoutV
REF1 VREF2
FBB ZBB RBB
20
Yield Enhancement using Self-Repair
Self-Repairing SRAM using body-bias can significantly improve design yield
21
Test-Chip of Self-Repairing SRAMVCO
Technology : IBM 0.13 µµµµm
128KB SRAMDual-Vt Triple-well tech. Number of Trans: ~ 7 million Die size: 16mm2VLSI CKT Symp. 2006, ITC 2005
64 KB LVT
Array
16 KB block
VCO
Isol
ated
cel
l
Sensor + Ref. gen. BB gen
Simulation results for 1MB array designed in IBM 0.13 µµµµm
22
Continuous vs Quantized Body Bias
Quantized (3 Level: FBB, ZBB, RBB) body bias scheme is a cost effective solution with good
yield enhancement possibility
Process Tolerance: Register Files
Kim et. al. VLSI Circuit Symposium 2004
Process Compensating Dynamic Circuit Technology
clk
. . .RS0 RS7
D0 D7
RS1
D1
LBL0
LBL1
N0
� Keeper upsizing degrades average performance
Conventional Static Keeper
Process Compensating Dynamic Circuit Technology
3-bit programmable keeper
clk
. . .RS0 RS7
D0 D7
RS1
D1
LBL0
LBL1
N0
b[2:0]
W 2W 4Ws s s
� Opportunistic speedup via keeper downsizing
C. Kim et al. , VLSI Circuits Symp. ‘03
� 5X reduction in robustness failing dies
0
50
100
150
200
250
0.7 0.8 0.9 1.0 1.1 1.2Normalized DC robustness
Num
ber
of d
ies
ConventionalThis work
Noise floor
saved dies
Robustness Squeeze
� 10% opportunistic speedup
0
50
100
150
200
250
300
0.8 0.9 1.0 1.1 1.2Normalized delay
Num
ber
of d
ies
ConventionalThis work
PCD μμμμ = 0.90Conv. μμμμ = 1.00
μμμμ : avg. delay
Delay Squeeze
On-Die Leakage Sensor For Measuring Process Variation
C. Kim et al. , VLSI Circuits Symp. ‘04
83μμμμm
73μμ μμ
m
current reference
comparators
current m
irrors
VBIASgen.
NMOS device
test interface
� High leakage sensing gain – 90nm dual-Vt, Vdd=1.2V, 7 level resolution, 0.66 mW @80Cº
Output codes from leakage sensor
001 010 011 100 101 110 111
Leakage Binning Results
Process detection
Self-Contained Process Compensation
Fab
Assembly
Wafer test
Burn inPackage testCustomer
Leakage measurement
On-die leakage sensor
Program PCD using fuses
Self-Repair: Architecture Level
Agarawal, Roy TVLSI 2005
Fault-Tolerant Cache Architecture
� BIST detects the faulty blocks� Config Storage stores the fault informationIdea is to resize the cache to avoid faulty blocks
during regular operation
Faulty
Mapping Issue
New TAG
INDEX
Off
OffTAG INDEX
Column Address
More than one INDEXare mapped to same
block
Include column address bits into
TAG bits
Resizing is transparent to processor → same memory address
Fault Tolerant Capability
0
50
100
150
200
250
300
350
0 105 210 315 419 524 629 734 839 944 1049
Chi
p C
ount
(N
chip
) Fault statistics
Chips saved by the proposed + redundancy (R=8, r=3)
Chips saved by ECC + redundancy ( R=16)
NFaulty-Cells
More number of saved chipsas compare to ECC ECC fails to save
any chips
� Proposed architecture can handle more number of faulty cells than ECC, as high as 890 faulty cells
� Saves more number of chips than ECC for a givenNFaulty-Cells
CPU Performance Loss
0.0
0.5
1.0
1.5
2.0
2.5
0 105 210 315 419 524 629 734 839
% C
PU
Per
form
ance
Los
s For a 64K cache averaged over SPEC
2000 benchmarks
NFaulty-Cells
� Increase in miss rate due to downsizing of cache
� Average CPU performance loss over all SPEC 2000benchmarks for a cache with 890 faulty cells is ~ 2 %
Logic: Process Tolerance
37
Logic: A New Paradigm for Low-Voltage, Variation Tolerant Circuit Synthesis Using
Critical Path Isolation (CRISTA)
Ghosh, Bhunia, Roy -- ICCAD 2006
38
• Post-Silicon technique for dynamic supply scaling and timing error detection/correction
• Error correction overhead is 1% for a 10% error rate
RAZOR: Dan Ernst et. al., MICRO 2003.
Razor Approach
D
CLK
DelayE
Q
Shadow Latch
Standard Latch
39
Vdd Scaling and Process Tolerance: Conventional Solutions
• Low power:
– Reduce the supply voltage• Error rate increases
– Dual-Vt/dual-VDD assignment• Number of critical paths increases
• Robustness:
– Increase supply voltage• Power dissipation increases
– Upsize the gates• Switching capacitance increases
Low power and robustness: conflicting requirements
Low power and robustness: conflicting requirements
40
CRISTA: Basic Idea
CLK
VDD=1V
VDD<1VVDD<1V
evaluateevaluate based on prediction
non-critical path activationcritical path activation
• Important points:
– Scale down the supply while making delay failures predictable
– Avoid the failures by adaptive clock stretching– Ensure that critical paths are activated rarely
Tc 2Tc
41
path delay
Num
ber
of p
aths
predictable and restricted to a logic section having low activation probability
S2+S3
S1
Design ATc
CLK
S1
S2 S3
Design B
VDD=1V
Design BVDD<1V
Design Considerations for CRISTA
• Few predictable critical paths• Low activation probability of critical paths• Slack between critical and non-critical paths under variations
Design A: conventional design
Design B: proposed design
VDD=1V
42
Case Study: AdderP0 G1 P G1 P2 G2 P3 G3
Co,3Co,2Co,1Co,0Ci,0
P0 G1 P1 G1 P2 G2 P3 G3
Co,3Co,2Co,1Co,0Ci,0 FAFA FAFA FA FA
Crit. path delay=260pslongest non-crit. path delay=165psP = 13uW (1-cycle)
VDD = 1V, TCLK = 260ps
Crit. path delay=330pslongest non-crit. path delay=260psP = 7.4uW (rare 2-cycles, decoder)
VDD = 0.8V, TCLK = 260ps
• Interesting features:
– Single critical path (activated by P0P1P2P3=1 & Ci,0=1)– Low activation probability of critical path
44% power saving by reducing voltage and, operating critical path at 2-cycle and other paths at 1-cycle
44% power saving by reducing voltage and, operating critical path at 2-cycle and other paths at 1-cycle
Can we apply same technique to any random logic?
43
Carry Select Adder
Ci0=
0
Cin= 0
Cout
Co0
Co1
A2
Long latency path (LLP)
Short latency path (SLP1)
Short latency path (SLP2)
A2A3A4A5A2A2A3A4A5
A2A2A3A4A5A2A2A3A4A5
Ci1=
1
M1M2M3M4M5M6M7M8M9M10
A i : i-bit Adder, M k: k-stage MUX
Stage 1 Stage 10 Stage 6
LATENCYPREDICTOR
BLOCK
• ~20% power saving with ~6% area overhead
44
Carry Save Multiplier
HAHA
FA
HA
HA
FAFA
FAFA FA
FA FA
Vector Merging Adder
Critical Path
Longest off-critical path
LATENCYPREDICTOR
BLOCK
• 25% power saving with ~5% area overhead
45
Wallace Tree Multiplier
Vector Merging Adder
Full Adders
Half Adder
Vector Merging Adder
Partial Products
Stage 1
Stage 2
Stage 3
Final Product
Critical pathLongest off-critical path
• 29% power saving with ~4% area overhead
46
Simulation Results
ISO Yield = 92%
20
25
30
35
40
45
12 bits 16 bits 32 bits
% P
ower
sav
ings
5
6
7
8
9
10
11
% A
rea
over
head
ISO Yield = 90%
0
5
10
15
20
25
12 bits 16 bits 32 bits
% P
ower
sav
ings
6
7
8
9
10
%A
rea
over
head
ISO Yield = 96%
26
27
28
29
8 bits 12 bits 16 bits
%P
ower
sav
ings
2.5
3.5
4.5
5.5
6.5
7.5%
Are
a ov
erhe
ad
0
2
4
6
8
10
12
14
6 bits 8 bits 10 bits 12 bits 20 bits
% T
hrou
ghpu
t pen
alty
3.7
3.8
3.9
4
4.1
4.2
4.3
4.4
% A
rea
Ove
rhea
d
% Throughput penalty% Area Overhead
Ripple Carry Adder Carry Save Multiplier
Wallace Tree Multiplier Performance penalty (WTM)
47
1 1 1
1 2
1
'
2
'
1 1
( ,..., ,..., ) ( ,..., 1,..., ) ( ,..., 0,..., )
( ,..., 1,..., ); ( ,..., 0,..., )
n n ni i i
n ni
i i
i
i i
f x x x f x x x f x x x
CF CF
CF f x x x CF f
x x
x
x x
x
x
•
= • = + • == + •
= = = =
CF1(x i=1)
CF2(x i=0)
MU
X
f
inputs
f1
f2x i
Activation probability of cofactors can be reducedHow to choose Control Variable ?
Prob =50%
Prob =50%CF11
CF12
MU
X
x j
f1Prob =25%
Prob =25%
Random Logic: Shannon’s Expansion
48
Original Circuit
CF11
CF21
CF32
CF63
CF53
CF42
MU
X N
etw
ork
Inputs
PO
Inputs
Further Isolation and Slack Creation by Sizing
• Slack creation strategy – Lagrangian Relaxation based sizing (B.C. Paul et. al., DAC 2004) is
used– Non-critical paths are selectively made faster– Critical paths are slightly slowed down
49
0
20
40
60
80
cht sct pcle mux decod cm150a x2 alu2 count
% Im
p. in
pow
er
% imp in power with switching activity = 0.2 100
% imp in power with switching activity = 0.5
MCNC benchmarks, 70nm Process
Simulation Results
• Average power saving = ~50%• Average area overhead = 18%• Avg performance penalty=5.9% (with 4 control variables) for signal
prob=0.5
0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
cht sct pcle mux decod cm150a x2 alu2 count
Are
a (x
103 )
um^2
Original design
Proposed design
MCNC benchmarks, 70nm Process
Power Area
50
Two-Stage Pipeline with Test Logic
Outputs
Com
para
tor
Pre-decoder
Pre-decoder
gclk
SF
Fs
SF
Fs
SF
Fs
Car
ry-L
ook-
ahea
d A
dder
freeze
Stalling Logic
●
LFSR
TM1TM2
CLK
● ●
VDDm
●●
4:1
mux
Outputs
Com
para
tor
SF
Fs
SF
Fs
SF
Fs
Car
ry-L
ook-
ahea
d A
dder
●
TM1
● ●VDDo ●●
2:1
Mux
●●
Low Power Robust Pipeline
Conventional Pipeline
fixed vectors
fixed vectors
●
TM1
TM2
CLK
Clock generator
Test logic
Proposed pipeline
Test logic
Regular pipeline
GDS Layout
Power measurement of proposed pipeline
20% reduction in conventional pipeline
40% extra reduction using CRISTA
~40% power saving with ~13% performance penalty~40% power saving with ~13% performance penalty
51
VDD Scaling, Process Variation, and Quality Trade-off: DCT
Banerjee, Karakonstantis, RoyDesign Automation and Test in Europe (DATE) 2007
52
Basic Idea• All computations are “ not equally important ” for determining outputs
• Identify important and unimportant computations based on output “ sensitivity ”
• Compute important computations with “ higher priority ”
• Delay errors due to variations/ Vdd scaling “ affect only ” non-important computations
• “ Gradual degradation ” in output with voltage scaling and process variations
53
DCT Based Image Compression Process
• DCT is used in current international image/video co ding standards
- JPEG, MPEG, H.261, H.263
QuantizerFDCT
Source image X
Compressed
Image Data
Z = T• X • T '
Z EntropyEncoder
RoundT• Z • T '
Q
JPEG Encoder Block Diagram
V
8××××8 blocks
512××××512 image
1D DCTTranspose
Memory1D DCTX Y ZW
54
High energy components (important outputs 75% energ y)Low energy components (less important outputs)
64
1 2
4
3 5
6 7
9
8
10
11
12
13
14
15
19
20
17
18
2816
27
29
4330
3126 42 44
3225 4541
24
54
4033 5546 53
2321 3934 5247 6156
3522 4838 5751 6260
3736 5049 5958 63
Energy Distribution of a 2D-DCT Output
Can important components be computed with higher priority ?
55(c)(d)
1D- intermediate DCT outputsInput Block
Transpose Memory
(a) (b)
1D-DCT
1D-DCT
Faster
Slo
wer
Com
puta
tion
Fas
ter
Com
puta
tion
Slower
x0
x2
x1
x3
x4
x5
x6
x7
w0
w2
w1
w3
w4
w5
w6
w7
w8 w16 w24 w32 w40 w48 w56
y0
y1
y2
y3
y4
y5
y6
y7
Final DCT outputs
Slo
wer
Com
puta
tion
Fas
ter
Com
puta
tionz0
z1
z2
z3
z4
Computation
Computation
Design Methodology
T.xt
56
<< (3 adders delay)
(2 adders delay)
52( + )x x d•
4w1 6( + )x x d•
70( + )x x d•
3 4( + )x x d•0w
─
+
Path Delays for 1D-DCT outputs
(4 adders delay)
<< ─+
<<
(3 adders delay)
─
70( + )x x e•
3 4( + )x x e•
70( + )x x f•
1 6( + )x x e•
1 6( + )x x f•
3 4( + )x x f•
52( + )x x f•
3 4( + )x x f•
1 6( + )x x f•
52( + )x x e•
2w
6w
─+
+
+
+
─
─
(4 adders delay)
>>
1w
+─<<
>> +─
+─<<
<<+─
70( - )x x a•
3 4( - )x x f•
1 6( - )x x a•
1 6( - )x x f•52( - )x x e•
1 6( - )x x e•
70( - )x x f•
3 4( - )x x a•52( - )x x a•52( - )x x e•
52( - )x x f•
(3 adders delay)
7w
>>
3w+70( - )x x a•
(3 adders delay)
<<>>
3 4( - )x x e•
1 6( - )x x a•3 4( - )x x a•
5w
3 4( - )x x f•
52( - )x x f• <<
<<+
52( - )x x a•
70( - )x x e•
70( - )x x f•3 4( - )x x e•
52( - )x x f• +
+
+─
─
─
─
─
(4 adders delay)
57
w0
w1
w2
w3
w4
w5
w6
w7
Longer Delays
Important Computations
Paths Not Computed
Delay=D1
@ Vdd1
w0
w1
w2
w3
w4
w5
w6
w7
D2 >D1
@Vdd2
D1
@Vdd2
Proposed DCT under Vdd scalingProposed Design with high/low delay paths Scaled Vdd: Longer paths under Vdd scaling
Extreme Scaled Vdd: Shorter paths affected
Paths Not Computed
w0
w1
w2
w3
w4
w5
w6
w7
D4 >D1
@Vdd3
D3 > D1
@Vdd3
Vdd3 < Vdd2 < Vdd1(nominal)D1 @Vdd3
Only DC component
58
0
0.5
1
1.5
2
2.5
3
3.5
4
Pat
h1(w
0)
Pat
h2(w
1)
Pat
h3(w
2)
Pat
h4(w
3)
Pat
h5(w
4)
Pat
h6(w
5)
Pat
h7(w
6)
Pat
h8(w
7)
Computation Paths
Del
ay(n
s)
Conventional DCT Proposed DCT
1D-DCT Path Delay Comparisons
59
Effect of Vdd Scaling
Conventional WTM DCT
1.0 V
0.9 V
0.8 V
FAILS
FAILSFAILS
FAILS
CSHM DCT (2 alphabet)
Proposed DCT
33.2233.2321.97PSNR (dB)
9033710873880490Area (um 2)
3.573.643.2Delay (ns)
2629.825.1Power (mW)
ProposedDCT
DCT with WTM
CSHM DCT(2 alphabets)
1.0V
23.4129PSNR (dB)
11.09(62.8%)17.53(41.2%)Power (mW)
Proposed DCTVdd=0.8VProposed DCT
Vdd=0.9V
Proposed Architecture at Reduced Voltage
Different Architectures at Nominal Voltage
• Graceful degradation of proposed DCT architecture under Vdd scaling ( Vdd can be scaled to 0.75V)
• Conventional architectures fails
Temporal Degradation: NBTI
Kang, Roy, et. al. – TCAD, DAC-07
Temporal Reliability Issues in CMOS Technology
• HCI – Hot Carrier Injection• NBTI – Negative Bias Temperature Instability
� Increase in V T of PMOS with time� The dominant reliability factors in
scaled tech.• TDDB, etc.
NBTI: Negative Bias Temperature Instability
• Interface trap (N IT) generation at the channel interface due to the Si-H bond breaking, when negative gate bias is applied
• With time , VT increases , subthreshold slope (S) increases, mobility degrades,
• Drive current (I DS) reduces and affect the PMOS speed• Overall reduces the lifetime of PMOS
Interface trap generation due to Si-H bond breaking
NBTI: Experimental Data
• PMOS VT degrades as a power of time due to NBTI• Fixed exponent of 1/6 matches the simulation data*
100
101
102
103
104
10-3
10-2
10-1
Stress time [s]
VT
h[V
]ExperimentalSimulation
~t0.17 trend line
* V. Huard and M. Denais, IRPS 2004
Power-law V T degradation Model
NBTI degrades in time of exponent 1/6
Distance into oxide
y
NH
⋅⋅⋅⋅HD t
NH(0)
0
Reaction rate:
Conservation of hydrogen:
(((( ))))012IT H HN N D t= ⋅= ⋅= ⋅= ⋅
(0)0[ ]F
RIT IT H
kk
N N N N⋅ ⋅⋅ ⋅⋅ ⋅⋅ ⋅− =− =− =− =
(((( )))) (((( )))) (((( ))))0
0
,HD t
IT HN t N y t dy⋅⋅⋅⋅
= ⋅= ⋅= ⋅= ⋅∫∫∫∫
(0)0 0[ ] R
ITF IT IT H
dNk N N k N N
dt≈≈≈≈= − −= − −= − −= − −
(((( )))) (((( ))))10 4
2F
IT HR
k NN t D t
k====IT
TOX
q NV
C
⋅ ∆⋅ ∆⋅ ∆⋅ ∆∆ =∆ =∆ =∆ =
Mobility degradation factor
• Mobility degradation due to NBTI is expressed in an additional V T shift, noted as m
• Overall temporal V T shift model is expressed as,
0 0.25
(1 )
oxE
Eox
TOX
mqχ E e t
VC
⋅⋅⋅⋅
∆ = +∆ = +∆ = +∆ = +
Impact of NBTI on circuit performance
Circuit Performance Degradation
• Performance (delay) degradation also follows the po wertrend with same 0.17 exponent
• In CMOS logic, only the rising (L2H) delay’s are affected
100
105
10-2
10-1
100
101
102
time (s)
% c
hang
e
VtInverter DelayROSC delay
109
Circuit Performance degradation cont.
• Delay degradation in ISCAS c432 • Activity factor (switching activity) does not affec t much on the
delay degradation� In reality, activity factor’s are balanced in the n ormal operations
100
102
104
106
108
10-2
10-1
100
101
102
Time (s)
% C
hang
e
Si = 1 (worst case)
Si < 1 (with activity)
Vth
Design method considering the NBTI degradation
NBTI-aware design method
• Over-design is required to guarantee a lifetime stability of the circuit
• LR sizing is used to optimize the circuit� Size the circuit considering the worst-case V T degradation over the
lifetime
Delay Constraint
Required lifetime of the design
Reduced lifetime due to NBTI degradation
time
max
ckt
. del
ay
NBTI-aware over-design
LR Sizing considering NBTI
* C. Chen et. al., TCAD 1999
1. Delay Constraint (DMAX)
2. Required Lifetime (TLife)
Calibrate switching activity’s (Si) in each node
Compute VT shift in each node
considering Si
LR sizing with delay constraint
DMAX
NBTI-aware Design
Lifetime ConstraintA new design constraint
Power-law V T model
Optimal sizing fromLagrangian Relaxation*
Guarantee a lifetime stabilityunder NBTI degradations
Simulation results
7.609.50115.1148c74L85
6.837.90131.9188c74283
9.6310.3577.292c74182
8.689.89194.6372c74181
7.869.00597.33638c3540
8.539.18513.51582c1908
8.069.203681816c499
7.328.90525590c432
Si < 1Si = 1
% delay degrad. (10 yrs)Nominal delay (ps)
No. ofTrans.
Circuit
* All benchmarks are synthesized in BPTM 70nm techn ology
1. Delay degradation in ISCAS85 benchmark circuits after 10 years
Simulation results cont.
5.85.8542.59120c74L85
10.010.066.71125c74283
11.211.331.180c74182
9.09.45111.1180c74181
3.313.441146.5500c3540
6.687.13489.67470c1908
6.717.82581.47340c499
13.614.8196.7385c432
Si < 1Si = 1
% Area overheadNominal area (um)
Nominal delay (ps)
Circuit
* All benchmarks are synthesized in BPTM 70nm techn ology
2. Area overhead in NBTI-aware sizing
Negative Bias Temperature Instability
• PMOS specific Aging Effect• Generation of (+) traps• Reaction-Diffusion (RD) model*• Time exponent ~ 1/6
x
P+P+
GATE
OXIDE
DRAIN
VBODY = 0V
x
H
Si Si Si Si
HH
H
VGS < 0V
VS = 0VVD = 0V
n-sub
xx x HH H
H2
H2
H2
H2
H2
*M. A. Alam, IEDM’03
After NBTI degradation
(((( )))) (((( ))))10 6
2F
IT HR
k NN t D t
k====
ITT
OX
q NV
C⋅ ∆⋅ ∆⋅ ∆⋅ ∆
∆ =∆ =∆ =∆ =
NBTI in Digital Circuits
VL
WL
BLBLB
VR
1
5
o1
o2
3
24
8
7
6
Logic Circuits• fMAX decreases ↓↓↓↓• Timing failure with time
Memory Circuits• Static Noise Margin (SNM) ↓↓↓↓• Read & Write Stability• Parametric Yield ↓↓↓↓
Temporal VTh increase in PMOS affects critical performance factors of digital VLSI circuits
i1
i3
i2
NBTI: Random Logic Circuits
• ISCAS’85 Benchmark Circuits, PTM 65nm • Gate delay: analytical delay model considering NBTI• Circuit delay: NBTI-aware Static Timing Analysis (STA)
• Circuit fMAX � time exponent n ~ 1/6
PTM 65nmISCAS’85
Benchmarks
3 years Lifetime
8% fMAX decrease
n~1/6
100
102
104
106
108
100
Time (s)f M
AX
redu
ctio
n (%
)
c2670c5315c1908c499c3540c74181
26.930.1923.803NOR
26.821.8917.262NOR
14.822.4519.573NAND
17.919.8816.862NAND
21.816.7713.771INV
3 yearst=0Δ (%)
Delay (ps)fanin
LogicCell
Delay Degrad. STD cells
NBTI: 6T SRAM Cell
104 106 108100
101
Time (s)
Deg
rada
tion
in S
NM
(%
)
PTM 60nm, 125 °°°°CCR = 1.33PR = 0.67
1/6 trend line
Static Noise Margin (SNM)
� SNM degrades by more than 10% in 3 years� % SNM Degradation ���� time exponent n ~ 1/6� WM improves with time under NBTI
0.43 0.44 0.45 0.46 0.47 0.48 0.490
0.2
0.4
0.6
0.8
1
Write Margin (V)
CD
F
t=0t=108
t=107
t=106
WM improveswith time
Distribution in Write Margin
Design for Reliability under NBTI
580 600 620 640 660 680 700 720
550
600
650
700
750
800
Delay (ps)
Are
a (u
m)
INITTR-BasedCell-Based
Cell-based
Area SavingFrom TR-based
10psc1908
� Gate Sizing applied to guarantee lifetime functiona lity of design
� 11.7% overhead for Cell-based sizing� 6.13% overhead for TR-based sizing
� 45% improvement in area overhead� Runtime complexity for TR-based sizing is identical to that of Cell-based sizing
� Simulation Setup� Synthesized in PTM 65nm� 1/6 VTh degradation model� 125°°°°C Stress temperature� 50% Signal Probability at PI’s
IDDQ based NBTI Characterization
• Test Circuit Fabricated• 1000 stage INV chain• DC Stress signal @Vin
• IDDQ measurement @GND
1.6 (nm)Tox
1.2 (V)VDD
209I/O Pin
20 (mm2)Die Size
CMOS 130nmTechnology
Inverter Chain
MicrophotographLayout
Vin
VDD
IDDQ Measurement1000 stages
Correlation between IDDQ & fMAX
• DM < 3ms, Temp=125°C, Vstress=1.7V• IDDQ degradation � n~1/6 during
• Clear signature of NBTI
• Correlation between IDDQ and fMAX can be used to predict circuit performance degradation under NBTI
MAX – MIN < 1.2%
10
100
101
% I D
DQ
(Vin
=0.0
) de
grad
atio
n
Time (s)100 101 102 103 104 105
-2
0
2
4
6
8
% I D
DQ
(Vin
=1.7
) de
grad
atio
nIDDQ (Vin=0.0)IDDQ (Vin=1.7)
Vstress = 1.7V @150°°°°C
n ~ 1/6
n > 1/6
S1
S2
Linear relationship
103
107
105
1075°°°°C, 65nm PTMISCAS Benchmarks
100100
101
102
fMAX reduction (%)
I DD
Qre
duct
ion
(%)
c2670c5315c1908c499c3540c74181
1/6( ) ( )
(0) (0)
( :constant)
leak MAXleak freq
leak M
freq leak
AX
I t f tR t R
I f
R K KR
∆ ∆= ∝
= ×
∝ =
% IDDQ decrease% fMAX increase n~1/6
IDDQ based Characterization Technique
Des
ign
phas
eP
ost-
silic
on p
hase
• Circuit-level NBTI Reliability Characterization
• IDDQ test is used• Expensive fMAX testing is avoided
(or minimized)• Accurate circuit level performance
degradation can be predicted
• IC specific burn-in to qualify the target produce
• Efficient way of field monitoring: dynamic local signature of produce usage
• Possible usage in other reliability sources; HCI
Initial Characterization•Compute Rleak, RfMAX
•Compute KfMAX = Rleak / RfMAX
Reliability Extrapolation•For each IDDQ measurement sample, Rleak is computed
•RfMAX = KfMAX X Rleak
•Estimate fMAX degradation
Lifetime Projection•Project IDDQ using KfMAX
NBTI Characterization Report
Temp. Dependency
82
Conclusions
�Process Variation and Process Tolerance is becoming important
�There is a need to optimize designs considering power/performance/yield