Download - Process-Tolerant Low-Power Design for the Nano …vlsi/courses/ee695kr/s...Process-Tolerant Low-Power Design for the Nano-meter Regime Exponential Increase in Leakage Non-Silicon Technology

Kaushik RoyElectrical & Computer Engineering

Purdue University

Process-Tolerant Low-Power Design for the Nano-meter

Regime

Exponential Increase in Leakage

Non-Silicon Technology

Silicon Micro- electronics

Silicon Nano- electronics

1 µm 100 nm 10 nm

1970 1980 2000 2010 2020

5 µm

Junctionleakage

GateSource

n+n+

Bulk

Drain

Subthreshold Leakage

Gate Leakage

A. Grove, IEDM 2002

Technology ( µµµµ)

Leak

age

Pow

er (

% o

f Tot

al)

0%

10%

20%

30%

40%

50%

1.5 0.7 0.35 0.18 0.09 0.05

Must stop at 50%

610ON

OFF

I

I====

310ON

OFF

I

I==== 2~6~ 10ON

OFF

I

I

Technology Trend

Buried Oxide (BOX)

Substrate

Fully-depleted body

Gate

VG

VS VD

DrainSource

Vback

Buried Oxide (BOX)

Substrate

Fully-depleted body

Gate

VG

VS VD

DrainSource

Vback

Bulk-CMOS

FD/SOI

Nano devices Carbon nanotubeIII-V devices nano-wiresSpintronics

Single gate device

20032009

2020

DGMOS

FinFET Trigate

Multi-gate devices

Buried Oxide (BOX)

Substrate

Source Floating Body Drain

GateVS

VG

VD

Buried Oxide (BOX)

Substrate

Source Floating Body Drain

GateVS

VG

VD

PD/SOI

Design methods to exploit the advantages of technology innovations

Variation in Process Parameters

Inter and Intra-die Variations

Device parameters are no longer deterministic

Device 1 Device 2

Channel length

130nm

30 %

5X

0.90 .9

1 .01 .0

1 .11 .1

1 .21 .2

1 .31 .3

1 .41 .4

11 22 33 44 55N orm alized Leakage (N o rm a lized Leakage ( IsbIsb ))

Nor

mal

ized

Fre

quen

cyN

orm

aliz

ed F

requ

ency

Delay and Leakage Spread

Source: Intel

10

100

1000

10000

1000 500 250 130 65 32Technology Node (nm)

# do

pant

ato

ms Source: Intel

Random dopant fluctuation

ReliabilityTemporal degradation of performance -- NBTI

Tech. generation

Fai

lure

pro

babi

lity

Time

Defects Life time degradation

Pessimistic Design Hurts Performance

worst-case corner

(150nm CMOS Measurements, 110 °°°°C)

0

50

100

150

200

Normalized I OFF

Num

ber

of d

ies

0 1 2 3 4 5 7

nominal corner

� Substantial variation in leakage across dies� 4X variation between nominal and worst-case leakage� Performance determined at nominal leakage� Robustness determined at worst-case leakage

7

Global and Local Variations

− −− −− −− −= ∆ += ∆ += ∆ += ∆ +t t GLOBAL t LOCALV V Vδ δδ δδ δδ δinter-die

GLOBALσσσσ

−−−−∆∆∆∆ t GLOBALV

intra-die

LOCALσσσσ

−−−−t LOCALVδδδδ

Random Dopant Fluctuation

8

Process Tolerance: Memories

S. Mukhopadhaya, Mahmoodi, RoyVLSI Circuit Symposium 2006, JSSC 2006, TCAD

9

Parametric Failures: Read Failure

Read failure => Flipping of Cell Data while Reading

BL BR

WL

VL=‘1’

VR=‘0’VREAD

+∆∆∆∆

-∆∆∆∆

+∆∆∆∆

-∆∆∆∆NR

PR

NL

PLAXRAXL

VTRIPRDVR=VREAD

VL

WL

Vol

tage

Time ->

WL

VR

VL

Vol

tage

Time ->( )TRIPRDREADRF VVPP >=

10

Parametric Failures in SRAM

BL BR

WL

NR

PR

NL

PL

AXRAXL‘1’ ‘0’

High-Vt Low-Vt

Parametric failures– Read Failures– Write Failures– Access Failures– Hold Failures

Parametric failures can degrade SRAM yieldFaulty chips Working chips

Test & Repair using Redundancy

11

Process Variations in On-chip SRAM

Parametric failures →→→→Yield degradation

σVt ≈ 30mv, using BPTM 45nm technology

Yield ≈ 33%

0000

50505050

100100100100

150150150150

200200200200

250250250250

300300300300

3503503503500 52 105

157

210

262

315

367

419

472

524

577

629

682

734

786

839

890

944

996

1049

Chi

p C

ount

Chi

p C

ount

Fault statisticsFault statisticsFault statisticsFault statistics

Number of faulty cells (Number of faulty cells ( NNFaultyFaulty --CellsCells ))

Simulation of an 64KB Cache

A. Agarwal, et. al, JSSC, 05

12

Inter-die Variation & Cell Failures

inter-die Vt shift ( ∆∆∆∆Vth-GLOBAL )

GLOBALσσσσ

“1” “0”

High–Vt Corners− Access failure ↑↑↑↑− Write failure ↑↑↑↑

“1” “0”

Low–Vt Corners− Read failure ↑↑↑↑− Hold failure ↑↑↑↑

13

Inter-die Variation & Memory Failure

Memory failure probabilities are high when inter-die shift in process is high

LVT HVTNom. Vt

BPTM 70nm Devices

14

Self-Repairing SRAM Array

Reduce the dominant failures at different inter-die corners to increase width of low failure region

Region A Region B Region CRegion C

HVT Corner

Access & Write failures dominate

Reduce AF & WF

Region A LVT Corner

Read & Hold failures dominate

Reduce RF & HF

LVT HVTNom. Vt

15

Post-Silicon Repair: Proposed Approach

inter-die

GLOBALσσσσ

intra-die

L O C A Lσσσσ

intra-die

L O C A Lσσσσ

Apply correction to the global variation to reduce number of failures due to local variations

16

Self-Repairing SRAM Array

Reduce the dominant failures at different inter-die corners to increase width of low failure region

Region A Region B Region C

LVT HVTNom. Vt

ZBB

RBB FBB

17

How to identify the inter-die Vt corner under a large intra-die variation ?

Effect of inter-die variation can be masked by intra-die variation

Monitor circuit parameters, e.g. leakage current

BLBL BRBR

WLWL

VDDVDD

GNDGND

18

Leakage of entire SRAM array is a reliable indicator of the inter-die Vt corner

1

1NY X

ii Y X

Y XN

σ σσ σσ σσ σµ µµ µµ µµ µ====

= => == => == => == => =∑∑∑∑

Array Leakage Monitoring

• Adding a large number of random variables reduces the effect of intra-die variation

19

Self-Repair using Leakage Monitoring

SRAM ARRAY

VOUT

Entire array leakage is monitored to detect inter-die corner and proper body-bias

is selected

BypassSwitch

On-chipLeakageMonitor

CalibrateSignal

VDD

SRAMArray

Comparator

Bodybias Body-Bias

selection

VoutV

REF1 VREF2

FBB ZBB RBB

20

Yield Enhancement using Self-Repair

Self-Repairing SRAM using body-bias can significantly improve design yield

21

Test-Chip of Self-Repairing SRAMVCO

Technology : IBM 0.13 µµµµm

128KB SRAMDual-Vt Triple-well tech. Number of Trans: ~ 7 million Die size: 16mm2VLSI CKT Symp. 2006, ITC 2005

64 KB LVT

Array

16 KB block

VCO

Isol

ated

cel

l

Sensor + Ref. gen. BB gen

Simulation results for 1MB array designed in IBM 0.13 µµµµm

22

Continuous vs Quantized Body Bias

Quantized (3 Level: FBB, ZBB, RBB) body bias scheme is a cost effective solution with good

yield enhancement possibility

Process Tolerance: Register Files

Kim et. al. VLSI Circuit Symposium 2004

Process Compensating Dynamic Circuit Technology

clk

. . .RS0 RS7

D0 D7

RS1

D1

LBL0

LBL1

N0

� Keeper upsizing degrades average performance

Conventional Static Keeper

Process Compensating Dynamic Circuit Technology

3-bit programmable keeper

clk

. . .RS0 RS7

D0 D7

RS1

D1

LBL0

LBL1

N0

b[2:0]

W 2W 4Ws s s

� Opportunistic speedup via keeper downsizing

C. Kim et al. , VLSI Circuits Symp. ‘03

� 5X reduction in robustness failing dies

0

50

100

150

200

250

0.7 0.8 0.9 1.0 1.1 1.2Normalized DC robustness

Num

ber

of d

ies

ConventionalThis work

Noise floor

saved dies

Robustness Squeeze

� 10% opportunistic speedup

0

50

100

150

200

250

300

0.8 0.9 1.0 1.1 1.2Normalized delay

Num

ber

of d

ies

ConventionalThis work

PCD μμμμ = 0.90Conv. μμμμ = 1.00

μμμμ : avg. delay

Delay Squeeze

On-Die Leakage Sensor For Measuring Process Variation

C. Kim et al. , VLSI Circuits Symp. ‘04

83μμμμm

73μμ μμ

m

current reference

comparators

current m

irrors

VBIASgen.

NMOS device

test interface

� High leakage sensing gain – 90nm dual-Vt, Vdd=1.2V, 7 level resolution, 0.66 mW @80Cº

Output codes from leakage sensor

001 010 011 100 101 110 111

Leakage Binning Results

Process detection

Self-Contained Process Compensation

Fab

Assembly

Wafer test

Burn inPackage testCustomer

Leakage measurement

On-die leakage sensor

Program PCD using fuses

Self-Repair: Architecture Level

Agarawal, Roy TVLSI 2005

Fault-Tolerant Cache Architecture

� BIST detects the faulty blocks� Config Storage stores the fault informationIdea is to resize the cache to avoid faulty blocks

during regular operation

Faulty

Mapping Issue

New TAG

INDEX

Off

OffTAG INDEX

Column Address

More than one INDEXare mapped to same

block

Include column address bits into

TAG bits

Resizing is transparent to processor → same memory address

Fault Tolerant Capability

0

50

100

150

200

250

300

350

0 105 210 315 419 524 629 734 839 944 1049

Chi

p C

ount

(N

chip

) Fault statistics

Chips saved by the proposed + redundancy (R=8, r=3)

Chips saved by ECC + redundancy ( R=16)

NFaulty-Cells

More number of saved chipsas compare to ECC ECC fails to save

any chips

� Proposed architecture can handle more number of faulty cells than ECC, as high as 890 faulty cells

� Saves more number of chips than ECC for a givenNFaulty-Cells

CPU Performance Loss

0.0

0.5

1.0

1.5

2.0

2.5

0 105 210 315 419 524 629 734 839

% C

PU

Per

form

ance

Los

s For a 64K cache averaged over SPEC

2000 benchmarks

NFaulty-Cells

� Increase in miss rate due to downsizing of cache

� Average CPU performance loss over all SPEC 2000benchmarks for a cache with 890 faulty cells is ~ 2 %

Logic: Process Tolerance

37

Logic: A New Paradigm for Low-Voltage, Variation Tolerant Circuit Synthesis Using

Critical Path Isolation (CRISTA)

Ghosh, Bhunia, Roy -- ICCAD 2006

38

• Post-Silicon technique for dynamic supply scaling and timing error detection/correction

• Error correction overhead is 1% for a 10% error rate

RAZOR: Dan Ernst et. al., MICRO 2003.

Razor Approach

D

CLK

DelayE

Q

Shadow Latch

Standard Latch

39

Vdd Scaling and Process Tolerance: Conventional Solutions

• Low power:

– Reduce the supply voltage• Error rate increases

– Dual-Vt/dual-VDD assignment• Number of critical paths increases

• Robustness:

– Increase supply voltage• Power dissipation increases

– Upsize the gates• Switching capacitance increases

Low power and robustness: conflicting requirements

Low power and robustness: conflicting requirements

40

CRISTA: Basic Idea

CLK

VDD=1V

VDD<1VVDD<1V

evaluateevaluate based on prediction

non-critical path activationcritical path activation

• Important points:

– Scale down the supply while making delay failures predictable

– Avoid the failures by adaptive clock stretching– Ensure that critical paths are activated rarely

Tc 2Tc

41

path delay

Num

ber

of p

aths

predictable and restricted to a logic section having low activation probability

S2+S3

S1

Design ATc

CLK

S1

S2 S3

Design B

VDD=1V

Design BVDD<1V

Design Considerations for CRISTA

• Few predictable critical paths• Low activation probability of critical paths• Slack between critical and non-critical paths under variations

Design A: conventional design

Design B: proposed design

VDD=1V

42

Case Study: AdderP0 G1 P G1 P2 G2 P3 G3

Co,3Co,2Co,1Co,0Ci,0

P0 G1 P1 G1 P2 G2 P3 G3

Co,3Co,2Co,1Co,0Ci,0 FAFA FAFA FA FA

Crit. path delay=260pslongest non-crit. path delay=165psP = 13uW (1-cycle)

VDD = 1V, TCLK = 260ps

Crit. path delay=330pslongest non-crit. path delay=260psP = 7.4uW (rare 2-cycles, decoder)

VDD = 0.8V, TCLK = 260ps

• Interesting features:

– Single critical path (activated by P0P1P2P3=1 & Ci,0=1)– Low activation probability of critical path

44% power saving by reducing voltage and, operating critical path at 2-cycle and other paths at 1-cycle

44% power saving by reducing voltage and, operating critical path at 2-cycle and other paths at 1-cycle

Can we apply same technique to any random logic?

43

Carry Select Adder

Ci0=

0

Cin= 0

Cout

Co0

Co1

A2

Long latency path (LLP)

Short latency path (SLP1)

Short latency path (SLP2)

A2A3A4A5A2A2A3A4A5

A2A2A3A4A5A2A2A3A4A5

Ci1=

1

M1M2M3M4M5M6M7M8M9M10

A i : i-bit Adder, M k: k-stage MUX

Stage 1 Stage 10 Stage 6

LATENCYPREDICTOR

BLOCK

• ~20% power saving with ~6% area overhead

44

Carry Save Multiplier

HAHA

FA

HA

HA

FAFA

FAFA FA

FA FA

Vector Merging Adder

Critical Path

Longest off-critical path

LATENCYPREDICTOR

BLOCK

• 25% power saving with ~5% area overhead

45

Wallace Tree Multiplier


Full Adders

Half Adder


Partial Products

Stage 1

Stage 2

Stage 3

Final Product

Critical pathLongest off-critical path

• 29% power saving with ~4% area overhead

46

Simulation Results

ISO Yield = 92%

20

25

30

35

40

45

12 bits 16 bits 32 bits

% P

ower

sav

ings

5

6

7

8

9

10

11

% A

rea

over

head

ISO Yield = 90%

0

5

10

15

20

25


% P

ower

sav

ings

6

7

8

9

10

%A

rea

over

head

ISO Yield = 96%

26

27

28

29


%P

ower

sav

ings

2.5

3.5

4.5

5.5

6.5

7.5%

Are

a ov

erhe

ad

0

2

4

6

8

10

12

14

6 bits 8 bits 10 bits 12 bits 20 bits

% T

hrou

ghpu

t pen

alty

3.7

3.8

3.9

4

4.1

4.2

4.3

4.4

% A

rea

Ove

rhea

d

% Throughput penalty% Area Overhead

Ripple Carry Adder Carry Save Multiplier

Wallace Tree Multiplier Performance penalty (WTM)

47

1 1 1

1 2

1

'

2

'

1 1

( ,..., ,..., ) ( ,..., 1,..., ) ( ,..., 0,..., )

( ,..., 1,..., ); ( ,..., 0,..., )

n n ni i i

n ni

i i

i

i i

f x x x f x x x f x x x

CF CF

CF f x x x CF f

x x

x

x x

x

x

•

= • = + • == + •

= = = =

CF1(x i=1)

CF2(x i=0)

MU

X

f

inputs

f1

f2x i

Activation probability of cofactors can be reducedHow to choose Control Variable ?

Prob =50%

Prob =50%CF11

CF12

MU

X

x j

f1Prob =25%

Prob =25%

Random Logic: Shannon’s Expansion

48

Original Circuit

CF11

CF21

CF32

CF63

CF53

CF42

MU

X N

etw

ork

Inputs

PO

Inputs

Further Isolation and Slack Creation by Sizing

• Slack creation strategy – Lagrangian Relaxation based sizing (B.C. Paul et. al., DAC 2004) is

used– Non-critical paths are selectively made faster– Critical paths are slightly slowed down

49

0

20

40

60

80

cht sct pcle mux decod cm150a x2 alu2 count

% Im

p. in

pow

er

% imp in power with switching activity = 0.2 100

% imp in power with switching activity = 0.5

MCNC benchmarks, 70nm Process

Simulation Results

• Average power saving = ~50%• Average area overhead = 18%• Avg performance penalty=5.9% (with 4 control variables) for signal

prob=0.5

0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

cht sct pcle mux decod cm150a x2 alu2 count

Are

a (x

103 )

um^2

Original design

Proposed design

MCNC benchmarks, 70nm Process

Power Area

50

Two-Stage Pipeline with Test Logic

Outputs

Com

para

tor

Pre-decoder

Pre-decoder

gclk

SF

Fs

SF

Fs

SF

Fs

Car

ry-L

ook-

ahea

d A

dder

freeze

Stalling Logic

●

LFSR

TM1TM2

CLK

● ●

VDDm

●●

4:1

mux

Outputs

Com

para

tor

SF

Fs

SF

Fs

SF

Fs

Car

ry-L

ook-

ahea

d A

dder

●

TM1

● ●VDDo ●●

2:1

Mux

●●

Low Power Robust Pipeline

Conventional Pipeline

fixed vectors

fixed vectors

●

TM1

TM2

CLK

Clock generator

Test logic

Proposed pipeline

Test logic

Regular pipeline

GDS Layout

Power measurement of proposed pipeline

20% reduction in conventional pipeline

40% extra reduction using CRISTA

~40% power saving with ~13% performance penalty~40% power saving with ~13% performance penalty

51

VDD Scaling, Process Variation, and Quality Trade-off: DCT

Banerjee, Karakonstantis, RoyDesign Automation and Test in Europe (DATE) 2007

52

Basic Idea• All computations are “ not equally important ” for determining outputs

• Identify important and unimportant computations based on output “ sensitivity ”

• Compute important computations with “ higher priority ”

• Delay errors due to variations/ Vdd scaling “ affect only ” non-important computations

• “ Gradual degradation ” in output with voltage scaling and process variations

53

DCT Based Image Compression Process

• DCT is used in current international image/video co ding standards

- JPEG, MPEG, H.261, H.263

QuantizerFDCT

Source image X

Compressed

Image Data

Z = T• X • T '

Z EntropyEncoder

RoundT• Z • T '

Q

JPEG Encoder Block Diagram

V

8××××8 blocks

512××××512 image

1D DCTTranspose

Memory1D DCTX Y ZW

54

High energy components (important outputs 75% energ y)Low energy components (less important outputs)

64

1 2

4

3 5

6 7

9

8

10

11

12

13

14

15

19

20

17

18

2816

27

29

4330

3126 42 44

3225 4541

24

54

4033 5546 53

2321 3934 5247 6156

3522 4838 5751 6260

3736 5049 5958 63

Energy Distribution of a 2D-DCT Output

Can important components be computed with higher priority ?

55(c)(d)

1D- intermediate DCT outputsInput Block

Transpose Memory

(a) (b)

1D-DCT

1D-DCT

Faster

Slo

wer

Com

puta

tion

Fas

ter

Com

puta

tion

Slower

x0

x2

x1

x3

x4

x5

x6

x7

w0

w2

w1

w3

w4

w5

w6

w7

w8 w16 w24 w32 w40 w48 w56

y0

y1

y2

y3

y4

y5

y6

y7

Final DCT outputs

Slo

wer

Com

puta

tion

Fas

ter

Com

puta

tionz0

z1

z2

z3

z4

Computation

Computation

Design Methodology

T.xt

56

<< (3 adders delay)

(2 adders delay)

52( + )x x d•

4w1 6( + )x x d•

70( + )x x d•

3 4( + )x x d•0w

─

+

Path Delays for 1D-DCT outputs

(4 adders delay)

<< ─+

<<

(3 adders delay)

─

70( + )x x e•

3 4( + )x x e•

70( + )x x f•

1 6( + )x x e•

1 6( + )x x f•

3 4( + )x x f•

52( + )x x f•

3 4( + )x x f•

1 6( + )x x f•

52( + )x x e•

2w

6w

─+

+

+

+

─

─

(4 adders delay)

>>

1w

+─<<

>> +─

+─<<

<<+─

70( - )x x a•

3 4( - )x x f•

1 6( - )x x a•

1 6( - )x x f•52( - )x x e•

1 6( - )x x e•

70( - )x x f•

3 4( - )x x a•52( - )x x a•52( - )x x e•

52( - )x x f•

(3 adders delay)

7w

>>

3w+70( - )x x a•

(3 adders delay)

<<>>

3 4( - )x x e•

1 6( - )x x a•3 4( - )x x a•

5w

3 4( - )x x f•

52( - )x x f• <<

<<+

52( - )x x a•

70( - )x x e•

70( - )x x f•3 4( - )x x e•

52( - )x x f• +

+

+─

─

─

─

─

(4 adders delay)

57

w0

w1

w2

w3

w4

w5

w6

w7

Longer Delays

Important Computations

Paths Not Computed

Delay=D1

@ Vdd1

w0

w1

w2

w3

w4

w5

w6

w7

D2 >D1

@Vdd2

D1

@Vdd2

Proposed DCT under Vdd scalingProposed Design with high/low delay paths Scaled Vdd: Longer paths under Vdd scaling

Extreme Scaled Vdd: Shorter paths affected

Paths Not Computed

w0

w1

w2

w3

w4

w5

w6

w7

D4 >D1

@Vdd3

D3 > D1

@Vdd3

Vdd3 < Vdd2 < Vdd1(nominal)D1 @Vdd3

Only DC component

58

0

0.5

1

1.5

2

2.5

3

3.5

4

Pat

h1(w

0)

Pat

h2(w

1)

Pat

h3(w

2)

Pat

h4(w

3)

Pat

h5(w

4)

Pat

h6(w

5)

Pat

h7(w

6)

Pat

h8(w

7)

Computation Paths

Del

ay(n

s)

Conventional DCT Proposed DCT

1D-DCT Path Delay Comparisons

59

Effect of Vdd Scaling

Conventional WTM DCT

1.0 V

0.9 V

0.8 V

FAILS

FAILSFAILS

FAILS

CSHM DCT (2 alphabet)

Proposed DCT

33.2233.2321.97PSNR (dB)

9033710873880490Area (um 2)

3.573.643.2Delay (ns)

2629.825.1Power (mW)

ProposedDCT

DCT with WTM

CSHM DCT(2 alphabets)

1.0V

23.4129PSNR (dB)

11.09(62.8%)17.53(41.2%)Power (mW)

Proposed DCTVdd=0.8VProposed DCT

Vdd=0.9V

Proposed Architecture at Reduced Voltage

Different Architectures at Nominal Voltage

• Graceful degradation of proposed DCT architecture under Vdd scaling ( Vdd can be scaled to 0.75V)

• Conventional architectures fails

Temporal Degradation: NBTI

Kang, Roy, et. al. – TCAD, DAC-07

Temporal Reliability Issues in CMOS Technology

• HCI – Hot Carrier Injection• NBTI – Negative Bias Temperature Instability

� Increase in V T of PMOS with time� The dominant reliability factors in

scaled tech.• TDDB, etc.

NBTI: Negative Bias Temperature Instability

• Interface trap (N IT) generation at the channel interface due to the Si-H bond breaking, when negative gate bias is applied

• With time , VT increases , subthreshold slope (S) increases, mobility degrades,

• Drive current (I DS) reduces and affect the PMOS speed• Overall reduces the lifetime of PMOS

Interface trap generation due to Si-H bond breaking

NBTI: Experimental Data

• PMOS VT degrades as a power of time due to NBTI• Fixed exponent of 1/6 matches the simulation data*

100

101

102

103

104

10-3

10-2

10-1

Stress time [s]

VT

h[V

]ExperimentalSimulation

~t0.17 trend line

* V. Huard and M. Denais, IRPS 2004

Power-law V T degradation Model

NBTI degrades in time of exponent 1/6

Distance into oxide

y

NH

⋅⋅⋅⋅HD t

NH(0)

0

Reaction rate:

Conservation of hydrogen:

(((( ))))012IT H HN N D t= ⋅= ⋅= ⋅= ⋅

(0)0[ ]F

RIT IT H

kk

N N N N⋅ ⋅⋅ ⋅⋅ ⋅⋅ ⋅− =− =− =− =

(((( )))) (((( )))) (((( ))))0

0

,HD t

IT HN t N y t dy⋅⋅⋅⋅

= ⋅= ⋅= ⋅= ⋅∫∫∫∫

(0)0 0[ ] R

ITF IT IT H

dNk N N k N N

dt≈≈≈≈= − −= − −= − −= − −

(((( )))) (((( ))))10 4

2F

IT HR

k NN t D t

k====IT

TOX

q NV

C

⋅ ∆⋅ ∆⋅ ∆⋅ ∆∆ =∆ =∆ =∆ =

Mobility degradation factor

• Mobility degradation due to NBTI is expressed in an additional V T shift, noted as m

• Overall temporal V T shift model is expressed as,

0 0.25

(1 )

oxE

Eox

TOX

mqχ E e t

VC

⋅⋅⋅⋅

∆ = +∆ = +∆ = +∆ = +

Impact of NBTI on circuit performance

Circuit Performance Degradation

• Performance (delay) degradation also follows the po wertrend with same 0.17 exponent

• In CMOS logic, only the rising (L2H) delay’s are affected

100

105

10-2

10-1

100

101

102

time (s)

% c

hang

e

VtInverter DelayROSC delay

109

Circuit Performance degradation cont.

• Delay degradation in ISCAS c432 • Activity factor (switching activity) does not affec t much on the

delay degradation� In reality, activity factor’s are balanced in the n ormal operations

100

102

104

106

108

10-2

10-1

100

101

102

Time (s)

% C

hang

e

Si = 1 (worst case)

Si < 1 (with activity)

Vth

Design method considering the NBTI degradation

NBTI-aware design method

• Over-design is required to guarantee a lifetime stability of the circuit

• LR sizing is used to optimize the circuit� Size the circuit considering the worst-case V T degradation over the

lifetime

Delay Constraint

Required lifetime of the design

Reduced lifetime due to NBTI degradation

time

max

ckt

. del

ay

NBTI-aware over-design

LR Sizing considering NBTI

* C. Chen et. al., TCAD 1999

1. Delay Constraint (DMAX)

2. Required Lifetime (TLife)

Calibrate switching activity’s (Si) in each node

Compute VT shift in each node

considering Si

LR sizing with delay constraint

DMAX

NBTI-aware Design

Lifetime ConstraintA new design constraint

Power-law V T model

Optimal sizing fromLagrangian Relaxation*

Guarantee a lifetime stabilityunder NBTI degradations

Simulation results

7.609.50115.1148c74L85

6.837.90131.9188c74283

9.6310.3577.292c74182

8.689.89194.6372c74181

7.869.00597.33638c3540

8.539.18513.51582c1908

8.069.203681816c499

7.328.90525590c432

Si < 1Si = 1

% delay degrad. (10 yrs)Nominal delay (ps)

No. ofTrans.

Circuit

* All benchmarks are synthesized in BPTM 70nm techn ology

1. Delay degradation in ISCAS85 benchmark circuits after 10 years

Simulation results cont.

5.85.8542.59120c74L85

10.010.066.71125c74283

11.211.331.180c74182

9.09.45111.1180c74181

3.313.441146.5500c3540

6.687.13489.67470c1908

6.717.82581.47340c499

13.614.8196.7385c432

Si < 1Si = 1

% Area overheadNominal area (um)

Nominal delay (ps)

Circuit

* All benchmarks are synthesized in BPTM 70nm techn ology

2. Area overhead in NBTI-aware sizing

Negative Bias Temperature Instability

• PMOS specific Aging Effect• Generation of (+) traps• Reaction-Diffusion (RD) model*• Time exponent ~ 1/6

x

P+P+

GATE

OXIDE

DRAIN

VBODY = 0V

x

H

Si Si Si Si

HH

H

VGS < 0V

VS = 0VVD = 0V

n-sub

xx x HH H

H2

H2

H2

H2

H2

*M. A. Alam, IEDM’03

After NBTI degradation

(((( )))) (((( ))))10 6

2F

IT HR

k NN t D t

k====

ITT

OX

q NV

C⋅ ∆⋅ ∆⋅ ∆⋅ ∆

∆ =∆ =∆ =∆ =

NBTI in Digital Circuits

VL

WL

BLBLB

VR

1

5

o1

o2

3

24

8

7

6

Logic Circuits• fMAX decreases ↓↓↓↓• Timing failure with time

Memory Circuits• Static Noise Margin (SNM) ↓↓↓↓• Read & Write Stability• Parametric Yield ↓↓↓↓

Temporal VTh increase in PMOS affects critical performance factors of digital VLSI circuits

i1

i3

i2

NBTI: Random Logic Circuits

• ISCAS’85 Benchmark Circuits, PTM 65nm • Gate delay: analytical delay model considering NBTI• Circuit delay: NBTI-aware Static Timing Analysis (STA)

• Circuit fMAX � time exponent n ~ 1/6

PTM 65nmISCAS’85

Benchmarks

3 years Lifetime

8% fMAX decrease

n~1/6

100

102

104

106

108

100

Time (s)f M

AX

redu

ctio

n (%

)

c2670c5315c1908c499c3540c74181

26.930.1923.803NOR

26.821.8917.262NOR

14.822.4519.573NAND

17.919.8816.862NAND

21.816.7713.771INV

3 yearst=0Δ (%)

Delay (ps)fanin

LogicCell

Delay Degrad. STD cells

NBTI: 6T SRAM Cell

104 106 108100

101

Time (s)

Deg

rada

tion

in S

NM

(%

)

PTM 60nm, 125 °°°°CCR = 1.33PR = 0.67

1/6 trend line

Static Noise Margin (SNM)

� SNM degrades by more than 10% in 3 years� % SNM Degradation �� time exponent n ~ 1/6� WM improves with time under NBTI

0.43 0.44 0.45 0.46 0.47 0.48 0.490

0.2

0.4

0.6

0.8

1

Write Margin (V)

CD

F

t=0t=108

t=107

t=106

WM improveswith time

Distribution in Write Margin

Design for Reliability under NBTI

580 600 620 640 660 680 700 720

550

600

650

700

750

800

Delay (ps)

Are

a (u

m)

INITTR-BasedCell-Based

Cell-based

Area SavingFrom TR-based

10psc1908

� Gate Sizing applied to guarantee lifetime functiona lity of design

� 11.7% overhead for Cell-based sizing� 6.13% overhead for TR-based sizing

� 45% improvement in area overhead� Runtime complexity for TR-based sizing is identical to that of Cell-based sizing

� Simulation Setup� Synthesized in PTM 65nm� 1/6 VTh degradation model� 125°°°°C Stress temperature� 50% Signal Probability at PI’s

IDDQ based NBTI Characterization

• Test Circuit Fabricated• 1000 stage INV chain• DC Stress signal @Vin

• IDDQ measurement @GND

1.6 (nm)Tox

1.2 (V)VDD

209I/O Pin

20 (mm2)Die Size

CMOS 130nmTechnology

Inverter Chain

MicrophotographLayout

Vin

VDD

IDDQ Measurement1000 stages

Correlation between IDDQ & fMAX

• DM < 3ms, Temp=125°C, Vstress=1.7V• IDDQ degradation � n~1/6 during

• Clear signature of NBTI

• Correlation between IDDQ and fMAX can be used to predict circuit performance degradation under NBTI

MAX – MIN < 1.2%

10

100

101

% I D

DQ

(Vin

=0.0

) de

grad

atio

n

Time (s)100 101 102 103 104 105

-2

0

2

4

6

8

% I D

DQ

(Vin

=1.7

) de

grad

atio

nIDDQ (Vin=0.0)IDDQ (Vin=1.7)

Vstress = 1.7V @150°°°°C

n ~ 1/6

n > 1/6

S1

S2

Linear relationship

103

107

105

1075°°°°C, 65nm PTMISCAS Benchmarks

100100

101

102

fMAX reduction (%)

I DD

Qre

duct

ion

(%)

c2670c5315c1908c499c3540c74181

1/6( ) ( )

(0) (0)

( :constant)

leak MAXleak freq

leak M

freq leak

AX

I t f tR t R

I f

R K KR

∆ ∆= ∝

= ×

∝ =

% IDDQ decrease% fMAX increase n~1/6

IDDQ based Characterization Technique

Des

ign

phas

eP

ost-

silic

on p

hase

• Circuit-level NBTI Reliability Characterization

• IDDQ test is used• Expensive fMAX testing is avoided

(or minimized)• Accurate circuit level performance

degradation can be predicted

• IC specific burn-in to qualify the target produce

• Efficient way of field monitoring: dynamic local signature of produce usage

• Possible usage in other reliability sources; HCI

Initial Characterization•Compute Rleak, RfMAX

•Compute KfMAX = Rleak / RfMAX

Reliability Extrapolation•For each IDDQ measurement sample, Rleak is computed

•RfMAX = KfMAX X Rleak

•Estimate fMAX degradation

Lifetime Projection•Project IDDQ using KfMAX

NBTI Characterization Report

Temp. Dependency

82

Conclusions

�Process Variation and Process Tolerance is becoming important

�There is a need to optimize designs considering power/performance/yield

Download - Process-Tolerant Low-Power Design for the Nano …vlsi/courses/ee695kr/s...Process-Tolerant Low-Power Design for the Nano-meter Regime Exponential Increase in Leakage Non-Silicon Technology

Top Related