process variations & process- adaptive design for the nano ... · process variations &...

Kaushik RoyElectrical & Computer Engineering

Purdue University

Process Variations & Process-Adaptive Design for the Nano-

meter Regime

Exponential Increase in Leakage

Non-Silicon Technology

Silicon Micro- electronics

Silicon Nano- electronics

1 µm 100 nm 10 nm

1970 1980 2000 2010 2020

5 µm

Junctionleakage

GateSource

n+n+

Bulk

Drain

Subthreshold Leakage

Gate Leakage

A. Grove, IEDM 2002

Technology (μ)

Leak

age

Pow

er (%

of T

otal

)

0%

10%

20%

30%

40%

50%

1.5 0.7 0.35 0.18 0.09 0.05

Must stop at 50%

610ON

OFF

II

=310ON

OFF

II

= 2~6~ 10ON

OFF

II

Technology Trend

Buried Oxide (BOX)Substrate

Fully-depleted body

Gate

VG

VS VD

DrainSource

Vback

Buried Oxide (BOX)Substrate

Fully-depleted body

Gate

VG

VS VD

DrainSource

Vback

Bulk-CMOS

FD/SOI

Nano devices Carbon nanotubeIII-V devices nano-wiresSpintronics

Single gate device

20032009

2020

DGMOS

FinFET Trigate

Multi-gate devices

Buried Oxide (BOX)

Substrate

Source Floating Body Drain

GateVS

VG

VD

Buried Oxide (BOX)

Substrate

Source Floating Body Drain

GateVS

VG

VD

PD/SOI

Design methods to exploit the advantages of technology innovations

Variation in Process Parameters

Inter and Intra-die Variations

Device parameters are no longer deterministic

Device 1 Device 2

Channel length

130nm

30 %

5X0.90 .9

1 .01 .0

1 .11 .1

1 .21 .2

1 .31 .3

1 .41 .4

11 22 33 44 55N orm alized Leakage (N orm alized Leakage ( IsbIsb ))

Nor

mal

ized

Fre

quen

cyN

orm

aliz

ed F

requ

ency

Delay and Leakage Spread

Source: Intel

10

100

1000

10000

1000 500 250 130 65 32Technology Node (nm)

# do

pant

ato

ms Source: Intel

Random dopant fluctuation

ReliabilityTemporal degradation of performance -- NBTI

Tech. generation

Failu

re p

roba

bilit

y

Time

Defects Life time degradation

Pessimistic Design Hurts Performance

worst-case corner

(150nm CMOS Measurements, 110°C)

0

50

100

150

200

Normalized IOFF

Num

ber o

f die

s

0 1 2 3 4 5 7

nominal corner

Substantial variation in leakage across dies4X variation between nominal and worst-case leakagePerformance determined at nominal leakageRobustness determined at worst-case leakage

7

Global and Local Variations

− −= Δ +t t GLOBAL t LOCALV V Vδ δinter-die

GLOBALσ

−Δ t GLOBALV

intra-die

LOCALσ

−t LOCALVδ

Random Dopant Fluctuation

8

Process Tolerance: Memories

S. Mukhopadhaya, Mahmoodi, RoyVLSI Circuit Symposium 2006, JSSC 2006, TCAD

9Parametric Failures: Read

Read failure => Flipping of Cell Data while Reading

BL BR

WL

VL=‘1’

VR=‘0’VREAD

+Δ

-Δ

+Δ

-ΔNR

PR

NL

PLAXRAXL

VTRIPRDVR=VREAD

VL

WL

Volta

ge

Time ->

WL

VR

VLVo

ltage

Time ->( )TRIPRDREADRF VVPP >=

10

Parametric Failures in SRAM

BL BR

WL

NR

PR

NL

PL

AXRAXL ‘1’ ‘0’

High-Vt Low-Vt

Parametric failures– Read Failures– Write Failures– Access Failures– Hold Failures

Parametric failures can degrade SRAM yieldFaulty chips Working chips

Test & Repair using Redundancy

11

Process Variations in On-chip SRAM

Parametric failures →Yield degradation

σVt ≈ 30mv, using BPTM 45nm technology

Yield ≈ 33%

0

50

100

150

200

250

300

350

0 52 105

157

210

262

315

367

419

472

524

577

629

682

734

786

839

890

944

996

1049

Chi

p C

ount

Chi

p C

ount Fault statistics

Number of faulty cells (Number of faulty cells (NNFaultyFaulty--CellsCells))

Simulation of an 64KB Cache A. Agarwal, et. al, JSSC, 05

12

Inter-die Variation & Cell Failures

inter-die Vt shift (ΔVth-GLOBAL)

GLOBALσ

“1” “0”

High–Vt Corners− Access failure ↑− Write failure ↑

“1” “0”

Low–Vt Corners− Read failure ↑− Hold failure ↑

13

Inter-die Variation & Memory Failure

Memory failure probabilities are high when inter-die shift in process is high

LVT HVTNom. Vt

BPTM 70nm Devices

14

Self-Repairing SRAM Array

Reduce the dominant failures at different inter-die corners to increase width of low failure region

Region A Region B Region CRegion C

HVT Corner

Access & Write failures dominate

Reduce AF & WF

Region A LVT Corner

Read & Hold failures dominate

Reduce RF & HF

LVT HVTNom. Vt

15

Post-Silicon Repair: Proposed Approach

inter-dieGLOBALσ

intra-die

L O C A Lσ

intra-die

L O C A Lσ

Apply correction to the global variation to reduce number of failures due to local variations

16

Self-Repairing SRAM Array

Reduce the dominant failures at different inter-die corners to increase width of low failure region

Region A Region B Region C

LVT HVTNom. Vt

ZBBRBB FBB

17

How to identify the inter-die Vt corner under a large intra-die variation ?

Effect of inter-die variation can be masked by intra-die variation

Monitor circuit parameters, e.g. leakage current

BLBL BRBR

WLWL

VDDVDD

GNDGND

18

Leakage of entire SRAM array is a reliable indicator of the inter-die Vt corner

1

1NY X

ii Y X

Y XN

σ σμ μ=

= => =∑

Array Leakage Monitoring

• Adding a large number of random variables reduces the effect of intra-die variation

19

Self-Repair using Leakage Monitoring

SRAM ARRAY

VOUT

Entire array leakage is monitored to detect inter-die corner and proper body-bias

is selected

BypassSwitch

On-chipLeakageMonitor

CalibrateSignal

VDD

SRAMArray

Comparator

Bodybias Body-Bias

selection

VoutV

REF1 VREF2

FBB ZBB RBB

20

Yield Enhancement using Self-Repair

Self-Repairing SRAM using body-bias can significantly improve design yield

21

Test-Chip of Self-Repairing SRAMVCO

Technology : IBM 0.13 μm128KB SRAMDual-Vt Triple-well tech. Number of Trans: ~ 7 million Die size: 16mm2VLSI CKT Symp. 2006, ITC 2005

64 KB LVT

Array

16 KB block

VCOIs

olat

ed c

ell

Sensor + Ref. gen. BB gen

Simulation results for 1MB array designed in IBM 0.13μm

22

Continuous vs Quantized Body Bias

Quantized (3 Level: FBB, ZBB, RBB) body bias scheme is a cost effective solution with good

yield enhancement possibility

Process Tolerance: Register Files

Kim et. al. VLSI Circuit Symposium 2004

Process Compensating Dynamic Circuit Technology

clk

. . .RS0 RS7

D0 D7

RS1

D1

LBL0

LBL1

N0

Keeper upsizing degrades average performance

Conventional Static Keeper

Process Compensating Dynamic Circuit Technology

3-bit programmable keeper

clk

. . .RS0 RS7

D0 D7

RS1

D1

LBL0

LBL1

N0

b[2:0]

W 2W 4Ws s s

Opportunistic speedup via keeper downsizing

C. Kim et al. , VLSI Circuits Symp. ‘03

5X reduction in robustness failing dies

0

50

100

150

200

250

0.7 0.8 0.9 1.0 1.1 1.2Normalized DC robustness

Num

ber o

f die

s

ConventionalThis work

Noise floor

saved dies

Robustness Squeeze

10% opportunistic speedup

0

50

100

150

200

250

300

0.8 0.9 1.0 1.1 1.2Normalized delay

Num

ber o

f die

sConventionalThis work

PCD μ = 0.90Conv. μ = 1.00

μ : avg. delay

Delay Squeeze

On-Die Leakage Sensor For Measuring Process Variation

C. Kim et al. , VLSI Circuits Symp. ‘04

83μm

73μ

m

current reference

comparators

current m

irrors

VBIASgen.

NMOS device

test interface

High leakage sensing gain – 90nm dual-Vt, Vdd=1.2V, 7 level resolution, 0.66 mW @80Cº

Output codes from leakage sensor

001 010 011 100 101 110 111

Leakage Binning Results

Process detection

Self-Contained Process CompensationFab

Assembly

Wafer test

Burn inPackage testCustomer

Leakage measurement

On-die leakage sensor

Program PCD using fuses

Self-Repair: Architecture Level

Agarawal, Roy TVLSI 2005

Fault-Tolerant Cache Architecture

BIST detects the faulty blocksConfig Storage stores the fault information

Idea is to resize the cache to avoid faulty blocks during regular operation

Faulty

Mapping Issue

More than one INDEXare mapped to same

block

Include column address bits into

TAG bits

Resizing is transparent to processor → same memory address

Fault Tolerant Capability

0

50

100

150

200

250

300

350

0 105 210 315 419 524 629 734 839 944 1049

Chi

p C

ount

(Nch

ip)

Fault statistics

Chips saved by the proposed + redundancy (R=8, r=3)

Chips saved by ECC + redundancy ( R=16)

NFaulty-Cells

More number of saved chipsas compare to ECC ECC fails to save

any chips

Proposed architecture can handle more number of faulty cells than ECC, as high as 890 faulty cells

Saves more number of chips than ECC for a givenNFaulty-Cells

CPU Performance Loss

0.0

0.5

1.0

1.5

2.0

2.5

0 105 210 315 419 524 629 734 839

% C

PU P

erfo

rman

ce L

oss For a 64K cache

averaged over SPEC 2000 benchmarks

NFaulty-Cells

Increase in miss rate due to downsizing of cache

Average CPU performance loss over all SPEC 2000benchmarks for a cache with 890 faulty cells is ~ 2%

Logic: Process Tolerance

Sizing, Dual-Vt, Synthesis, etc.

• Mostly worst-case design methdology• Guard-banding with larger transistors/larger

library elements• Selective guard-banding to reduce delay in

longer paths– Transistor sizing– Dual-Vt

• Logic optimization and simultaneous sizing/dual-Vt

38

Logic: A New Paradigm for Low-Voltage, Variation Tolerant Circuit Synthesis Using

Critical Path Isolation (CRISTA)-- Better than worst-case design

Ghosh, Bhunia, Roy -- ICCAD 2006

39Vdd Scaling and Process Tolerance: Conventional Solutions

• Low power:– Reduce the supply voltage

• Error rate increases

– Dual-Vt/dual-VDD assignment• Number of critical paths increases

• Robustness:– Increase supply voltage

• Power dissipation increases

– Upsize the gates• Switching capacitance increases

Low power and robustness: conflicting requirements

Low power and robustness: conflicting requirements

40

• Post-Silicon technique for dynamic supply scaling and timing error detection/correction

• Error correction overhead is 1% for a 10% error rate

RAZOR: Dan Ernst et. al., MICRO 2003.

Razor Approach

D

CLK

DelayE

Q

Shadow Latch

Standard Latch

41

CRISTA: Basic Idea

CLK

VDD=1V

VDD<1VVDD<1V

evaluateevaluate based on prediction

non-critical path activationcritical path activation

• Important points:– Design time technique (as opposed to post-Si)– Scale down the supply while making delay failures

predictable– Avoid the failures by adaptive clock stretching– Ensure that critical paths are activated rarely

Tc 2Tc

42

path delay

Num

ber o

f pat

hs

predictable and restricted to a logic section having low activation probability

S2+S3

S1

Design ATc

CLK

S1S2 S3

Design BVDD=1V

Design BVDD<1V

Design Considerations for CRISTA

• Few predictable critical paths• Low activation probability of critical paths• Slack between critical and non-critical paths under variations

Design A: conventional design

Design B: proposed design

VDD=1V

43

Case Study: AdderP0 G1 P G1 P2 G2 P3 G3

Co,3Co,2Co,1Co,0Ci,0

P0 G1 P1 G1 P2 G2 P3 G3Co,3Co,2Co,1Co,0Ci,0 FAFA FAFA FA FA

Crit. path delay=260pslongest non-crit. path delay=165psP = 13uW (1-cycle)

VDD = 1V, TCLK = 260ps

Crit. path delay=330pslongest non-crit. path delay=260psP = 7.4uW (rare 2-cycles, decoder)

VDD = 0.8V, TCLK = 260ps

• Interesting features:

– Single critical path (activated by P0P1P2P3=1 & Ci,0=1)– Low activation probability of critical path

44% power saving by reducing voltage and, operating critical path at 2-cycle and other paths at 1-cycle

44% power saving by reducing voltage and, operating critical path at 2-cycle and other paths at 1-cycle

Can we apply same technique to any random logic?

44

Wallace Tree Multiplier

Vector Merging Adder

Full AddersHalf AdderVector Merging Adder

Partial Products

Stage 1

Stage 2

Stage 3

Final Product

Critical pathLongest off-critical path

• 29% power saving with ~4% area overhead

45

Simulation Results

ISO Yield = 92%

20

25

30

35

40

45

12 bits 16 bits 32 bits

% P

ower

sav

ings

567891011

% A

rea

over

head ISO Yield = 90%

0

5

10

15

20

25


% P

ower

sav

ings

6

7

8

9

10

%A

rea

over

head

ISO Yield = 96%

26

27

28

29


%Po

wer

sav

ings

2.5

3.5

4.5

5.5

6.5

7.5

%A

rea

over

head

02468

101214

6 bits 8 bits 10 bits 12 bits 20 bits

% T

hrou

ghpu

t pen

alty

3.73.83.944.14.24.34.4

% A

rea

Ove

rhea

d% Throughput penalty% Area Overhead

Ripple Carry Adder Carry Save Multiplier

Wallace Tree Multiplier Performance penalty (WTM)

46

1 1 1

1 2

1

'

2

'

1 1

( ,..., ,..., ) ( ,..., 1,..., ) ( ,..., 0,..., )

( ,..., 1,..., ); ( ,..., 0,..., )

n n ni i i

n ni

i i

i

i i

f x x x f x x x f x x xCF CF

CF f x x x CF f

x xx

x xx

x•

= • = + • == + •

= = = =

CF1(xi=1)

CF2(xi=0)

MU

X

f

inputs

f1

f2xi

Activation probability of cofactors can be reducedHow to choose Control Variable ?

Prob =50%

Prob =50%CF11

CF12

MU

X

xj

f1Prob =25%

Prob =25%

Random Logic: Shannon’s Expansion

47

Original Circuit

CF11

CF21

CF32

CF63

CF53

CF42

MU

X N

etw

ork

Inputs

PO

Inputs

Further Isolation and Slack Creation by Sizing

• Slack creation strategy – Lagrangian Relaxation based sizing (B.C. Paul et. al., DAC 2004) is

used– Non-critical paths are selectively made faster– Critical paths are slightly slowed down

48

0

20

40

60

80

cht sct pcle mux decod cm150a x2 alu2 count

% Im

p. in

pow

er

% imp in power with switching activity = 0.2 100% imp in power with switching activity = 0.5

MCNC benchmarks, 70nm Process

Simulation Results

• Average power saving = ~45%• Average area overhead = 18%• Avg performance penalty=5.9% (with 4 control variables) for signal

prob=0.5

0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

cht sct pcle mux decod cm150a x2 alu2 count

Are

a (x

103 )

um^2

Original design

Proposed design

MCNC benchmarks, 70nm Process

Power Area

49

Two-Stage Pipeline with Test Logic

Outputs

Com

para

tor

Pre-decoder

Pre-decoder

gclk

SFFs

SFFs

SFFs

Car

ry-L

ook-

ahea

d A

dder

freeze

Stalling Logic

●

LFSR

TM1TM2

CLK

● ●

VDDm

●●

4:1

mux

Outputs

Com

para

tor

SFFs

SFFs

SFFs

Car

ry-L

ook-

ahea

d A

dder

●

TM1● ●

VDDo ●●

2:1

Mux

●●

Low Power Robust Pipeline

Conventional Pipeline

fixed vectors

fixed vectors

●

TM1TM2

CLK

Clock generator

Test logic

Proposed pipeline

Test logic

Regular pipeline

GDS Layout

Power measurement of proposed pipeline

20% reduction in conventional pipeline

40% extra reduction using CRISTA

~40% power saving with ~13% performance penalty~40% power saving with ~13% performance penalty

CRISTA in Thermal Design

50

51

Temperature Aware Design using CRISTA*

S3

Secondary critical(it will run with two-cycles at VDDL2)

• Scale down the supply at elevated temperatures while making delay failures predictable

• Maintain rated frequency• Avoid the failures by adaptive clock stretching• Ensure that critical paths are activated rarely• Tradeoff between temperature and performance loss

path delay

# pa

ths

S2 S1

1 cycle bound

Critical (it will run with two-cycles at VDDL1 and VDDL2)

path delay

# pa

ths

S

S Ghosh, S Bhunia and K Roy, Low overhead circuit synthesis for temperature adaptation using dynamic voltage scheduling, DATE 2007.

52

Simulation Results

• Similar temperature reduction as conventional technique (i.e. supply scaling with half frequency)

• Avg performance penalty=5.9% (at VDDL1) and 15.2% (at VDDL2) compared to 50% in conventional technique

• Avg area overhead ~14%

MCNC benchmarks, 70nm devices, Nominal supply = 1V, Tamb = 100oC

99.5

100.5

101.5

102.5

cht sct pcle mux cmb alu2 cm150a decod

Tem

pera

ture

(o C)

Org(VDDH), freq=fOrg(VDDL) , freq=f/2CRISTA(VDDL1), freq=f(adaptive)CRISTA(VDDL2) , freq=f(adaptive)

53

Micro-Architectural Implementation

• Applied to execution stage (adders and multipliers) of in-order pipeline

• Can be an add-on to standard DVFS

Tem

p se

nsor

free

ze

ID EX MEM

●

VDDH

VDDL

0

1

TREF2

Sel

CLK

IF WB

●

●●

Pre-decoder

TREF1

54

Simulation Results

DVFS triggered

DVFS triggering avoided in CRISTA

• Avoids requirement of full-chip DVFS

SPEC CINT benchmarks (gcc)

0 0.5 1 1.5 2 2.5 3 3.565

70

75

80

Time (sec)

Max

. Tem

p (C

)

No DTM DVFS

0 0.5 1 1.5 2 2.5 3 3.565

70

75

80

Time (sec)

Max

. Tem

p (C

)

CRISTA

DVFS: Dynamic Supply and Frequency ScalingCRISTA : Occasional 2-Cycle Operation

VV f (GHz)f (GHz)

DVFSDVFS 0.90.9 1.51.5

CRISTACRISTA 0.850.85 33No No

DTMDTM1.21.2

33

55

Simulation Results

0

20

40

60

80

100

120

bzip2

crafty eo

n

gap

gcc gzip mcfpars

erperl

mktw

olfvo

rtex

vpr

T(de

gree

C)

No DTM DVFS CRISTA

• Average reduction in peak temperature ~6.6% with ~3.4% throughput loss and ~4.2% area overhead

Peak Temperature (TREF1=75oC, TREF2=80oC)

VDD Scaling, Process Variation, and Quality Trade-off: DCT

-- Quality knob for tuning

Banerjee, Karakonstantis, RoyDesign Automation and Test in Europe (DATE) 2007

Basic Idea• All computations are “not equally important” for determining outputs

• Identify important and unimportant computations based on output “sensitivity”

• Compute important computations with “higher priority”

• Delay errors due to variations/ Vdd scaling “affect only” non-important computations

• “Gradual degradation” in output with voltage scaling and process variations

DCT Based Image Compression Process

• DCT is used in current international image/video coding standards

- JPEG, MPEG, H.261, H.263

QuantizerFDCT

Source image X

Compressed

Image Data

Z = T• X • T '

Z EntropyEncoder

RoundT• Z • T '

Q

JPEG Encoder Block Diagram

V

8×8 blocks

512×512 image

1D DCT TransposeMemory

1D DCTX Y ZW

High energy components (important outputs 75% energy)Low energy components (less important outputs)

64

1 2

4

3 5

6 7

9

8

10

11

12

13

14

15

19

20

17

18

2816

27

29

4330

3126 42 44

3225 4541

24

54

4033 5546 53

2321 3934 5247 6156

3522 4838 5751 6260

3736 5049 5958 63

Energy Distribution of a 2D-DCT Output

Can important components be computed with higher priority ?

(c)(d)

1D- intermediate DCT outputsInput Block

Transpose Memory

(a) (b)

1D-DCT

1D-DCT

Faster

Slow

er C

ompu

tatio

n

Fast

er C

ompu

tatio

n

Slower

x0

x2

x1

x3

x4

x5

x6

x7

w0

w2

w1

w3

w4

w5

w6

w7

w8 w16 w24 w32 w40 w48 w56

y0

y1

y2

y3

y4

y5

y6

y7

Final DCT outputs

Slow

er C

ompu

tatio

n

Fast

er

Com

puta

tionz0

z1

z2

z3

z4

Computation

Computation

Design Methodology

T.xt

<< (3 adders delay)

(2 adders delay)

52( + )x x d•

4w1 6( + )x x d•

70( + )x x d•

3 4( + )x x d•0w

─

+

Path Delays for 1D-DCT outputs

(4 adders delay)

<< ─+

<<

(3 adders delay)

─

70( + )x x e•

3 4( + )x x e•

70( + )x x f•

1 6( + )x x e•

1 6( + )x x f•

3 4( + )x x f•

52( + )x x f•

3 4( + )x x f•

1 6( + )x x f•

52( + )x x e•

2w

6w

─+

+

++

─

─

(4 adders delay)

>>

1w

+─<<

>> +─

+─<<

<<+─

70( - )x x a•

3 4( - )x x f•

1 6( - )x x a•

1 6( - )x x f•52( - )x x e•

1 6( - )x x e•

70( - )x x f•

3 4( - )x x a•52( - )x x a•52( - )x x e•

52( - )x x f•

(3 adders delay)

7w

>>

3w+70( - )x x a•

(3 adders delay)

<<>>

3 4( - )x x e•

1 6( - )x x a•3 4( - )x x a•

5w

3 4( - )x x f•

52( - )x x f• <<

<<+

52( - )x x a•

70( - )x x e•

70( - )x x f•3 4( - )x x e•

52( - )x x f• +

+

+─

─

──

─

(4 adders delay)

w0

w1

w2

w3

w4

w5

w6

w7

Longer Delays

Important Computations

Paths Not Computed

Delay=D1

@ Vdd1

w0

w1

w2

w3

w4

w5

w6

w7

D2 >D1

@Vdd2

D1

@Vdd2

Proposed DCT under Vdd scalingProposed Design with high/low delay paths Scaled Vdd: Longer paths under Vdd scaling

Extreme Scaled Vdd: Shorter paths affected

Paths Not Computed

w0

w1

w2

w3

w4

w5

w6

w7

D4 >D1

@Vdd3

D3 > D1

@Vdd3

Vdd3 < Vdd2 < Vdd1(nominal)D1 @Vdd3

Only DC component

00.5

11.5

22.5

33.5

4

Pat

h1(w

0)

Pat

h2(w

1)

Pat

h3(w

2)

Pat

h4(w

3)

Pat

h5(w

4)

Pat

h6(w

5)

Pat

h7(w

6)

Pat

h8(w

7)

Computation Paths

Del

ay(n

s)Conventional DCT Proposed DCT

1D-DCT Path Delay Comparisons

Effect of Vdd Scaling

Conventional WTM DCT

1.0 V

0.9 V

0.8 V

FAILS

FAILSFAILS

FAILS

CSHM DCT (2 alphabet)

Proposed DCT 1.0V CSHM DCT

(2 alphabets)DCT

with WTM Proposed

DCT

Power (mW) 25.1 29.8 26Delay (ns) 3.2 3.64 3.57Area (um2) 80490 108738 90337PSNR (dB) 21.97 33.23 33.22

Proposed DCT Vdd=0.9V

Proposed DCTVdd=0.8V

Power (mW) 17.53(41.2%) 11.09(62.8%)PSNR (dB) 29 23.41

Proposed Architecture at Reduced Voltage

Different Architectures at Nominal Voltage

•Graceful degradation of proposed DCT architecture under Vdd scaling ( Vdd can be scaled to 0.75V)

• Conventional architectures fails

65

1. Discrete Cosine Transform (DCT)

• Graceful degradation in quality under Vdd scaling (Vdd can be scaled to from 1V to 0.8V) with ~60% improvement in power

• Conventional architectures fail

• Scale down supply voltage

• Design/algorithm such that failures can only occur in less critical section --minimal quality degradation

CRISTA for DSP SystemsCRISTA for DSP Systems

Conventional WTM DCT

1.0 V

0.9 V

0.8 V

FAILS

FAILSFAILS

FAILS

CSHM DCT (2 alphabet)

Proposed DCT

66

2. Finite Impulse Response (FIR)

less-critical coefficients

critical coefficients

For Other DSP SystemsFor Other DSP SystemsPower at Nominal/Scaled Voltage

020406080

100120

1 0.9 0.8Voltage(V)

Pow

er(m

W) Conv. LCCSE

33.6%

FAILS

FAILS

3. Color Interpolation

1.0 V

0.9 V

0.8 VFAILS

FAILS

Conv Proposed

Ri, j

Gi, j−1Gi+1, jGi, j+1

Gi−1, j

M1Vdd

G’1G’2

>>2

>>2

>>1 ――

M

Vdd

Ri, j+2Ri+2, jRi -2, j

Ri, j -2

G’3

• Bilinear component is critical and gradient component is less-critical

• Design architecture such that failures can only occur in gradient term

Temporal Degradation: BTI

Kang, Roy, et. al. – TCAD, DAC-07

Bias Temperature Instability

• PMOS specific Aging Effect• Generation of (+) traps• Reaction-Diffusion (RD) model*• Time exponent ~ 1/6

x

P+P+

GATE

OXIDE

DRAIN

VBODY = 0V

x

H

Si Si Si Si

HH

H

VGS < 0V

VS = 0VVD = 0V

n-sub

xx x HH H

H2

H2

H2

H2

H2

*M. A. Alam, IEDM’03

After NBTI degradation

( ) ( )10 6

2F

IT HR

k NN t D tk

=

ITT

OX

q NV

C⋅ Δ

Δ =

IDDQ based NBTI Characterization

• Test Circuit Fabricated• 1000 stage INV chain• DC Stress signal @Vin

• IDDQ measurement @GND

Technology CMOS 130nm

Die Size 20 (mm2)

I/O Pin 209

Tox 1.6 (nm)

VDD 1.2 (V)

Inverter Chain

Microphotograph Layout

Vin

VDD

IDDQ Measurement1000 stages

NBTI in Digital Circuits

VL

WL

BLBLB

VR

1

5

o1

o2

3

24

8

7

6

Logic Circuits• fMAX decreases↓• Timing failure with time

Memory Circuits• Static Noise Margin (SNM)↓• Read & Write Stability• Parametric Yield↓

Temporal VTh increase in PMOS affects critical performance factors of digital VLSI circuits

i1

i3

i2

IDDQ based Characterization Technique

Des

ign

phas

ePo

st-s

ilico

n p

hase

• Circuit-level NBTI Reliability Characterization

• IDDQ test is used• Expensive fMAX testing is avoided

(or minimized)• Accurate circuit level performance

degradation can be predicted• IC specific burn-in to qualify the

target produce• Efficient way of field monitoring:

dynamic local signature of produce usage

• Possible usage in other reliability sources; HCI

Initial Characterization•Compute Rleak, RfMAX•Compute KfMAX = Rleak / RfMAX

Reliability Extrapolation•For each IDDQ measurement sample, Rleak is computed•RfMAX = KfMAX X Rleak•Estimate fMAX degradation

Lifetime Projection•Project IDDQ using KfMAX

NBTI Characterization Report

Temp. Dependency

NBTI: Random Logic Circuits

• ISCAS’85 Benchmark Circuits, PTM 65nm • Gate delay: analytical delay model considering NBTI• Circuit delay: NBTI-aware Static Timing Analysis (STA)• Circuit fMAX time exponent n ~ 1/6

PTM 65nmISCAS’85

Benchmarks

3 years Lifetime

8% fMAX decrease

n~1/6

100

102

104

106

108

100

Time (s)f M

AX

redu

ctio

n (%

)

c2670c5315c1908c499c3540c74181

LogicCell fanin

Delay (ps)Δ (%)

t=0 3 years

INV 1 13.77 16.77 21.8

NAND 2 16.86 19.88 17.9

NAND 3 19.57 22.45 14.8

NOR 2 17.26 21.89 26.8

NOR 3 23.80 30.19 26.9

Delay Degrad. STD cells

NBTI: 6T SRAM Cell

104 106 108100

101

Time (s)

Deg

rada

tion

in S

NM

(%)

PTM 60nm, 125°CCR = 1.33PR = 0.67

1/6 trend line

Static Noise Margin (SNM)

SNM degrades by more than 10% in 3 years% SNM Degradation time exponent n ~ 1/6WM improves with time under NBTI

0.43 0.44 0.45 0.46 0.47 0.48 0.490

0.2

0.4

0.6

0.8

1

Write Margin (V)

CD

F

t=0t=108

t=107

t=106

WM improveswith time

Distribution in Write Margin

Design for Reliability under NBTI

580 600 620 640 660 680 700 720

550

600

650

700

750

800

Delay (ps)

Are

a (u

m)

INITTR-BasedCell-Based

Cell-based

Area SavingFrom TR-based

10ps c1908

Gate Sizing applied to guarantee lifetime functionality of design11.7% overhead for Cell-based sizing6.13% overhead for TR-based sizing

45% improvement in area overheadRuntime complexity for TR-based sizing is identical to that of Cell-based sizing

Simulation SetupSynthesized in PTM 65nm1/6 VTh degradation model125°C Stress temperature50% Signal Probability at PI’s

75

Memory Solution #1: Periodic CellMemory Solution #1: Periodic Cell--Flipping*Flipping*

• Balance the signal probability on both side of the cell SL = SR

• Still incurs close to 10% degradation in SNM

ML

WL

BLBLB

MR VL

VR

SL > SR NBTI↑ SL = SR NBTI↓

*S. Kumar et al., ISQED’06

76

Solution #2: Standby VSolution #2: Standby VDDDD ScalingScaling

• NBTI is a strong function of VDD• Lower VDD during standby mode MIN VDD• Effective solution with low design effort

1.0 0.9 0.8 0.7 0.60

2

4

6

8

10

12

14

16

% S

NM

Deg

rada

tion

VDDL (V)

99%50%10%1%

Activity1.0V Nominal VDDVDDL @stand-by mode

65nm PTM@125°C for 3 years

• Normal Mode VDD• Standby Mode VDDL

• Normal Mode NBTI• Standby Mode Min.

NBTI

77

Solution #3: Sensing & CorrectionSolution #3: Sensing & Correction

• Sense reliability degradation adaptive correction using circuit techniques

• *T. Kim et al., VLSI Circuit Symposium 2007• ** K. Kang et al., Design Automation Conf. 2007• Need to properly consider design overhead

RST

RORG

PhaseC

omparator

VDD Stress NBTI_OUT

ADCN7 N-Body Bias

Selection +−

ADCP7 P-Body Bias

Selection

+−

nbody

pbody

P/N Equalizer

VM

GN

GP

Reliability Sensor* Adaptive Body Bias Generator**

Conclusions

• Process Variation and Process Tolerance is becoming important

• There is a need to optimize designs considering power/performance/yield

process variations & process- adaptive design for the nano ... · process variations &...

Documents