designing reliable systems with unreliable components

40
® ® 1 Designing Reliable Systems Designing Reliable Systems With With Unreliable Components Unreliable Components Shekhar Borkar, Shekhar Borkar, Intel Fellow Intel Fellow Director, Microprocessor Research Director, Microprocessor Research November 14, 2005 November 14, 2005

Upload: others

Post on 30-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing Reliable Systems With Unreliable Components

®® 1

Designing Reliable SystemsDesigning Reliable SystemsWithWith

Unreliable ComponentsUnreliable Components

Shekhar Borkar, Shekhar Borkar,

Intel FellowIntel Fellow

Director, Microprocessor ResearchDirector, Microprocessor Research

November 14, 2005November 14, 2005

Page 2: Designing Reliable Systems With Unreliable Components

2

OutlineOutline

�� Process technology scaling Process technology scaling

�� Near term challengesNear term challenges

�� µµµµµµµµArchitectureArchitecture & design solutions& design solutions

�� Upcoming paradigm shiftsUpcoming paradigm shifts

�� Long term outlook & challengesLong term outlook & challenges

�� SummarySummary

Page 3: Designing Reliable Systems With Unreliable Components

3

Technology ScalingTechnology Scaling

GATE

SOURCE

BODY

DRAIN

Xj

ToxD

GATE

SOURCE DRAIN

Leff

BODY

Lower active powerLower active powerVdd & Vt scalingVdd & Vt scaling

Faster transistor, Faster transistor, higher performancehigher performance

Oxide thickness Oxide thickness scales downscales down

Doubles transistor Doubles transistor densitydensity

Dimensions scale Dimensions scale down by 30%down by 30%

Scaling will continue, but with challenges!Scaling will continue, but with challenges!Scaling will continue, but with challenges!

Page 4: Designing Reliable Systems With Unreliable Components

4

Technology OutlookTechnology Outlook

Medium High Very HighMedium High Very HighVariabilityVariability

Energy scaling will slow downEnergy scaling will slow down>0.5>0.5>0.5>0.5>0.35>0.35Energy/Logic Op Energy/Logic Op scalingscaling

0.5 to 1 layer per generation0.5 to 1 layer per generation88--9977--8866--77Metal LayersMetal Layers

1111111111111111RC DelayRC Delay

Reduce slowly towards 2Reduce slowly towards 2--2.52.5<3<3~3~3ILD (K)ILD (K)

Low Probability High ProbabilitLow Probability High ProbabilityyAlternate, 3G etcAlternate, 3G etc

128

1111

20162016

High Probability Low ProbabilitHigh Probability Low ProbabilityyBulk Planar CMOSBulk Planar CMOS

Delay scaling will slow downDelay scaling will slow down>0.7>0.7~0.7~0.70.70.7Delay = CV/I Delay = CV/I scalingscaling

256643216842Integration Integration Capacity (BT)Capacity (BT)

88161622223232454565659090Technology Node Technology Node (nm)(nm)

20182018201420142012201220102010200820082006200620042004High Volume High Volume

ManufacturingManufacturing

Page 5: Designing Reliable Systems With Unreliable Components

5

Page 6: Designing Reliable Systems With Unreliable Components

6

The The Leakage(sLeakage(s))……

1

10

100

1000

10000

30 50 70 90 110 130

Temp (C)

Ioff

(n

a/u

)

0.25

90

45

0.1

1

10

100

1000

0.25u 0.18u 0.13u 90nm 65nm 45nm

Technology

SD

Leakag

e (

Watt

s)

2X Tr Growth

1.5X Tr Growth

90nm MOS Transistor

50nm50nm SiSi

1.2 nm SiO1.2 nm SiO22

GateGate

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

0.25u 0.18u 0.13u 90nm 65nm 45nm

Technology

Ga

te L

ea

kag

e (

Wa

tts)

1.5X

2X

During Burn-in

1.4X Vdd

Page 7: Designing Reliable Systems With Unreliable Components

7

Must Fit in Power EnvelopeMust Fit in Power Envelope

0

200

400

600

800

1000

1200

1400

90nm 65nm 45nm 32nm 22nm 16nm

Po

wer

(W),

Po

wer

Den

sit

y (

W/c

m2)

SiO2 Lkg

SD Lkg

Active

10 mm Die

Technology, Circuits, and Technology, Circuits, and

Architecture to constrain the power Architecture to constrain the power

Page 8: Designing Reliable Systems With Unreliable Components

8

Between Now and ThenBetween Now and Then——

�� Move away from Frequency alone to Move away from Frequency alone to deliver performancedeliver performance

�� More onMore on--die memorydie memory

�� MultiMulti--everywhereeverywhere

––MultiMulti--threadingthreading

––Chip level multiChip level multi--processingprocessing

�� Throughput oriented designsThroughput oriented designs

�� Valued performance by higher level of Valued performance by higher level of integrationintegration

––Monolithic & Monolithic & PolylithicPolylithic

Page 9: Designing Reliable Systems With Unreliable Components

9

Leakage SolutionsLeakage Solutions

Tri-gate Transistor

Silicon substrateSilicon substrate

1.2 nm SiO1.2 nm SiO22

GateGate

Planar Transistor

SiGeSiGe

Silicon substrate

Gate electrode

3.0nm High-k

For a few generations, then what?

Page 10: Designing Reliable Systems With Unreliable Components

10

Active Power ReductionActive Power Reduction

Slow Fast Slow

Lo

w S

up

ply

V

olt

ag

e

Hig

h S

up

ply

V

olt

ag

e

Multiple Supply Voltages

Logic BlockFreq = 1Vdd = 1Throughput = 1Power = 1Area = 1 Pwr Den = 1

Vdd

Logic Block

Freq = 0.5Vdd = 0.5Throughput = 1Power = 0.25Area = 2Pwr Den = 0.125

Vdd/2

Logic Block

Throughput Oriented Designs

Page 11: Designing Reliable Systems With Unreliable Components

11

Leakage ControlLeakage ControlBody Bias

VddVbp

Vbn-Ve

+Ve

22--10X10XReductionReduction

Sleep Transistor

Logic Block

22--1000X1000XReductionReduction

Stack Effect

Equal Loading

55--10X10XReductionReduction

Page 12: Designing Reliable Systems With Unreliable Components

12

Optimum FrequencyOptimum Frequency

Maximum performance withMaximum performance with

•• Optimum pipeline depth Optimum pipeline depth

•• Optimum frequencyOptimum frequency

Process Technology

0

2

4

6

8

10

1 2 3 4 5 6 7 8 9 10Relative Frequency

Sub-threshold Leakage increases

exponentially

Pipeline Depth

0

2

4

6

8

10

1 2 3 4 5 6 7 8 9 10Relative Pipeline Depth

Power

Efficiency

Optimum

Pipeline & Performance

0

2

4

6

8

10

1 2 3 4 5 6 7 8 9 10Relative Frequency (Pipelining)

Performance

DiminishingReturn

Page 13: Designing Reliable Systems With Unreliable Components

13

µµArchitectureArchitecture TechniquesTechniques

0%

25%

50%

75%

100%

1u 0.5u 0.25u 0.13u 65nm

Cach

e %

of

To

tal

Are

a

486Pentium®

Pentium® III

Pentium® 4

Pentium® M

Increase on-die Memory

ST Wait for Mem

MT1 Wait for Mem

MT2 Wait

MT3

Single ThreadSingle Thread

MultiMulti--ThreadingThreading

Full HW Utilization

Multi-threading

Improved performance, no impact Improved performance, no impact

on thermals & power deliveryon thermals & power delivery

C1 C2

C3 C4

Cache

Chip Multi-processing

LargeCore

1

1.5

2

2.5

3

3.5

1 2 3 4

Die Area, Power

Re

lati

ve

Pe

rfo

rma

nc

e

Multi Core

Single Core

Page 14: Designing Reliable Systems With Unreliable Components

14

Special Purpose HardwareSpecial Purpose Hardware

TC

B ExecCore

PLL

OOO

ROM

CA

M1

TC

B ExecCore

PLL

ROB

ROM

CL

BInputseq

Sendbuffer

2.23 mm X 3.54 mm, 260K transistors

Opportunities: Network processing engines

MPEG Encode/Decode engines, Speech engines

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1995 2000 2005 2010 2015

MIP

S

GP MIPS

@75W

TOE MIPS

@~2W

TCP/IP Offload EngineTCP/IP Offload Engine

Si Monolithic Polylithic

CPU

SpecialHW

Memory

CMOS RF

Wireline

RF

Opto-Electronics

Dense Memory

HeterogeneousSi, SiGe, GaAs

Special purpose HW and extreme integration

Page 15: Designing Reliable Systems With Unreliable Components

15

Page 16: Designing Reliable Systems With Unreliable Components

16

Sources of VariationsSources of Variations

0

50

100

150

200

250H

eat

Flu

x (

W/c

m2

)

Heat Flux (W/cm2)—Vcc variation

40

50

60

70

80

90

100

110

Te

mp

era

ture

(C

)

Temp Variation & Hot spots

10

100

1000

10000

1000 500 250 130 65 32

Technology Node (nm)

Me

an

Nu

mb

er

of

Do

pa

nt

Ato

ms

Random Dopant Fluctuations

0.01

0.1

1

1980 1990 2000 2010 2020

micron

10

100

1000

nm

193nm193nm248nm248nm

365nm365nm LithographyLithography

WavelengthWavelength

65nm65nm

90nm90nm

130nm130nm

GenerationGeneration

GapGap

45nm45nm

32nm32nm13nm 13nm

EUVEUV

180nm180nm

Source: Mark Bohr, Intel

Sub-wavelength Lithography

Page 17: Designing Reliable Systems With Unreliable Components

17

Impact of Static VariationsImpact of Static Variations

130nm

30%

5X

FrequencyFrequency

~30%~30%

LeakageLeakage

PowerPower

~5~5--10X10X

0.90.9

1.01.0

1.11.1

1.21.2

1.31.3

1.41.4

11 22 33 44 55Normalized Leakage (Normalized Leakage (IsbIsb))

No

rmalized

Fre

qu

en

cy

No

rmalized

Fre

qu

en

cy

Today…

Page 18: Designing Reliable Systems With Unreliable Components

18

Circuit Design TradeoffsCircuit Design Tradeoffs

0

0.5

1

1.5

2

Low-Vt usagelow high

Higher probability of target frequency with:Higher probability of target frequency with:

1.1. Larger transistor sizes Larger transistor sizes

2.2. Higher LowHigher Low--VtVt usageusage

But with power penaltyBut with power penalty

0

0.5

1

1.5

2

Transistor sizesmall large

power

target frequency probability

Page 19: Designing Reliable Systems With Unreliable Components

19

Impact of Critical PathsImpact of Critical Paths

1.1

1.2

1.3

1.4

1 9 17 25

# of critical paths

Mean

clo

ck f

req

uen

cy

Clock frequency

Nu

mb

er

of

die

s

0%

20%

40%

60%

0.9 1.1 1.3 1.5

# critical paths

�� With increasing # of critical pathsWith increasing # of critical paths

––Both Both σσσσσσσσ and and µµµµµµµµ become smallerbecome smaller

––Lower mean frequencyLower mean frequency

Page 20: Designing Reliable Systems With Unreliable Components

20

Impact of Logic DepthImpact of Logic Depth

0%

20%

40%

-16% -8% 0% 8% 16%

Delay

20%

40% NMOS

PMOS

Device IONN

um

ber

of

sa

mp

les (

%)

Variation (%)

NMOS Ion

σ/µσ/µσ/µσ/µ

PMOS Ion

σσσσ/µµµµ

Delay

σσσσ/µµµµ

5.6% 3.0% 4.2%

Logic depth: 16

0.0

0.5

1.0

Logic depth

Rati

o o

f

dela

y- σσ σσ

to Io

n- σσ σσ

16 49

Page 21: Designing Reliable Systems With Unreliable Components

21

0

0.5

1

1.5

Logic depthsmalllarge

frequency

target frequency probability # uArch critical paths

0

0.5

1

1.5

less more

µµArchitectureArchitecture TradeoffsTradeoffs

Higher target frequency with:Higher target frequency with:

1.1. Shallow logic depthShallow logic depth

2.2. Larger number of critical pathsLarger number of critical paths

But with lower probabilityBut with lower probability

Page 22: Designing Reliable Systems With Unreliable Components

22

VariationVariation--tolerant Designtolerant Design

0

0.5

1

1.5

# uArch critical pathsless more

Balance Balance Balance Balance

power & power & power & power &

frequency frequency frequency frequency

with with with with

variation variation variation variation

tolerancetolerancetolerancetolerance

0

0.5

1

1.5

Logic depthsmalllarge

frequency

target frequency probability

0

0.5

1

1.5

2

Transistor sizesmall large

power

target frequency probability

0

0.5

1

1.5

2

Low-Vt usagelow high

Page 23: Designing Reliable Systems With Unreliable Components

23

Leakage PowerLeakage Power

Fre

qu

en

cy

Fre

qu

en

cy DeterministicDeterministic

ProbabilisticProbabilistic

10X variation 10X variation

~50% total power~50% total power

Probabilistic DesignProbabilistic Design

DelayDelay

Path DelayPath Delay Pro

bab

ilit

yP

rob

ab

ilit

yDeterministic design techniques inadequate in the futureDeterministic design techniques inadequate in the futureDeterministic design techniques inadequate in the future

Due to Due to

variations in:variations in:

VVdddd, V, Vtt, and , and

TempTemp

Delay TargetDelay Target

# o

f P

ath

s# o

f P

ath

s

DeterministicDeterministic

Delay TargetDelay Target

# o

f P

ath

s# o

f P

ath

s

ProbabilisticProbabilistic

Page 24: Designing Reliable Systems With Unreliable Components

24

Shift in Design ParadigmShift in Design Paradigm

�� MultiMulti--variable design optimization for:variable design optimization for:

–– Yield and bin splits Yield and bin splits

–– Parameter variations Parameter variations

–– Active and leakage powerActive and leakage power

–– Performance Performance

Tomorrow:Tomorrow:Global OptimizationGlobal Optimization

MultiMulti--variatevariate

Today:Today:Local OptimizationLocal Optimization

Single VariableSingle Variable

Page 25: Designing Reliable Systems With Unreliable Components

25

Page 26: Designing Reliable Systems With Unreliable Components

26

TodayToday’’s Freelance Layouts Freelance Layout

Vss

Vdd

OpIp

Vss

Vdd

Op

No layout restrictionsNo layout restrictionsNo layout restrictions

Page 27: Designing Reliable Systems With Unreliable Components

27

Transistor Orientation RestrictionsTransistor Orientation Restrictions

Vss

Vdd

OpIp

Vss

Vdd

Op

Transistor orientation restricted to improve manufacturing control

Transistor orientation restricted to improve Transistor orientation restricted to improve

manufacturing controlmanufacturing control

Page 28: Designing Reliable Systems With Unreliable Components

28

Op

Vss

Vdd

Ip

Vss

Vdd

Op

Transistor Width QuantizationTransistor Width Quantization

Page 29: Designing Reliable Systems With Unreliable Components

29

TodayToday’’s Unrestricted s Unrestricted RoutingRouting

Page 30: Designing Reliable Systems With Unreliable Components

30

Future Metal RestrictionsFuture Metal Restrictions

Page 31: Designing Reliable Systems With Unreliable Components

31

TodayToday’’s Metric: s Metric: Maximizing Transistor DensityMaximizing Transistor Density

Dense layout causes hot-spotsDense layout causes hotDense layout causes hot--spotsspots

Page 32: Designing Reliable Systems With Unreliable Components

32

TomorrowTomorrow’’s Metric: s Metric: Optimizing Transistor & Power DensityOptimizing Transistor & Power Density

Balanced DesignBalanced DesignBalanced Design

Page 33: Designing Reliable Systems With Unreliable Components

33

Implications to DesignImplications to Design

�� Design fabric will be Design fabric will be RegularRegular

�� Will look like Will look like SeaSea--ofof--transistorstransistorsinterconnected with regular interconnected with regular interconnect fabricinterconnect fabric

�� Shift in the design efficiency metricShift in the design efficiency metric

–– From From Transistor DensityTransistor Density to to Balanced Balanced DesignDesign

Page 34: Designing Reliable Systems With Unreliable Components

34

Page 35: Designing Reliable Systems With Unreliable Components

35

Technology OutlookTechnology Outlook

Medium High Very HighMedium High Very HighVariabilityVariability

Energy scaling will slow downEnergy scaling will slow down>0.5>0.5>0.5>0.5>0.35>0.35Energy/Logic Op Energy/Logic Op scalingscaling

0.5 to 1 layer per generation0.5 to 1 layer per generation88--9977--8866--77Metal LayersMetal Layers

1111111111111111RC DelayRC Delay

Reduce slowly towards 2Reduce slowly towards 2--2.52.5<3<3~3~3ILD (K)ILD (K)

Low Probability High ProbabilitLow Probability High ProbabilityyAlternate, 3G etcAlternate, 3G etc

128

1111

20162016

High Probability Low ProbabilitHigh Probability Low ProbabilityyBulk Planar CMOSBulk Planar CMOS

Delay scaling will slow downDelay scaling will slow down>0.7>0.7~0.7~0.70.70.7Delay = CV/I Delay = CV/I scalingscaling

256643216842Integration Integration Capacity (BT)Capacity (BT)

88161622223232454565659090Technology Node Technology Node (nm)(nm)

20182018201420142012201220102010200820082006200620042004High Volume High Volume

ManufacturingManufacturing

Page 36: Designing Reliable Systems With Unreliable Components

36

ReliabilityReliability

Soft Error FIT/Chip (Logic & Mem)

0

50

100

150

180

130 90 65 45 32 22 16

Re

lati

ve

~8% degradation/bit/generation

Time dependent device degradation

0

1

1 2 3 4 5 6 7 8 9 10

Time

Ion

Rela

tive

Burn-in may phase out…?

1

10

100

1000

10000

180 90 45 22

Jo

x (

Re

lati

ve)

Hi-K?

Extreme device variations

0

50

100

100 120 140 160 180 200

Vt(mV)

Re

lati

ve

Wider

Page 37: Designing Reliable Systems With Unreliable Components

37

Implications to ReliabilityImplications to Reliability

�� Extreme variations (Static & Dynamic) Extreme variations (Static & Dynamic) will result in unreliable componentswill result in unreliable components

�� Impossible to design reliable system Impossible to design reliable system as we know todayas we know today

––Transient errors (Soft Errors) Transient errors (Soft Errors)

––Gradual errors (Variations)Gradual errors (Variations)

––Time dependent (Degradation)Time dependent (Degradation)

Reliable systems with unreliable components

—Resilient µµµµArchitectures

Reliable systems with unreliable components Reliable systems with unreliable components

——Resilient Resilient µµµµµµµµArchitecturesArchitectures

Page 38: Designing Reliable Systems With Unreliable Components

38

Implications to TestImplications to Test

�� OneOne--timetime--factory testing will be outfactory testing will be out

�� BurnBurn--in to catch chip infantin to catch chip infant--mortality mortality will not be practicalwill not be practical

�� Test HW will be part of the designTest HW will be part of the design

�� Dynamically selfDynamically self--test, detect errors, test, detect errors, reconfigure, & adaptreconfigure, & adapt

Page 39: Designing Reliable Systems With Unreliable Components

39

In a NutIn a Nut--shellshell……

100 Billion

Transistors

100 BT integration capacity

Billions unusable (variations)

Some will fail over time

Yet, deliver high performance in the power &

cost envelope

Yet, deliver high performance in the power & Yet, deliver high performance in the power &

cost envelopecost envelope

Intermittent failures

Page 40: Designing Reliable Systems With Unreliable Components

40

SummarySummary�� Near term:Near term:

–– Optimum frequency & Optimum frequency & µµµµµµµµArchitectureArchitecture

–– Lots of memory & MultiLots of memory & Multi——everywhereeverywhere

–– Valued performance with higher integrationValued performance with higher integration

�� Paradigm shift:Paradigm shift:

–– From deterministic to probabilistic design, with From deterministic to probabilistic design, with multimulti--variatevariate optimizationoptimization

–– Evolution of regular design fabricEvolution of regular design fabric

�� Long term:Long term:

–– Reliable systems with unreliable componentsReliable systems with unreliable components

–– Dynamic selfDynamic self--test, detect, reconfigure, & adapttest, detect, reconfigure, & adapt