techniques to mitigate the effects of congenital faults in processors

58
Techniques to Mitigate the Effects of Congenital Faults in Processors Smruti R. Sarangi

Upload: latoya

Post on 07-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Techniques to Mitigate the Effects of Congenital Faults in Processors. Smruti R. Sarangi. Process Variation. Corner rounding, edge shortening (courtesy IBM Microelectronics). Semiconductor Fabrication facility (courtesy tabalcoaching.com). Photolithography Unit (Courtesy Upenn). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Techniques to Mitigate the Effects of Congenital Faults in Processors

Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

Page 2: Techniques to Mitigate the Effects of Congenital Faults in Processors

Process Variation

Smruti R. Sarangi

2

Corner rounding, edge shortening (courtesy IBM Microelectronics)

Page 3: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

3

Semiconductor

Fabrication facility

(courtesy tabalcoaching.com)

Page 4: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

4

Photolithography Unit

(Courtesy Upenn)

Page 5: Techniques to Mitigate the Effects of Congenital Faults in Processors

Basic Lithographic Process

The source of light is typically a argon-flouride laserThe light passes through an array of lenses to reach the

silicon substrateThe resolution limit is given by:

To decrease the resolution we need to : Decrease the wavelength Increase the refractive index

Smruti R. Sarangi

5

R = k1λ / NA NA = n sin θ

Page 6: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

6

Parameter Variation

Parameter Variation

Process Supply Voltage Temperature

P TV

Threshold Voltage – Vt Transistor Length – Leff

Page 7: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

7

Why is Variation a Problem ?

Unpredictability of Vt , Leff and T implies :

Lower chip frequency and higher leakage

courtesy Shekhar Borkar, Intel

Page 8: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

8

Implications on Design Decisions

Static timing analysis not possibleOverly conservative designs

Chips too slow Performance of a generation lost

Possible solution Clock the chip at an unsafe frequency Tolerate resulting timing errors Reduce timing errors

Architectural techniques Circuit techniques

Page 9: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

9

Overview

Techniques to

Reduce Timing Errors

Dynamic Optimization

Techniques to

Tolerate Timing Errors

Model for Timing Errors due to

Process Variation

Model for Process Variation

Page 10: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

10

Process Variation

Process Variation

Systematic Variation Random Variation

Lens aberrations Mask deformities Thickness variation in CMP Photo-lithographic effects

Variable dopant densityLine edge roughness

Page 11: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

11

Modeling Systematic Variation

Variation Map

100

0

1000

Break into a million cells

Page 12: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

12

Systematic and Random Variation

Superimpose random variation on top of systematic

Normal Distribution

Distribution of systematic components Normal distribution

Spatial Correlation

Multi-variate

Normal Distribution

Page 13: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

13

Overview

Techniques to

Reduce Timing Errors

Dynamic Optimization

Techniques to

Tolerate Timing Errors

Model for Process Variation

Model for Timing Errors due to

Process VariationISQED ‘07

Page 14: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

14

Distribution of path delays

in pipe stage: With variation

Timing Errors

Distribution of path delays

in pipe stage: No variation

Timing errors

P(E) = 1 – cdf(tclk)

Page 15: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

15

Model for Timing Errors

Basic assumptions A structure consists of many critical paths

The critical path depends on the input critical path delay > clock period timing error

clock period = delay of the longest critical path at maximum temperature no variation

All pipeline stages are tightly designed 0 slack

Page 16: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

16

Error rate: PE (t) = 1 – cdf(t)

Paths in a Pipeline Stage

pdf(t) cdf (t)

Timing errorst

f

1

Page 17: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

17

Basic Kinds of Structures

Logic Memory

Heterogeneous critical paths ALUs, comparators, sense-amps

Homogenous critical paths SRAMs, CAMs

Mixed

x% memory and (100-x)% logic Used to model renamer, wakeup/select

Page 18: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

18

Logic

35% Wiring

Elmore Delay Model

65% Gates

Alpha Power Law

))(( thDD

DDeff

g VVT

VLT

Critical Path

Page 19: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

19

Logic Delay

(dwire+ * dgate)*Dvarlogic = Dlogic

+dgate*Dextra

Dlogic

Relative gate delay

due to systematic

variation in P,V, TDelay due to variation

in the random and syst.

component within a stage

Distribution of path delays – no variation

dwire + dgate = 1

Distribution of

path delays

with variation

Obtain Dlogic using a timing analysis tool

Page 20: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

20

Memory Delay

Memory CellMemory Line

Use Kirchoff’s equations Long channel trans. equations Multi-variable Taylor expansionDelay dist.

Delayline = max(Delaycell)

max. distribution

extend analysis

done by Roy et. al.

IEEE TCAD ‘05

Page 21: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

21

Combined Error Model

We have the delay distributions – cdf(t) – for memory and logic with variation

For each structure per access, P(E) = 1 – cdf(t) P(E) per inst. = P(E) , =accesses/inst.

Combined error rate per instruction

P(E)total = P(E)

Page 22: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

22

Validation – LogicS. Das et. al. ‘05

Page 23: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

23

Overview

Model for Timing Errors due to

Process Variation

Techniques to

Reduce Timing Errors

Dynamic Optimization

Model for Process Variation

Techniques to

Tolerate Timing Errors

Page 24: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

24

Variation Aware Timing Speculation (VATS)

Multicore

Chip

Processor

Core

Diva

Checker

L0 Cache

L1 Cache

Checker

Razor Latches

Unsafe

frequency Error free:

- Lower freq

- Safe design

Page 25: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

25

Other VATS Checkers

TIMERRTOL – Uht et. al.Razor – Dan Ernst et. al., MICRO 2003X-Checker – X. Vera et. al, SELSE 2006X-Pipe – X. Vera et. al., ASGI 2006Sato and Arita, COSLP 2003

Page 26: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

26

Overview

Model for Timing Errors due to

Process Variation

Dynamic Optimization

Model for Process Variation

Techniques to

Tolerate Timing Errors

Techniques to

Reduce Timing Errors

Submitted to

ISCA ‘07

Page 27: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

27

Basic Mechanisms – Shift and TiltE

rrro

r R

ate

(PE)

f

frequency

Before

After Err

ror

Ra

te(P

E)frequency

Before After

f

frequencyE

rror

R

ate(

PE)

f

Tilt Shift

Page 28: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

28

Architectural Mechanisms

Resizable issue queue(Albonesi et. al.) switch pass trans. off smaller queue shifts the error rate curve

SRAM/CAM array

Pass Transistors

SRAM/CAM array

Pass Transistors

SRAM/CAM array

Sense Amps

OriginalNew error

rate

Page 29: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

29

Gate SizingTransistor Width – W

Delay A + B/W Power W

Original path

delay dist.

Make faster paths

slower to save power

Gate Sizing

Page 30: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

30

Optimization: Replicate ALUs

Tradeoff is power vs errorsIDEA : Switch between the two ALUs

Use gate sized ALU if it is not timing critical and vice versa

Difference in Error Rate

Page 31: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

31

Multicore

Chip

Core

frequency

Err

or

Rat

e(P

E)

f

Fine Grain ABB and ASV

Adaptive Body Bias (ABB) – Vbb

Vbb Delay Leakage

Vbb Delay Leakage

Adaptive Supply Voltage (ASV) -- Vdd Vdd Delay Leakage Dynamic

Vary:

Supply Voltage(ASV)

Body Voltage (ABB)

Page 32: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

32

Overview

Techniques to

Reduce Timing Errors

Techniques to

Tolerate Timing Errors

Model for Process Variation

Model for Timing Errors due to

Process Variation

Dynamic Optimization

Page 33: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

33

Dynamic Behavior

Temperature

Activity Factors

Page 34: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

34

Formulate an Optimization Problem

Constraints Temperature – At all points T < TMAX

Power – Total core power < PMAX

Error – Total errors < ErrMAX

Goal – Maximize performance

Optimization Output

Constraints Goals

Input

Page 35: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

35

Outputs

15 ABB/ASV regions 30 values of (Vdd, Vbb)

33 outputsf, Vdd, Vbb can take

many valuesVery large state

space

Vdd

Vbb

f

ALU

Issue queue

size

1Outputs: + 30 + 1 + 1 = 33

Page 36: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

36

Dimensionality Reduction

1 2 3 4 65 7

Ma

x.

Fre

que

ncy

Stages

Minimum Frequency

Find the max. frequency that each stage can supportFind the slowest stageThis is the core frequencyMinimize power in the rest of the units

core frequency

Page 37: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

37

Inputs

Inputs : , TH, Vt0, Rth, Kleak

activity factor

accesses/cycleHeat sink

temperatureThermal

resistance

Phase Heat sink cycleForever

Constant in

Leakage eqn.

Page 38: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

38

Optimization Overview

Inputs

f(1)

Freq. Algorithm

Inputs

Freq. Algorithm

min

f(15)

fcore

Power Algorithm

Power

Algorithm

fcore

Inputs Inputs

Vdd Vbb Vdd Vbb

Page 39: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

39

Fuzzy Logic

based Algorithm

Fuzzy Logic Based Algorithm

Inputs - Computationally expensive

- Requires detailed models

+ Accurate Results

+ Very fast computation times

+ Incorporates detailed models

- Slight inaccuracy

Exhaustive Search

(Freq/Power)

Page 40: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

40

Fuzzy

SubController1

Final Picture

Inputs

f(1)

Inputs

Fuzzy

SubController15

min

f(15)

fcore

Fuzzy

SubController1

Fuzzy

SubController15

fcore

Inputs Inputs

Vdd Vbb Vdd Vbb

Page 41: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

41

Timeline

t

Phase 120 ms Phase

Heat Sink Cycle 2-3 secs

New Phase

Detected

20 s

Measure IPC and i

0.5 s

1 st

ep

2 ms

Test configuration

6 s

ST

OP

Run Fuzzy Controller Algorithm

10 s

Bring to chosen working point

2 ms

Retuning Cycles

Page 42: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

42

Results

Page 43: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

43

Evaluation Framework

Processor ModeledAthlon 64 floorplan

3-wide processor

12 stage pipeline

45 nm, Vdd = 1 V, 6 GHz

Core

Core Core

Core

4-core private L2 cacheSherwood phase

detector (ISCA ’03) Variation Modeling PVT maps for 100 dies

Fuzzy controller 10,000 training examples 25 rules

10 SpecInt and 10 SpecFp

benchmarks, 1 billion insts.

C

C C

C

Page 44: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

44

Terminology

Baseline Proc. with variation effects

TS Baseline+DIVA checker

TS+FU TS + FU replication

TS+Queue TS + issue-queue resizing

TS+ABB+ASV Both circuit level techniques

TS+Dyn TS + dynamic optimization

TS+All TS+FU+Queue+ABB+ASV+dyn

NoVar Without any variation effects

Page 45: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

45

Error PlotsMaximum Perf.

point

Maximum Perf.

point

ErrMAX

TS only ALL = TS + ABB + ASV

Page 46: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

46

Execution Point

Power

Frequency

Log (Timing Error Rate)

frequency

power

power

errors

frequency

errors

constanterror

constantfreq.

constantpower

Page 47: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

47

Frequency

49%23%

Frequency increase: 10 – 49 %50% of the gains are due to dynamic opts.

Static

Oracle

Fuzzy

Page 48: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

48

Performance

19%34%

We can nullify effects of variation and even speedupThe performance loss due to fuzzy logic is minimal

Static

Page 49: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

49

Conclusion

Do not design processors for worst case Need to tolerate variation induced errors

Contributions Model for timing errors New framework for tradeoffs in P, f and P(E) High dimensional dynamic adaptation Eval. of arch. techniques to tolerate/mitigate P(E)

10-49% increase in frequency 7-34% increase in performance

Page 50: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

50

Conclusion II

CADRE (DSN’06) Arch. support to make a board level computer

cycle-accurate deterministic

Phoenix (MICRO’06 & Top Picks’07) arch. support to detect and patch processor

design bugs

Page 51: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

51

BACKUP

Page 52: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

52

Algorithm

f, Vdd, Vbb

Verify T < TMAX T Rth, TH

Pdyn

Pleak Pleak0, Vt

Delay Vt

Error ModelFind fmax

Verify Err < ErrMAX

Inputs :

, Rth, TH

, Pleak0, Vt

Page 53: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

53

Memory Delay

Solve for Icell using long channel eqns.

Icell = f(VtX,VtY,LX,LY)

VtX,VtY,LX and LY are gaussian variables

VDD

BL BR

WL

Icell

Y

X

cellmem

IT

1

vtx, vty, lx, ly are the systematic components

vtx, vty, lx, ly are the random components

Page 54: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

54

Memory Delay - II

Find a distribution for Tmem

Tmem is a function of four gaussian variables

Model Tmem as a normal distribution

Find the and for Tmem using multi-variable Taylor expansion

This is the access time dist. for 1 bit

A typical entry has 32-128 bits Find the max distribution of 32-128 normal variables

Error probability = 1 – cdf(tmem)

Page 55: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

55

Fuzzy Low LevelX

i

j

Xj

ij ij

Wij = exp[ -(( - )/ )2]

Xj

ijij

j

iji WW

y

yi

W y

i

ii

W

yW

yi

Wi

Final Output

Page 56: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

56

Recovery Penalty

Page 57: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

57

Validation – Memory

Page 58: Techniques to Mitigate the Effects of Congenital Faults in Processors

Smruti R. Sarangi

58

PowerMax Power Limit

Proc. with no variation – 25 W, PMAX = 30 W