online timing variation tolerance for digital integrated circuits guihai yan & xiaowei li state...

33
Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences (ICT, CAS)

Upload: corbin-cribb

Post on 29-Mar-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Online Timing Variation Tolerance for Digital Integrated Circuits

Guihai Yan & Xiaowei Li

State Key Laboratory of Computer Architecture,Institute of Computing Technology, Chinese Academy of Sciences

(ICT, CAS)

Page 2: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Sources of timing variation

PVT variation Dynamic: Voltage & Temperature fluctuations Static: Process variation

Aging degradation NBTI, PBTI TDDB

Soft errors (in non-regular logics) SEU & SET

Page 3: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Process variation Sub-wavelength Lithography

“What you get is not what you want”

Systematic Random dopant fluctuations

Vth variation Random

1980 1990 2000 2010 2020

100nm

1m

10nm

1980 1990 2000 2010 20201980 1990 2000 2010 2020

100nm

1m

10nm

193nm193nm248nm248nm

365nm365nmLithographyLithographyWavelengthWavelength

65nm65nm

90nm90nm

130nm130nm

GenerationGeneration

GapGap

45nm45nm

32nm32nm

180nm180nm

13nm 13nm EUVEUV

Max Freq. differentiate by 20% ![Teodorescu, ISCA’08]

P variation is time-independent, “DC component”

Page 4: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Temperature variation

Application-specificSlow-varying

Milliseconds Typical thermal

constant : 2ms

[Donald, ISCA’06]

T variation is slow-varying, “Low-frequency components”

EL Synthesizer

EL Synthesizer

EL Synthesizer

EL Synthesizer

TM Agent

Core1 Core2

Core3 Core4

Page 5: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Voltage variation Fast-changing

Inductive noise• a.k.a. L(di/dt)

problem IR-drop

Why it is harder to keep a constant voltage level ?Example:Power budget: 100W ,Working voltage: 1V ,Current: 100A ,To keep voltage fluctuation between ±5%, RPDN < 0.5 mOhm

PDN hierarchy modelV variation is fast-changing,

“High-frequency components”

Page 6: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Aging degradation

Aging mechanisms NBTI (PMOS) PBTI (NMOS) TDDB

20%degradation10years

LifetimeUseful time

Infant mortality

Aging

Failu

re r

ate

Page 7: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Soft errors SEU (Single Event Upset)

Unintentional bit-flip in storage cells SET (Single Event Transient)

Transient voltage pulse propagating in combinational logics

Flip-flop

clk

So Combinational Logic

Si

……

……

SEUSET

Page 8: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Outline

TEA-TM Timing emergency-aware thread migration PVT variations co-optimization

SVFD Stability violation based fault detection On-line fault detection via timing sensing Delay fault, aging delay, soft errors

MicroFix Margin-reducing with timing sensing Application to DVFS

ReviveNet Aging-delay tolerance

Page 9: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

TEA-TM: Timing Emergency-Aware Thread Migration

Focus on the essential Timing issue

Not Necessarily aggregated, but can cancel off each others in some cases. Hence, “Complementary”.

Process variation

Voltage variation

Temperature variation

Timing variation

( + , - ) ( + , - ) ( + , - )

Page 10: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Some terms

Timing emergency (TE) Emergency level (EL)

“Density” of TE Define: EL = # of TE per 100

millions cyclesTime

Dela

y Timing Emergency

Threshold

Violent

Mild

Slow corner

Fast corner

Voltage Temperature

Process

Large fluctuation

Small fluctuation

Hot

Cool

Page 11: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

How PVT Variations Complement each other ?

• Observation in time domain

What if exchange the threads on Core1 and Core2?

T. Mild, V. MildCore1:

Large margin, low EL

T. Violent, V. ViolentCore2:

Little margin, High EL

Time

Del

ay

Threshold

Time

Del

ay

T Violent, V Violent

T Mild, V Mild T Mild, V Violent

T Violent, V Mild

Emergency

Excessive headroom

Mild + Violent

Page 12: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Frequency domain analysis

Migrate threads = “ Graft” V component

Del

ay

DT

H

Time

Core2

Del

ay

DT

H

Time

Core1

TM

TM

T V

FrequencyS

pect

rum

de

viat

ion

T V

Frequency

Spe

ctru

m

devi

atio

n

T

V

Frequency

Spe

ctru

m

devi

atio

nT

V

FrequencyS

pect

rum

de

viat

ion

P P

P P

Page 13: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Frequency domain analysis (cont.)

Relative frequency spectrum deviations on 2GHz quad-core processor. P: 0-100Hz, T: 100Hz-1MHz, V: 1MHz-250MHz.

Potential Core3 and Core4 are mild

Strategy exchange threads on Core1 and Core4, Core2 and Core 3

Page 14: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

EL Synthesizer

EL Synthesizer

EL Synthesizer

EL Synthesizer

TM Agent

Core1 Core2

Core3 Core4

TEA-TM Summary

Analyzing the complementary effect from both time and frequency

domain Presenting a delay sensor-

based scheme (TEA-TM) to exploit the complementary effect Simple, cost-efficient FFT-like heuristic

Leveraging the Core-Level Complementary Effects of PVT Variations to Reduce Timing Emergencies in Multi-Core ProcessorsGuihai Yan, Xiaoyao Liang, Yinhe Han, Xiaowei Li,In the Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10), Saint-Malo, France. pp.485-496, Jun. 2010.

Throughput: 30%

Fairness: 80%

Page 15: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Stability Violation

Stable Period vs. Variable Period

Time

(n-1)T nT

Si

So

Stable PeriodVariable Period

t1 t2

Stability Violation: Signal transitions occur in Stable

Period.

Flip-flop

Combinational Logic

Flip-flop

……

clk clk

……Si So

Page 16: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

In what situations would SVs occur?

• Delay faults resulting from – Delay defects (introduced in manufacturing processes)– Aging (Wearout) induced performance degradation

Due to Delay Fault

Setup time Setup time violation

T T

• But, Can soft error be modeled by SV?

Thus, delay faults caused stability violation do not differ too much from “setup time violation”

YES!

Page 17: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

How do Soft Errors cause SV?

Flip-flop

clk

So Combinational Logic

Si

……

……

SEU

Si violates Stability Requirement!

SET

So violates Stability Requirement!

Notice: NOLY the SVs occurring in “vulnerable window”--- within which the flip-flops are updated --- could cause

failures.

Time

(n-1)T nT

Si

So

Stable PeriodVariable Period

t1 t2

Time

(n-1)T nT

Si

So

Stable PeriodVariable Period

t1 t2

Page 18: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

The next problem is How to detect stability violations?

Low cost stability checker

Delay faults and soft errors can be modeled as Stability Violations.

VDD

CLKS M1 M2

M3 M4

M5

S1 S2

S3

M6

M7

M8

GND

S5

A1S4

M10

GND

GND

B1 An

VDD VDD VDD VDDCLKS

M9

X

Y

STABILITY CHECKER COMPRESSOR

M11

M12

GNDCo Co_b

D QCin

CLK

Co

Co_b

Comb.

XOR Protection

SiK-1 SiK SoK

B1

CLKG

CLKG

Bn

CLKS

Latch

Latch

CLKSoft error/Delay fault

Detected

Aging Delay Detected

OUTPUT LATCH

Flip-flop

Page 19: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Some Rresults Implementation

SVFD-protected FPU Using 65nm PTM, Hspice Simulation

• A Unified Online Fault Detection Scheme via Checking of Stability Violation Guihai Yan, Yinhe Han, Xiaowei Li, IEEE/ACM Desing, Automation and Test in Europe (DATE’09), pp.496-501, 2009.

• SVFD: A Versatile Online Fault Detection Scheme via Checking of Stability Violation Guihai Yan, Yinhe Han, Xiaowei Li, IEEE Transactions on Very Large Scale Integration Systems (T-VLSI), 19(9), Sep. 2011.

Page 20: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Besides of fault detection, what else can we do with SVFD?

Dynamic margin reduction MicroFix: an application to

DVFSAging tolerance

ReviveNet: Fine-grained aging delay tolerance

Page 21: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Dynamic margin reduction

……(K-1)FFs

KFFs

Delay Error Prediction Signals

……(K-1)th stage

LogicKth stage

Logic

Timing Sensors

Timing Sensors

Target Pipeline

Voltage/Frequency

Control

Normal Voltage Supply

………… …… …… ……

CLK……

……

……

FCLK

BCLK

Conservative Voltage Supply

CLK

BCLK

FCLKT×TH

T×TH

UAFFFAFF

GFF

FCLK BCLK

BAFF

CLK

FFs

Timing sensors setup

Page 22: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Operational Principles

V, F V, FV, F

Reduce frequency from F to F Reduce voltage from V to V

(a) Traditional DVFS

Increase frequency from F to F Increase voltage from V to V

Reducing Power

Increasing Performance

V, FV, F

Increase voltage from V to V

Increase frequency from F to F

V, F

V V-v Monitoring

No error predicted

V V+ v

Error predicted

F F + fMonitoring

No error predicted

F F- fError

predicted

(b) MicroFix enhanced DVFS

Reduce frequency from F to F

Reduce voltage from V to V

Restore a tight margin

Restore a tight margin

Page 23: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Fine-grained margin exploited

P1

P2

FF

FFCritial Path

Cycle Period

Non-critical Path

K-1th stage Kth stage

Cycle Period

FF

FF

FF

FF

Generous Flip-flop (GFF) Forward Adaptable Flip-flop (FAFF)

Backward Adaptable Flip-flop (BAFF) Unadaptable Flip-flop (UAFF)

Localized timing imbalance

Page 24: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Case study results

Apply to a FPU 32nm PTM models

TH=0.2~0.3 is an optimal choice!Efficiency Improvement: 35% EDP

MicroFix: Using Timing Interpolation and Delay Sensors for Power ReductionGuihai Yan, Yinhe Han, Hui Liu, Xiaoyao Liang, Xiaowei Li, ACM Transactions on Design Automation of Electronic Systems (TODAES), 16(2), 1-21, 2011.

MicroFix: Exploiting Path-grained Timing Adaptability for Improving Power-Performance EfficiencyGuihai Yan, YinheHan, Hui Liu, Xiaoyao Liang, Xiaowei Li, ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED’09), pp395-400, 2009.

Page 25: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Localized Aging Tolerance

Fresh

Aging delay

T

Guard band

Aging delay

Stability violation in guard band,is NOT“ timing violation”

Delay fault

Detection slack

Stability violation in detection slack, is“ timing violation” ——Delay fault

T

The chance for aging adaptation We have chance to “act before it’s too late”

Page 26: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Nudge for timing margin

Dynamic time borrowing Path-grained, NOT stage-grained

……(K-1)FFs

CLK

KFFs

CLK

Aging Alarms

……(K-1) stage

LogicK stage Logic

Aging Alarms

ReviveNet

AdaptationAgent

Adaptation Agent

To prior Agents

From next Agents

Aging SensorAging

SensorAging

Sensor

Aging SensorAging

SensorAging

Sensor

Page 27: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Aging sensors setup

Coarse-grained detection

Upstream flip-flops

Downstream flip-flops

…… Logic

Sensor1

Sensorn

……

… …

Aging alarm

Timing Non-critical Signals

Stability CheckerO

RStability Checker

Stability Checker

Ou

tpu

t L

atch

Aging alarm

VDD

CLKM1 M2

M3 M4

M5

S1 S2

S3

M6

M7

M8

GND

S5

S4

VDD VDD

GND

Page 28: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Trail-based adaptation

FFs

CLK

CLK FCLK

MU

XM

UX

MU

X

BCLK

FCLK

BCLK

UAFF

FAFF

GFF

CLK

BCLK

FCLKM

UX

MU

X

BAFF

MU

X

AgentTH/2

TH/2

Da

ta-i

n

Da

ta-o

ut

Round-Robin Trial Adaptation (K)01. The Kth Agent receives an aging emergency 02. FOR each adaptation state candidate 03. Conduct a trial adaptation04. IF the emergency is eliminated 05. THEN break (Adaptation succeeded!) 06. ELSE 07. Recover this trial adaptation 08. IF all the adaptation states have been reached09. THEN break (Adaptation failed!) 10. END FOR

Adaptation latency is non-critical

Trail till success

Fine-grained adaptation

Page 29: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Implementation

False-alarm filterSharing filters to reduce overhead

ReviveNet: A Self-adaptive Architecture for Improving Lifetime Reliability via Localized Timing AdaptationGuihai Yan, Yinhe Han, Xiaowei Li,IEEE Transactions on Computers (TC), 60(9), Sep. 2011.

Page 30: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Conclusion

Dynamic timing variation is increasingly critical

Online timing variation detection and tolerance is a promising approach to dynamic variation

Application-specific timing variation MicroFix for DVFS ReviveNet for aging tolerance

Holistic solution can be more cost-effective TEA-TM Architectural optimization for Circuit symptom

Page 31: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Publication (Chronological order)1. Guihai Yan, Yinhe Han, Xiaowei Li, 

ReviveNet: A Self-adaptive Architecture for Improving Lifetime Reliability via Localized Timing Adaptation, IEEE Transactions on Computers (TC), Vol.60, No.9, pp.1219-1232, Sep. 2011.

2. Guihai Yan, Yinhe Han, Xiaowei Li, SVFD: A Versatile Online Fault Detection Scheme via Checking of Stability Violation, IEEE Transactions on Very Large Scale Integration Systems (T-VLSI), Vol.19, No.9, pp.1627-1640, Sep. 2011.

3. Guihai Yan, Yinhe Han, Hui Liu, Xiaoyao Liang, Xiaowei Li, MicroFix: Using Timing Interpolation and Delay Sensors for Power Reduction, ACM Transactions on Design Automation of Electronic Systems (TODAES), Vol.16, No.2, pp.1-21, Mar. 2011.

4. Jianbo Dong, Lei Zhang, Yinhe Han, Guihai Yan, Xiaowei Li, Performance-asymmetry-aware Scheduling for Chip Multiprocessors with Static Core Coupling, Journal of Systems Architecture, Vol.56, pp.534-542, 2010.

5. Guihai Yan, Xiaoyao Liang, Yinhe Han, Xiaowei Li, Leveraging the Core-Level Complementary Effects of PVT Variations to Reduce Timing Emergencies in Multi-Core Processors, In the Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10), Saint-Malo, France. pp.485-496, Jun. 2010.

6. Guihai Yan, YinheHan, Hui Liu, Xiaoyao Liang, Xiaowei Li, MicroFix: Exploiting Path-grained Timing Adaptability for Improving Power-Performance Efficiency, ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED'09), pp.395-400, 2009.

7. Song Jin, Yinhe Han, Lei Zhang, Huawei Li , Xiaowei Li and Guihai Yan, M-IVC: Using Multiple Input Vectors to Minimize Aging-induced Delay, Proc. of IEEE Asian Test Symposium (ATS'09), 2009.

8. Guihai Yan, Yinhe Han, Xiaowei Li, A Unified Online Fault Detection Scheme via Checking of Stability Violation, IEEE/ACM Desing, Automation and Test in Europe (DATE'09), pp.496-501, 2009.

9. Guihai Yan, Yinhe Han, Xiaowei Li, Hui Liu, BAT: Performance-Driven Crosstalk Mitigation Based on Bus-grouping Asynchronous Transmission, IEICE Transactions On Electronics, Vol.E91-C, No.10, pp.1690-1697, Oct, 2008.

Page 32: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

Book Chapters

Fault Tolerance Designs for Digital Integrated Circuits: Tolerating defects/faults, parameter variations, and soft errors (in Chinese), Beijing, Science Press, 2011. ISBN 978-7-03-030576-3.

Page 33: Online Timing Variation Tolerance for Digital Integrated Circuits Guihai Yan & Xiaowei Li State Key Laboratory of Computer Architecture, Institute of Computing

When I’ve done a program…