enpeople.virginia.edu/~xg2dt/papers/xinfei guo_uvers_poster...xinfei guo advisor: mircea r. stan...

1
10 th Annual University of Virginia Engineering Research Symposium (UVERS 2014) Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 7 Shared L3 Cache Core 8 Zzzzzz... Zzzzzz... Heat Heat Heat Heat Heat Heat & 0 0 0 Q Q SET CL R S R Timing Error! High Power! Failure! Slow! Transistor Aging ( Wearout) Deterioration of circuit/system performance over time Increase design margin More significant with extremely scaling technology Both Reversible and Permanent Part Bias Temperature Instability (BTI) is the most dominant reversible aging mechanism Previous work Accept the variations, track and monitor them Dynamically adapt to the variations Reduce actual variations during operation Limitations of previous work The worst case becomes even worse with technology scaling Power, performance and area (PPA) overhead This Work Our Goal Reduce aging induced variations directly without introducing overhead Relax the design margin Deeply rejuvenate the chip Improve PPA metrics Features Explore the idea of periodic sleep for electronic systems not unlike that of biological systems Postulate that future electronics system will use sleep time as an active recovery period essential for their overall performance Deeply rejuvenate the chip during “Sleep Time Demonstrate the techniques with both experiments and models Contributions Three Accelerated Self-Healing techniques Control sleep conditions explicitly (e.g. higher temperatures, negative voltages) Proactive Accelerated Rejuvenation (control the ratio of sleep vs. active) A first-order circuit model Consider both wearout and accelerated recovery periods Based on latest device level NBTI models Validate using hardware(FPGA) experiments Exploring On-Chip Solution Negative voltage generator “On-Chip Heater” in other electronic systems architectures such as multicore. N/PBTI HCI TDDB EM Motivation Inspired by Biology: Sleep vs. Inactivity Biological View : During sleep, there are still several active processes that are essential for the recovery of their full capabilities Conventional view in circuit community: Sleep for electronic systems means a period of inactivity or idleness. (Power gating/Clock gating, etc.) Our Hypothesis: Sleep should be used as an active recovery period for future electronics. Electronic systems will benefit from such sleep periods with active rejuvenation during which some of the effects of wearout (like BTI) can be reversed, thus leading to effective self-healing. Proactive Accelerated Rejuvenation Scheduled explicit accelerated recover periods ahead of any sign of stress Less overhead (no tracking, adaptation circuitry needed) Easy to implement Predictable and controllable Better cumulative metrics Extend life time effectively Cross-Layer Model Wearout Model for FPGAs Based on Trapping/Detrapping (TD) model AC vs. DC Stress Recovery is slower than degradation The unrecovered part will accumulate phase to phase t 1 : Stress time; t 2 : recovery time The total threshold voltage shift : The total delay shift: Accelerated Recovery Model Big dependence of delay shift as a function of voltage, temperature and sleep/active Ratio. Delay change in one cycle (T): Fitting parameters are extracted based on measurement Accelerated Self-Healing Stress and Recovery “Knobs” voltage, time length, temperature, switching activity (AC/DC) and Ratio of active (wearout) and sleep (rejuvenation) time. Test Configuration Commercialized FPGA chips Accelerated Testing Methodology Test Results Effect of Switching Activity on Wearout AC stress degrades the Performance slower Recovery is much slower Effect of Temperature on Wearout Negative Voltage Experimental Setup High Temperature Ratio of active vs. sleep time Summary Future Work On-chip Negative Voltages Combine with on-chip power regulation techniques Breakdown voltage limitation Gate-induced drain leakage current (GIDL) On-chip Heater Combine the accelerated techniques with existing core scheduling solution Utilize “Dark Silicon” Conclusions Propose three accelerated Self-healing techniques Demonstrate several cases that bring stressed chips to within 90% of their original design margin On-chip solutions are discussed Limitations: First-order model, other aging mechanisms (EM, TDDB, etc.), chip-to-chip variations Exploring the extra flexibility offered by the circadian rhythms to improve the power, performance and area (PPA) metrics Acknowledgements This work was supported in part by NSF under grant No. CCF-1255907, and by SRC through Global Research Collaboration (GRC) program under task ID. 2410.001. We would also like to thank Dr. Wayne Burleson from AMD Research and Mr.Alec Roelke from UVA for discussions. *Source: http://gladstoneinstitutes.org/node/11312 High Performance Low Power (HPLP) Lab, Computer Engineering Program, University of Virginia Xinfei Guo Advisor: Mircea R. Stan Exploring Accelerated Self-Healing Techniques for Electronic Chips and Systems Biological Clock* Time Vth(t1) t1 t1+t2 Vth Vth(t1+t2) Stress Recovery 0 Stress Recovery ) )) ( 1 log( ) 1 log( 1 )( ( )) 1 log( ( ) ( 1 2 2 1 2 2 2 1 t t C k Ct k t V Ct A t t V th th ) exp( ) exp( ~ 0 2 2 ox ddr kTt BV kT E K (1) (2) Stress and Recovery behavior Pass-transistor based LUT structure C0 C1 C2 C3 In0 Routing Blocks In1 LUT Path of Interest M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 dds ox dds d V Ct A kTt BV kT E Y t T )) 1 log( ( ) exp( ) exp( ~ ) ( 1 0 1 (3) ) ) 1 log( ) 1 1 log( 1 )( 1 ( )) 1 1 log( ( ) ( 0 Ct k t C k t T V t C A t t T d dds d a d (4) 16-b Counter fref clk in Cout 16 En En 75 LUTs Circuit Under Test (CUT) rst ref out osc d f C f T 4 1 2 1 FPGA Board and Mother Test Board Test configuration FPGA Chip To FPGA Programmer To Mother Board Programmer To PC Thermal Chamber Logic Analyzer Chip is inside Temperature Control Test Conditions AC/DC stress test results 0 0.5 1 1.5 2 2.5 0 3 hours 6 hours 12 hours 24 hours Frequency Degradation (%) AC Stress DC Stress 0 1 2 3 4 5 6 7 8 9 x 10 4 0 0.5 1 x 10 -9 Time(s) Delay Change Td (s) 110 C Measurement 100C Measurement 100C Model 110C Model Accelerated Wearout with 110 °C and 100 °C for 1 day 0 0.5 0 hour 0.3 hour 1 hours 2 hours 4 hours 6 hours Recovered Delay(ns) 0V 0V Model -0.3V -0.3V Model 0 0.5 1 1.5 2 2.5 0 hour 0.3 hour 1 hours 2 hours 4 hours 6 hours Recovered Delay(ns) Negative Voltage-Accelerated Recovery at 20°C and 110 °C 20 °C 110 °C 0 0.5 1 1.5 0 hour 0.3 hour 1 hours 2 hours 4 hours 6 hours Recovered Delay(ns) 20 °C 20 °C Model 110 °C 110 °C Model 0 0.5 1 1.5 2 2.5 0 hour 0.3 hour 1 hours 2 hours 4 hours 6 hours Recovered Delay(ns) 0 V -0.3 V High Temperature-Accelerated Recovery under 0V and -0.3 V 24.5 24.7 24.9 25.1 25.3 25.5 25.7 25.9 26.1 Frequency(MHz) Design Margin Relax Parameter (%) for ratio of active to sleep time is 4 Wearout for 48 hours Accelerated Recovery for 12 hours Design Margin Relax Parameter** (%) for all cases Illustration of wearout vs. recovery Illustration of Multicore System Self-Healing Sleep but recovery Note: AS accelerated stress, AR accelerated recovery **Design Margin Relaxed Parameter: Percentage the chip recovered from the original margin. Sleeping Cores 24 hours

Upload: others

Post on 11-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enpeople.virginia.edu/~xg2dt/papers/Xinfei Guo_UVERS_Poster...Xinfei Guo Advisor: Mircea R. Stan Exploring Accelerated Self-Healing Techniques for Electronic Chips and Systems Biological

10th Annual University of Virginia Engineering Research Symposium (UVERS 2014)

Core 6

Core 1 Core 2 Core 3 Core 4

Core 5 Core 7

Shared L3 Cache

Core 8

Zzzzzz...

Zzzzzz...

Heat Heat

Hea

t

Heat

Heat Heat

&0

0

0

Q

QSET

CLR

S

R

Timing Error!

High Power!Failure!

Slow!

Transistor Aging (Wearout)

Deterioration of circuit/system performance over time

Increase design margin

More significant with extremely scaling technology

Both Reversible and Permanent Part

Bias Temperature Instability (BTI) is the

most dominant reversible aging mechanism

Previous work

Accept the variations, track and

monitor them

Dynamically adapt to the variations

Reduce actual variations during operation

Limitations of previous work

The worst case becomes even worse with technology scaling

Power, performance and area (PPA) overhead

This WorkOur Goal

Reduce aging induced variations directly without

introducing overhead

Relax the design margin

Deeply rejuvenate the chip

Improve PPA metrics

Features

Explore the idea of periodic sleep for electronic systems not

unlike that of biological systems

Postulate that future electronics system will use sleep time as

an active recovery period essential for their overall

performance

Deeply rejuvenate the chip during “Sleep Time”

Demonstrate the techniques with both experiments and

models

ContributionsThree Accelerated Self-Healing techniques

Control sleep conditions explicitly (e.g. higher temperatures,

negative voltages)

Proactive Accelerated Rejuvenation (control the ratio of

sleep vs. active)

A first-order circuit model

Consider both wearout and accelerated recovery periods

Based on latest device level NBTI models

Validate using hardware(FPGA) experiments

Exploring On-Chip Solution

Negative voltage generator

“On-Chip Heater” in other electronic systems architectures

such as multicore.

N/PBTI

HCI

TDDB

EM

Motivation

Inspired by Biology: Sleep vs. Inactivity

Biological View:

During sleep, there are still several

active processes that are essential for

the recovery of their full capabilities

Conventional view in circuit community:

Sleep for electronic systems means a period of inactivity or

idleness. (Power gating/Clock gating, etc.)

Our Hypothesis:

Sleep should be used as an active recovery period for future

electronics. Electronic systems will benefit from such sleep

periods with active rejuvenation during which some of the

effects of wearout (like BTI) can be reversed, thus leading to

effective self-healing.

Proactive Accelerated Rejuvenation

Scheduled explicit accelerated recover periods ahead of any

sign of stress

Less overhead (no tracking, adaptation circuitry needed)

Easy to implement

Predictable and controllable

Better cumulative metrics

Extend life time effectively

Cross-Layer ModelWearout Model for FPGAs

Based on Trapping/Detrapping (TD) model

AC vs. DC Stress

Recovery is slower than degradation

The unrecovered part will accumulate phase to phase

t1 : Stress time; t2 : recovery time

The total threshold voltage shift :

The total delay shift:

Accelerated Recovery Model

Big dependence of delay

shift as a function of voltage,

temperature and sleep/active

Ratio.

Delay change in one cycle (T):

Fitting parameters are extracted based on measurement

Accelerated Self-Healing

Stress and Recovery “Knobs”

voltage, time length, temperature, switching activity (AC/DC)

and Ratio of active (wearout) and sleep (rejuvenation) time.

Test Configuration

Commercialized

FPGA chips

Accelerated Testing

Methodology

Test ResultsEffect of Switching Activity on Wearout

AC stress degrades the

Performance slower

Recovery is much slower

Effect of Temperature on Wearout

Negative Voltage

Experimental Setup High Temperature

Ratio of active vs. sleep time

Summary

Future WorkOn-chip Negative Voltages

Combine with on-chip power regulation techniques

Breakdown voltage limitation

Gate-induced drain leakage current (GIDL)

On-chip Heater

Combine the accelerated

techniques with existing

core scheduling solution

Utilize “Dark Silicon”

Conclusions Propose three accelerated Self-healing techniques

Demonstrate several cases that bring stressed chips to within

90% of their original design margin

On-chip solutions are discussed

Limitations: First-order model, other aging mechanisms

(EM, TDDB, etc.), chip-to-chip variations

Exploring the extra flexibility offered by the circadian

rhythms to improve the power, performance and area (PPA)

metrics

AcknowledgementsThis work was supported in part by NSF under grant No.

CCF-1255907, and by SRC through Global Research

Collaboration (GRC) program under task ID. 2410.001. We

would also like to thank Dr. Wayne Burleson from AMD

Research and Mr. Alec Roelke from UVA for discussions.

*Source: http://gladstoneinstitutes.org/node/11312

High Performance Low Power (HPLP) Lab, Computer Engineering Program, University of Virginia

Xinfei Guo Advisor: Mircea R. Stan

Exploring Accelerated Self-Healing Techniques for Electronic Chips and Systems

Biological Clock*

Time

∆Vth(t1)

t1 t1+t2

∆Vth

∆Vth(t1+t2)

Stress Recovery

0

Stress Recovery

)))(1log(

)1log(1)(())1log(()(

12

212221

ttCk

CtktVCtAttV thth

)exp()exp(~ 022

ox

ddr

kTt

BV

kT

EK

(1)

(2)

Stress and Recovery behavior

Pass-transistor based LUT structure

C0

C1

C2

C3

In0

Routing

Blocks

In1

LUT

Path of Interest

M1

M2

M3

M4

M5

M6

M7

M8

M9

M10

ddsox

ddsd

V

CtA

kTt

BV

kT

EYtT

))1log(()exp()exp(~)( 10

1

(3)

))1log(

)1

1log(

1)(1

(

))1

1log((

)( 0Ctk

tCk

tT

V

tCA

ttT d

dds

dad

(4)

16-b

Counter

fref clk

in

Cout16

EnEn

75 LUTs

Circuit Under Test (CUT)rst

refoutosc

dfCf

T4

1

2

1

FPGA Board and Mother Test Board

Test configuration

FPGA Chip

To FPGA

Programmer

To Mother Board

ProgrammerTo PC

Thermal

Chamber

Logic Analyzer

Chip is inside

Temperature

Control

Test Conditions

AC/DC stress test results

0

0.5

1

1.5

2

2.5

0 3 hours 6 hours 12 hours 24 hours

Fre

qu

ency

Deg

rad

atio

n (

%)

AC Stress DC Stress

0 1 2 3 4 5 6 7 8 9

x 104

0

0.5

1x 10

-9

Time(s)

Del

ay C

hang

e

Td

(s)

110 C Measurement

100C Measurement

100C Model

110C Model

Accelerated Wearout with 110 °C and 100 °C for 1 day

0

0.5

0 hour 0.3 hour 1 hours 2 hours 4 hours 6 hoursRec

over

ed D

elay

(ns)

0V 0V Model

-0.3V -0.3V Model

0

0.5

1

1.5

2

2.5

0 hour 0.3 hour 1 hours 2 hours 4 hours 6 hours

Rec

over

ed D

elay

(ns)

Negative Voltage-Accelerated Recovery at 20°C and 110 °C

20 °C110 °C

0

0.5

1

1.5

0 hour 0.3 hour 1 hours 2 hours 4 hours 6 hours

Rec

over

ed D

elay

(ns)

20 °C 20 °C Model110 °C 110 °C Model

0

0.5

1

1.5

2

2.5

0 hour 0.3 hour 1 hours 2 hours 4 hours 6 hours

Rec

over

ed D

elay

(ns)

0 V -0.3 V

High Temperature-Accelerated Recovery under 0V and -0.3 V

24.5

24.7

24.9

25.1

25.3

25.5

25.7

25.9

26.1

Fre

qu

ency

(MH

z)Design Margin Relax Parameter (%)

for ratio of active to sleep time is 4

Wearout for 48 hours

Accelerated

Recovery for

12 hours

Design Margin Relax Parameter** (%) for all cases

Illustration of wearout vs. recovery

Illustration of Multicore System Self-Healing

Sleep but recovery

Note: AS – accelerated stress, AR – accelerated recovery

**Design Margin Relaxed Parameter: Percentage the chip recovered from the original margin.

Sleeping Cores

24 hours