thermal chamber towards aging-aware and self …people.virginia.edu/~xg2dt/papers/xinfei...

1
11st Annual University of Virginia Engineering Research Symposium (UVERS 2015) Circadian Rhythms Multiple-Critical-Path Embeddable NBTI Sensors [3] Small & Flexible: embeddable in system level design and a top-down design flow Track both aging and accelerated recovery Can be used as triggers for Proactive Recovery Circuit-level: Transient Simulation, be compatible with circuit simulators (e.g SPICE) [2]; Architecture-level: physically aware parameterized high-level modes that are integrated with simulators like gem5; System-level: Optimized scheduling algorithms that trade off between lifetime and other metrics, like energy efficiency. Biology-inspired Accelerated Self-Healing Techniques Control sleep conditions explicitly (e.g. higher temperatures, negative voltages, UV exposure) PPAR- Periodical Proactive Accelerated Rejuvenation (control the ratio of sleep vs. active) Potential On-Chip Solutions Negative voltage generator “On-Chip Heating” generation Multiple-Critical-Path Embeddable NBTI Sensors Cross-Layer Optimization Infrastructure Introduce device level accelerated recovery to system design Lead to new design methodology, like design for accelerated recovery (DFAR) or Power- and Aging-aware co-design Extend the proposed methods to emerging technologies, such as FinFET and 3DIC [1] X. Guo, A. Roelke, M. Stan, “Proactive Periodic Accelerated Rejuvenation: A Circadian-Rhythm-Inspired Solution for Resilient Electronic Systems, Submitted. [2] X. Guo, A. Roelke, M. Stan, “A SPICE-Compatible BTI Transient Model Considering Accelerated Recovery,” Ongoing. [3] X. Guo, M. Stan, “MCPENS: Multiple-Critical-Path Embeddable NBTI Sensors for Dynamic Wearout Management,” IEEE Workshop on Silicon Errors in LogicSystem Effects (SELSE-11), April, 2015. [4] M. Stan, X. Guo, A. Roelke, “Modeling and Experimental Demonstration of Accelerated Self-Healing Techniques in CMOS Circuits,” Proc. of GOMAC Tech, March, 2015. [5] X. Guo, W. Burleson, M. Stan, “Modeling and Experimental Demonstration of Accelerated Self-Healing Techniques,Proc. of ACM/IEEE Design Automation Conference (DAC), June, 2014. Proactive Periodical Accelerated Rejuvenation [1] Schedule explicit accelerated recover periods ahead of any sign of stress in the early lifetime The irreversible wearout is “delayed” explicitly A wearout-adaptation strategy only needs to track rapid (reversible) wearout over a short period of time Achieve optimal average performance Predictable and controllable Extend life time effectively Stress and Recovery “Knobs” voltage, time length, temperature, switching activity (AC/DC) and Ratio of active (wearout) and sleep (rejuvenation) time. Test Configuration Commercial FPGA chips (40nm) Accelerated Testing Methodology Combine the accelerated techniques with existing core scheduling solution Utilize “Dark Silicon Design some on-chip reconfigurable fast switching elements Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 7 Shared L3 Cache Core 8 Zzzzzz... Zzzzzz... Heat Heat Heat Heat Heat Heat More significant with extremely scaling technology One transistor failure might lead to the whole system failure Increase design margin Both Reversible and Permanent Part Most dominant aging effects Both are reversible BTI - Biased Temperature Instability EM Electromigration VLSI-Very Large Scale Integration Predict aging induced degradations, add guard band or design for the worst case Hard to predict due to uncertain thermal/switching, etc; The worst case becomes even worse with technology scaling; Power, performance and area (PPA) overhead. Track and monitor them, dynamically adapt to the aging Sensors need to track through the whole life time; The average case is skewed; Power, performance and area (PPA) overhead. Reduce the stress during operation, thus alleviate aging Not applicable for high performance system; Not applicable for all aging effects. Repair Aging by Reversing the Aging effects (Accelerating Self-Healing) Take advantage of the recovery property of aging; Rejuvenate the chip during “sleep”; Applicable to all reversible aging; Reduce the sensing time; Much less PPA overhead. Inspired by Biology: Sleep vs. Inactivity [4, 5] Biological View : During sleep, there are still several active processes that are essential for the recovery of their full capabilities Conventional view in circuit community: Sleep for electronic systems means a period of inactivity or idleness. (Power gating/Clock gating, etc.) Our Idea: Sleep should be used as an active recovery period for future electronics. Electronic systems will benefit from such sleep periods with active rejuvenation during which some of the effects of wearout (like BTI) can be reversed by several techniques (high temperature, negative voltage, UV light, reverse current, etc.), thus leading to effective self-healing. High-Performance Low-Power ( ) Lab, Computer Engineering Program, University of Virginia Xinfei Guo, Advisor: Mircea R. Stan Towards Aging-aware and Self-healing VLSI Chips and Systems FPGA Board and Mother Test Board 16-b Counter fref clk in Cout 16 En En 75 LUTs Circuit Under Test (CUT) rst ref out osc d f C f T 4 1 2 1 Test configuration FPGA Chip To FPGA Programmer To Mother Board Programmer To PC 24.5 24.7 24.9 25.1 25.3 25.5 25.7 25.9 26.1 Frequency(MHz) Wearout for 48 hours Accelerated Recovery for 12 hours Illustration of aging vs. accelerated recovery Illustration of Multicore System Self-Healing Biological Clock Sleep but recovery **Design Margin Relaxed Parameter: Percentage the chip recovered from the original margin. Sleeping Cores & 0 0 0 Q Q SET CL R S R Timing Error! High Power! Failure! Slow! ~ mm Transistors Metal Wires Personal Use (Electronic Devices) Industry (measuring instruments) Spaceship Antenna and Communication systems Sensing Networks VLSI Chips and Systems Aging/Wearout BTI & EM N/PBTI HCI TDDB EM Time Vth(t1) t1 t1+t2 Vth 0 Vstress Remove Vstress Vstress Remove Vstress Previous Work This Work Accelerated Self-Healing Test Conditions Thermal Chamber (Chip Inside) Motherboard Data Sampling 18.8 18.85 18.9 18.95 19 19.05 19.1 19.15 19.2 19.25 0 500 1000 1500 2000 2500 3000 3500 Frequency(MHz) Time (minutes) 48 hrs vs. 12 hrs 24 hrs vs. 6 hrs 12 hrs vs. 3 hrs 8 hrs vs. 2 hrs 18.891 18.911 18.931 18.951 18.971 18.991 19.011 19.031 19.051 19.071 19.091 48 hrs (No Recovery) 24 hrs vs. 6 hrs 12 hrs vs. 3 hrs 8 hrs vs. 2 hrs Frequency(MHz) Experimental Setup On-chip Heating On-chip Aging Sensors MCPENS Conventional DWM DVFS Body Bias Proactive Recovery Core 2 Core 1 Accelerated Self-healing MCPENS Path<N:0> MCPENS Path<N:0> Core 3 MCPENS Path<N:0> Core 5 Core 4 MCPENS Path<N:0> MCPENS Path<N:0> Core 6 MCPENS Path<N:0> Cross-layer Infrastructure Key Contributions Selected Publications Device Level Circuit Level Architecture Level System Level

Upload: others

Post on 11-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thermal Chamber Towards Aging-aware and Self …people.virginia.edu/~xg2dt/papers/Xinfei Guo_UVERS15...Lead to new design methodology, like design for accelerated recovery (DFAR) or

11st Annual University of Virginia Engineering Research Symposium (UVERS 2015)

Circadian Rhythms

Multiple-Critical-Path Embeddable NBTI Sensors [3]

Small & Flexible: embeddable in system level design and a

top-down design flow

Track both aging and accelerated recovery

Can be used as triggers for Proactive Recovery

Circuit-level: Transient Simulation, be compatible

with circuit simulators (e.g SPICE) [2];

Architecture-level: physically aware

parameterized high-level modes that

are integrated with simulators like gem5;

System-level: Optimized scheduling

algorithms that trade off between lifetime

and other metrics, like energy efficiency.

Biology-inspired Accelerated Self-Healing Techniques

Control sleep conditions explicitly (e.g. higher temperatures,

negative voltages, UV exposure)

PPAR- Periodical Proactive Accelerated Rejuvenation

(control the ratio of sleep vs. active)

Potential On-Chip Solutions

Negative voltage generator

“On-Chip Heating” generation

Multiple-Critical-Path Embeddable NBTI Sensors

Cross-Layer Optimization Infrastructure

Introduce device level accelerated recovery to system design

Lead to new design methodology, like design for accelerated

recovery (DFAR) or Power- and Aging-aware co-design

Extend the proposed methods to emerging technologies,

such as FinFET and 3DIC

[1] X. Guo, A. Roelke, M. Stan, “Proactive Periodic Accelerated

Rejuvenation: A Circadian-Rhythm-Inspired Solution for Resilient

Electronic Systems, ” Submitted.

[2] X. Guo, A. Roelke, M. Stan, “A SPICE-Compatible BTI Transient

Model Considering Accelerated Recovery,” Ongoing.

[3] X. Guo, M. Stan, “MCPENS: Multiple-Critical-Path Embeddable

NBTI Sensors for Dynamic Wearout Management,” IEEE Workshop on

Silicon Errors in Logic–System Effects (SELSE-11), April, 2015.

[4] M. Stan, X. Guo, A. Roelke, “Modeling and Experimental

Demonstration of Accelerated Self-Healing Techniques in CMOS

Circuits,” Proc. of GOMAC Tech, March, 2015.

[5] X. Guo, W. Burleson, M. Stan, “Modeling and Experimental

Demonstration of Accelerated Self-Healing Techniques,” Proc. of

ACM/IEEE Design Automation Conference (DAC), June, 2014.

Proactive Periodical Accelerated Rejuvenation [1]

Schedule explicit accelerated recover periods ahead of any

sign of stress in the early lifetime

The irreversible wearout is “delayed” explicitly

A wearout-adaptation strategy only needs to track rapid

(reversible) wearout over a short period of time

Achieve optimal average performance

Predictable and controllable

Extend life time effectively

Stress and Recovery “Knobs”

voltage, time length, temperature, switching activity (AC/DC)

and Ratio of active (wearout) and sleep (rejuvenation) time.

Test Configuration

Commercial

FPGA chips (40nm)

Accelerated Testing

Methodology

Combine the accelerated

techniques with existing

core scheduling solution

Utilize “Dark Silicon”

Design some on-chip

reconfigurable fast

switching elementsCore 6

Core 1 Core 2 Core 3 Core 4

Core 5 Core 7

Shared L3 Cache

Core 8

Zzzzzz...

Zzzzzz...

Heat Heat

Hea

t

Heat

Heat Heat

More significant with extremely scaling technology

One transistor failure might lead to the whole system failure

Increase design margin

Both Reversible and Permanent Part

Most dominant aging effects

Both are reversible

BTI - Biased Temperature Instability

EM – Electromigration

VLSI-Very Large Scale Integration

Predict aging induced degradations, add guard band or

design for the worst case

Hard to predict due to uncertain thermal/switching, etc;

The worst case becomes even worse with technology scaling;

Power, performance and area (PPA) overhead.

Track and monitor them, dynamically adapt to the aging

Sensors need to track through the whole life time;

The average case is skewed;

Power, performance and area (PPA) overhead.

Reduce the stress during operation, thus alleviate aging

Not applicable for high performance system;

Not applicable for all aging effects.

Repair Aging by Reversing the Aging effects

(Accelerating Self-Healing)

Take advantage of the recovery property of aging;

Rejuvenate the chip during “sleep”;

Applicable to all reversible aging;

Reduce the sensing time;

Much less PPA overhead.

Inspired by Biology: Sleep vs. Inactivity [4, 5]

Biological View:

During sleep, there are still several

active processes that are essential for

the recovery of their full capabilities

Conventional view in circuit community:

Sleep for electronic systems means a period of inactivity or

idleness. (Power gating/Clock gating, etc.)

Our Idea:

Sleep should be used as an active recovery period for future

electronics. Electronic systems will benefit from such sleep

periods with active rejuvenation during which some of the

effects of wearout (like BTI) can be reversed by several

techniques (high temperature, negative voltage, UV light,

reverse current, etc.), thus leading to effective self-healing.

High-Performance Low-Power ( ) Lab, Computer Engineering Program, University of Virginia

Xinfei Guo, Advisor: Mircea R. Stan

Towards Aging-aware and Self-healing VLSI Chips and Systems

FPGA Board and Mother Test Board

16-b

Counter

fref clk

in

Cout16

EnEn

75 LUTs

Circuit Under Test (CUT)rst

refoutosc

dfCf

T4

1

2

1

Test configuration

FPGA Chip

To FPGA

Programmer

To Mother Board

ProgrammerTo PC

24.5

24.7

24.9

25.1

25.3

25.5

25.7

25.9

26.1

Fre

quen

cy(M

Hz)

Wearout for 48 hours

Accelerated

Recovery for

12 hours

Illustration of aging vs. accelerated recovery

Illustration of Multicore System Self-Healing

Biological Clock

Sleep but recovery

**Design Margin Relaxed Parameter: Percentage the chip recovered from the original margin.

Sleeping Cores

&0

0

0

Q

QSET

CLR

S

R

Timing Error!

High Power!Failure!

Slow!

• ~ mm

• Transistors

• Metal Wires

Personal Use(Electronic Devices)

Industry(measuring

instruments)Spaceship

Antenna and Communication

systems

Sensing Networks …

VLSI Chips and Systems

Aging/Wearout

BTI & EM

N/PBTI

HCI

TDDB

EM

Time

∆Vth(t1)

t1 t1+t2

∆Vth

0

VstressRemove

VstressVstress

Remove

Vstress

Previous Work

This Work

Accelerated Self-Healing

Test Conditions

Thermal Chamber

(Chip Inside)

Motherboard

Data Sampling

18.8

18.85

18.9

18.95

19

19.05

19.1

19.15

19.2

19.25

0 500 1000 1500 2000 2500 3000 3500

Fre

qu

ency

(MH

z)

Time (minutes)

48 hrs vs. 12 hrs 24 hrs vs. 6 hrs 12 hrs vs. 3 hrs 8 hrs vs. 2 hrs

18.891

18.911

18.931

18.951

18.971

18.991

19.011

19.031

19.051

19.071

19.091

48 hrs (No Recovery) 24 hrs vs. 6 hrs 12 hrs vs. 3 hrs 8 hrs vs. 2 hrs

Fre

qu

ency

(MH

z)

Experimental Setup

On-chip Heating

On-chip Aging Sensors – MCPENS

Conventional

DWM

DVFS

Body BiasProactive

Recovery

Core 2Core 1

Accelerated

Self-healing

MCPENS

Path<N:0>

MCPENS

Path<N:0>

Core 3

MCPENS

Path<N:0>

Core 5Core 4

MCPENS

Path<N:0>

MCPENS

Path<N:0>

Core 6

MCPENS

Path<N:0>

Cross-layer Infrastructure

Key Contributions

Selected Publications

Device Level

Circuit Level

Architecture Level

System Level