Task 2410.001 CLASH - Cross-Layer Accelerated Self-Healing:
Circadian Rhythms for Resilient Electronic SystemsXinfei Guo, Alec Roelke, Mircea R. Stan
ECE Department, University of Virginia, Charlottesville, VA
Previous Solutions Tolerate - Design for the worst case
Compensate - Dynamically adapt to wearout
Slow - Reduce the stress during operation
Passive Recovery
Wearout Issues BTI, HCI, TDDB, EM, etc.
Increase design margin and worsen metrics
Cross-layer Issues
Both Reversible and Irreversible Part
Overview
This Project Repair wearout completely
Accelerated & Active Recovery
Circadian Rhythms for FULL recovery
Cross-Layer Implementations
Wearout
Accelerated Self-Healing [DAC ’14]
Interface Board ChipData Sampling
16-b
frefclk
in
Cout
16
En En
75 LUTs
Circuit Under Test (CUT)
rst
(a) (b)
Thermal Chamber
Counter
Main Idea Sleep → Proactive Recovery
Some of the effects of wearout (e.g. BTI and EM) can be reversed
by applying several techniques (high temperature, negative
voltage, UV light, reverse current, etc.), thus leading to effective
accelerated self-healing fundamentally.
Test Setup Commercial 40nm FPGA chips
Accelerated Testing Methodology
Knobs: V, T, AC/DC, Sleep/Active
Results
“Sleep When Getting Tired” [ASPDAC ’16]
Main Idea The boundary between reversible & irreversible is “soft”
Irreversible wearout can be recovered through acceleration
Frequency dependency of accelerated wearout & recovery
“Sleep when getting tired” to FULLY avoid the irreversible wearout
Negative “turbo” boost at the system level
Results
Recovery under Different “circadian rhythms”
Reduction of Design Margin >60X
Average Performance Improvement –
Close to the fresh state through the whole lifetime
Circuit-level implementationsNegative Voltages A Charge-pump Neg. voltage generator
Embeddable Wearout Sensors [SELSE ’15]
Vdd
clk1
clk2
clk1C1 C2
charging charge redistribution
Vout
+ +
High TemperaturesWearout-aware Power Gating
Core 6
Core 1 Core 2 Core 3 Core 4
Core 5 Core 7
Shared L3 Cache
Core 8
Zzzzzz...
Zzzzzz...Heat Heat
Hea
t
Heat
Heat Heat
Metastable element based
Track both wearout and (accelerated) recovery
Track Multiple Paths simultaneously
Be aware of wearout induced path reranking
Used together with multiple dynamic management
track poll1
poll2
poll3
track
discharg discharg
ref
10%
5%
2%path0track poll1
poll2
poll3
discharg discharg
ref
5%
path1
path2
path3
5%
5%
track
track
poll1
poll2
poll3
outa outb
A CLASH System [JVLSI]
Fresh “Turbo-boost”
Fre
qu
en
cy
Time
Desig
n M
arg
in
Average
Frequency
Negative
Days
Hours
End of life
Years
Wearout
No recoveryActive
Sleep
Utilize Core Redundancy in dark silicon era
Optimal Scheduling/Load balancing
Introduce Accelerated & Active Recovery as a
new design knob for cross-layer resilience
Tradeoff between lifetime, power, performance
FFD
Scan_in
SE
clk
Q
Q
rst
Old Scan Cell
MUX
MUX
FFD
Scan_in
SE
clk
Q
Q
rst
SENSOR
sel
Core
New Scan Cell
Close-loop Open-loop
To user
Accelerated Active Adaptive solutions or
Recovery
MU
X
Embedded in a ASIC design flow
Sleep
VoltageNegative
Logic Blocks
vddvdd_high
Sleep
Boosted Vdd
GeneratorNegative Voltage
vdd
SleepEn
Power Gating Block
2016 SLD Review
2X
1X
Accelerated Recovery
Heating Element
Core
Logic
output
LogicCoreCritical
path
-300.6mV
638ns
-300.6mV
4.36mV
4.33mV
Ripple: 1.45%
Clock frequency = 66.7MHz
Accelerated & Active
Recovery for 12 hours24.5
25
25.5
26
Fre
qu
ency
(MH
z)
Accelerated Stress for 48 hours
72.4%
An example where about 72.4% of wearout is
recovered by accelerated self-healing techniques
in only ¼ of stress time (measured).
Accelerated self-healing space exploration (model)
Core-level
implementations
Active blocks healing inactive elements
(e.g. Dark Silicon & Redundant Resources)
Core
sensors
Sensor
Sensor
Proactive
Accelerated
From
To cores
Apply to
Sleep Cores Active Cores
Accelerated
rou
ter
Scheduler
& Active Recovery
outputs
sleep cores
outputs
Applications
Scheduler
& Active Recovery
Blocks
Core
Allocation
LoadBalancer
Heat for accelerated
recovery
- VActive Recovery
EN
CircuitWearout Sensors
Architecture
Dark SiliconProgramCounters
+1
SystemVirtualSensorsLoad Balancer
Heating Elements
AcceleratedRecovery
EN
Negative Voltage Generator
Proactive Scheduler
RedundantResources
23.2min
57.8min
344.3min
104.5min
Cross-layer Accelerated Self-Healing
Illustration of “Negative Turbo Boost”A potential implementation of Cross-layer Accelerated
Self-Healing in a NoC system
Recovery time after 12-hour constant stress under
regular operation condition (no accelerated stress)