thermal chamber towards aging-aware and self …people.virginia.edu/~xg2dt/papers/xinfei...
TRANSCRIPT
11st Annual University of Virginia Engineering Research Symposium (UVERS 2015)
Circadian Rhythms
Multiple-Critical-Path Embeddable NBTI Sensors [3]
Small & Flexible: embeddable in system level design and a
top-down design flow
Track both aging and accelerated recovery
Can be used as triggers for Proactive Recovery
Circuit-level: Transient Simulation, be compatible
with circuit simulators (e.g SPICE) [2];
Architecture-level: physically aware
parameterized high-level modes that
are integrated with simulators like gem5;
System-level: Optimized scheduling
algorithms that trade off between lifetime
and other metrics, like energy efficiency.
Biology-inspired Accelerated Self-Healing Techniques
Control sleep conditions explicitly (e.g. higher temperatures,
negative voltages, UV exposure)
PPAR- Periodical Proactive Accelerated Rejuvenation
(control the ratio of sleep vs. active)
Potential On-Chip Solutions
Negative voltage generator
“On-Chip Heating” generation
Multiple-Critical-Path Embeddable NBTI Sensors
Cross-Layer Optimization Infrastructure
Introduce device level accelerated recovery to system design
Lead to new design methodology, like design for accelerated
recovery (DFAR) or Power- and Aging-aware co-design
Extend the proposed methods to emerging technologies,
such as FinFET and 3DIC
[1] X. Guo, A. Roelke, M. Stan, “Proactive Periodic Accelerated
Rejuvenation: A Circadian-Rhythm-Inspired Solution for Resilient
Electronic Systems, ” Submitted.
[2] X. Guo, A. Roelke, M. Stan, “A SPICE-Compatible BTI Transient
Model Considering Accelerated Recovery,” Ongoing.
[3] X. Guo, M. Stan, “MCPENS: Multiple-Critical-Path Embeddable
NBTI Sensors for Dynamic Wearout Management,” IEEE Workshop on
Silicon Errors in Logic–System Effects (SELSE-11), April, 2015.
[4] M. Stan, X. Guo, A. Roelke, “Modeling and Experimental
Demonstration of Accelerated Self-Healing Techniques in CMOS
Circuits,” Proc. of GOMAC Tech, March, 2015.
[5] X. Guo, W. Burleson, M. Stan, “Modeling and Experimental
Demonstration of Accelerated Self-Healing Techniques,” Proc. of
ACM/IEEE Design Automation Conference (DAC), June, 2014.
Proactive Periodical Accelerated Rejuvenation [1]
Schedule explicit accelerated recover periods ahead of any
sign of stress in the early lifetime
The irreversible wearout is “delayed” explicitly
A wearout-adaptation strategy only needs to track rapid
(reversible) wearout over a short period of time
Achieve optimal average performance
Predictable and controllable
Extend life time effectively
Stress and Recovery “Knobs”
voltage, time length, temperature, switching activity (AC/DC)
and Ratio of active (wearout) and sleep (rejuvenation) time.
Test Configuration
Commercial
FPGA chips (40nm)
Accelerated Testing
Methodology
Combine the accelerated
techniques with existing
core scheduling solution
Utilize “Dark Silicon”
Design some on-chip
reconfigurable fast
switching elementsCore 6
Core 1 Core 2 Core 3 Core 4
Core 5 Core 7
Shared L3 Cache
Core 8
Zzzzzz...
Zzzzzz...
Heat Heat
Hea
t
Heat
Heat Heat
More significant with extremely scaling technology
One transistor failure might lead to the whole system failure
Increase design margin
Both Reversible and Permanent Part
Most dominant aging effects
Both are reversible
BTI - Biased Temperature Instability
EM – Electromigration
VLSI-Very Large Scale Integration
Predict aging induced degradations, add guard band or
design for the worst case
Hard to predict due to uncertain thermal/switching, etc;
The worst case becomes even worse with technology scaling;
Power, performance and area (PPA) overhead.
Track and monitor them, dynamically adapt to the aging
Sensors need to track through the whole life time;
The average case is skewed;
Power, performance and area (PPA) overhead.
Reduce the stress during operation, thus alleviate aging
Not applicable for high performance system;
Not applicable for all aging effects.
Repair Aging by Reversing the Aging effects
(Accelerating Self-Healing)
Take advantage of the recovery property of aging;
Rejuvenate the chip during “sleep”;
Applicable to all reversible aging;
Reduce the sensing time;
Much less PPA overhead.
Inspired by Biology: Sleep vs. Inactivity [4, 5]
Biological View:
During sleep, there are still several
active processes that are essential for
the recovery of their full capabilities
Conventional view in circuit community:
Sleep for electronic systems means a period of inactivity or
idleness. (Power gating/Clock gating, etc.)
Our Idea:
Sleep should be used as an active recovery period for future
electronics. Electronic systems will benefit from such sleep
periods with active rejuvenation during which some of the
effects of wearout (like BTI) can be reversed by several
techniques (high temperature, negative voltage, UV light,
reverse current, etc.), thus leading to effective self-healing.
High-Performance Low-Power ( ) Lab, Computer Engineering Program, University of Virginia
Xinfei Guo, Advisor: Mircea R. Stan
Towards Aging-aware and Self-healing VLSI Chips and Systems
FPGA Board and Mother Test Board
16-b
Counter
fref clk
in
Cout16
EnEn
75 LUTs
Circuit Under Test (CUT)rst
refoutosc
dfCf
T4
1
2
1
Test configuration
FPGA Chip
To FPGA
Programmer
To Mother Board
ProgrammerTo PC
24.5
24.7
24.9
25.1
25.3
25.5
25.7
25.9
26.1
Fre
quen
cy(M
Hz)
Wearout for 48 hours
Accelerated
Recovery for
12 hours
Illustration of aging vs. accelerated recovery
Illustration of Multicore System Self-Healing
Biological Clock
Sleep but recovery
**Design Margin Relaxed Parameter: Percentage the chip recovered from the original margin.
Sleeping Cores
&0
0
0
Q
QSET
CLR
S
R
Timing Error!
High Power!Failure!
Slow!
• ~ mm
• Transistors
• Metal Wires
Personal Use(Electronic Devices)
Industry(measuring
instruments)Spaceship
Antenna and Communication
systems
Sensing Networks …
VLSI Chips and Systems
Aging/Wearout
BTI & EM
N/PBTI
HCI
TDDB
EM
…
Time
∆Vth(t1)
t1 t1+t2
∆Vth
0
VstressRemove
VstressVstress
Remove
Vstress
Previous Work
This Work
Accelerated Self-Healing
Test Conditions
Thermal Chamber
(Chip Inside)
Motherboard
Data Sampling
18.8
18.85
18.9
18.95
19
19.05
19.1
19.15
19.2
19.25
0 500 1000 1500 2000 2500 3000 3500
Fre
qu
ency
(MH
z)
Time (minutes)
48 hrs vs. 12 hrs 24 hrs vs. 6 hrs 12 hrs vs. 3 hrs 8 hrs vs. 2 hrs
18.891
18.911
18.931
18.951
18.971
18.991
19.011
19.031
19.051
19.071
19.091
48 hrs (No Recovery) 24 hrs vs. 6 hrs 12 hrs vs. 3 hrs 8 hrs vs. 2 hrs
Fre
qu
ency
(MH
z)
Experimental Setup
On-chip Heating
On-chip Aging Sensors – MCPENS
Conventional
DWM
DVFS
Body BiasProactive
Recovery
Core 2Core 1
Accelerated
Self-healing
MCPENS
Path<N:0>
MCPENS
Path<N:0>
Core 3
MCPENS
Path<N:0>
Core 5Core 4
MCPENS
Path<N:0>
MCPENS
Path<N:0>
Core 6
MCPENS
Path<N:0>
Cross-layer Infrastructure
Key Contributions
Selected Publications
Device Level
Circuit Level
Architecture Level
System Level