building reliable chips in future technologies: fact, fiction or ...salishan conference on...
TRANSCRIPT
-
1 Salishan Conference on High-Speed Computing, 2015
Building reliable chips in future technologies: Fact, fiction or an oxymoron?
Vikas Chandra ARM Research
San Jose, CA
-
2 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Acknowledgments
§ Rob Aitken (ARM) § Greg Yeric (ARM) § Subhasish Mitra (Stanford)
-
3 Salishan Conference on High-Speed Computing, 2015
Some things are beyond our control!
Courtesy Takahashi Kaito (SII Nanotechnology Inc.)
-
4 Salishan Conference on High-Speed Computing, 2015
The eras of computing
1960 1970 1980 1990 2000 2010 2020
Uni
ts
1M
10M Mainframe
Mini
1st Era
100M
1 Billion PC
Desktop Internet
2nd Era
100 Billion The Internet of Things
10 Billion Mobile Internet
2025
§ Cheap § Low power § Ubiquitous
2030
-
5 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Need for reliability
-
6 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Lifetime profiles
Expected Lifetime
Usa
ge
low high
high
-
7 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Feature size scaling
§ Impact of feature size scaling § Charge discretization § Random manufacturing defects § Increasing electric field § Thin gate oxides § Interface defects at Si/SiON interface § Metal defects § Susceptibility to soft errors
Density doubling ~ every 2.5 Years
0.028
3 2
1.5
0.8 1.0
0.5 0.35
0.25 0.18
0.09 0.13
0.065 0.045
0.032
0.022
§ Beyond 14nm, scaling is more reliant on new materials § Will require more reliability analysis
Low-K ILD
Hi-K
FinFET
Cu Strain
-
8 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
CMOS Switches: 19xx - 2015
-
9 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Devices 2015 – 2025
-
10 Salishan Conference on High-Speed Computing, 2015
Technology scaling overview
1980 1990 2000 2010 1970
Com
plex
ity
Transistors
Patterning
Interconnect
Al wires
NMOS
2020
Planar CMOS
HKMG Strain
CU wires
PMOS LE, ~λ
LE,
-
11 Salishan Conference on High-Speed Computing, 2015
Technology scaling overview
1980 1990 2000 2010 1970
Com
plex
ity
Transistors
Patterning
Interconnect
Al wires
NMOS
2020
Planar CMOS
HKMG Strain
CU wires
PMOS LE, ~λ
LE,
-
12 Salishan Conference on High-Speed Computing, 2015
FinFET LELE
Technology complexity inflection point?
1980 1990 2000 2010 1970
Com
plex
ity
Transistors
Patterning
Interconnect
Al wires
NMOS
2020
Planar CMOS
HKMG Strain
CU wires
PMOS LE, ~λ
LE,
-
13 Salishan Conference on High-Speed Computing, 2015
Technology complexity inflection point?
1980 1990 2000 2010 1970
Com
plex
ity
Transistors
Patterning
Interconnect
Al wires
NMOS
2020
Planar CMOS
HKMG Strain
CU wires
PMOS LE, ~λ
LE,
-
14 Salishan Conference on High-Speed Computing, 2015
LE
Future technology
2010 2015 2020 2025 2005
Com
plex
ity Transistors
Patterning
Interconnect
FinFET
LELE
SADP
LELELE
EUV
HNW
µ-enh
SAQP
// 3DIC Graphene wire, CNT via
EUV LELE
VNW
Opto I/O
EUV + DWEB
2D: C, MoS
EUV + DSA
Opto int Spintronics
Seq. 3D
Al / Cu / W wires
NEMS
Planar CMOS
W LI
10nm
eNVM
Cu doping
7nm 5nm 3nm
1D: CNT
You are here
-
15 Salishan Conference on High-Speed Computing, 2015
LE
Future transistors
2010 2015 2020 2025 2005
Com
plex
ity Transistors
Patterning
Interconnect
FinFET
LELE
SADP
LELELE
EUV
HNW
µ-enh
SAQP
// 3DIC Graphene wire, CNT via
EUV LELE
VNW
Opto I/O
EUV + DWEB
2D: C, MoS
EUV + DSA
Opto int Spintronics
Seq. 3D
Al / Cu / W wires
NEMS
Planar CMOS
W LI
10nm
eNVM
Cu doping
7nm 5nm 3nm
1D: CNT
-
16 Salishan Conference on High-Speed Computing, 2015
LE
Future technology: patterning
2010 2015 2020 2025 2005
Com
plex
ity Transistors
Patterning
Interconnect
FinFET
LELE
SADP
LELELE
EUV
HNW
µ-enh
SAQP
// 3DIC Graphene wire, CNT via
EUV LELE
VNW
Opto I/O
EUV + DWEB
2D: C, MoS
EUV + DSA
Opto int Spintronics
Seq. 3D
Al / Cu / W wires
NEMS
Planar CMOS
W LI
10nm
eNVM
Cu doping
7nm 5nm 3nm
1D: CNT
-
17 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Device lifetime and failure rate
1 – 20 weeks
Normal lifetime Early life failures Wearout
3 – 10 years time
Failu
re r
ate
Increasing manufacturing
defects
Increasing transient errors
Acceleration of aging
phenomena
-
18 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Unreliable transistors – 3 phases
Gate to source shorts Insulator cracks Thin oxide defects Small opens Poor vias/contacts
Burn-in
Soft errors - Memory/Flip-flops - Combinational cells Noise - Electrical - RTN
Design Architecture
ECC
Transistor wearout - BTI, TDDB
Metal wearout - Electromigration
ILD wearout
Design margins Overdesign
1 2 3 Early life failures Normal lifetime Wearout/Aging
-
19 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Soft error: Impact on circuits 1 0 0 1
upset
transient
Storage cells - SRAM bit cells, Flip-flops
Combinational cells
Flip-flops
Un protected memory
Soft error contribution rate
-
20 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Big system reliability § “Always on” system are especially vulnerable to soft error
§ Vulnerability further increases substantially at low supply voltage § As technology scales down, number of cores scales up
§ Rates of failures increases: From 2 weeks (current) to 1 hour
Source: Draft ICiS 2012 Reliability Workshop
-
21 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
SRAM upsets 0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 1 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Single cell upset
Mul0 cell upset
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 1
1 0 0 0
0 1
1 1
1 0 0
0 0
0 1
0 0 0
0 0 1 0 0 0 0
0 0
1 0 0 0 0
cell area ê cell capacitances ê
Qcoll ê
Feature size scaling
1 2 3 4 1 2 3 4 Word arrangement in a mux-‐4 arch
-
22 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
FIT rate in 28nm: SRAM
-
23 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Soft error mitigation - SRAM
§ ECC with physical interleaving reduces the FIT rate to ~0 § Physical interleaving: multi cell error à single bit error § Temporal scrubbing to correct single bit errors
Error Correcting
Code
Area
Power Error rate
f
f Memory Compare
Corrector
Data in
Data out
Error signal
M
K
M
K
K
M
-
24 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
SoC soft error trends Bitcell SER FIT rate per node
0
100
200
300
400
500
600
700
200 150 100 50 0
SCU Avg/node MCU Avg/node
SoC SER FIT rate per node
1
10
100
1000
200 150 100 50 0
Memory SER Logic SER
Even though per memory bitcell SER sensitivity is decreasing, overall FIT per SoC is increasing
-
25 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Are all bits equally vulnerable?
Intrinsic soft error
susceptibility
Architectural Vulnerability
Factor
Timing Vulnerability
Factor
Soft error rate
Functional masking
Timing masking
-
26 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Soft error in logic § No “cheap” way to protect - spatially distributed bits § Vulnerability factor analysis helps to identify critical ones
Architecture Workload
Critical flip-flop
identification
- Fault injection - Formal methods - …
-
27 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Error injection flow
Start application
Application outcome
Pick random injection cycle
Bit flip
Pick random register
Pick random bit Rep
eat
Vanished, Output Mismatch Unexpected Termination, Hang
-
28 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Fault injection results
The Key message here is that even though most faults vanish, we s6ll need to worry about the remaining 4-‐8% of faults.
-
29 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Cross layer resilience approach Reliability Performance Power Area
Resilience Library
Circuit Logic
Architecture
SIHFT
Application
Fault Injection Guides cross-layer
Physical Design Aware Exploration
ARM IP ARM Core
Technology Library SP&R flow
Emulation
Cross layer resilience
Design
Workload
Critical FF identification
FIT analysis
Fault Injection Optimized
Design
ECO
SER optimized
library
FIT, power, area, performance
Workload
Workload
Library SER data
Circuit resilience example
-
30 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
FinFET and soft error rate
SER = Adiff e(-Qcrit/Qcoll)
Qcoll ê leads to SER ê
LET
Col
lect
ed c
harg
e Planar SRAM FinFET SRAM
Source: Fang and Oates, TDMR 2011
Technology node
Soft
erro
r ra
te
Reduction due to FinFET
Source: S. Ramey, et al, IRPS 2013
-
31 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Random telegraph noise (RTN)
§ Discrete level current fluctuation with time § Caused by charge trapping/de-trapping in dielectric § Behavior is results in 1/f noise for large FETs
Drain current Dra
in c
urre
nt
Time (s)
10%
Source: K. Takeuchi, Renesas, VMC 2011
-
32 Salishan Conference on High-Speed Computing, 2015
RTN scaling
RTN scales faster (1/LW) as opposed to other variability phenomena which scale as 1/(LW)1/2
Sour
ce: K
. Tak
euch
i, R
enes
as, V
MC
201
1
-
33 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Phase 3: Reliability towards end of life
1 – 20 weeks 3 – 10 years time
Failu
re r
ate
Transistor wearout - BTI, TDDB
Metal wearout - Electromigration
-
34 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Wearout/aging
Vt gm Ids Ioff
Transistor
Igate
BTI
BTI TDDB
BTI TDDB BTI
TDDB
R
EM
BTI à Bias Temperature Instability
TDDB à Time Dependent Dielectric Breakdown
EM à Electromigration
-
35 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Bias Temperature Instability (BTI)
§ Reliability concern at Si-SiON interface § Gradual shift in transistor parameters with time
Si Si Si Si Si
* H
* H
H2 H
H
Silicon Gate oxide Poly
* H
Si-H bond recovery
ΔVt
Time
Si-H bond disassociation
Stress stage Recovery stage
PMOS
Negative Bias: Si-H bond disassociation
Zero Bias: Si-H bond recovery
++
++
H
HH0
VDD
VDD
-VDD
+
+
H
H
VDD or 0
VDD
VDD0
-
36 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Wearout: Impact on circuits
Combinational State-holding cells (bit cells, flip-flops)
§ Fmax ê § Timing failure as circuits age
D Q
clk
§ Static Noise Margin ê § Read and write stability ê § Parametric yield loss
-
37 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Delay Degradation ≠ Failure
But there could be hold errors due to aging in clock path
-
38 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
CPU workload dependent aging: A case study
§ Mid-size ARM CPU § > 100K instances, > 10K sequentials § Large enough to be interesting § Small enough for rapid turnaround time
§ Simulation settings § Vdd = 0.9V, Temp = 105 oC, Lifetime = 3 years § CP 28LP standard-cell library § >1K cell-topologies, >10K timing-arcs
§ NBTI and PBTI aging model § RD: Reaction-Diffusion [Gielen 11, Zheng 09] § TD: Trapping-Detrapping [Velamala 12]
-
39 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Instance based simulation flow
For more details: Workload Dependent NBTI and PBTI Analysis for a sub-45nm Commercial Microprocessor, Intl. Reliability Physics Symposium (IRPS), 2013.
-
40 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Block and Path Timing Degradation
§ Dhrystone workload x 10
6 9 12 15 18 0
5
10 4
max = 17.6% avg = 9.5% min = 4.9%
% Timing Degradation
Pat
h H
isto
gram
0 10 20 0
5
10
Block
% T
imin
g D
egra
datio
n
-
41 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Path Rank Analysis § Rank paths in fresh and aged design, sorted by slack § Non-critical paths can become critical and vice versa
Path rank % Timing
Degradation Fresh Aged (Dhrystone)
1 14084 7.64 2 9781 7.94 3 9329 8.02 4 12345 7.87 5 6220 8.31 6 36672 7.16 7 7771 8.19 8 11580 7.96 9 28975 7.40 10 20054 7.66
Path rank % Timing
Degradation Aged (Dhrystone) Fresh
1 179394 15.61 2 145042 15.41 3 134419 15.18 4 1413427 17.57 5 272323 15.67 6 224034 15.46 7 331934 15.76 8 275422 15.56 9 481425 16.06 10 208561 15.24
-
42 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Workload Power-State Trace
Workload Power-State (Active / Sleep)
vs. Time Average
Active Time
mp3 0.03
web-browse 0.07
3D rendering 0.24
Dhrystone 0.40
video H264 0.54
-
43 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Workload-Dependent Processor Aging
Workload
% Timing Degradation
Switching-activity Power-state
Switching-activity and power-state
RD TD RD TD RD TD
mp3 10.0 6.3 3.6 2.5 2.3 1.6 web-browse 10.6 7.3 6.1 4.2 4.1 2.8 3D rendering 12.3 8.4 6.9 4.7 5.4 3.7
Dhrystone 11.2 7.7 10.1 6.9 7.3 5.0 video H264 12.5 8.5 11.4 7.8 9.1 6.2 worst-case 15.6 10.7 15.6 10.7 15.6 10.7
RD: Reaction-Diffusion model TD: Trapping-Detrapping model
-
44 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015
Conclusions
§ Reliability is more critical than ever for RAS critical designs – Servers, HPC etc.
§ Errors can be mitigated at devices, circuits, architecture or even software level § Lots of opportunities to design reliable systems from ground up
§ As we scale, some things are becoming worse, some better § BTI, TDDB, EM é § SER ê
§ Future technology challenges § Need to keep an eye on the evolution of future devices § Quantification of wearout impact on CPU PPA scaling § RTN scaling in the era of new devices (nanowires, compound
semiconductors, CNT)
-
45 Salishan Conference on High-Speed Computing, 2015
Fin