building reliable chips in future technologies: fact, fiction or ...salishan conference on...

45
1 Salishan Conference on High-Speed Computing, 2015 Building reliable chips in future technologies: Fact, fiction or an oxymoron? Vikas Chandra ARM Research San Jose, CA

Upload: others

Post on 15-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • 1 Salishan Conference on High-Speed Computing, 2015

    Building reliable chips in future technologies: Fact, fiction or an oxymoron?

    Vikas Chandra ARM Research

    San Jose, CA

  • 2 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Acknowledgments

    § Rob Aitken (ARM) § Greg Yeric (ARM) §  Subhasish Mitra (Stanford)

  • 3 Salishan Conference on High-Speed Computing, 2015

    Some things are beyond our control!

    Courtesy Takahashi Kaito (SII Nanotechnology Inc.)

  • 4 Salishan Conference on High-Speed Computing, 2015

    The eras of computing

    1960 1970 1980 1990 2000 2010 2020

    Uni

    ts

    1M

    10M Mainframe

    Mini

    1st Era

    100M

    1 Billion PC

    Desktop Internet

    2nd Era

    100 Billion The Internet of Things

    10 Billion Mobile Internet

    2025

    § Cheap §  Low power § Ubiquitous

    2030

  • 5 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Need for reliability

  • 6 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Lifetime profiles

    Expected Lifetime

    Usa

    ge

    low high

    high

  • 7 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Feature size scaling

    § Impact of feature size scaling § Charge discretization § Random manufacturing defects §  Increasing electric field § Thin gate oxides §  Interface defects at Si/SiON interface § Metal defects §  Susceptibility to soft errors

    Density doubling ~ every 2.5 Years

    0.028

    3 2

    1.5

    0.8 1.0

    0.5 0.35

    0.25 0.18

    0.09 0.13

    0.065 0.045

    0.032

    0.022

    § Beyond 14nm, scaling is more reliant on new materials §  Will require more reliability analysis

    Low-K ILD

    Hi-K

    FinFET

    Cu Strain

  • 8 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    CMOS Switches: 19xx - 2015

  • 9 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Devices 2015 – 2025

  • 10 Salishan Conference on High-Speed Computing, 2015

    Technology scaling overview

    1980 1990 2000 2010 1970

    Com

    plex

    ity

    Transistors

    Patterning

    Interconnect

    Al wires

    NMOS

    2020

    Planar CMOS

    HKMG Strain

    CU wires

    PMOS LE, ~λ

    LE,

  • 11 Salishan Conference on High-Speed Computing, 2015

    Technology scaling overview

    1980 1990 2000 2010 1970

    Com

    plex

    ity

    Transistors

    Patterning

    Interconnect

    Al wires

    NMOS

    2020

    Planar CMOS

    HKMG Strain

    CU wires

    PMOS LE, ~λ

    LE,

  • 12 Salishan Conference on High-Speed Computing, 2015

    FinFET LELE

    Technology complexity inflection point?

    1980 1990 2000 2010 1970

    Com

    plex

    ity

    Transistors

    Patterning

    Interconnect

    Al wires

    NMOS

    2020

    Planar CMOS

    HKMG Strain

    CU wires

    PMOS LE, ~λ

    LE,

  • 13 Salishan Conference on High-Speed Computing, 2015

    Technology complexity inflection point?

    1980 1990 2000 2010 1970

    Com

    plex

    ity

    Transistors

    Patterning

    Interconnect

    Al wires

    NMOS

    2020

    Planar CMOS

    HKMG Strain

    CU wires

    PMOS LE, ~λ

    LE,

  • 14 Salishan Conference on High-Speed Computing, 2015

    LE

    Future technology

    2010 2015 2020 2025 2005

    Com

    plex

    ity Transistors

    Patterning

    Interconnect

    FinFET

    LELE

    SADP

    LELELE

    EUV

    HNW

    µ-enh

    SAQP

    // 3DIC Graphene wire, CNT via

    EUV LELE

    VNW

    Opto I/O

    EUV + DWEB

    2D: C, MoS

    EUV + DSA

    Opto int Spintronics

    Seq. 3D

    Al / Cu / W wires

    NEMS

    Planar CMOS

    W LI

    10nm

    eNVM

    Cu doping

    7nm 5nm 3nm

    1D: CNT

    You are here

  • 15 Salishan Conference on High-Speed Computing, 2015

    LE

    Future transistors

    2010 2015 2020 2025 2005

    Com

    plex

    ity Transistors

    Patterning

    Interconnect

    FinFET

    LELE

    SADP

    LELELE

    EUV

    HNW

    µ-enh

    SAQP

    // 3DIC Graphene wire, CNT via

    EUV LELE

    VNW

    Opto I/O

    EUV + DWEB

    2D: C, MoS

    EUV + DSA

    Opto int Spintronics

    Seq. 3D

    Al / Cu / W wires

    NEMS

    Planar CMOS

    W LI

    10nm

    eNVM

    Cu doping

    7nm 5nm 3nm

    1D: CNT

  • 16 Salishan Conference on High-Speed Computing, 2015

    LE

    Future technology: patterning

    2010 2015 2020 2025 2005

    Com

    plex

    ity Transistors

    Patterning

    Interconnect

    FinFET

    LELE

    SADP

    LELELE

    EUV

    HNW

    µ-enh

    SAQP

    // 3DIC Graphene wire, CNT via

    EUV LELE

    VNW

    Opto I/O

    EUV + DWEB

    2D: C, MoS

    EUV + DSA

    Opto int Spintronics

    Seq. 3D

    Al / Cu / W wires

    NEMS

    Planar CMOS

    W LI

    10nm

    eNVM

    Cu doping

    7nm 5nm 3nm

    1D: CNT

  • 17 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Device lifetime and failure rate

    1 – 20 weeks

    Normal lifetime Early life failures Wearout

    3 – 10 years time

    Failu

    re r

    ate

    Increasing manufacturing

    defects

    Increasing transient errors

    Acceleration of aging

    phenomena

  • 18 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Unreliable transistors – 3 phases

    Gate to source shorts Insulator cracks Thin oxide defects Small opens Poor vias/contacts

    Burn-in

    Soft errors - Memory/Flip-flops - Combinational cells Noise - Electrical - RTN

    Design Architecture

    ECC

    Transistor wearout - BTI, TDDB

    Metal wearout - Electromigration

    ILD wearout

    Design margins Overdesign

    1 2 3 Early life failures Normal lifetime Wearout/Aging

  • 19 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Soft error: Impact on circuits 1 0 0 1

    upset

    transient

    Storage cells - SRAM bit cells, Flip-flops

    Combinational cells

    Flip-flops

    Un protected memory

    Soft error contribution rate

  • 20 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Big system reliability §  “Always on” system are especially vulnerable to soft error

    §  Vulnerability further increases substantially at low supply voltage § As technology scales down, number of cores scales up

    §  Rates of failures increases: From 2 weeks (current) to 1 hour

    Source: Draft ICiS 2012 Reliability Workshop

  • 21 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    SRAM upsets 0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  1  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    Single  cell  upset  

    Mul0  cell  upset  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  0  0  0  0  0  

    0  0  1  

    1  0  0  0  

    0  1  

    1  1  

    1  0  0  

    0  0  

    0  1  

    0  0  0  

    0  0  1  0  0  0  0  

    0  0  

    1  0  0  0  0  

    cell  area  ê  cell  capacitances  ê  

    Qcoll  ê      

    Feature  size  scaling    

    1   2   3   4   1   2   3   4  Word  arrangement  in  a  mux-‐4  arch  

  • 22 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    FIT rate in 28nm: SRAM

  • 23 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Soft error mitigation - SRAM

    §  ECC with physical interleaving reduces the FIT rate to ~0 §  Physical interleaving: multi cell error à single bit error § Temporal scrubbing to correct single bit errors

    Error Correcting

    Code

    Area

    Power Error rate

    f

    f Memory Compare

    Corrector

    Data in

    Data out

    Error signal

    M

    K

    M

    K

    K

    M

  • 24 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    SoC soft error trends Bitcell SER FIT rate per node

    0

    100

    200

    300

    400

    500

    600

    700

    200 150 100 50 0

    SCU Avg/node MCU Avg/node

    SoC SER FIT rate per node

    1

    10

    100

    1000

    200 150 100 50 0

    Memory SER Logic SER

    Even though per memory bitcell SER sensitivity is decreasing, overall FIT per SoC is increasing

  • 25 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Are all bits equally vulnerable?

    Intrinsic soft error

    susceptibility

    Architectural Vulnerability

    Factor

    Timing Vulnerability

    Factor

    Soft error rate

    Functional masking

    Timing masking

  • 26 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Soft error in logic § No “cheap” way to protect - spatially distributed bits § Vulnerability factor analysis helps to identify critical ones

    Architecture Workload

    Critical flip-flop

    identification

    - Fault injection - Formal methods - …

  • 27 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Error injection flow

    Start application

    Application outcome

    Pick random injection cycle

    Bit flip

    Pick random register

    Pick random bit Rep

    eat

    Vanished, Output Mismatch Unexpected Termination, Hang

  • 28 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Fault injection results

    The  Key  message  here  is  that  even  though  most  faults  vanish,  we  s6ll  need  to  worry  about  the  remaining  4-‐8%  of  faults.  

  • 29 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Cross layer resilience approach Reliability Performance Power Area

    Resilience Library

    Circuit Logic

    Architecture

    SIHFT

    Application

    Fault Injection Guides cross-layer

    Physical Design Aware Exploration

    ARM IP ARM Core

    Technology Library SP&R flow

    Emulation

    Cross layer resilience

    Design

    Workload

    Critical FF identification

    FIT analysis

    Fault Injection Optimized

    Design

    ECO

    SER optimized

    library

    FIT, power, area, performance

    Workload

    Workload

    Library SER data

    Circuit resilience example

  • 30 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    FinFET and soft error rate

    SER = Adiff e(-Qcrit/Qcoll)

    Qcoll ê leads to SER ê

    LET

    Col

    lect

    ed c

    harg

    e Planar SRAM FinFET SRAM

    Source: Fang and Oates, TDMR 2011

    Technology node

    Soft

    erro

    r ra

    te

    Reduction due to FinFET

    Source: S. Ramey, et al, IRPS 2013

  • 31 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Random telegraph noise (RTN)

    § Discrete level current fluctuation with time § Caused by charge trapping/de-trapping in dielectric § Behavior is results in 1/f noise for large FETs

    Drain current Dra

    in c

    urre

    nt

    Time (s)

    10%

    Source: K. Takeuchi, Renesas, VMC 2011

  • 32 Salishan Conference on High-Speed Computing, 2015

    RTN scaling

    RTN scales faster (1/LW) as opposed to other variability phenomena which scale as 1/(LW)1/2

    Sour

    ce: K

    . Tak

    euch

    i, R

    enes

    as, V

    MC

    201

    1

  • 33 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Phase 3: Reliability towards end of life

    1 – 20 weeks 3 – 10 years time

    Failu

    re r

    ate

    Transistor wearout - BTI, TDDB

    Metal wearout - Electromigration

  • 34 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Wearout/aging

    Vt gm Ids Ioff

    Transistor

    Igate

    BTI

    BTI TDDB

    BTI TDDB BTI

    TDDB

    R

    EM

    BTI à Bias Temperature Instability

    TDDB à Time Dependent Dielectric Breakdown

    EM à Electromigration

  • 35 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Bias Temperature Instability (BTI)

    § Reliability concern at Si-SiON interface § Gradual shift in transistor parameters with time

    Si Si Si Si Si

    * H

    * H

    H2 H

    H

    Silicon Gate oxide Poly

    * H

    Si-H bond recovery

    ΔVt

    Time

    Si-H bond disassociation

    Stress stage Recovery stage

    PMOS

    Negative Bias: Si-H bond disassociation

    Zero Bias: Si-H bond recovery

    ++

    ++

    H

    HH0

    VDD

    VDD

    -VDD

    +

    +

    H

    H

    VDD or 0

    VDD

    VDD0

  • 36 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Wearout: Impact on circuits

    Combinational State-holding cells (bit cells, flip-flops)

    §  Fmax ê § Timing failure as circuits age

    D Q

    clk

    §  Static Noise Margin ê § Read and write stability ê §  Parametric yield loss

  • 37 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Delay Degradation ≠ Failure

    But there could be hold errors due to aging in clock path

  • 38 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    CPU workload dependent aging: A case study

    § Mid-size ARM CPU §  > 100K instances, > 10K sequentials §  Large enough to be interesting §  Small enough for rapid turnaround time

    §  Simulation settings §  Vdd = 0.9V, Temp = 105 oC, Lifetime = 3 years §  CP 28LP standard-cell library §  >1K cell-topologies, >10K timing-arcs

    §  NBTI and PBTI aging model §  RD: Reaction-Diffusion [Gielen 11, Zheng 09] §  TD: Trapping-Detrapping [Velamala 12]

  • 39 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Instance based simulation flow

    For more details: Workload Dependent NBTI and PBTI Analysis for a sub-45nm Commercial Microprocessor, Intl. Reliability Physics Symposium (IRPS), 2013.

  • 40 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Block and Path Timing Degradation

    § Dhrystone workload x 10

    6 9 12 15 18 0

    5

    10 4

    max = 17.6% avg = 9.5% min = 4.9%

    % Timing Degradation

    Pat

    h H

    isto

    gram

    0 10 20 0

    5

    10

    Block

    % T

    imin

    g D

    egra

    datio

    n

  • 41 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Path Rank Analysis § Rank paths in fresh and aged design, sorted by slack § Non-critical paths can become critical and vice versa

    Path rank % Timing

    Degradation Fresh Aged (Dhrystone)

    1 14084 7.64 2 9781 7.94 3 9329 8.02 4 12345 7.87 5 6220 8.31 6 36672 7.16 7 7771 8.19 8 11580 7.96 9 28975 7.40 10 20054 7.66

    Path rank % Timing

    Degradation Aged (Dhrystone) Fresh

    1 179394 15.61 2 145042 15.41 3 134419 15.18 4 1413427 17.57 5 272323 15.67 6 224034 15.46 7 331934 15.76 8 275422 15.56 9 481425 16.06 10 208561 15.24

  • 42 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Workload Power-State Trace

    Workload Power-State (Active / Sleep)

    vs. Time Average

    Active Time

    mp3 0.03

    web-browse 0.07

    3D rendering 0.24

    Dhrystone 0.40

    video H264 0.54

  • 43 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Workload-Dependent Processor Aging

    Workload

    % Timing Degradation

    Switching-activity Power-state

    Switching-activity and power-state

    RD TD RD TD RD TD

    mp3 10.0 6.3 3.6 2.5 2.3 1.6 web-browse 10.6 7.3 6.1 4.2 4.1 2.8 3D rendering 12.3 8.4 6.9 4.7 5.4 3.7

    Dhrystone 11.2 7.7 10.1 6.9 7.3 5.0 video H264 12.5 8.5 11.4 7.8 9.1 6.2 worst-case 15.6 10.7 15.6 10.7 15.6 10.7

    RD: Reaction-Diffusion model TD: Trapping-Detrapping model

  • 44 Salishan Conference on High-Speed Computing, 2015 Salishan Conference on High-Speed Computing, 2015

    Conclusions

    § Reliability is more critical than ever for RAS critical designs – Servers, HPC etc.

    §  Errors can be mitigated at devices, circuits, architecture or even software level §  Lots of opportunities to design reliable systems from ground up

    § As we scale, some things are becoming worse, some better §  BTI, TDDB, EM é §  SER ê

    §  Future technology challenges §  Need to keep an eye on the evolution of future devices §  Quantification of wearout impact on CPU PPA scaling §  RTN scaling in the era of new devices (nanowires, compound

    semiconductors, CNT)

  • 45 Salishan Conference on High-Speed Computing, 2015

    Fin