engineering the right accelerated life tests for reliability qualification: customer use conditions...

21
Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter: Sudarshan Rangaraj ([email protected]) Hardware Reliability Manager – Amazon Lab126, Sunnyvale CA Based largely on papers authored at IRPS, IITC, ECTC and review of literature Acknowledgements: current and former colleagues at Intel and Amazon Lab126 1

Upload: toby-simon

Post on 17-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry

standards based approaches

Presenter: Sudarshan Rangaraj ([email protected])Hardware Reliability Manager – Amazon Lab126, Sunnyvale CA

Based largely on papers authored at IRPS, IITC, ECTC and review of literature

Acknowledgements: current and former colleagues at Intel and Amazon Lab126

1

Page 2: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

2/11

Motivation and Relevance

• Industry standards e.g. JEDEC, AEC, MIL provide qualification criteria, e.g. – HAST: 130C/85%RH/96 hours 0 Fails / 45 Tested– 150C 1000 hours Bake

• Blanket qualification criteria without knowledge of product use conditions (UC) can be undesirable:– Over-design, extra cost for reliability margin most customers will not use– Field failures: negative to user experience and company brand

• Goal of reliability engineering:– Start with the customer– Use field intelligence to develop UC models, compare them to standards– Strive to meet the higher bar, reliability can be a marketing advantage!

2

Page 3: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

3/11

Advantages of standards based testing

• Allows suppliers and their customers a speak a common language

• Helps overcome differences in reliability certification methodology, helps clarify expectations

• Guarantees a consistent reliability bar

• Valuable in well established industries

3

Page 4: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

4/11

Importance of understanding usage conditions• A robust reliability qualification process protects the customer i.e. ensures

sufficient reliability while optimizing cost for the manufacturer

• Three elements of robust reliability engineering:1. Quantified understanding of customer usage patterns and use conditions

2. Well designed accelerated life tests

3. Acceleration models (of sufficiently high confidence) that link the two

• Pitfalls of not making an accurate link between stress and use conditions– Over design leading to added cost and impact to bottom line– Under design high customer returns, poor experience erodes brand

4

Page 5: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

5/11

Talk outline

• Overview of common failure mechanisms in IC components

• Analysis of field use condition data….review one example

• Contrast use condition knowledge based qualification to standards based qualification using 2 case studies1. Moisture and voltage bias induced failures in IC components2. Temperature cycling failures in IC components

5

Page 6: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

6/11

IC component – package stack-upSilicon substrate

Devices: front-end

Metals/via: back-end with ultra low-k ILD

Metals/via: far-back-end with polymer ILD

Bumps: C4 with Cu – Pb-free solder

Images from proceedings of IITC 2013

Package: metals/via

6

Page 7: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

7/11

Some common failure modes in IC components and associated extreme use conditions

Reliability failure mechanism Extreme use condition

1 Front end: transistor gate di-electric reliability

- High power states at high voltage, frequency, temperature and current

2 Backend: Di-electric breakdown

3 Backend & bumps: Electro-migration

4 Backend: stress voiding - Sustained operation at high temperatures

5 Moisture ingress: De-lamination, electro-chemical corrosion, metal migration, pop-corning etc.

- Low power modes like OFF/Stand-by- High humidity and temperature ambient

conditions e.g. 25C 80% RH6 Temperature cycling: Cracking

and de-lamination- Repeated cold temperature exposures

when part may be OFF- Power cycles when part is ON

• Dominant failure modes for an IC used in a server, cell-phone and a wearable device will be very different because usage is different!

7

Page 8: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

8/11

Chip operating states

Effective RH vs. temperature at the part surface

OFF and STAND-BY modes are critical states for moisture absorption into chip/package: highest RH at part surface

OFF state: low T, high RH

STAND-BY: higher T, lower RH

ON state: high T, low RH

• OFF mode: chip and package at ambient T, ambient RH at part surface• STAND-BY mode: ambient T + self-heating (~10C) from few “always ON” IO pins• ON state: chip at high T, low RH at the part surface

68

Page 9: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

9/11

Use conditions by product segment: risk from moisture

Market segment ON time as fraction of product lifetime

OFF/STAND-BY events, durations

Ambient environments

Servers, High Performance computing & high end Desktop

Very large Very few events of short duration

Controlled T, RH in data centers and server farms

Desktop enterprise Lower Sizeable Indoor T, RH

Mobile - laptop Lower Sizeable number of longer duration events

Some outdoor T, RH exposureWorse in hot humid GEOs

Ultra-mobile: Tablet, smartphone

Lower Sizeable number of longer duration events

Often outdoor T, RH Worse in hot humid GEOs

Wearables/IoT A new set of applications, still being understood?

Incr

ea

sin

g m

ois

ture

ris

k

9

Page 10: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

10/11

Events leading to moisture exposure

• Packaging/Assembly operations……factory floor

• Customer warehouses during storage

• Customer factories during surface mount

• Usage by end customer especially in hot + humid locations

10

Page 11: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

11/11

Failure modes due to moisture and temperature cycling

blister

Package blistering and cracking between copper traces after surface mount on to system motherboard, a.k.a. “pop-corning” [Literature]

Edge de-lamination after temp-cycle B (125 to -55C) on very early 22nm silicon process Proceedings of ECTC 2013

11

Page 12: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

12/11

Moisture diffusion under a 25C 80% RH ambient exposure

Time at 25C 80% RH

Finite element modeling

7 days 50 days

C/C

SA

T

Time (days)

• Under sustained exposure, moisture confined to edge 1mm of chip/package

• Consistent with empirical failure observations

Through underfill

Through PKG

Chip

Package

7 days 50 days

12

Page 13: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

13/11

Mining use conditions: data collection and analysis

• Customer profile data from ~2000 worldwide laptop users for one year

• OFF (shutdown), STAND-BY and HIBERNATE times recorded data used to generate distributions

User ID OFF time STAND-by time

1 {-, -, -,……..} {-, -, -,……..}

2 {-, -, -,……..} {-, -, -,……..}

3 {-, -, -,……..} {-, -, -,……..}

2123 {-, -, -,……..} {-, -, -,……..}

• Distributions combining all data from all users

• Distribution of Max{off times} and Max{Stand-by time} per users

Format of user data:

13

Page 14: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

14/11

Moisture exposure in use condition: user data

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time (hrs)

Cum

ulat

ive

prob

abili

ty

non-S0 duration distribution

Cu

mu

lati

ve p

rob

abil

ity

Time (hours)

All data from 2000 users

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Cu

mu

lativ

e p

roba

bili

ty

0 25 50 75 100 125 150 175 200 225

Time (days)

Cu

mu

lati

ve p

rob

abil

ity

Time (days)

Max {OFF time} i.e. 100th %tile per user

99th percentile 4 days99.5th percentile 7 days 95th percentile 50 days

Standby/Off times: Nominal = 7 days, Worst case = 50 daysConservative ambient condition: 25C 80% RH, 20% of cities in the world experience this for 5% of the year i.e. a 95th percentile condition from surveys

14

Page 15: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

15/11

Phenomenological Acceleration Model for dominant moisture induced chip – package failure modes

Variable Range used in study Acceleration factor

Temperature 85 – 130C Ea = 0.71 eV (90% CL lower bound)

RH 65 – 85% n = 4 (best estimate)

Voltage (V) 1.2 – 3.3V m = 0.5 (best estimate)Vt = 1.4V

Peck’s law fits empirically observed HAST fails

15

• Temperature – strongest variable• Relative humidity and voltage – relatively weaker effects

Page 16: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

16/11

Accelerated life testing: failure rate data for a “typical” failure mode

10000100010010

99

95

90

80

70605040

30

20

10

5

1

Time to Failure

Pe

rce

nt

7.11990 0.658992 25.184

5.50666 0.658992 14.9165.92739 0.658992 55.5094.36013 0.658992 10.373

Loc Scale AD*Table of Statistics

85 85110 85130 65130 85

temp RH

Probability Plot (Fitted Arrhenius, Fitted Ln) for start readout

Arbitrary Censoring - ML EstimatesLognormal

Relation plot (Temp vs MTTF)

1

10

100

1000

10000

100000

1000000

10000000

100000000

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140

Temp (C)

MT

TF

(H

r)

UC: 25C

Ea=0.44

Ea=0.71

Ea=1.1

EA-1

EA-2

EA-AVG

• Thermal acceleration different in the 130 – 110C and 110 – 85C ranges• Epoxy glass transition ~120C, over accelerated moisture diffusion above 120C• Stressing recommended below glass transition of packaging polymers, T < TG is

what is relevant for use condition anyway

16

Page 17: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

17/11

HAST stress durations: use conditions vs. JEDEC JESD22-A110 standard requirements

Stress condition

Stress time equivalent to 7 days at 25C 80% RH (hrs)

Stress time equivalent to 50 days at 25C 80% RH (hrs)

JEDEC JESD 22 A110 equivalent readout (hrs)

130C 85% RH <1 5.7 96

110C 85% RH 2.5 18 264

85C 85% RH 17 121 1000

• Conservative worst case (50 days @ 25C 80% RH): JEDEC requirements +8 times higher than use condition based requirements

• Intel uses a “test to fail” approach during process development. These gating readouts go beyond use condition based requirements

17

Nominal Worst case JEDEC Std.

Page 18: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

18/11

Some thoughts about temperature cycling

JEDEC standard for temp-cycle

• Having to demonstrate reliability down to -55 or -65C may need trade-off between reliability and performance/yield• Di-electric constant (electrical performance) vs. fracture toughness• Epoxy flow characteristics vs. fracture toughness

Most common: TCB 125 to -55C, 700 cycles

18

Page 19: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

19/11

Some examples of cold-side effects: material response

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

Cra

ck d

rivi

ng

ene

rgy

(no

rma

lize

d)

-60 -50 -40 -30 -20 -10 0 10 20 30

Cold side temperature (C)

Crack driving energy (F.E. modeling) rises sharply below -20C

Str

ain

to

fail

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

-55 -30 25

Temperature (C)

Measured strain-to-fail drops 2X from 25C to -55C for passivation polymer

Solder fracture toughness drops precipitously below -25C [Literature]

If T < -25C was not relevant for the use condition of the component, by using TCB for qual., we might be solving problems not relevant to customer usage

19

Page 20: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

20/11

Risk of over or under-assessing field reliabilityNumber of cycles at various operating DT equivalent to TCB 700 cycles (JEDEC standard)

A simple temp-cycle model (Coffin-Manson):

{Nf1/Nf2} = {DT2/DT1}n

• For an always ON server in a controlled environment TCB 700 cycles may be over-kill• No cold exposures, -55C is not relevant• At DT of 50C, TCB 700 represents 10 – 50 cycles/day for 5 years

• For a part that may get used in an COMMS application with outdoor exposures in Alaska with 10 year life requirement TCB 700 under-assesses field reliability

Desktop & Servers Highly mobile devices

20

Example use condition requirement

[Tmax-Tmin]

Page 21: Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter:

21/11

Key messages

Important to pick stress conditions that are relevant to worst case usage to avoid artifacts not relevant to worst case use e.g. embrittlement

Standards offer a guideline or starting point. Qualification plans should be based on knowledge of use conditions

Limiting failure modes in the components that comprise a system will likely be very different for various applications….standards don’t directly address that

25