Parametric‐Degradation‐Based Virtual Failures and Reliability Assessment and
MonitoringFeng‐Bin SunHDD Reliability Engineering
Hitachi Global Storage TechnologiesSan Jose, California
©2011 ASQ & Presentation Feng‐Bin SunPresented live on Oct 14th, 2010
http://reliabilitycalendar.org/The_Reliability_Calendar/Webinars_‐_English/Webinars_‐_English.html
ASQ Reliability Division English Webinar SeriesOne of the monthly webinars
on topics of interest to reliability engineers.
To view recorded webinar (available to ASQ Reliability Division members only) visit asq.org/reliability
To sign up for the free and available to anyone live webinars visit reliabilitycalendar.org and select English Webinars to find links to register for upcoming events
http://reliabilitycalendar.org/The_Reliability_Calendar/Webinars_‐_English/Webinars_‐_English.html
10/14/2010 1
October 14, 2010
by
Feng-Bin Sun
HDD Reliability Engineering
Hitachi Global Storage Technologies
San Jose, California
2
Disclaimers & Acknowledgements
! This presentation is based on a joint paper by F. Sun, E. Ou, and S. Zhang published at ISSAT 15th International Conference on Reliability and Quality in Design, August 6 - 8, 2009, San Francisco, California.
! Deep appreciation to the speaker’s previous employer, especially its management, for their consistent support in the course of this study.
! This paper doesn’t have any company or product specific information.
! This presentation will focus on the concepts and philosophies with minimum mathematics.
10/14/2010
3
Table of Contents ! Introduction ! Parametric Degradation Pattern And Quantification During
Product Reliability Test ! Parametric Trigger Limits Determination ! Population Dynamic Behavior Monitoring Via Box-Whisker
Plotting ! Individual Behavior Examination For The Potential Failure
Candidates Via Scatter (Line) Plotting ! Virtual Failure Concept and Holistic Health/Reliability ! Holistic Reliability Assessment/Monitoring based on True and
Virtual Failures ! Example and Conclusions
10/14/2010
What’s Happening Inside Your Product?
Health Philosophy and Holistic Reliability Assessment
10/14/2010 5
Physical Health Mental Health
Holistic Health
True Failure
(Macro Level) Virtual Failures
(Micro Level) Holistic Failures
(Macro + Micro) +
Human Health Parametrics
! Body temperature ! Blood pressure ! Pulse rate & rhythm & quality ! Heartbeat, or heart sound ! Respiratory rate & effort & quality ! Cholesterol level ! Blood sugar level ! …………..
7
Product Sub-Healthy Condition vs Parametric Monitoring
! Similar to human being, inside a product there is an ongoing health degrading process that could cause the product to a critical, or “near-death” condition.
! Depending on the degree of such sub-healthy condition, a product may not manifest itself as a physical failure on the macro level during a life test, but it can soon become a true failure after a short time of field operation.
! To account for the field reliability impact of such “invisible” but sub-healthy product, the concept of “virtual failure” is proposed and its impact on the product long-term reliability is quantified based on the correlation between the incident of each CP exceeding the trigger limit and the incident of its actual failing in pre-designed reliability testing.
! The objective of this paper is to describe the importance of tracking the HDD parametric performance during reliability life testing, the identification of failure-indicative critical parameters, their degradation pattern, measurement algorithm, triggering mechanism and limits, and quantification of their impact on long-term reliability.
10/14/2010
8
Values & Benefits of Parametric Monitoring
The parametric tracking provides faster feedback to Engineering Design teams for product and design improvements
The parametric tracking accelerates parametric feedback into product test, manufacturing, and field health monitoring
The parametric tracking increases product reliability and provide better ability to pro-actively identify field excursions
Revealing latent defects, improving manufacturing process and product design will reduce overall Field Returns
10/14/2010
9
Critical Parameters – HDD Example
There are more than 100 critical parameters at both the head component level and drive system level that can be tracked and collected during HDD reliability tests:
! G-List: "growth" defect table for sectors that have gone bad after the drive was placed in use
! Error Rate (ER): raw bit error rate ! MR Bias: bias current applied on magneto resistive head ! Magneto-Resistive Asymmetry (MRA): channel amplitude asymmetry
compensation ! MR Resistance: reader element resistance of magneto resistive head ! Non-Repeatable Run-Outs (NRRO): measurement of how much a platter wobbles,
or moves off-center ! Spin Up Time: time required for the disk platters to get up to full operational
speed from a stationary start ! Variable Gain Amplifier (VGA): channel amplitude gain compensation ! Write Current: current going through the writer element during write operation ! ………………….
10/14/2010
Identification of Failure Indicative Critical Parameters for Parametric Reliability Tracking
! Based on the extensive studies of available historical CP data: ! from reliability life test failures using degradation analysis, and ! from field failure returns using multivariate analysis
! Four parameters, three at head level and one at drive level, were identified as the most failure-indicative candidates: ! Parameter A – Head Level ! Parameter B – Head Level ! Parameter C – Head Level ! Parameter D – Drive Level
10 10/14/2010
11
Underlying Principle & Justifications ! The failure mechanism of a product is governed by its
weakest link. ! The head-level reliability performance of a HDD, as a
whole, is governed by the worst-head (if one head fails, the whole drive fails).
! Therefore, the CP values of the worst head should be used to represent the head-level parametric performance of a HDD.
Parametric Degradation Pattern During Life Test - A sample line plot of Parameters A, B, C, and D
8/8/2009 12
(a) Par A (b) Par B
(c) Par C (d) Par D
13
Graphical Illustration of the Maximum Degradation Measurement
10/14/2010
Maximum Degradation Measurement ! For Monotonic Parameters, Parameters A and
B, use the max % change of the worst head
Max [ABS(CPcurrent day – CP1st day)/CP1st day] x 100%
! For Fluctuating Parameters, Parameters C and D, use the max fluctuation (of the worst head for head level)
14
Max CP Fluctuation = (CPMax – CPMin) / Constant
10/14/2010
Parametric Trigger Limits Determination
15 10/14/2010
Individual Behavior Examination For The Potential Failure Candidates
16 10/14/2010
Potential Failure and Virtual Failure ! Potential Failure candidates: those product whose
maximum % change or the maximum fluctuation of any of failure-indicative parameters exceeds its trigger limit
! Not every potential failure candidate will fail in reliability test.
! There is a correlation, or likelihood, between the incident of each CP exceeding the trigger limit and the incident of its actual failing in reliability test.
! Potential Failure x Failure Likelihood => Virtual Failure ! A weighted linear function is introduced to convert
potential failures to their equivalent virtual failures: Virtual Failure Counts = Potential Failure Counts Triggered by Parameter A x a% + Potential Failure Counts Triggered by Parameter B x b% + Potential Failure Counts Triggered by Parameter C x c% + Potential Failure Counts Triggered by Parameter D x d%
17 10/14/2010
Calculation of Failure-Equivalence Factor
The Failure-Equivalence Factor, or Failure Likelihood, is estimated using historical reliability test data as follows: Conditional probability that drive will fail given that it’s CP maximum % change exceeds the trigger limit:
[ Total # of ORT HDDs whose maximum % changes exceed the specified trigger limits and fail in ORT ]
[ Total # of ORT HDDs whose maximum % changes in ORT exceed the specified trigger limits ]
True Failure + Virtual Failure - Holistic Failure and Reliability
19
True Failure (Macro Level)
Virtual Failures (Micro Level)
Total Adjusted Failures
(Holistic: Macro + Micro) +
! Traditionally, product reliability during life test, such as annual failure rate (AFR), is assessed based on the true failures.
! Such approach ignores the sub-healthy condition and failure likelihood of Potential Failures.
! Combining true failures (at macro level) and virtual failures (at micro level) for product with health degradation provides a holistic view of product failures.
Physical Health Mental Health Holistic Health + 10/14/2010
Holistic Reliability Assessment – based on True and Virtual Failures
20
Enhanced sensitivity due to incorporation of “virtual failures”
10/14/2010
Conclusions ! Introducing parametric-degradation-based virtual failure
concept adds a micro-dimension to the traditional failure domain.
! Incorporating virtual failures into reliability calculation enhances the detection sensitivity of ongoing reliability test , and therefore can surface the poor vintage with potential high future failure rate due to “invisible” high degradation of critical parameters.
! This approach is not only applicable in HDD industry, but also in any other product with health degradation and measurable parametric monitoring.
! The accuracy of the ‘virtual failure’ counts depends on how well the critical parameters are identified and how good the failure-equivalence factors are estimated.
! Further analyses should be conducted to better estimate the failure-equivalence factors and evaluate the linear function assumption when more and more life test data are accumulated.
21 10/14/2010
22 10/14/2010