machine learning and failure prediction in hard disk drives · machine learning allows us to...
TRANSCRIPT
![Page 1: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/1.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED
Machine Learning and Failure Prediction in Hard Disk Drives
Dr. Amit Chattopadhyay
Director of Technology
Advanced Reliability Engineering
Western Digital, San Jose
-
HEPiX, October 2015
![Page 2: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/2.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED2
Acknowledgement:
-
Work done in collaboration with Sourav Chatterjee at Western Digital, San Jose
![Page 3: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/3.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED3
What is Machine Learning
-
Algorithms that:
Improve their performance P
At some task T
With Experience E
Task: predict an event in the future
Will this hard drive fail?
Training data: provides Experience
Reliability Test
Performance: how many mistakes does it make?
In principle, similar to how a child learns from experience
![Page 4: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/4.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED4
A simple task, can you identify the object?
-
Task T: Can you identify this object?
Experience E:
Training Examples
Performance:
How many does
it get right?
Other measures
possible
Apple
Pear
Tomato
Cow
Dog
Horse
Often not simple manually coded Attributes
Compare to explicit Parameter based algorithm:
If (shape == round)
then apple
else
{
if (four legged)
}
![Page 5: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/5.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED5
Typical Workflow
-
Prediction
Training
LabelsTraining
Images
Training
Training
Image
Features
Image
Features
Testing
Test Image
Learned
model
Learned
model
![Page 6: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/6.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED6
Where is machine learning used?
-
Recommendation Engines
A Spam filter
Self driving Cars
Smart machines such as NEST thermostat
![Page 7: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/7.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED7
A more sophisticated example: Self driving car
-
Can learn much harder tasks
Not limited by simple parametric attribute based decisions
Training
Images from windshield camera
Labels: driver action
Feature: each pixel
Lesson: Sophisticated algorithms can often solve problems better than human beings
A Training Image
Label: Turn Left
![Page 8: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/8.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED8
Kinds of Machine Learning problems
-
Predict:turnLEFT or RIGHT
Predict:How manydegrees to left?30.5 degrees
No training
Examples
![Page 9: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/9.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED9
Kinds of Machine Learning problems
-
![Page 10: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/10.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED10
Classification Problems
-
Functional approximation
Define a function such that it divides positive and negatives
Linear: Easier and more robust (eg. Naïve Bayes, Logistic Regression)
Non Linear: Harder (eg. Support Vector Machine, Neural Network)
Linear decision boundary nonlinear decision boundary
Training
Examples:
• Points in 2D
space
• <x,y>• Label:
Red/Green
[ <30,3>, Red]
![Page 11: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/11.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED11
Issues in categorization
-
Can we generalize?
May show very good accuracy with training data
However not so much with test data
Think polynomial regression example
Model complexity: Order of polynomial
![Page 12: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/12.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED12
Confusion matrix and ROC curves
-
Confusion Matrix
98 positives 2 negatives
98% accuracy not meaningful
Does it predict true negatives?
ROC curves
Tradeoff between false positive
and true positives
How many false positives are
we willing to accept?
False +ve rate
Tru
e+
ve
rate
Area under this curve defines success rate
![Page 13: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/13.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED13
Disclaimer: This presentation is mostly about “what is possible”
How is Machine Learning relevant to Hard Disk drives?
-
![Page 14: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/14.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED14
Let us demonstrate with 1 specific example on product A
-
• Every week we produce a large number of drives of this product (several 10s of 1000s)• Sample some (200-300) drives from a 50,000 lot and run a stress test (ORT : Ongoing
Reliability Test)
Problem statement –• ORT shows an ‘excursion’ for 3 weeks of production • The entire production lot during this time gets ‘stop ship’ped• Sitting on a large inventory of several $M that needs action
OR
T
![Page 15: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/15.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED15
What can we do?
-
Traditional solution – based on tribal knowledge/engineering insight
• Define “good” and “bad” explicitly, based on specific limit checks (If A < … and B>..., then good)
• Run reliability tests on 300 “good” drives to convince ourselves these are good• Lose 6 weeks - that is the duration of the reliability test• Trust that the rest of “good” drives will behave this way• Ship and make $
Machine Learning solution–
• Use Supervised Machine Learning on ORT data
• Supervisor: ORT failure rate –• Pass test :“good” drive• Fail test: “bad” drive
• Allow the algorithm to decide and build a predictive model of failure rate
Key benefit: Predict the failure rate without having to run the test !!
![Page 16: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/16.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED16 -
How is this done?
Do a 50-50 split on the ORT population, into “Training” and “Test”
Each drive goes through detailed characterization (about 2000 variables) before getting to ORT.
On the “Training” population:
Use a classifier (χ2 , Boosted forest,…) to choose the top 20-25 features from this variable list.
Build a Logistic Regression model. We can use others as needed (SVM, Neural Nets, Naïve Bayes etc)
Calculate the failure probability of each drive
Based on failure probabilities, generate a ‘health hierarchy’/grading system of drives before they get shipped
Validate the model we just built on the “Test” set – it already has ORT data; how does the calculated failure probabilities match up
![Page 17: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/17.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED17
Failure Rate: sensitivity to key features chosen by the classifier
-
![Page 18: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/18.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED18
Model building on the 1st 50% (training set)
These drives actually ran the test, so we can see how this holds up against real data
If the learning was ideal , the blue line - would be the perfect classifier - Have all passers to the left, and all failures to the right
![Page 19: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/19.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED19
Model validation on the 2nd 50% (‘test’ set)
-
Then –• Green dotted line: average actual Failure
rate of this population
• This is all we had before - An average failure rate that looked bad
Now –• Lot more insight. 10x spread in failure
rates
• Grades 1-4 : pretty good drives; Grades 8-10 : quite bad
![Page 20: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/20.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED20
Measures of success – ROC curve and TTF chart
-
Now we can use this algorithm on the entire production and calculate Failure Rate of each drive
Appropriate business actions taken accordingly
For this analysis, area under ROC curve ~ 85%-87%
Algorithm making the right business call > 8 times out of 10.
![Page 21: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/21.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED21
Summary
-
Drive Failure Prediction using Initial (Factory) conditions has always been tricky.
Conventional methods do not have a high success rate – mostly under-predict
Machine Learning allows us to approach the problem in a different way
Still early, with continuous progress being made.
This analytical technique, coupled with periodic in-field measurements, can provide a robust framework for fleet management.
![Page 22: Machine Learning and Failure Prediction in Hard Disk Drives · Machine Learning allows us to approach the problem in a different way Still early, with continuous progress being made](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5edd0e9422601c50f7f5a/html5/thumbnails/22.jpg)
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED22 -
Thank you !