data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25
DESCRIPTION
Vancouver DevOps Days 25 October 2013 IT Ops collect a ton of data and produce reams of graphs to monitor systems and applications. Getting the right signal out of all that noise however is getting tougher and tougher. The traditional techniques to deal with such metrics, whether threshold-based or very simple statistical methods that were developed to deal with stable, static manufacturing processes, are failing in today’s dynamic environment. Interest in applying more advanced analytics and machine learning to detect anomalies is gaining steam but understanding which algorithms should be used to identify and predict anomalies without producing more false positives is not so easy. This talk will begin with a brief definition of the types of anomalies commonly found in dynamic data center environments and then discuss some of the key elements to consider when thinking about anomaly detection such as: Understanding your data’s characteristics The two main approaches for analyzing operations data: parametric and non-parametric methods Overview of some current simple statistical methods and their weaknesses Simple data transformations that can give you powerful results By the end of this talk, attendees will understand the pros and cons of the key statistical analysis techniques and walk away with examples as well as practical rules of thumb and usage patterns.TRANSCRIPT
Anomaly Detection Analytics for the Data Centre
devopsdays Vancouver25 October 2013
Toufic Boubez, Ph.D.Co-Founder, CTOMetafor Software
2
Toufic intro – who I am
• Co-Founder/CTO Metafor Software• Co-Founder/CTO Layer 7 Technologies
– Acquired by Computer Associates in 2013– I escaped
• Co-Founder/CTO Saffron Technology• Chief Architect IBM (SOA)• Building large scale software systems for 20
years (I’m older than I look, I know!)
3
Why this talk?
• April: devopsdays Austin: Open Space talk– Blog:
http://metaforsoftware.com/beyond-the-pretty-charts-a-report-from-devopsdays-in-austin/
• June: devopsdays Silicon Valley presentation:– Five major lessons learned
• Explore issues mentioned in June
• Note: real data• Note: no labels on charts – on purpose!!• Note to self: remember to SLOW DOWN!• Note to self: mention the cats!! Everybody loves cats!!
4
Wall of Charts™
5
The Wall of Charts side-effects
“Alert fatigue is the single biggest problem we have right now … We need to be more intelligent about our alerts or we’ll all go insane.”
- John Vincent, Monitorama, March 2013
Alert Overload Metrics Overload
6
Need mo’ better alerting
– So what if my unicorn usage is at 89-91%, and has been stable?– I’d much rather know if it’s at 60% and has been rapidly increasing
– Static thresholds and rules won’t help you in this case– Need some intelligent Anomaly Detection mechanism
7
Anomaly Detection for DevOps
• Anomaly detection (also known as outlier detection) is the search for items or events which do not conform to an expected pattern. [Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey". ACM Computing Surveys 41 (3): 1]
• For devops: Need to know when one or more of our metrics is going wonky
8
#monitoringsucks vs #iheartmonitoring
• Proper monitoring tools should give us all the information we need to be PROACTIVE– But they don’t
• Current monitoring tools assume that the underlying system is relatively static– Surround it with static thresholds and rules.– Good for detecting catastrophic events but not
much else– BUT WHY!!??
9
“Traditional” analytics …
• Roots in manufacturing process QC
10
… are based on Gaussian distributions
• Makes assumptions about probability distributions and process behaviour– Usually assumes data is normally distributed with
a useful and usable mean and standard deviation• Blah blah blah what does it mean?
11
What’s normal!!??
12
Distribution Schmistribution
13
Three-Sigma Rule
• Three-sigma rule– ~68% of the values lie within 1 std deviation of the mean– ~95% of the values lie within 2 std deviations– 99.73% of the values lie within 3 std deviations
14
Aaahhhh
• The mysterious red lines explained
15
Moving Averages for detecting outliers
• Big idea:– Based on past values, predict most likely next value– Alert if actual value “significantly” deviates from predicted
value• Simple Moving Average
– Average of last N values in your time series• S[t] <- sum(X[t-(N-1):t])/N
– Each value in the window contributes equally to prediction– Idea is that your next value should not significantly deviate
from the general trend of your data
16
Weighted Moving Average
• Weigthed Moving Average– Similar to SMA but assigns linearly (arithmetically)
decreasing weights to every value in the window– Older values contribute less to the prediction
• Neither SMA or WMA deal well with periodicity in your data
17
Exponential Smoothing
• Exponential Smoothing– Similar to weighted average, but with weights decay
exponentially over the whole set of historic samples• S[t]=αX[t-1] + (1-α)S[t-1]
– Is as almost as bad as moving averages in dealing with periodicity and trending time series!!
• DES: Holt-Winters– In addition to data smoothing factor (α), introduces a trend
smoothing factor (β)– Better at dealing with periodicity and trending
• ALL assume Gaussian!
18
Gaussian distributions are powerful because:
• Far far in the future, in a galaxy far far away:– I can make the same predictions because the
statistical properties of the data haven’t changed– I can compare different metrics since they have
similar statistical properties
• BUT…• Cue in DRAMATIC MUSIC
19
What’s my distribution?
20
Another common distribution
21
Let’s look at an example
22
Histogram – probability distribution
23
3-sigma rule
24
Holt-Winters predictions
25
Are we doomed?
• There’s A LOT you can do with the data, other than just looking at it and putting thresholds!– Adaptive Mixture of Gaussians– Non-parametric techniques (
http://www.metaforsoftware.com/everything-you-should-know-about-anomaly-detection-know-your-data-parametric-or-non-parametric/)
– Spectral analysis
26
Mixture of Gaussians
27
We’re not doomed, but: Know your data!!
• You need to understand the statistical properties of your data, and where it comes from, in order to determine what kind of analytics to use.
• A large amount of data center data is non-Gaussian– Guassian statistics won’t work– Use appropriate techniques
28
Pet Peeve #1: How much data do we need?
• Trend towards higher and higher sampling rates in data collection
• Reminds me of Jorge Luis Borges’ story about Funes the Memorious– Perfect recollection of the slightest details of every
instant of his life, but lost the ability for abstraction
• Our brain works on abstraction– We notice patterns BECAUSE we can abstract
29
The danger of over-abstraction
+
= comfortable?
30
So, how much data DO you need?
• You don’t need more resolution that twice your highest frequency (Nyquist-Shanon sampling theorem)
• Most of the algorithms for analytics will smooth, average, filter, and pre-process the data.
• Watch out for correlated metrics (e.g. used vs. available memory)
31
Think: Is all data important to collect?
• Two camps:– Data is data, let’s collect and analyze everything and
figure out the trends. – Not all data is important, so let’s figure out what’s
important first and understand the underlying model so we don’t waste resources on the rest.
• Similar to the very public bun fight between Noam Chomsky and Peter Norvig– http://norvig.com/chomsky.html
• Unresolved as far as I know
32
Do we need both metrics?
33
More?
• Only scratched the surface• I want to talk more about analytics, in more
depth, but time’s up!!– (Actually Jenny won’t let me)
• Come talk to me during the breaks!• Thank you!