data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

33
Anomaly Detection Analytics for the Data Centre devopsdays Vancouver 25 October 2013 Toufic Boubez, Ph.D. Co-Founder, CTO Metafor Software

Upload: tboubez

Post on 26-Jan-2015

107 views

Category:

Technology


1 download

DESCRIPTION

Vancouver DevOps Days 25 October 2013 IT Ops collect a ton of data and produce reams of graphs to monitor systems and applications. Getting the right signal out of all that noise however is getting tougher and tougher. The traditional techniques to deal with such metrics, whether threshold-based or very simple statistical methods that were developed to deal with stable, static manufacturing processes, are failing in today’s dynamic environment. Interest in applying more advanced analytics and machine learning to detect anomalies is gaining steam but understanding which algorithms should be used to identify and predict anomalies without producing more false positives is not so easy. This talk will begin with a brief definition of the types of anomalies commonly found in dynamic data center environments and then discuss some of the key elements to consider when thinking about anomaly detection such as: Understanding your data’s characteristics The two main approaches for analyzing operations data: parametric and non-parametric methods Overview of some current simple statistical methods and their weaknesses Simple data transformations that can give you powerful results By the end of this talk, attendees will understand the pros and cons of the key statistical analysis techniques and walk away with examples as well as practical rules of thumb and usage patterns.

TRANSCRIPT

Page 1: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

Anomaly Detection Analytics for the Data Centre

devopsdays Vancouver25 October 2013

Toufic Boubez, Ph.D.Co-Founder, CTOMetafor Software

Page 2: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

2

Toufic intro – who I am

• Co-Founder/CTO Metafor Software• Co-Founder/CTO Layer 7 Technologies

– Acquired by Computer Associates in 2013– I escaped

• Co-Founder/CTO Saffron Technology• Chief Architect IBM (SOA)• Building large scale software systems for 20

years (I’m older than I look, I know!)

Page 3: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

3

Why this talk?

• April: devopsdays Austin: Open Space talk– Blog:

http://metaforsoftware.com/beyond-the-pretty-charts-a-report-from-devopsdays-in-austin/

• June: devopsdays Silicon Valley presentation:– Five major lessons learned

• Explore issues mentioned in June

• Note: real data• Note: no labels on charts – on purpose!!• Note to self: remember to SLOW DOWN!• Note to self: mention the cats!! Everybody loves cats!!

Page 4: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

4

Wall of Charts™

Page 5: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

5

The Wall of Charts side-effects

“Alert fatigue is the single biggest problem we have right now … We need to be more intelligent about our alerts or we’ll all go insane.”

- John Vincent, Monitorama, March 2013

Alert Overload Metrics Overload

Page 6: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

6

Need mo’ better alerting

– So what if my unicorn usage is at 89-91%, and has been stable?– I’d much rather know if it’s at 60% and has been rapidly increasing

– Static thresholds and rules won’t help you in this case– Need some intelligent Anomaly Detection mechanism

Page 7: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

7

Anomaly Detection for DevOps

• Anomaly detection (also known as outlier detection) is the search for items or events which do not conform to an expected pattern. [Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey". ACM Computing Surveys 41 (3): 1]

• For devops: Need to know when one or more of our metrics is going wonky

Page 8: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

8

#monitoringsucks vs #iheartmonitoring

• Proper monitoring tools should give us all the information we need to be PROACTIVE– But they don’t

• Current monitoring tools assume that the underlying system is relatively static– Surround it with static thresholds and rules.– Good for detecting catastrophic events but not

much else– BUT WHY!!??

Page 9: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

9

“Traditional” analytics …

• Roots in manufacturing process QC

Page 10: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

10

… are based on Gaussian distributions

• Makes assumptions about probability distributions and process behaviour– Usually assumes data is normally distributed with

a useful and usable mean and standard deviation• Blah blah blah what does it mean?

Page 11: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

11

What’s normal!!??

Page 12: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

12

Distribution Schmistribution

Page 13: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

13

Three-Sigma Rule

• Three-sigma rule– ~68% of the values lie within 1 std deviation of the mean– ~95% of the values lie within 2 std deviations– 99.73% of the values lie within 3 std deviations

Page 14: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

14

Aaahhhh

• The mysterious red lines explained

Page 15: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

15

Moving Averages for detecting outliers

• Big idea:– Based on past values, predict most likely next value– Alert if actual value “significantly” deviates from predicted

value• Simple Moving Average

– Average of last N values in your time series• S[t] <- sum(X[t-(N-1):t])/N

– Each value in the window contributes equally to prediction– Idea is that your next value should not significantly deviate

from the general trend of your data

Page 16: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

16

Weighted Moving Average

• Weigthed Moving Average– Similar to SMA but assigns linearly (arithmetically)

decreasing weights to every value in the window– Older values contribute less to the prediction

• Neither SMA or WMA deal well with periodicity in your data

Page 17: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

17

Exponential Smoothing

• Exponential Smoothing– Similar to weighted average, but with weights decay

exponentially over the whole set of historic samples• S[t]=αX[t-1] + (1-α)S[t-1]

– Is as almost as bad as moving averages in dealing with periodicity and trending time series!!

• DES: Holt-Winters– In addition to data smoothing factor (α), introduces a trend

smoothing factor (β)– Better at dealing with periodicity and trending

• ALL assume Gaussian!

Page 18: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

18

Gaussian distributions are powerful because:

• Far far in the future, in a galaxy far far away:– I can make the same predictions because the

statistical properties of the data haven’t changed– I can compare different metrics since they have

similar statistical properties

• BUT…• Cue in DRAMATIC MUSIC

Page 19: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

19

What’s my distribution?

Page 20: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

20

Another common distribution

Page 21: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

21

Let’s look at an example

Page 22: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

22

Histogram – probability distribution

Page 23: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

23

3-sigma rule

Page 24: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

24

Holt-Winters predictions

Page 25: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

25

Are we doomed?

• There’s A LOT you can do with the data, other than just looking at it and putting thresholds!– Adaptive Mixture of Gaussians– Non-parametric techniques (

http://www.metaforsoftware.com/everything-you-should-know-about-anomaly-detection-know-your-data-parametric-or-non-parametric/)

– Spectral analysis

Page 26: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

26

Mixture of Gaussians

Page 27: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

27

We’re not doomed, but: Know your data!!

• You need to understand the statistical properties of your data, and where it comes from, in order to determine what kind of analytics to use.

• A large amount of data center data is non-Gaussian– Guassian statistics won’t work– Use appropriate techniques

Page 28: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

28

Pet Peeve #1: How much data do we need?

• Trend towards higher and higher sampling rates in data collection

• Reminds me of Jorge Luis Borges’ story about Funes the Memorious– Perfect recollection of the slightest details of every

instant of his life, but lost the ability for abstraction

• Our brain works on abstraction– We notice patterns BECAUSE we can abstract

Page 29: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

29

The danger of over-abstraction

+

= comfortable?

Page 30: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

30

So, how much data DO you need?

• You don’t need more resolution that twice your highest frequency (Nyquist-Shanon sampling theorem)

• Most of the algorithms for analytics will smooth, average, filter, and pre-process the data.

• Watch out for correlated metrics (e.g. used vs. available memory)

Page 31: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

31

Think: Is all data important to collect?

• Two camps:– Data is data, let’s collect and analyze everything and

figure out the trends. – Not all data is important, so let’s figure out what’s

important first and understand the underlying model so we don’t waste resources on the rest.

• Similar to the very public bun fight between Noam Chomsky and Peter Norvig– http://norvig.com/chomsky.html

• Unresolved as far as I know

Page 32: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

32

Do we need both metrics?

Page 33: Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

33

More?

• Only scratched the surface• I want to talk more about analytics, in more

depth, but time’s up!!– (Actually Jenny won’t let me)

• Come talk to me during the breaks!• Thank you!