data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

Anomaly Detection Analytics for the Data Centre

devopsdays Vancouver25 October 2013

Toufic Boubez, Ph.D.Co-Founder, CTOMetafor Software

2

Toufic intro – who I am

• Co-Founder/CTO Metafor Software• Co-Founder/CTO Layer 7 Technologies

– Acquired by Computer Associates in 2013– I escaped

• Co-Founder/CTO Saffron Technology• Chief Architect IBM (SOA)• Building large scale software systems for 20

years (I’m older than I look, I know!)

3

Why this talk?

• April: devopsdays Austin: Open Space talk– Blog:

http://metaforsoftware.com/beyond-the-pretty-charts-a-report-from-devopsdays-in-austin/

• June: devopsdays Silicon Valley presentation:– Five major lessons learned

• Explore issues mentioned in June

• Note: real data• Note: no labels on charts – on purpose!!• Note to self: remember to SLOW DOWN!• Note to self: mention the cats!! Everybody loves cats!!



4

Wall of Charts™

5

The Wall of Charts side-effects

“Alert fatigue is the single biggest problem we have right now … We need to be more intelligent about our alerts or we’ll all go insane.”

- John Vincent, Monitorama, March 2013

Alert Overload Metrics Overload

6

Need mo’ better alerting

– So what if my unicorn usage is at 89-91%, and has been stable?– I’d much rather know if it’s at 60% and has been rapidly increasing

– Static thresholds and rules won’t help you in this case– Need some intelligent Anomaly Detection mechanism

7

Anomaly Detection for DevOps

• Anomaly detection (also known as outlier detection) is the search for items or events which do not conform to an expected pattern. [Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey". ACM Computing Surveys 41 (3): 1]

• For devops: Need to know when one or more of our metrics is going wonky

http://www.cs.umn.edu/tech_reports_upload/tr2007/07-017.pdf

8

#monitoringsucks vs #iheartmonitoring

• Proper monitoring tools should give us all the information we need to be PROACTIVE– But they don’t

• Current monitoring tools assume that the underlying system is relatively static– Surround it with static thresholds and rules.– Good for detecting catastrophic events but not

much else– BUT WHY!!??

9

“Traditional” analytics …

• Roots in manufacturing process QC

10

… are based on Gaussian distributions

• Makes assumptions about probability distributions and process behaviour– Usually assumes data is normally distributed with

a useful and usable mean and standard deviation• Blah blah blah what does it mean?

11

What’s normal!!??

12

Distribution Schmistribution

13

Three-Sigma Rule

• Three-sigma rule– ~68% of the values lie within 1 std deviation of the mean– ~95% of the values lie within 2 std deviations– 99.73% of the values lie within 3 std deviations

14

Aaahhhh

• The mysterious red lines explained

15

Moving Averages for detecting outliers

• Big idea:– Based on past values, predict most likely next value– Alert if actual value “significantly” deviates from predicted

value• Simple Moving Average

– Average of last N values in your time series• S[t] <- sum(X[t-(N-1):t])/N

– Each value in the window contributes equally to prediction– Idea is that your next value should not significantly deviate

from the general trend of your data

16

Weighted Moving Average

• Weigthed Moving Average– Similar to SMA but assigns linearly (arithmetically)

decreasing weights to every value in the window– Older values contribute less to the prediction

• Neither SMA or WMA deal well with periodicity in your data

17

Exponential Smoothing

• Exponential Smoothing– Similar to weighted average, but with weights decay

exponentially over the whole set of historic samples• S[t]=αX[t-1] + (1-α)S[t-1]

– Is as almost as bad as moving averages in dealing with periodicity and trending time series!!

• DES: Holt-Winters– In addition to data smoothing factor (α), introduces a trend

smoothing factor (β)– Better at dealing with periodicity and trending

• ALL assume Gaussian!

18

Gaussian distributions are powerful because:

• Far far in the future, in a galaxy far far away:– I can make the same predictions because the

statistical properties of the data haven’t changed– I can compare different metrics since they have

similar statistical properties

• BUT…• Cue in DRAMATIC MUSIC

19

What’s my distribution?

20

Another common distribution

21

Let’s look at an example

22

Histogram – probability distribution

23

3-sigma rule

24

Holt-Winters predictions

25

Are we doomed?

• There’s A LOT you can do with the data, other than just looking at it and putting thresholds!– Adaptive Mixture of Gaussians– Non-parametric techniques (

http://www.metaforsoftware.com/everything-you-should-know-about-anomaly-detection-know-your-data-parametric-or-non-parametric/)

– Spectral analysis

http://www.metaforsoftware.com/everything-you-should-know-about-anomaly-detection-know-your-data-parametric-or-non-parametric/



26

Mixture of Gaussians

27

We’re not doomed, but: Know your data!!

• You need to understand the statistical properties of your data, and where it comes from, in order to determine what kind of analytics to use.

• A large amount of data center data is non-Gaussian– Guassian statistics won’t work– Use appropriate techniques

28

Pet Peeve #1: How much data do we need?

• Trend towards higher and higher sampling rates in data collection

• Reminds me of Jorge Luis Borges’ story about Funes the Memorious– Perfect recollection of the slightest details of every

instant of his life, but lost the ability for abstraction

• Our brain works on abstraction– We notice patterns BECAUSE we can abstract

29

The danger of over-abstraction

+

= comfortable?

30

So, how much data DO you need?

• You don’t need more resolution that twice your highest frequency (Nyquist-Shanon sampling theorem)

• Most of the algorithms for analytics will smooth, average, filter, and pre-process the data.

• Watch out for correlated metrics (e.g. used vs. available memory)

31

Think: Is all data important to collect?

• Two camps:– Data is data, let’s collect and analyze everything and

figure out the trends. – Not all data is important, so let’s figure out what’s

important first and understand the underlying model so we don’t waste resources on the rest.

• Similar to the very public bun fight between Noam Chomsky and Peter Norvig– http://norvig.com/chomsky.html

• Unresolved as far as I know

http://norvig.com/chomsky.html

32

Do we need both metrics?

33

More?

• Only scratched the surface• I want to talk more about analytics, in more

depth, but time’s up!!– (Actually Jenny won’t let me)

• Come talk to me during the breaks!• Thank you!

data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

Technology

anomaly detection analytics

devops anomaly detection

outlier detection

n values

past values

real data note

value alert

static thresholds