did something change? - using statistical techniques to interpret service and resource metrics por...

Did Something Change?

Using Statistical Techniques to Interpret Service and Resource Metrics

Frank Bereznay

Abstract

Did Something Change?

In a perfect world, one would always know the answer to that question. Unfortunately, nobody works in a perfect world. This paper / presentation will explore statistical techniques used to look for deviations in metrics that are due to assignable causes as opposed to the period to period variation that is normally present. Hypothesis Testing, Statistical Process Control, Multivariate Adaptive Statistical Filtering, and Analysis of Variance will be compared and contrasted. SAS code will be used to perform the analysis. Exploratory analysis techniques will be used to build populations for analysis purposes.

Outline

What is Statistics all about? It’s the population that counts Repeatable processes

Four techniques to review Hypothesis Testing Statistical Process Control MultiVariate Adaptive Statistical Filtering (MASF) Analysis of Variance (ANOVA)

Example Summary & Questions

A Note About Bill Mullen

What is Statistics All About?

It is the Population that Counts. Populations have Parameters. Samples have Statistics.

The Science of Statistics is all about estimating Population Parameters by taking Samples and calculating Statistics.

What is Statistics All About?

What is your population? It can be anything you want it to be, but

It must have well defined boundaries. A production cycle of a manufacturing process. A work shift. A bottling run for a particular wine vintage.

It must be randomly sampled.

Hypothesis Testing

Standard topic for first year Stat Class. Simple and easy to do. Interpretation of results has been

misunderstood.

Hypothesis Testing

Start with a statement you wish to contradict or disprove. Typically this is the status quo. It becomes the null hypothesis. The average message rate is 15 per minute.

Create an alternative hypothesis that contradicts the null hypothesis. The average message rate is not 15 per

minute.

Hypothesis Testing

Standard Notation for stating problem.

15:

15:0

AH

H

Hypothesis Testing

What is the population we are working with here? It is a 24 hour period. We must randomly sample across the entire

period.

We randomly collect message rates at 10 different points in time. 13,14,16,11,16,15,12,16,12,14

Hypothesis Testing – Population Parameters

Rate

Mean 15.00Standard Error 0.10Median 15.00Mode 15.00Standard Deviation 3.74Sample Variance 14.02Kurtosis -0.09Skewness 0.20Range 25.00Minimum 4.00Maximum 29.00Sum 21,605.00Count 1440

Hypothesis Testing – Population Distribution

Frequency Count of Message Rate

0

20

40

60

80

100

120

140

160

180

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Message Rate

Co

un

t

Hypothesis Testing – Sample Statistics

Rates Rates1314 Mean 13.9016 Standard Error 0.5911 Median 14.0016 Mode 16.0015 Standard Deviation 1.8512 Sample Variance 3.4316 Kurtosis -1.4312 Skewness -0.2114 Range 5.00

Minimum 11.00Maximum 16.00Sum 139.00Count 10.00

Hypothesis Testing

)/_tan_/()( NDeviationdardSSampleSampleMeant

Calculation of the t statistic

)10/85.1/()159.13( t

86.1t

with 9 (N-1) Degrees of Freedom

Hypothesis Testing

-5 -4 -3 -2 -1 0 1 2 3 4 5

95%

2.5%2.5%

-2.262 2.262

-1.86

Hypothesis Testing

So, What does this tell us? The official statement is:

At a 95% confidence level, the data is insufficient for us to state the mean of the population is not 15 for the 24 hour period being examined.

Important point, the contrary is not necessarily true. This does not prove in any way the population mean

is 15.

Hypothesis Testing

Statistical Assumptions that need to be considered. Underlying population does not need to be

normally distributed. The population must be randomly sampled.

Hypothesis Testing

Some practical uses. Validating we have met an SLA. Looking to see if something is not what we

expect it to be.

Key point This technique combines an a priori

expectation about a quality metric with sampled data. You need to know your data and choose wisely.

Statistical Process Control

Two Legends standout in this area: Walter Shewhart W. Edwards Deming

SPC is conceptually similar to Hypothesis Testing, but computationally different. No a priori data point is needed. Data is sub-grouped for calculation purposes.

SPC and Hypothesis Testing can produce different results for the same set of data.


Sample Order Output

Shift 1 2 3 4 5 6 7 8 Mean Range1 14 15 15 14 18 14 13 17 15 52 10 12 11 15 13 12 10 13 12 53 16 19 19 17 18 17 19 19 18 3

Hour

Overall Mean = 15 Average Range = 4.33


n*A2 D3 D4 Control Limit Derivation

2 1.880 - 3.268 Based on Mean Values

3 1.023 - 2.574 UCLX = Grand Mean + A2*Mean Range

4 0.729 - 2.282 CLX = Grand Mean

5 0.577 - 2.114 LCLX = Grand Mean - A2*Mean Range

6 0.483 - 2.0047 0.419 0.076 1.924 Based on Range Values

8 0.373 0.136 1.864 UCLR = D4* Mean Range

9 0.337 0.184 1.816 CLR = Mean Range

10 0.308 0.223 1.777 LCLR = D3*Mean Range

11 0.285 0.256 1.74412 0.266 0.283 1.717

13 0.249 0.307 1.693 n* = number of observations per subgroup14 0.235 0.328 1.67215 0.223 0.347 1.653

Statistical Process ControlDone the correct way: 15 ± .373*4.33

10

12

14

16

18

20

1 2 3

Subgroup

UCL CL LCL Subgroup

13.38

16.62

Statistical Process ControlWithout Sub-grouping limit calculation

Shift Hour Output Output1 1 141 2 15 Mean 151 3 15 Standard Error 0.591 4 14 Median 151 5 18 Mode 191 6 14 Standard Deviation 2.901 7 13 Sample Variance 8.431 8 17 Kurtosis -1.092 1 10 Skewness -0.122 2 12 Range 92 3 11 Minimum 102 4 15 Maximum 192 5 13 Sum 3602 6 12 Count 242 7 102 8 133 1 163 2 193 3 193 4 173 5 183 6 173 7 193 8 19

Mean ± 3*Std/Sqrt(Sample Size)

15 ± 3*2.9/2.8315 ± 3.08

Statistical Process ControlWithout Sub-grouping limit calculation

10

12

14

16

18

20

1 2 3

Subgroup

UCL CL LCL Subgroup

11.92

18.08


Statistical Assumptions that need to be considered. The data does not need to be normally

distributed. Proper sub grouping of the data is

fundamental to the technique. Sampling plan must be random and cover the

boundaries of the population being examined.


Practical Uses Useful for measuring discrete physical

objects. Things that have physical properties. Counts for outputs. Dollar volumes for orders / sales.

Not appropriate for interval based instrumentation data we frequently use.

Multivariate Adaptive Statistical Filtering (MASF)

Developed by Annie Shum and Jeff Buzen. Subject of 1995 CMG Paper by same name.

Practitioner’s approach to create a statistical detection technique which addresses the unique challenges of the interval driven time series datasets used by Computer Resource Management Professionals.

Why MASF?

Variance based statistical detection techniques are based on repeatable processes. Filling a bottle with wine. Manufacturing a roll of paper.

Commercial computer workloads are generally not repeatable processes (and that is an understatement!).

MASF

A two step process is established: A Reference Set is created during a period of

normal operation in place of a random sample.

The Reference Set is used as a set of criteria to examine data from subsequent periods.

MASF – Reference Set

What is a normal period? Workloads vary by time of day, day of week

and month of year.

Day 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23Mon 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Tue 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Wed 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72

Thur 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96

Fri 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120

Sat 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144

Sun 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168

Hour

MASF – Aggregation Policies

The collected data can / should be grouped into set of hours with same characteristics. Increases number of samples per collection

period.

Day 8 9 10 11 12 13 14 15 16 17

Mon 1 2 2 2 3 4 4 4 5 5

Tue 6 7 7 7 3 8 8 8 9 9

Wed 6 7 7 7 3 8 8 8 9 9

Thur 6 7 7 7 3 8 8 8 9 9

Fri 10 11 11 11 12 13 13 13 14 14

Hour

MASF – Aggregation Policies

Response Time Example.

Period# of Obs

per Week Mean Std Dev.1 1 0.04 0.022 3 0.85 0.063 4 0.25 0.094 3 0.81 0.065 2 0.73 0.046 3 0.35 0.027 9 0.66 0.038 9 0.62 0.039 6 0.55 0.02

10 1 0.34 0.0211 3 0.58 0.0312 1 0.28 0.0613 3 0.45 0.0414 2 0.35 0.04

Total 50

MASF – Detection Limits

Monday Tuesday thru Thursday

Friday

0

0.2

0.4

0.6

0.8

1

1.2

8 9 10 11 12 13 14 15 16 17Hour

LCL CL UCL

0

0.2

0.4

0.6

0.8

1

1.2

8 9 10 11 12 13 14 15 16 17Hour

LCL CL UCL

0

0.2

0.4

0.6

0.8

1

1.2

8 9 10 11 12 13 14 15 16 17Hour

LCL CL UCL

MASF - Summary

Very robust statistical detection technique. Addresses random sampling issues. Addresses volatility in commercial computing

workloads. More of a framework than a specific

procedure. Reference set is user defined. Measurement methodology is user defined.

MASF Summary

Measurement framework is intended to be an N period rolling average. Ideally 10 to 20 points per reference set. Longer term datasets subject to Time Series

influences which distorts metrics.

This technique should be included in every Resource Management Specialist’s toolkit!

Analysis of Variance (ANOVA)

A comparison of parameters across populations.

Best explained by why it was developed. Agricultural work in the late 1800’s to improve

crop yields. Plot of land was divided into multiple areas and

subjected to different treatments. Test was developed to compare the effects of

these different treatments on crop yield.

ANOVA

Watering Schedule

Seed Placement Pattern

Hybrid Seeds

Exposure to Sunlight

Different Fertilizer

Frequent Tilling of Land

Example of how this type of experiment would be setup:

Important Point - We are dealing with six separate populations.

ANOVA

Same ground rules as Hypothesis Testing Start of by assuming all population means are

equal. Null Hypothesis

Attempt to prove they are not all the same. Alternative Hypothesis

However, calculation of the result is very laborious and best done by a computer.

ANOVA

Test stated in similar fashion to Hypothesis Testing.

equalareallNotH

H

nA

n

____:

...: 3210

ANOVA

Accepting Null Hypothesis has same meaning as Hypothesis Testing. Can’t prove any mean is different – end of

test.

Accepting Alternative Hypothesis has an interesting twist. One or more of the means are different – but

which one(s) is/are different?

ANOVA

Turkey test answers the Alternative Hypothesis question. John Tukey developed a technique to group

means of an ANOVA test when the Alternative Hypothesis is accepted.

We now have a way to take a set of multiple data populations and segment them into like groups.

ANOVA

SMF Data Volume Example

Date Day Count Date Day Count1-Jun Wed 120 16-Jun Thur 1122-Jun Thur 118 17-Jun Fri 1123-Jun Fri 104 20-Jun Mon 1376-Jun Mon 124 21-Jun Tue 1187-Jun Tue 116 22-Jun Wed 1148-Jun Wed 113 23-Jun Thur 1169-Jun Thur 119 24-Jun Fri 112

10-Jun Fri 118 27-Jun Mon 13413-Jun Mon 138 28-Jun Tue 11514-Jun Tue 119 29-Jun Wed 11415-Jun Wed 118 30-Jun Thur 118

ANOVA

SAS Proc ANOVA Procedure

Proc ANOVA;

Class Day;

Model Count = Day;

Means Day / Tukey;

Run;

ANOVA

Key Results from Test Pr > F = .0424

We conclude at a 95% confidence level that one or more of the means are different.

Tukey Test Monday and Friday are different All other days are the same A certain degree of ambiguity

ANOVA

Typical way to report or display results of Tukey test.

Mon Tue Wed Thur Fri

|-------------------------------|

|----------------------------|

ANOVA

Second test. Compare the day of the week across weeks

Proc ANOVA;

Format Date Date8.;

Class Date;

Model Count = Date; By Day;

Run;

ANOVA

Results from second test. All five tests accepted the null hypothesis. Pr > F were all in the high 90% range. So the ‘official’ statement is’

The data is insufficient to conclude there is any difference in the mean value for a day of the week across weeks.

ANOVA

Statistical Assumptions that need to be considered. Sufficient data is needed to obtain 6 to 10

observations for each treatment. Need to be sensitive to correlated data. Sampling plan must be random and cover the

boundaries of the population being examined.

ANOVA

Practical Uses Comparing data from multiple days to see if it is the

same or different. Use it as a clustering technique to build aggregated

data groups for a MASF analysis. Multiple factor ANOVAs can look at multiple

treatments (factors) at the same time. Day of week and hour of day.

A very powerful tool that should be in everybody’s toolkit!

Midrange Server Example

One Month of Prime Shift usage data for an OLTP server.

The MASF technique will be used to look for deviations. The first three weeks will be used to be the

reference set to examine the fourth weeks data.

ANOVA will be used to create Aggregation Policies to cluster the hourly data.

Midrange Server ExampleTable of Hourly Usage Metrics

Date Day 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0Daily Mean

3/1 Wed 34.7 35.1 29.4 27.9 27.9 27.9 27.9 27.8 29.83/2 Thur 29.0 28.7 28.1 28.2 27.9 27.9 28.4 27.9 28.23/3 Fri 36.0 30.5 29.6 29.4 28.4 27.9 28.4 28.1 29.73/6 Mon 29.4 28.9 30.2 28.7 27.9 28.1 28.0 27.9 28.63/7 Tue 35.2 29.3 28.2 28.2 28.6 27.9 28.1 28.5 29.23/8 Wed 36.5 31.9 30.6 32.4 31.9 36.1 30.3 30.3 32.43/9 Thur 37.2 31.5 29.9 31.0 36.6 31.3 28.4 29.7 31.8

3/10 Fri 27.6 27.9 27.9 27.7 28.0 27.7 27.7 27.8 27.83/13 Mon 30.9 30.2 29.9 30.3 29.6 29.3 29.0 27.9 29.63/14 Tue 34.9 32.2 33.0 33.9 31.9 31.5 33.3 31.5 32.83/15 Wed 35.9 33.3 35.8 36.2 34.6 30.6 37.9 28.6 34.03/16 Thur 34.6 33.8 28.0 38.9 33.9 30.5 28.6 33.7 32.73/17 Fri 31.2 33.3 31.8 31.0 30.7 30.7 29.6 28.3 30.83/20 Mon 30.1 28.3 28.4 28.4 28.7 29.4 28.4 28.8 28.83/21 Tue 28.1 28.1 28.8 28.2 28.0 30.1 30.6 28.1 28.83/22 Wed 37.8 28.9 28.1 28.1 28.8 27.9 28.6 28.0 29.53/23 Thur 38.2 28.7 28.7 33.5 42.0 30.6 27.9 28.0 32.13/24 Fri 27.9 28.5 28.1 33.3 27.9 28.0 28.1 27.9 28.63/27 Mon 30.8 33.4 32.3 28.9 28.8 28.2 28.8 28.3 30.03/28 Tue 32.3 29.0 28.2 30.0 34.9 33.6 32.4 43.8 33.23/29 Wed 28.5 29.3 28.8 28.7 28.8 28.2 28.3 28.3 28.63/30 Thur 33.2 31.4 34.8 31.3 28.2 28.2 28.3 28.3 30.43/31 Fri 28.4 28.4 28.2 29.4 28.4 28.2 28.4 28.3 28.4

32.4 30.4 29.9 30.6 30.4 29.5 29.3 29.5 30.3Hourly Mean

Hour

Reference

Set


ANOVA test was performed on the hours of the day. Two overlapping groups were identified.

CPUAVE 33.1 30.8 30.7 30.5 29.7 29.6 29.4 28.8

Hour 8 11 12 9 10 13 14 15

|----------------------|

|--------------------------------------------|


A second ANOVA test was performed on the day of the week. Identified two non-overlapping groups. Group 1

Monday and Friday.

Group 2 Tuesday, Wednesday and Thursday.


The following aggregation policy was built for this workload.

Day 8 9 10 11 12 13 14 15Mon 1 3 3 3 3 3 3 3

Tue 2 4 4 4 4 4 4 4

Wed 2 4 4 4 4 4 4 4

Thur 2 4 4 4 4 4 4 4

Fri 1 3 3 3 3 3 3 3

Hour


The aggregation policy was used to build the following reference set from the table of hour usage metrics:

Reference Set Days Hours

Std Deviation Mean

Upper Control Limit

Lower Control Limit

1 Mon & Fri 8 2.1 31.1 37.4 24.82 Tue,Wed,Thur 8 4.6 34.6 48.5 20.73 Mon & Fri 9 to 15 1.5 28.9 33.5 24.34 Tue,Wed,Thur 9 to 15 3.3 30.3 40.2 20.4


Plotting this along with the actual data from the fourth week produced the following control chart for Monday:

0

5

10

15

20

25

30

35

40

8 9 10 11 12 13 14 15Hour

LCL CL UCL Actual

Midrange Server ExampleException Table for Rest of Week

Day Hour LCLCPU Mean UCL Day Hour LCL

CPU Mean UCL

Tue 8 20.7 32.3 48.5 Thur 8 20.7 33.2 48.5 9 20.4 29.0 40.2 9 20.4 31.4 40.2 10 20.4 28.2 40.2 10 20.4 34.8 40.2 11 20.4 30.0 40.2 11 20.4 31.3 40.2 12 20.4 34.9 40.2 12 20.4 28.2 40.2 13 20.4 33.6 40.2 13 20.4 28.2 40.2 14 20.4 32.4 40.2 14 20.4 28.3 40.2 15 20.4 43.8 40.2 15 20.4 28.3 40.2Wed 8 20.7 28.5 48.5 Fri 8 24.8 28.4 31.1 9 20.4 29.3 40.2 9 24.3 28.4 28.9 10 20.4 28.8 40.2 10 24.3 28.2 28.9 11 20.4 28.7 40.2 11 24.3 29.4 28.9 12 20.4 28.8 40.2 12 24.3 28.4 28.9 13 20.4 28.2 40.2 13 24.3 28.2 28.9 14 20.4 28.3 40.2 14 24.3 28.4 28.9 15 20.4 28.3 40.2 15 24.3 28.3 28.9

Summary

So, What is in your toolkit? Pick up these tools at your nearest CMG meeting.

They do take some getting used to, but are worth the learning curve. Hypothesis Testing, Statistical Process Control, MASF and

ANOVA

Be very wary of your data. The Time Series Data we routinely work with is a very

complicated multi-dimensional dataset. Get to know you data. The better you know the data,

the better you know your workload.

Summary

Next Step - Recommended Reading I. Trubin’s CMG papers on application of MASF and

variance based statistical detection techniques. 2001 – Exception Detection System, Based on Statistical

Process Control Concept. 2002 – Global and Application Levels Exception Detection

System, Based on MASF Technique 2003 – Disk Subsystem Capacity Management, Based on

Business Drivers, I/O Performance Metrics and MASF 2004 – Mainframe Global and Workload Levels Statistical

Exception Detection System, Based on MASF 2005 – Capturing Workload Pathology by Statistical

Exception Detection System.

Questions ???

did something change? - using statistical techniques to interpret service and resource metrics por...

Technology

hypothesis testing calculation

statistical techniques

null hypothesis

alternative hypothesis

population mean

manufacturing process

underlying population

sampled data