managing software projects analysis and evaluation of data - reliable, accurate, and valid data -...

29
Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data Smoothing: Moving Averages - Data Correlation - Normalization of Data (Source: Tsui, F. Managing Software Projects. Jones and Bartlett, 2004)

Upload: cleopatra-hodge

Post on 11-Jan-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

Managing Software Projects

Analysis and Evaluation of Data

- Reliable, Accurate, and Valid Data

- Distribution of Data

- Centrality and Dispersion

- Data Smoothing: Moving Averages

- Data Correlation

- Normalization of Data

(Source: Tsui, F. Managing Software Projects. Jones and Bartlett, 2004)

Page 2: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

Reliable, Accurate, and Valid Data

Page 3: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

3

Definitions

• Reliable data: Data that are collected and tabulated according to the defined rules of measurement and metric

• Accurate data: Data that are collected and tabulated according to the defined level of precision of measurement and metric

• Valid data: Data that are collected, tabulated, and applied according to the defined intention of applying the measurement

Page 4: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

Distribution of Data

Page 5: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

5

Definition

• Data distribution: A description of a collection of data that shows the spread of the values and the frequency of occurrences of the values of the data

Page 6: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

6

Example #1: Skew of the Distribution

• Severity level 1: 23

• Severity level 2: 46

• Severity level 3: 79

• Severity level 4: 95

• Severity level 5: 110

The number of problems detected at each of five severity levels

(more on next slide)

Page 7: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

7

Example #1 (continued)N

umbe

r of

Pro

blem

s F

ound

120 –

100 –

80 –

60 –

40 –

20 –

Severity Level

0 1 2 3 4 5

++

++

+

Number of problems is skewed towards the higher-numbered severity levels

Page 8: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

8

Example #2: Range of Data Values

• Functional area 1: 2

• Functional area 2: 7

• Functional area 3: 3

• Functional area 4: 8

• Functional area 5: 0

• Functional area 6: 1

• Functional area 7: 8

The number of severity level 1 problems by functional area

The range is from 0 to 8

Page 9: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

9

Example #3: Data Trends

• Week 1: 20

• Week 2: 23

• Week 3: 45

• Week 4: 67

• Week 5: 35

• Week 6: 15

• Week 7: 10

The total number of problems found in a specificfunctional area across the test time period in weeks

Page 10: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

Centrality and Dispersion

Page 11: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

11

Definition

• Centrality analysis: An analysis of a data set to find the typical value of that data set

• Approaches– Average value

– Median value

– Mode value

– Variance and Standard deviation

– Control chart

Page 12: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

12

Average, Median, and Mode

• Average value (or mean): One type of centrality analysis that estimates the typical (or middle) value of a data set by summing all the observed data values and dividing the sum by the number of data points– This is the most common of the centrality analysis methods

• Median: A value used in centrality analysis to estimate the typical (or middle) value of a data set. After the data values are sorted, the median is the data value that splits the data set into upper and lower halves– If there are an even number of values, the values of the middle two

observations are averaged to obtain the median• Mode: The most frequently occurring value in a data set

– If the data set contains floating point values, use the highest frequency of values occurring between two consecutive integers (inclusive)

Page 13: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

13

ExampleData Set = {2, 7, 3, 8, 0, 1, 8}

Average = xavg = (2 + 7 + 3 + 8 + 0 + 1 + 8) / 7 = 4.1

Median: 3 0, 1, 2, 3, 7, 8, 8 ^ Mode: 8

Page 14: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

14

Variance and Standard Deviation

• Variance: The average of the squared deviations from the average value

s2 = SUM [ (xi – xavg)2) ] / (n – 1)

• Standard deviation: the square root of the variance. A metric used to define and measure the dispersion of data from the average value in a data set

• It is numerically defined as follows:

s = SQRT [ SUM [ (xi – xavg)2) ] / (n – 1) ]

where SQRT = square root function SUM = sum function xi = ith observation xave = average of all xi

n = total number of observations

Page 15: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

15

Standard Deviation: ExampleData Set = {2, 7, 3, 8, 0, 1, 8}

xavg = (2 + 7 + 3 + 8 + 0 + 1 + 8) / 7 = 4.1

SUM [ (xi – xavg)2) ] = 4.41 + 8.41 + 1.21 + 15.21 + 16.81 + 9.61 + 15.21 = 70.87

SUM [ (xi – xavg)2) ] / (n – 1) = 70.87 / 6 = 11.81

STANDARD DEVIATION = s = SQRT(11.81) = 3.44

Page 16: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

16

Control Chart

• Control chart: A chart used to assess and control the variability of some process or product characteristic

• It usually involves establishing lower and upper limits (the control limits) of data variations from the data set’s average value

• If an observed data value falls outside the control limits, then it would trigger evaluation of the characteristic

Page 17: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

17

Control Chart (continued)

7.54 problems

4.1 problems (average)

0.66 problems

++

+

+

Page 18: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

Data Smoothing: Moving Averages

Page 19: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

19

Definitions

• Moving average: A technique for expressing data by computing the average of a fixed grouping (e.g., data for a fixed period) of data values; it is often used to suppress the effects of one extreme data point

• Data smoothing: A technique used to decrease the effects of individual, extreme variability in data values

Page 20: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

20

Example

Test week Problems found 2-week moving avg 3-week moving avg

1 20 - -

2 33 26.5 -

3 45 39 32.7

4 67 56 48.3

5 35 51 49

6 15 25 39

7 20 17.5 23.3

Page 21: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

Data Correlation

Page 22: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

22

Definition

• Data correlation: A technique that analyzes the degree of relationship between sets of data

• One sought-after relationship is software is that between some attribute prior to product release and the same attribute after product release

• One popular way to examine data correlation is to analyze whether a linear relationship exists– Two sets of data are paired together and plotted

– The resulting graph is reviewed to detect any relationship between the data sets

Page 23: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

23

Linear Regression

• Linear regression: A technique that estimates the relationship between two sets of data by fitting a straight line to the two sets of data values

• This is a more formal method of doing data correlation• Linear regression uses the equation of line: y = mx + b,

where m is the slope and b is the y-intercept value• To calculate the slope, use the following:m = SUM [(xi – xavg) x (yi – yavg)] / SUM [(xi – xavg)2]

• To calculate the y-intercept, use the following:b = yavg – (m x xavg)

Page 24: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

24

Example

SW Products #Pre-release #Post-release

A 10 24

B 5 13

C 35 71

D 75 155

E 15 34

F 22 50

G 7 16

H 54 112

Pre-release and Post-release Problems

Page 25: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

25

Example (continued)

xavg = 27.9yavg = 59.4

m = 2.0 slope (approx.)b = 3.6 y-intercept (approx.)

y = 2x + 3.6

Page 26: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

26

Example (continued)N

umbe

r of

Pos

t-re

leas

e P

robl

ems

Fou

nd

Number of Pre-release Problems Found

200 -

150 –

100 –

50 –

0

10 20 30 40 50 60 70 80

++

+

+

++

+

+

Page 27: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

Normalization of Data

Page 28: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

28

Definition

• Normalizing data: A technique used to bring data characterizations to some common or standard level so that comparisons become more meaningful

• This is needed because a pure comparison of raw data sometimes does not provide an accurate comparison

• The number of source lines of code is the most common means of normalizing data– Function points may also be used

Page 29: Managing Software Projects Analysis and Evaluation of Data - Reliable, Accurate, and Valid Data - Distribution of Data - Centrality and Dispersion - Data

29

Summary

• Reliable, Accurate, and Valid Data• Distribution of Data• Centrality and Dispersion• Data Smoothing: Moving Averages• Data Correlation• Normalization of Data