managing software projects analysis and evaluation of data - reliable, accurate, and valid data -...

Managing Software Projects

Analysis and Evaluation of Data

- Reliable, Accurate, and Valid Data

- Distribution of Data

- Centrality and Dispersion

- Data Smoothing: Moving Averages

- Data Correlation

- Normalization of Data

(Source: Tsui, F. Managing Software Projects. Jones and Bartlett, 2004)

Reliable, Accurate, and Valid Data

3

Definitions

• Reliable data: Data that are collected and tabulated according to the defined rules of measurement and metric

• Accurate data: Data that are collected and tabulated according to the defined level of precision of measurement and metric

• Valid data: Data that are collected, tabulated, and applied according to the defined intention of applying the measurement

Distribution of Data

5

Definition

• Data distribution: A description of a collection of data that shows the spread of the values and the frequency of occurrences of the values of the data

6

Example #1: Skew of the Distribution

• Severity level 1: 23





The number of problems detected at each of five severity levels

(more on next slide)

7

Example #1 (continued)N

umbe

r of

Pro

blem

s F

ound

120 –

100 –

80 –

60 –

40 –

20 –

Severity Level

0 1 2 3 4 5

++

++

+

Number of problems is skewed towards the higher-numbered severity levels

8

Example #2: Range of Data Values

• Functional area 1: 2







The number of severity level 1 problems by functional area

The range is from 0 to 8

9

Example #3: Data Trends

• Week 1: 20

• Week 2: 23

• Week 3: 45

• Week 4: 67

• Week 5: 35

• Week 6: 15

• Week 7: 10

The total number of problems found in a specificfunctional area across the test time period in weeks

Centrality and Dispersion

11

Definition

• Centrality analysis: An analysis of a data set to find the typical value of that data set

• Approaches– Average value

– Median value

– Mode value

– Variance and Standard deviation

– Control chart

12

Average, Median, and Mode

• Average value (or mean): One type of centrality analysis that estimates the typical (or middle) value of a data set by summing all the observed data values and dividing the sum by the number of data points– This is the most common of the centrality analysis methods

• Median: A value used in centrality analysis to estimate the typical (or middle) value of a data set. After the data values are sorted, the median is the data value that splits the data set into upper and lower halves– If there are an even number of values, the values of the middle two

observations are averaged to obtain the median• Mode: The most frequently occurring value in a data set

– If the data set contains floating point values, use the highest frequency of values occurring between two consecutive integers (inclusive)

13

ExampleData Set = {2, 7, 3, 8, 0, 1, 8}

Average = xavg = (2 + 7 + 3 + 8 + 0 + 1 + 8) / 7 = 4.1

Median: 3 0, 1, 2, 3, 7, 8, 8 ^ Mode: 8

14

Variance and Standard Deviation

• Variance: The average of the squared deviations from the average value

s2 = SUM [ (xi – xavg)2) ] / (n – 1)

• Standard deviation: the square root of the variance. A metric used to define and measure the dispersion of data from the average value in a data set

• It is numerically defined as follows:

s = SQRT [ SUM [ (xi – xavg)2) ] / (n – 1) ]

where SQRT = square root function SUM = sum function xi = ith observation xave = average of all xi

n = total number of observations

15

Standard Deviation: ExampleData Set = {2, 7, 3, 8, 0, 1, 8}

xavg = (2 + 7 + 3 + 8 + 0 + 1 + 8) / 7 = 4.1

SUM [ (xi – xavg)2) ] = 4.41 + 8.41 + 1.21 + 15.21 + 16.81 + 9.61 + 15.21 = 70.87

SUM [ (xi – xavg)2) ] / (n – 1) = 70.87 / 6 = 11.81

STANDARD DEVIATION = s = SQRT(11.81) = 3.44

16

Control Chart

• Control chart: A chart used to assess and control the variability of some process or product characteristic

• It usually involves establishing lower and upper limits (the control limits) of data variations from the data set’s average value

• If an observed data value falls outside the control limits, then it would trigger evaluation of the characteristic

17

Control Chart (continued)

7.54 problems

4.1 problems (average)

0.66 problems

++

+

+

Data Smoothing: Moving Averages

19

Definitions

• Moving average: A technique for expressing data by computing the average of a fixed grouping (e.g., data for a fixed period) of data values; it is often used to suppress the effects of one extreme data point

• Data smoothing: A technique used to decrease the effects of individual, extreme variability in data values

20

Example

Test week Problems found 2-week moving avg 3-week moving avg

1 20 - -

2 33 26.5 -

3 45 39 32.7

4 67 56 48.3

5 35 51 49

6 15 25 39

7 20 17.5 23.3

Data Correlation

22

Definition

• Data correlation: A technique that analyzes the degree of relationship between sets of data

• One sought-after relationship is software is that between some attribute prior to product release and the same attribute after product release

• One popular way to examine data correlation is to analyze whether a linear relationship exists– Two sets of data are paired together and plotted

– The resulting graph is reviewed to detect any relationship between the data sets

23

Linear Regression

• Linear regression: A technique that estimates the relationship between two sets of data by fitting a straight line to the two sets of data values

• This is a more formal method of doing data correlation• Linear regression uses the equation of line: y = mx + b,

where m is the slope and b is the y-intercept value• To calculate the slope, use the following:m = SUM [(xi – xavg) x (yi – yavg)] / SUM [(xi – xavg)2]

• To calculate the y-intercept, use the following:b = yavg – (m x xavg)

24

Example

SW Products #Pre-release #Post-release

A 10 24

B 5 13

C 35 71

D 75 155

E 15 34

F 22 50

G 7 16

H 54 112

Pre-release and Post-release Problems

25

Example (continued)

xavg = 27.9yavg = 59.4

m = 2.0 slope (approx.)b = 3.6 y-intercept (approx.)

y = 2x + 3.6

26

Example (continued)N

umbe

r of

Pos

t-re

leas

e P

robl

ems

Fou

nd

Number of Pre-release Problems Found

200 -

150 –

100 –

50 –

0

10 20 30 40 50 60 70 80

++

+

+

++

+

+

Normalization of Data

28

Definition

• Normalizing data: A technique used to bring data characterizations to some common or standard level so that comparisons become more meaningful

• This is needed because a pure comparison of raw data sometimes does not provide an accurate comparison

• The number of source lines of code is the most common means of normalizing data– Function points may also be used

29

Summary

• Reliable, Accurate, and Valid Data• Distribution of Data• Centrality and Dispersion• Data Smoothing: Moving Averages• Data Correlation• Normalization of Data

managing software projects analysis and evaluation of data - reliable, accurate, and valid data -...

Documents