managing software projects analysis and evaluation of data - reliable, accurate, and valid data -...

Managing Software Projects

Analysis and Evaluation of Data

- Reliable, Accurate, and Valid Data

- Distribution of Data

- Centrality and Dispersion

- Data Smoothing: Moving Averages

- Data Correlation

- Normalization of Data

(Source: Tsui, F. Managing Software Projects. Jones and Bartlett, 2004)

Reliable, Accurate, and Valid Data

Definitions

• Reliable data: Data that are collected and tabulated according to the defined rules of measurement and metric

• Accurate data: Data that are collected and tabulated according to the defined level of precision of measurement and metric

• Valid data: Data that are collected, tabulated, and applied according to the defined intention of applying the measurement

Distribution of Data

Definition

• Data distribution: A description of a collection of data that shows the spread of the values and the frequency of occurrences of the values of the data

Example #1: Skew of the Distribution

• Severity level 1: 23

The number of problems detected at each of five severity levels

(more on next slide)

Example #1 (continued)N

120 –

100 –

80 –

60 –

40 –

20 –

Severity Level

0 1 2 3 4 5

Number of problems is skewed towards the higher-numbered severity levels

Example #2: Range of Data Values

• Functional area 1: 2

The number of severity level 1 problems by functional area

The range is from 0 to 8

Example #3: Data Trends

• Week 1: 20

• Week 2: 23

• Week 3: 45

• Week 4: 67

• Week 5: 35

• Week 6: 15

• Week 7: 10

The total number of problems found in a specificfunctional area across the test time period in weeks

Centrality and Dispersion

Definition

• Centrality analysis: An analysis of a data set to find the typical value of that data set

• Approaches– Average value

– Median value

– Mode value

– Variance and Standard deviation

– Control chart

Average, Median, and Mode

• Average value (or mean): One type of centrality analysis that estimates the typical (or middle) value of a data set by summing all the observed data values and dividing the sum by the number of data points– This is the most common of the centrality analysis methods

• Median: A value used in centrality analysis to estimate the typical (or middle) value of a data set. After the data values are sorted, the median is the data value that splits the data set into upper and lower halves– If there are an even number of values, the values of the middle two

observations are averaged to obtain the median• Mode: The most frequently occurring value in a data set

– If the data set contains floating point values, use the highest frequency of values occurring between two consecutive integers (inclusive)

ExampleData Set = {2, 7, 3, 8, 0, 1, 8}

Average = xavg = (2 + 7 + 3 + 8 + 0 + 1 + 8) / 7 = 4.1

Median: 3 0, 1, 2, 3, 7, 8, 8 ^ Mode: 8

Variance and Standard Deviation

• Variance: The average of the squared deviations from the average value

s2 = SUM [ (xi – xavg)2) ] / (n – 1)

• Standard deviation: the square root of the variance. A metric used to define and measure the dispersion of data from the average value in a data set

• It is numerically defined as follows:

s = SQRT [ SUM [ (xi – xavg)2) ] / (n – 1) ]

where SQRT = square root function SUM = sum function xi = ith observation xave = average of all xi

n = total number of observations

Standard Deviation: ExampleData Set = {2, 7, 3, 8, 0, 1, 8}

xavg = (2 + 7 + 3 + 8 + 0 + 1 + 8) / 7 = 4.1

SUM [ (xi – xavg)2) ] = 4.41 + 8.41 + 1.21 + 15.21 + 16.81 + 9.61 + 15.21 = 70.87

SUM [ (xi – xavg)2) ] / (n – 1) = 70.87 / 6 = 11.81

STANDARD DEVIATION = s = SQRT(11.81) = 3.44

Control Chart

• Control chart: A chart used to assess and control the variability of some process or product characteristic

• It usually involves establishing lower and upper limits (the control limits) of data variations from the data set’s average value

• If an observed data value falls outside the control limits, then it would trigger evaluation of the characteristic

Control Chart (continued)

7.54 problems

4.1 problems (average)

0.66 problems

Data Smoothing: Moving Averages

Definitions

• Moving average: A technique for expressing data by computing the average of a fixed grouping (e.g., data for a fixed period) of data values; it is often used to suppress the effects of one extreme data point

• Data smoothing: A technique used to decrease the effects of individual, extreme variability in data values

Example

Test week Problems found 2-week moving avg 3-week moving avg

1 20 - -

2 33 26.5 -

3 45 39 32.7

4 67 56 48.3

5 35 51 49

6 15 25 39

7 20 17.5 23.3

Data Correlation

Definition

• Data correlation: A technique that analyzes the degree of relationship between sets of data

• One sought-after relationship is software is that between some attribute prior to product release and the same attribute after product release

• One popular way to examine data correlation is to analyze whether a linear relationship exists– Two sets of data are paired together and plotted

– The resulting graph is reviewed to detect any relationship between the data sets

Linear Regression

• Linear regression: A technique that estimates the relationship between two sets of data by fitting a straight line to the two sets of data values

• This is a more formal method of doing data correlation• Linear regression uses the equation of line: y = mx + b,

where m is the slope and b is the y-intercept value• To calculate the slope, use the following:m = SUM [(xi – xavg) x (yi – yavg)] / SUM [(xi – xavg)2]

• To calculate the y-intercept, use the following:b = yavg – (m x xavg)

Example

SW Products #Pre-release #Post-release

A 10 24

B 5 13

C 35 71

D 75 155

E 15 34

F 22 50

G 7 16

H 54 112

Pre-release and Post-release Problems

Example (continued)

xavg = 27.9yavg = 59.4

m = 2.0 slope (approx.)b = 3.6 y-intercept (approx.)

y = 2x + 3.6

Example (continued)N

Number of Pre-release Problems Found

150 –

100 –

50 –

10 20 30 40 50 60 70 80

Normalization of Data

Definition

• Normalizing data: A technique used to bring data characterizations to some common or standard level so that comparisons become more meaningful

• This is needed because a pure comparison of raw data sometimes does not provide an accurate comparison

• The number of source lines of code is the most common means of normalizing data– Function points may also be used

Summary

• Reliable, Accurate, and Valid Data• Distribution of Data• Centrality and Dispersion• Data Smoothing: Moving Averages• Data Correlation• Normalization of Data

managing software projects analysis and evaluation of data - reliable, accurate, and valid data -...

Documents

3 centrality

the centrality menace

statistics, data, measures of central tendency and...

statistics review i class 13. class overview levels of...

themes10 centrality doublecol

04 centrality indices

knowledge management prof.dr.nada lavrač topics: - degree...

colombian liberalization and integration to world trade...

the effects of centrality ordering in label propagation...

measures of central tendency and dispersion for grouped...

degree centrality closeness centrality betweenness

data mining lecture 11 classification naïve bayes...

node centrality - dcc

dispersion prediction models in nested data...

measures of dispersion & central...

sna 3a: centrality

matching urban lidar data to dispersion models by d.r

data mining lecture 11 classification naïve bayes graphs...

degree centrality closeness centrality betweenness...

correlation between centrality metrics and their ... · pdf...