principles of effort estimation

A Principled MethodologyA Dozen Principles of

Software Effort Estimation

Ekrem Kocaguneli, 11/07/2012

2

Agenda• Introduction• Publications• What to Know• 8 Questions

• Answers• 12 Principles

• Validity Issues• Future Work

3

Introduction

Software effort estimation (SEE) is the process of estimating the total effort required to complete a software project (Keung2008 [1]).

Among IT projects developed in 2009, only 32% were successfully completed within time with full functionality [23]

Successful estimation is critical for an organizations

Over-estimation: Killing promising projectsUnder-estimation: Wasting entire effort! E.g. NASA’s launch-control system cancelled after initial estimate of $200M was overrun by another $200M [22]

4

Introduction (cntd.)

We will discuss algorithms, but it would be irresponsible to say that SEE is merely an algorithmic problem. Organizational factors

are just as important

E.g. common experiences of data collection and user interaction in organizations operating in different domains

5

Introduction (cntd.)

This presentation is not about a single algorithm/answer targeting a single problem.

It brings together critical questions and related solutions.

It is (unfortunately) not everything about SEE.

Because there is not just one question.

6

What to know?When do I have perfect data? What is the best effort

estimation method?Can I use multiple methods?

ABE methods are easy to use. How can I improve them?What if I lack resources

for local data?

I don’t believe in size attributes. What can I do?

Are all attributes and all instances necessary?

How to experiment, which sampling method to use?

1 2

34

5

67

8

7

PublicationsJournals• E. Kocaguneli, T. Menzies, J. Keung, “On the Value of Ensemble Effort Estimation”, IEEE Transactions on

Software Engineering, 2011.• E. Kocaguneli, T. Menzies, A. Bener, J. Keung, “Exploiting the Essential Assumptions of Analogy-based

Effort Estimation”, IEEE Transactions on Software Engineering, 2011.• E. Kocaguneli, T. Menzies, J. Keung, “Kernel Methods for Software Effort Estimation”, Empirical

Software Engineering Journal, 2011.• J. Keung, E. Kocaguneli, T. Menzies, “A Ranking Stability Indicator for Selecting the Best Effort Estimator

in Software Cost Estimation”, Journal of Automated Software Engineering, 2012.

Under review Journals• E. Kocaguneli, T. Menzies, J. Keung, “Active Learning for Effort Estimation”, third round review at IEEE

Transactions on Software Engineering.• E. Kocaguneli, T. Menzies, E. Mendes, “Transfer Learning in Effort Estimation”, submitted to ACM

Transactions on Software Engineering.• E. Kocaguneli, T. Menzies, “Software Effort Models Should be Assessed Via Leave-One-Out Validation”,

under second round review at Journal of Systems and Software.• E. Kocaguneli, T. Menzies, E. Mendes, “Towards Theoretical Maximum Prediction Accuracy Using D-

ABE”, submitted to IEEE Transactions on Software Engineering.

Conference• E. Kocaguneli, T. Menzies, J. Hihn, Byeong Ho Kang, “Size Doesn‘t Matter? On the Value of Software Size

Features for Effort Estimation”, Predictive Models in Software Engineering (PROMISE) 2012.• E. Kocaguneli, T. Menzies, “How to Find Relevant Data for Effort Estimation”, International Symposium

on Empirical Software Engineering and Measurement (ESEM) 2011• E. Kocaguneli, G. Gay, Y. Yang, T. Menzies, “When to Use Data from Other Projects for Effort Estimation”,

International Conference on Automated Software Engineering (ASE) 2010, Short-paper.

8

When do I have the perfect data?

Principle #1: Know your domain

Principle #2: Let the experts talk

Domain knowledge is important in every step (Fayyad1996 [2])Yet, this knowledge takes time and effort to gain, e.g. percentage commit information

Initial results may be off according to domain expertsSuccess is to create a discussion, interest and suggestions

Principle #3: Suspect your data“Curiosity” to question is a key characteristic (Rauser2011 [3])

e.g. in an SEE project, 200+ test cases, 0 bugs

Principle #4: Data collection is cyclicAny step from mining till presentation may be repeated

1

9

What is the best effort estimation method?

There is no agreed upon best estimation method

(Shepperd2001 [4])

Methods change ranking w.r.t. conditions such as data sets, error

measures (Myrtveit2005 [5])

Experimenting with: 90 solo-methods, 20 public data sets, 7 error measures

2

Top 13 methods are CART & ABE methods (1NN, 5NN)

10

How to use superior subset of methods?

We have a set of superior methods to

recommend

Assembling solo-methods may be a good idea, e.g. fusion of 3 biometric modalities (Ross2003 [20])

Baker2007 [7], Kocaguneli2009 [8], Khoshgoftaar2009 [9] failed to outperform solo-methods

3

But the previous evidence of assembling multiple methods in SEE is discouraging

Combine top 2,4,8,13 solo-methods via mean, median and IRWM

11

How to use superior subset of methods?What is the best effort estimation method?

Principle #6: Assemble superior solo-methods

A novel scheme for assembling solo-methods

Multi-methods that outperform all solo-methods

Principle #5: Use a ranking stability indicator

A method to identify successful methods using their rank changes

32

This research published at: . • Kocaguneli, T. Menzies, J. Keung, “On the Value of Ensemble Effort Estimation”, IEEE Transactions on

Software Engineering, 2011.• J. Keung, E. Kocaguneli, T. Menzies, “A Ranking Stability Indicator for Selecting the Best Effort Estimator in

Software Cost Estimation”, Journal of Automated Software Engineering, 2012.

12

How can we improve ABE methods?

Analogy based methods make use of similar past projects for estimation

They are very widely used (Walkerden1999 [10]) as:• No model-calibration to local data• Can better handle outliers• Can work with 1 or more attributes• Easy to explain

Two promising research areas• weighting the selected analogies

(Mendes2003 [11], Mosley2002[12])• improving design options (Keung2008 [1])

4

13

How can we improve ABE methods? (cntd.)

In none of the scenarios did we see a significant improvement

Compare performance of each k-value with and without weighting.

Building on the previous research (Mendes2003 [11], Mosley2002[12] ,Keung2008 [1]), we adopted two different strategies

We used kernel weighting to weigh selected analogies

a) Weighting analogies

14

How can we improve ABE methods? (cntd.)D-

ABE• Get best estimates of all training

instances• Remove all the training instances

within half of the worst MRE (acc. to TMPA).

• Return closest neighbor’s estimate to the test instance.

c

t

db

e

a

f

Test instanceTraining Instances

Worst MRE

Close to the worst MREReturn the

closest neighbor’s estimate

b) Designing ABE methods

Easy-path: Remove training instance that violate assumptions

TEAK will be discussed later.D-ABE: Built on theoretical maximum prediction accuracy (TMPA) (Keung2008 [1])

15


DABE Comparison to static k w.r.t. MMRE

DABE Comparison to static k w.r.t. win, tie, loss


Principle #7: Weighting analogies is overelaboration

16

Investigation of an unexplored and promising ABE option of kernel-weighting

A negative result published at ESE Journal

Principle #8: Use easy-path design

An ABE design option that can be applied to different ABE methods (D-ABE, TEAK)

This research published at: . • E. Kocaguneli, T. Menzies, A. Bener, J. Keung, “Exploiting the Essential Assumptions of Analogy-based Effort

Estimation”, IEEE Transactions on Software Engineering, 2011.• E. Kocaguneli, T. Menzies, J. Keung, “Kernel Methods for Software Effort Estimation”, Empirical Software

Engineering Journal, 2011.

17

How to handle lack of local data?Finding enough local training data is a fundamental problem of SEE (Turhan2009 [13]).

Merits of using cross-data from another company is questionable (Kitchenham2007 [14]).

We use a relevancy filtering method called TEAK on public and proprietary data sets.

5

Cross data works as well as within data for 6 out of 8 proprietary data sets, 19 out of 21 public data sets after TEAK’s relevancy filtering

Similar projects, dissimilar effort values, hence high variance

Similar projects, similar effort values, hence low variance

18

How to handle lack of local data? (cntd.)

Principle #9: Use relevancy filtering

A novel method to handle lack of local data

Successful application on public as well as proprietary data

This research published at: . • E. Kocaguneli, T. Menzies, “How to Find Relevant Data for Effort Estimation”, International Symposium on

Empirical Software Engineering and Measurement (ESEM) 2011• E. Kocaguneli, G. Gay, Y. Yang, T. Menzies, “When to Use Data from Other Projects for Effort Estimation”,

International Conference on Automated Software Engineering (ASE) 2010, Short-paper.

19

E(k) matrices & PopularityThis concept helps the next 2 problems: size features and the essential content, i.e. pop1NN and QUICK algorithms, respectively

A similar concept is reverse nearest neighbor (RNN) in ML, used to find instances whose k-NN’s are included in a specific query (Achtert2006 [26]).

20

E(k) matrices & Popularity (cntd.)Outlier pruning

1. Calculate “popularity” of instances

2. Sorting by popularity, 3. Label one instance at a time4. Find the stopping point5. Return estimate from labeled

training data

Finding the stopping point1. If all popular instances are exhausted.2. Or if there is no MRE improvement for n consecutive times.3. Or if the ∆ between the best and the worst error of the last n

instances is very small. (∆ = 0.1; n = 3)

Sample steps

21

Picking random training instance is

not a good idea

More popular instances in the active pool decreases error

One of the stopping point conditions fire

E(k) matrices & Popularity (cntd.)

22

Do I have to use size attributes?At the heart of widely accepted SEE methods lies the software size attributes

COCOMO uses LOC (Boehm1981 [15]), whereas FP (Albrecht1983 [16]) uses logical transactions

Size attributes are beneficial if used properly (Lum2002 [17]); e.g. DoD and NASA uses successfully.

Yet, the size attributes may not be trusted or may not be estimated at the early stages. That disrupts adoption of SEE methods.

Measuring software productivity by lines of code is like measuring progress on an airplane by how much it weighs – B. Gates

This is a very costly measuring unit because it encourages the writing of insipid code - E. Dijkstra

6

23

Do I have to use size attributes? (cntd.)

pop1NN (w/o size) vs. CART and 1NN (w/ size)

Given enough resources for correct collection and estimation, size features are helpful

If not, then outlier pruning helps.

24


Principle #10: Use outlier pruning

Promotion of SEE methods that can compensate the lack of the software size features

A method called pop1NN that shows that size features are not a “must”.

This research published at: . • E. Kocaguneli, T. Menzies, J. Hihn, Byeong Ho Kang, “Size Doesn‘t Matter? On the Value of Software Size

Features for Effort Estimation”, Predictive Models in Software Engineering (PROMISE) 2012.

25

What is the essential content of SEE data?In a matrix of N instances and F features, the essential content is N F ′ ∗ ′

SEE is populated with overly complex methods for marginal performance increase (Jorgensen2007 [18])

QUICK is an active learning method combines outlier removal and synonym pruning

7

Synonym pruning1. Transpose normalized matrix

and calculate the popularity of features

2. Select non-popular features.

Removal of features based on distance seemed to be reserved for instances.

Similar tasks both remove cells in the hypercube of all cases times all columns (Lipowezky1998 [24])

ABE method as a two dimensional reduction (Ahn2007 [25])

In our lab variance-based feature selector is used as a row selector

26

What is the essential content of SEE data? (cntd.)

At most 31% of all the cells

On median 10%

Performance?

There is a consensus in the high-dimensional data analysis community that the onlyreason any methods work in very high dimensions is that, in fact, the data are nottruly high-dimensional. (Levina & Bickel 2005)

27


Only one dataset where QUICK is significantly worse than passiveNN

4 such data sets when QUICK is compared to CART

QUICK vs passiveNN (1NN) QUICK vs CART

28


Principle #11: Combine outlier and synonym pruning

An unsupervised method to find the essential content of SEE data sets and reduce the data needs

Promoting research to elaborate on the data, not on the algorithm

This research is under 3rd round review: . • E. Kocaguneli, T. Menzies, J. Keung, “Active Learning for Effort Estimation”, third round review at IEEE

Transactions on Software Engineering.

29

How should I choose the right SM? Observed

No significant difference for B&V values among 90 methods

Only minutes of run time difference (<15)

LOO is not probabilistic and results can be easily shared

8Expectation

(Kitchenham2007 [7])

30

How should I choose the right SM? (cntd.)

Principle #12: Be aware of sampling method trade-off

The first investigation of B&V trade-off in SEE domain

Recommendation based on experimental concerns

This research is under 2nd round review: . • E. Kocaguneli, T. Menzies, “Software Effort Models Should be Assessed Via Leave-One-Out Validation”,

under second round review at Journal of Systems and Software.

31

What to know?When do I have perfect data? What is the best effort

estimation method?

Can I use multiple methods?ABE methods are easy to use.

How can I improve them?What if I lack resources for local data?

I don’t believe in size attributes. What can I do?

Are all attributes and all instances necessary?

How to experiment, which sampling method to use?

1. Know your domain2. Let the experts talk 3. Suspect your data4. Data collection is cyclic

5. Use a ranking stability indicator

6. Assemble superior solo-methods

7. Weighting analogies is over-elaboration8. Use easy-path design9. Use relevancy filtering

10. Use outlier pruning

11. Combine outlier and synonym pruning

12. Be aware of sampling method trade-off

32

Validity IssuesConstruct validity, i.e. do we measure what

we intend to measure?Use of previously recommended estimation

methods, error measures and data sets

External validity, i.e. can we generalize results outside current specifications

Difficult to assert that results will definitely hold

Yet we use almost all the publicly available SEE data sets.

Median value of projects used by the studies reviewed is 186 projects (Kitchenham2007 [14])

Our experimentation uses 1000+ projects

33

Future WorkApplication on publicly accessible big data sets

Smarter, larger scale algorithms for general conclusions

300K projects, 2M users 250K open source projects

Current methods may face scalability issues. Improving common ideas for scalability, e.g. linear time NN methods

Application to different domains, e.g. defect

prediction

Combining intrinsic dimensionality techniques in ML for lower bound

dimensions of SEE data sets (Levina2004 [27])

What have we covered?

34

36

References[1] J. W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 15th Asia-Pacific Software Engineering Conference, pp. 495– 502, 2008.[2] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The kdd process for extracting useful knowledge from volumes of data,” Commun. ACM, vol. 39, no. 11, pp. 27–34, Nov. 1996.[3] J. Rauser, “What is a career in big data?” 2011. [Online]. Available: http: //strataconf.com/stratany2011/public/schedule/speaker/10070[4] M. Shepperd and G. Kadoda, “Comparing Software Prediction Techniques Using Simulation,” IEEE Trans. Softw. Eng., vol. 27, no. 11, pp. 1014–1022, 2001.[5] I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,” IEEE Trans. Softw. Eng., vol. 31, no. 5, pp. 380–391, May 2005.[6] E. Alpaydin, “Techniques for combining multiple learners,” Proceedings of Engineering of Intelligent Systems, vol. 2, pp. 6–12, 1998.[7] D. Baker, “A hybrid approach to expert and model-based effort estimation,” Master’s thesis, Lane Department of Computer Science and Electrical Engineering, West Virginia University, 2007.[8] E. Kocaguneli, Y. Kultur, and A. Bener, “Combining multiple learners induced on multiple datasets for software effort prediction,” in International Symposium on Software Reliability Engineering (ISSRE), 2009, student Paper.[9] T. M. Khoshgoftaar, P. Rebours, and N. Seliya, “Software quality analysis by combining multiple projects and learners,” Software Quality Control, vol. 17, no. 1, pp. 25–49, 2009.[10] F. Walkerden and R. Jeffery, “An empirical study of analogy-based software effort estimation,” Empirical Software Engineering, vol. 4, no. 2, pp. 135–158, 1999.[11] E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web hypermedia applications,” Empirical Software Engineering, vol. 8, no. 2, pp. 163–196, 2003.

37[12] E. Mendes and N. Mosley, “Further investigation into the use of cbr and stepwise regression to predict development effort for web hypermedia applications,” in International Symposium on Empirical Software Engineering, 2002.[13] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company and within-company data for defect prediction,” Empirical Software Engineering, vol. 14, no. 5, pp. 540–578, 2009.[14] B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,” IEEE Trans. Softw. Eng., vol. 33, no. 5, pp. 316– 329, 2007.[15] B. W. Boehm, C. Abts, A. W. Brown, S. Chulani, B. K. Clark, E. Horowitz, R. Madachy, D. J. Reifer, and B. Steece, Software Cost Estimation with Cocomo II. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2000.[16] A. Albrecht and J. Gaffney, “Software function, source lines of code and development effort prediction: A software science validation,” IEEE Trans. Softw. Eng., vol. 9, pp. 639–648, 1983.[17] K. Lum, J. Powell, and J. Hihn, “Validation of spacecraft cost estimation models for flight and ground systems,” in ISPA’02: Conference Proceedings, Software Modeling Track, 2002.[18] M. Jorgensen and M. Shepperd, “A systematic review of software development cost estimation studies,” IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 33–53, 2007.[19] ] B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,” IEEE Trans. Softw. Eng., vol. 33, no. 5, pp. 316– 329, 2007.[20] A. Ross, “Information fusion in biometrics,” Pattern Recognition Letters, vol. 24, no. 13, pp. 2115–2125, Sep. 2003.[21] Raymond P. L. Buse, Thomas Zimmermann: Information needs for software development analytics. ICSE 2012: 987-996[22] Spareref.com. Nasa to shut down checkout & launch control system, August 26, 2002. http://www.spaceref.com/news/viewnews.html?id=475.[23] Standish Group (2004). CHAOS Report (Report). West Yarmouth, Massachusetts: Standish Group.[24] U. Lipowezky, Selection of the optimal prototype subset for 1-NN classification, Pattern Recognition Lett. 19 (1998) 907}918.[25] Hyunchul Ahn, Kyoung-jae Kim, Ingoo Han, A case-based reasoning system with the two-dimensional reduction technique for customer classification, Expert Systems with Applications, Volume 32, Issue 4, May 2007, Pages 1011-1019, ISSN 0957-4174, 10.1016/j.eswa.2006.02.021.[26] Elke Achtert, Christian Böhm, Peer Kröger, Peter Kunath, Alexey Pryakhin, and Matthias Renz. 2006. Efficient reverse k-nearest neighbor search in arbitrary metric spaces. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data (SIGMOD '06)[27] E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing Systems, volume 17, Cambridge, MA, USA, 2004. The MIT Press.

http://www.spaceref.com/news/viewnews.html?id=475

38

Detail Slides

39

Pre-processors and learners

40

What is the best effort estimation method? (cntd.)

1. Rank methods acc. to win, loss and win-loss values

2. δr is the max. rank change

3. Sort methods acc. to loss and observe δr values

41

What is the best effort estimation method? (cntd.)

What about aggregate results reflecting on specific scenarios? (question of a reviewer)

Sort methods according to increasing MdMREGroup MRE values that are statistically the same

Note how superior solo-methods

correspond to the best (lowest MRE) groups

Highlighted are the cases, where superior-methods do not occur

in the top group

42


We used kernel weighting with 4 kernels with 5 bandwidth values plus IRWM to weigh selected analogies (5 different k values)

A total of 2090 settings:• 19 datasets * 5 k-values = 95• 19 datasets * 5 k values * 4 kernels * 5 bandwidths = 1900• IRWM: 19 datasets * 5 k values = 95

43


In none of the scenarios did we see a significant improvement

Compare performance of each k-value with and without weighting.

• o = tie for 3 or more k values• - = loss for 3 or more k values• + = win for 3 or more k values

We used kernel weighting to weigh selected analogies

44

How to handle lack of local data? (cntd.)

TEAK on proprietary data TEAK on public data

45


Can standard methods tolerate the lack of size attributes?

CART w/o size vs. CART w/ size CART and 1NN

46

How should I choose the right SM?

Only one work (Kitchenham2007 [7]) discusses implications of sampling method (SM) on the bias and variance

Expectations is LOO: high variance, low bias 3Way: low variance, high bias 10Way: in between

Does expectation hold?

What about run time and ease-of replication?

8

principles of effort estimation

Education