principles of effort estimation
DESCRIPTION
Phd tTRANSCRIPT
A Principled MethodologyA Dozen Principles of
Software Effort Estimation
Ekrem Kocaguneli, 11/07/2012
2
Agenda• Introduction• Publications• What to Know• 8 Questions
• Answers• 12 Principles
• Validity Issues• Future Work
3
Introduction
Software effort estimation (SEE) is the process of estimating the total effort required to complete a software project (Keung2008 [1]).
Among IT projects developed in 2009, only 32% were successfully completed within time with full functionality [23]
Successful estimation is critical for an organizations
Over-estimation: Killing promising projectsUnder-estimation: Wasting entire effort! E.g. NASA’s launch-control system cancelled after initial estimate of $200M was overrun by another $200M [22]
4
Introduction (cntd.)
We will discuss algorithms, but it would be irresponsible to say that SEE is merely an algorithmic problem. Organizational factors
are just as important
E.g. common experiences of data collection and user interaction in organizations operating in different domains
5
Introduction (cntd.)
This presentation is not about a single algorithm/answer targeting a single problem.
It brings together critical questions and related solutions.
It is (unfortunately) not everything about SEE.
Because there is not just one question.
6
What to know?When do I have perfect data? What is the best effort
estimation method?Can I use multiple methods?
ABE methods are easy to use. How can I improve them?What if I lack resources
for local data?
I don’t believe in size attributes. What can I do?
Are all attributes and all instances necessary?
How to experiment, which sampling method to use?
1 2
34
5
67
8
7
PublicationsJournals• E. Kocaguneli, T. Menzies, J. Keung, “On the Value of Ensemble Effort Estimation”, IEEE Transactions on
Software Engineering, 2011.• E. Kocaguneli, T. Menzies, A. Bener, J. Keung, “Exploiting the Essential Assumptions of Analogy-based
Effort Estimation”, IEEE Transactions on Software Engineering, 2011.• E. Kocaguneli, T. Menzies, J. Keung, “Kernel Methods for Software Effort Estimation”, Empirical
Software Engineering Journal, 2011.• J. Keung, E. Kocaguneli, T. Menzies, “A Ranking Stability Indicator for Selecting the Best Effort Estimator
in Software Cost Estimation”, Journal of Automated Software Engineering, 2012.
Under review Journals• E. Kocaguneli, T. Menzies, J. Keung, “Active Learning for Effort Estimation”, third round review at IEEE
Transactions on Software Engineering.• E. Kocaguneli, T. Menzies, E. Mendes, “Transfer Learning in Effort Estimation”, submitted to ACM
Transactions on Software Engineering.• E. Kocaguneli, T. Menzies, “Software Effort Models Should be Assessed Via Leave-One-Out Validation”,
under second round review at Journal of Systems and Software.• E. Kocaguneli, T. Menzies, E. Mendes, “Towards Theoretical Maximum Prediction Accuracy Using D-
ABE”, submitted to IEEE Transactions on Software Engineering.
Conference• E. Kocaguneli, T. Menzies, J. Hihn, Byeong Ho Kang, “Size Doesn‘t Matter? On the Value of Software Size
Features for Effort Estimation”, Predictive Models in Software Engineering (PROMISE) 2012.• E. Kocaguneli, T. Menzies, “How to Find Relevant Data for Effort Estimation”, International Symposium
on Empirical Software Engineering and Measurement (ESEM) 2011• E. Kocaguneli, G. Gay, Y. Yang, T. Menzies, “When to Use Data from Other Projects for Effort Estimation”,
International Conference on Automated Software Engineering (ASE) 2010, Short-paper.
8
When do I have the perfect data?
Principle #1: Know your domain
Principle #2: Let the experts talk
Domain knowledge is important in every step (Fayyad1996 [2])Yet, this knowledge takes time and effort to gain, e.g. percentage commit information
Initial results may be off according to domain expertsSuccess is to create a discussion, interest and suggestions
Principle #3: Suspect your data“Curiosity” to question is a key characteristic (Rauser2011 [3])
e.g. in an SEE project, 200+ test cases, 0 bugs
Principle #4: Data collection is cyclicAny step from mining till presentation may be repeated
1
9
What is the best effort estimation method?
There is no agreed upon best estimation method
(Shepperd2001 [4])
Methods change ranking w.r.t. conditions such as data sets, error
measures (Myrtveit2005 [5])
Experimenting with: 90 solo-methods, 20 public data sets, 7 error measures
2
Top 13 methods are CART & ABE methods (1NN, 5NN)
10
How to use superior subset of methods?
We have a set of superior methods to
recommend
Assembling solo-methods may be a good idea, e.g. fusion of 3 biometric modalities (Ross2003 [20])
Baker2007 [7], Kocaguneli2009 [8], Khoshgoftaar2009 [9] failed to outperform solo-methods
3
But the previous evidence of assembling multiple methods in SEE is discouraging
Combine top 2,4,8,13 solo-methods via mean, median and IRWM
11
How to use superior subset of methods?What is the best effort estimation method?
Principle #6: Assemble superior solo-methods
A novel scheme for assembling solo-methods
Multi-methods that outperform all solo-methods
Principle #5: Use a ranking stability indicator
A method to identify successful methods using their rank changes
32
This research published at: . • Kocaguneli, T. Menzies, J. Keung, “On the Value of Ensemble Effort Estimation”, IEEE Transactions on
Software Engineering, 2011.• J. Keung, E. Kocaguneli, T. Menzies, “A Ranking Stability Indicator for Selecting the Best Effort Estimator in
Software Cost Estimation”, Journal of Automated Software Engineering, 2012.
12
How can we improve ABE methods?
Analogy based methods make use of similar past projects for estimation
They are very widely used (Walkerden1999 [10]) as:• No model-calibration to local data• Can better handle outliers• Can work with 1 or more attributes• Easy to explain
Two promising research areas• weighting the selected analogies
(Mendes2003 [11], Mosley2002[12])• improving design options (Keung2008 [1])
4
13
How can we improve ABE methods? (cntd.)
In none of the scenarios did we see a significant improvement
Compare performance of each k-value with and without weighting.
Building on the previous research (Mendes2003 [11], Mosley2002[12] ,Keung2008 [1]), we adopted two different strategies
We used kernel weighting to weigh selected analogies
a) Weighting analogies
14
How can we improve ABE methods? (cntd.)D-
ABE• Get best estimates of all training
instances• Remove all the training instances
within half of the worst MRE (acc. to TMPA).
• Return closest neighbor’s estimate to the test instance.
c
t
db
e
a
f
Test instanceTraining Instances
Worst MRE
Close to the worst MREReturn the
closest neighbor’s estimate
b) Designing ABE methods
Easy-path: Remove training instance that violate assumptions
TEAK will be discussed later.D-ABE: Built on theoretical maximum prediction accuracy (TMPA) (Keung2008 [1])
15
How can we improve ABE methods? (cntd.)
DABE Comparison to static k w.r.t. MMRE
DABE Comparison to static k w.r.t. win, tie, loss
How can we improve ABE methods? (cntd.)
Principle #7: Weighting analogies is overelaboration
16
Investigation of an unexplored and promising ABE option of kernel-weighting
A negative result published at ESE Journal
Principle #8: Use easy-path design
An ABE design option that can be applied to different ABE methods (D-ABE, TEAK)
This research published at: . • E. Kocaguneli, T. Menzies, A. Bener, J. Keung, “Exploiting the Essential Assumptions of Analogy-based Effort
Estimation”, IEEE Transactions on Software Engineering, 2011.• E. Kocaguneli, T. Menzies, J. Keung, “Kernel Methods for Software Effort Estimation”, Empirical Software
Engineering Journal, 2011.
17
How to handle lack of local data?Finding enough local training data is a fundamental problem of SEE (Turhan2009 [13]).
Merits of using cross-data from another company is questionable (Kitchenham2007 [14]).
We use a relevancy filtering method called TEAK on public and proprietary data sets.
5
Cross data works as well as within data for 6 out of 8 proprietary data sets, 19 out of 21 public data sets after TEAK’s relevancy filtering
Similar projects, dissimilar effort values, hence high variance
Similar projects, similar effort values, hence low variance
18
How to handle lack of local data? (cntd.)
Principle #9: Use relevancy filtering
A novel method to handle lack of local data
Successful application on public as well as proprietary data
This research published at: . • E. Kocaguneli, T. Menzies, “How to Find Relevant Data for Effort Estimation”, International Symposium on
Empirical Software Engineering and Measurement (ESEM) 2011• E. Kocaguneli, G. Gay, Y. Yang, T. Menzies, “When to Use Data from Other Projects for Effort Estimation”,
International Conference on Automated Software Engineering (ASE) 2010, Short-paper.
19
E(k) matrices & PopularityThis concept helps the next 2 problems: size features and the essential content, i.e. pop1NN and QUICK algorithms, respectively
A similar concept is reverse nearest neighbor (RNN) in ML, used to find instances whose k-NN’s are included in a specific query (Achtert2006 [26]).
20
E(k) matrices & Popularity (cntd.)Outlier pruning
1. Calculate “popularity” of instances
2. Sorting by popularity, 3. Label one instance at a time4. Find the stopping point5. Return estimate from labeled
training data
Finding the stopping point1. If all popular instances are exhausted.2. Or if there is no MRE improvement for n consecutive times.3. Or if the ∆ between the best and the worst error of the last n
instances is very small. (∆ = 0.1; n = 3)
Sample steps
21
Picking random training instance is
not a good idea
More popular instances in the active pool decreases error
One of the stopping point conditions fire
E(k) matrices & Popularity (cntd.)
22
Do I have to use size attributes?At the heart of widely accepted SEE methods lies the software size attributes
COCOMO uses LOC (Boehm1981 [15]), whereas FP (Albrecht1983 [16]) uses logical transactions
Size attributes are beneficial if used properly (Lum2002 [17]); e.g. DoD and NASA uses successfully.
Yet, the size attributes may not be trusted or may not be estimated at the early stages. That disrupts adoption of SEE methods.
Measuring software productivity by lines of code is like measuring progress on an airplane by how much it weighs – B. Gates
This is a very costly measuring unit because it encourages the writing of insipid code - E. Dijkstra
6
23
Do I have to use size attributes? (cntd.)
pop1NN (w/o size) vs. CART and 1NN (w/ size)
Given enough resources for correct collection and estimation, size features are helpful
If not, then outlier pruning helps.
24
Do I have to use size attributes? (cntd.)
Principle #10: Use outlier pruning
Promotion of SEE methods that can compensate the lack of the software size features
A method called pop1NN that shows that size features are not a “must”.
This research published at: . • E. Kocaguneli, T. Menzies, J. Hihn, Byeong Ho Kang, “Size Doesn‘t Matter? On the Value of Software Size
Features for Effort Estimation”, Predictive Models in Software Engineering (PROMISE) 2012.
25
What is the essential content of SEE data?In a matrix of N instances and F features, the essential content is N F ′ ∗ ′
SEE is populated with overly complex methods for marginal performance increase (Jorgensen2007 [18])
QUICK is an active learning method combines outlier removal and synonym pruning
7
Synonym pruning1. Transpose normalized matrix
and calculate the popularity of features
2. Select non-popular features.
Removal of features based on distance seemed to be reserved for instances.
Similar tasks both remove cells in the hypercube of all cases times all columns (Lipowezky1998 [24])
ABE method as a two dimensional reduction (Ahn2007 [25])
In our lab variance-based feature selector is used as a row selector
26
What is the essential content of SEE data? (cntd.)
At most 31% of all the cells
On median 10%
Performance?
There is a consensus in the high-dimensional data analysis community that the onlyreason any methods work in very high dimensions is that, in fact, the data are nottruly high-dimensional. (Levina & Bickel 2005)
27
What is the essential content of SEE data? (cntd.)
Only one dataset where QUICK is significantly worse than passiveNN
4 such data sets when QUICK is compared to CART
QUICK vs passiveNN (1NN) QUICK vs CART
28
What is the essential content of SEE data? (cntd.)
Principle #11: Combine outlier and synonym pruning
An unsupervised method to find the essential content of SEE data sets and reduce the data needs
Promoting research to elaborate on the data, not on the algorithm
This research is under 3rd round review: . • E. Kocaguneli, T. Menzies, J. Keung, “Active Learning for Effort Estimation”, third round review at IEEE
Transactions on Software Engineering.
29
How should I choose the right SM? Observed
No significant difference for B&V values among 90 methods
Only minutes of run time difference (<15)
LOO is not probabilistic and results can be easily shared
8Expectation
(Kitchenham2007 [7])
30
How should I choose the right SM? (cntd.)
Principle #12: Be aware of sampling method trade-off
The first investigation of B&V trade-off in SEE domain
Recommendation based on experimental concerns
This research is under 2nd round review: . • E. Kocaguneli, T. Menzies, “Software Effort Models Should be Assessed Via Leave-One-Out Validation”,
under second round review at Journal of Systems and Software.
31
What to know?When do I have perfect data? What is the best effort
estimation method?
Can I use multiple methods?ABE methods are easy to use.
How can I improve them?What if I lack resources for local data?
I don’t believe in size attributes. What can I do?
Are all attributes and all instances necessary?
How to experiment, which sampling method to use?
1. Know your domain2. Let the experts talk 3. Suspect your data4. Data collection is cyclic
5. Use a ranking stability indicator
6. Assemble superior solo-methods
7. Weighting analogies is over-elaboration8. Use easy-path design9. Use relevancy filtering
10. Use outlier pruning
11. Combine outlier and synonym pruning
12. Be aware of sampling method trade-off
32
Validity IssuesConstruct validity, i.e. do we measure what
we intend to measure?Use of previously recommended estimation
methods, error measures and data sets
External validity, i.e. can we generalize results outside current specifications
Difficult to assert that results will definitely hold
Yet we use almost all the publicly available SEE data sets.
Median value of projects used by the studies reviewed is 186 projects (Kitchenham2007 [14])
Our experimentation uses 1000+ projects
33
Future WorkApplication on publicly accessible big data sets
Smarter, larger scale algorithms for general conclusions
300K projects, 2M users 250K open source projects
Current methods may face scalability issues. Improving common ideas for scalability, e.g. linear time NN methods
Application to different domains, e.g. defect
prediction
Combining intrinsic dimensionality techniques in ML for lower bound
dimensions of SEE data sets (Levina2004 [27])
What have we covered?
34
35
36
References[1] J. W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 15th Asia-Pacific Software Engineering Conference, pp. 495– 502, 2008.[2] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The kdd process for extracting useful knowledge from volumes of data,” Commun. ACM, vol. 39, no. 11, pp. 27–34, Nov. 1996.[3] J. Rauser, “What is a career in big data?” 2011. [Online]. Available: http: //strataconf.com/stratany2011/public/schedule/speaker/10070[4] M. Shepperd and G. Kadoda, “Comparing Software Prediction Techniques Using Simulation,” IEEE Trans. Softw. Eng., vol. 27, no. 11, pp. 1014–1022, 2001.[5] I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,” IEEE Trans. Softw. Eng., vol. 31, no. 5, pp. 380–391, May 2005.[6] E. Alpaydin, “Techniques for combining multiple learners,” Proceedings of Engineering of Intelligent Systems, vol. 2, pp. 6–12, 1998.[7] D. Baker, “A hybrid approach to expert and model-based effort estimation,” Master’s thesis, Lane Department of Computer Science and Electrical Engineering, West Virginia University, 2007.[8] E. Kocaguneli, Y. Kultur, and A. Bener, “Combining multiple learners induced on multiple datasets for software effort prediction,” in International Symposium on Software Reliability Engineering (ISSRE), 2009, student Paper.[9] T. M. Khoshgoftaar, P. Rebours, and N. Seliya, “Software quality analysis by combining multiple projects and learners,” Software Quality Control, vol. 17, no. 1, pp. 25–49, 2009.[10] F. Walkerden and R. Jeffery, “An empirical study of analogy-based software effort estima- tion,” Empirical Software Engineering, vol. 4, no. 2, pp. 135–158, 1999.[11] E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web hypermedia applications,” Empirical Software Engineering, vol. 8, no. 2, pp. 163–196, 2003.
37[12] E. Mendes and N. Mosley, “Further investigation into the use of cbr and stepwise regression to predict development effort for web hypermedia applications,” in International Symposium on Empirical Software Engineering, 2002.[13] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company and within-company data for defect prediction,” Empirical Software Engineering, vol. 14, no. 5, pp. 540–578, 2009.[14] B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,” IEEE Trans. Softw. Eng., vol. 33, no. 5, pp. 316– 329, 2007.[15] B. W. Boehm, C. Abts, A. W. Brown, S. Chulani, B. K. Clark, E. Horowitz, R. Madachy, D. J. Reifer, and B. Steece, Software Cost Estimation with Cocomo II. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2000.[16] A. Albrecht and J. Gaffney, “Software function, source lines of code and development effort prediction: A software science validation,” IEEE Trans. Softw. Eng., vol. 9, pp. 639–648, 1983.[17] K. Lum, J. Powell, and J. Hihn, “Validation of spacecraft cost estimation models for flight and ground systems,” in ISPA’02: Conference Proceedings, Software Modeling Track, 2002.[18] M. Jorgensen and M. Shepperd, “A systematic review of software development cost estimation studies,” IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 33–53, 2007.[19] ] B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,” IEEE Trans. Softw. Eng., vol. 33, no. 5, pp. 316– 329, 2007.[20] A. Ross, “Information fusion in biometrics,” Pattern Recognition Letters, vol. 24, no. 13, pp. 2115–2125, Sep. 2003.[21] Raymond P. L. Buse, Thomas Zimmermann: Information needs for software development analytics. ICSE 2012: 987-996[22] Spareref.com. Nasa to shut down checkout & launch control system, August 26, 2002. http://www.spaceref.com/news/viewnews.html?id=475.[23] Standish Group (2004). CHAOS Report (Report). West Yarmouth, Massachusetts: Standish Group.[24] U. Lipowezky, Selection of the optimal prototype subset for 1-NN classification, Pattern Recognition Lett. 19 (1998) 907}918.[25] Hyunchul Ahn, Kyoung-jae Kim, Ingoo Han, A case-based reasoning system with the two-dimensional reduction technique for customer classification, Expert Systems with Applications, Volume 32, Issue 4, May 2007, Pages 1011-1019, ISSN 0957-4174, 10.1016/j.eswa.2006.02.021.[26] Elke Achtert, Christian Böhm, Peer Kröger, Peter Kunath, Alexey Pryakhin, and Matthias Renz. 2006. Efficient reverse k-nearest neighbor search in arbitrary metric spaces. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data (SIGMOD '06)[27] E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing Systems, volume 17, Cambridge, MA, USA, 2004. The MIT Press.
38
Detail Slides
39
Pre-processors and learners
40
What is the best effort estimation method? (cntd.)
1. Rank methods acc. to win, loss and win-loss values
2. δr is the max. rank change
3. Sort methods acc. to loss and observe δr values
41
What is the best effort estimation method? (cntd.)
What about aggregate results reflecting on specific scenarios? (question of a reviewer)
Sort methods according to increasing MdMREGroup MRE values that are statistically the same
Note how superior solo-methods
correspond to the best (lowest MRE) groups
Highlighted are the cases, where superior-methods do not occur
in the top group
42
How can we improve ABE methods? (cntd.)
We used kernel weighting with 4 kernels with 5 bandwidth values plus IRWM to weigh selected analogies (5 different k values)
A total of 2090 settings:• 19 datasets * 5 k-values = 95• 19 datasets * 5 k values * 4 kernels * 5 bandwidths = 1900• IRWM: 19 datasets * 5 k values = 95
43
How can we improve ABE methods? (cntd.)
In none of the scenarios did we see a significant improvement
Compare performance of each k-value with and without weighting.
• o = tie for 3 or more k values• - = loss for 3 or more k values• + = win for 3 or more k values
We used kernel weighting to weigh selected analogies
44
How to handle lack of local data? (cntd.)
TEAK on proprietary data TEAK on public data
45
Do I have to use size attributes? (cntd.)
Can standard methods tolerate the lack of size attributes?
CART w/o size vs. CART w/ size CART and 1NN
46
How should I choose the right SM?
Only one work (Kitchenham2007 [7]) discusses implications of sampling method (SM) on the bias and variance
Expectations is LOO: high variance, low bias 3Way: low variance, high bias 10Way: in between
Does expectation hold?
What about run time and ease-of replication?
8