![Page 1: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/1.jpg)
1
How to Find Relevant Data for Effort Estimation ?
毛 可2012-03-28
![Page 2: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/2.jpg)
2
Author
• Ekrem Kocaguneli ( [email protected] )• Tim Menzies
• Specialties: Data Mining, Effort Estimation
• 11’ TSE: Exploiting the Essential Assumptions of Analogy-Based Effort Estimation (TEAK)
• 11’ TSE: On the Value of Ensemble Effort Estimation• 11’ ESEM: –• 10’ ASE: When to Use Data from Other Projects for Effort Estimation(short)• Pre: Relevancy Filtering for Defect Estimation
![Page 3: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/3.jpg)
3
Motivation (Why)The Locality(1) Assumption• Data divides best on one attribute
– 1. project type;e.g. embedded, etc; – 2. development centers of developers; – 3. development language– 4. application type(MIS; GNC; etc); – 5. targeted hardware platform; – 6. in-house vs out sourced projects;
• If Locality(1)– Hard to use data across these boundaries– confined model, need to collect local data
![Page 4: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/4.jpg)
4
Motivation (Why)
The Locality(N) Assumption• Data divides best on combination of
attributes• If Locality(N)
– Easier to use data across these boundaries
![Page 5: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/5.jpg)
5
Work
• Cross-vs-Within + “relevancy filtering” for
effort estimation– Cross as good as within
– Companies can use other’s data for their estimates
– If they first apply “relevancy filtering”• "cross" same as "local"
![Page 6: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/6.jpg)
6
Technology (How)
• How to find relevant training data?
![Page 7: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/7.jpg)
7
Technology (How)
• Variance Pruning
![Page 8: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/8.jpg)
8
Technology (How)• TEAK = ABE0 + Instance selection
– 11’ TSE: Exploiting the Essential Assumptions of Analogy-Based Effort Estimation
• ABE0 = ABE version 0– most commonly used– Normalized numerics, 0 to 1– Euclidean distance– equal weight to all attributes– return median effort of k-nearest neighbors
• Instance selection– smart way to adjust training data
![Page 9: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/9.jpg)
9
Technology (How)• TEAK is a variance-based instance
selector• It is built via GAC trees (binary for even)
• TEAK is a two-pass system– First pass selects low variance
relevant projects ( instance selection )– Second pass retrieves projects to
estimate from ( instance retrieval )
• Variance Pruning– > 10% * max ( σ2 )
– > (100%+10%) * max ( σ2 ) ?
![Page 10: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/10.jpg)
11
Technology (How)• TEAK finds local regions important to the estimation of
particular cases
• TEAK finds those regions via locality(N) not locality(1)
![Page 11: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/11.jpg)
12
Experiments - Datasets
• Public availability: for reproducibility• cross-within divisibility• 6 out of 20+ datasets from PROMISE
![Page 12: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/12.jpg)
13
Experiments - Datasets
For dataset X: subset X1 , X2 , X3• Within
– TEAK for X1, X2, X3 separately. LOOCV
• Cross– X1 test, X2+X3 train. … N-Fold CV
• Repeat 20 times! As TEAK is greedy, vary according to input data order
![Page 13: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/13.jpg)
14
Experiments
• Win-Loss-Tie:
• Mann Whitney Test (95%)– 检验两个总体的分布是否有显著的差别
![Page 14: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/14.jpg)
15
Experiment1 - Performace Comparison
MAR: Mean Absolute ResidualMdMRE: Median MRE
![Page 15: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/15.jpg)
16
Experiment1 - Performace ComparisonAnalogy by 1-neighbor: (PRED(25) > 0.3 on C81 Subsets )for i = 1:numTestCases estimates(i) = effortTrain(nearestCase(i)) * sizeTest(i) / sizeTrain(nearestCase(i)); for k = 1 : numTestFactors estimates(i) = estimates(i) * cdTestReady(i,k) / cdTrainReady(nearestCase(i),k); endend
Analogy by K-neighbor:
![Page 16: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/16.jpg)
17
Experiment2 – Retrieval Tendency
![Page 17: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/17.jpg)
18
Experiment2 – Retrieval Tendency
Diagonal( WC ) vs. Off-Diagonal( CC ) selectionPercentages sorted
Percentiles of diagonals andoff-diagonals
![Page 18: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/18.jpg)
19
Conclusion
1. Cross performance is no worse than within performance
2. Probability that estimator retrieves a training instance form cross/within data is the same
Implication:• Companies can learn from each other’s data• Locality(N). Maybe, there are general effects in SE
– Effects that transcend boundaries of one company– Local vs. Global Model…
![Page 19: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/19.jpg)
20
Future work• Check external validity
– After instance selection, Does cross == within ?
• Build more repositories– More useful than previously thought for effort estimation
• Synonym discovery– Can only use cross-data if it has the same ontology– Auto-generate lexicons to map terms between data sets. ( “LOC” – “size”, “product complexity” )
![Page 20: 1 How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28](https://reader035.vdocuments.us/reader035/viewer/2022070412/56649f565503460f94c7aa64/html5/thumbnails/20.jpg)
Thanks! Q & A ?
21