1
The deviance problem in effort estimation
And defectprediction
And softwareengineering
2
Variance: confusing prior results?
Software effort estimation– Jorgensen: most effort estimation is
“expert-based”; so “model-based” estimation
is a waste of time model- vs expert-base studies:
5 better, 5 even, 5 worse
Software defect prediction– Shepperd&Ince:
static code measures un-informative for software quality
dumb LOC vs Mccabe studies: 6 better, 6 even, 6 worse
I smell a rat
Selecting Best Practices for Effort Estimation - Menzies, Chen, Hihn, Lum. TSE 200X
Data Mining Static Code to Learn Defect Predictor - Menzies, Greenwald,Frank, TSE 200X
3
What you never want to hear…
“This isn't right. This isn't even wrong.”– Wolfgang Pauli
4
Standard disclaimer
An excessive focus on empiricism …– … stunts the development of novel ,
pre-experimental, speculations
But currently: – there is no danger of an excess of
empiricism in SE– SE= a field flooded by pre-
experimental speculations.
5
Sampleexperiments Public domain data Don’t test using your
training data– N-way cross val
M * randomize order Straw man Feature subset
selection Thall shalt script
– you will run it again Study mean and
variance over M * N
Defect predictions
defect prediction
effort estimation
6
Data summation:K.I.S.S. Combine PD/PF
Compute & sort combined performance deltas , method A vs all others
Summarize as quartiles
400,000 runs– Nb= naïve bayes– J48= entrophy-based
decision tree learner– oneR=straw man– logNums= log the numerics
Massive FSSMassive FSS
Singletons, including LOC, not enoughSingletons, including LOC, not enough
7
Variance: confusing prior results?
Software effort estimation– Jorgensen: most effort estimation is
“expert-based”; so “model-based” estimation
is a waste of time model- vs expert-base studies:
5 better, 5 even, 5 worse
Software defect prediction– Shepperd&Ince:
static code measures un-informative for software quality
dumb LOC vs Mccabe studies: 6 better, 6 even, 6 worse
I smell a rat
8
Sources of varianceSoftware effort estimation
30 * { shuffle, test = data[1..10]
train = data - test, <a,b> = LC(train)
MRE = Estimate(a,b,test) }
Software defect prediction10 * { randomly select 90% of data,
score each attribute via “INFOGAIN” }
Numerous candidatesfor “most informative”attributes
Large deviations confuse comparisons of competing methods
Can be reduced by FSS
Target class: continuous
Target class: discrete
9
What is Feature Subset Selection?
PCA worse (empirically)
INFOGAIN fastest – Useful for defect detection
e.g. 10,000 modules in defect logs
WRAPPER slowest– Performs best– Practical for effort estimation
e.g. dozens of past projects in company databases
Turned blue to green
a = 10.1 + 0.3x + 0.9y - 1.2z1) “wiggle” in x,y,z causes “wiggle” in “a”2) Removing x,y,z,reduces “wiggle”in “a”3) But can damage mean performance
10
Warning: no single “best” theorydefect predictioneffort estimation
11
Committee-based learning
Ensemble-based learning – bagging, boosting,
stacking, etc
Conclusions by voting across a committee– 10 identical experts are a
waste of money– 10, slightly different, experts
can offer different insights onto a problem
12
Using committees Classification ensembles:
– “Majority vote does as good as anything else”- Tom Dietrich
Numeric prediction ensembles– Can use other measures: “heuristic rejection rules”– Theorists: “gasp horror”– Seasoned cost-estimation practitioners: “of course”
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Standard statistics failing . T-tests report that none of these are “worse
Standard statistics failing . T-tests report that none of these are “worse
13
• For any pair of treatments, • If one is “worse”• Vote it off• Repeat till none “worse”
survivors
14
So… So, those M*N-way cross-vals
– Time to use them.
New research area– Automatic model selection methods are now required
– Data fusion in biometrcs
The technical problem is not the challenge– Issues with explanation and expectation
15
Why so many unverified ideas in software engineering?
Humans use language to mark territory– Repeated effect: linguistic drift
Villages, separated by just a few miles, evolve different dialects
– Language choice = who you talk to, what tools you buy
US vs THEM: SUN built JAVA as a weapon against Microsoft
– Result: never-ending stream of new language systems
Vendors want to sell new tools, not assess them.
16
But, the tide is turning Text mining of NFRs, traceability:
– IEEE RE’06 (Minniapolis, 2006) The Detection and Classification
of Non-Functional Requirements Cleland-Huang, Settimi, Zou, Solc
– IEEE TSE Jan, 2006, p 4-19: Advancing Candidate Link Generation
for Requirements Tracing– Hayes, Dekhtyar, Sundaram
Software effort estimation– IEEE TSE 200?
Selecting Best Practices for Effort Estimation– Menzies, Chen, Hihn, Lum
Software defect prediction– IEEE TSE 200?
Data Mining Static Code Attributes to Learn Defect Predictors
– Menzies, Greenwald, Frank Yes Timmy, senior forums endorseempirical rigor
Yes Timmy, senior forums endorseempirical rigor
Bestpaper