nearest neighbor sampling for better defect prediction gary d. boetticher department of software...
TRANSCRIPT
Nearest Neighbor Sampling for Better Nearest Neighbor Sampling for Better Defect PredictionDefect Prediction
Gary D. BoetticherDepartment of Software EngineeringUniversity of Houston - Clear Lake
Houston, Texas, USA
The Problem: The Problem: Why is there not more ML in Why is there not more ML in Software Engineering?Software Engineering?
Human-Based62 to 86% [Jørgensen 2004]
AlgorithmicMachine
Learning7 to 16%
Key IdeaKey Idea
More ML in SE through a more More ML in SE through a more defined experimental process.defined experimental process.
AgendaAgenda
A better defined process for better predicting (quality)
Experiments: Nearest Neighbor Sampling on PROMISE
Defect data sets
Extending the approach
Discussion
Conclusions
A Better Defined ProcessA Better Defined Process
Emphasis of ML approachesEmphasis on Measuring Success
– PRED(X)– Accuracy– MARE
Prediction success depends upon the relationship between training and test data.
PROMISE Defect Data (from NASA)PROMISE Defect Data (from NASA)Project Code Description
CM1 C NASA spacecraft instrumentKC1 C++ Storage management for receiving/processing ground dataKC2 C++ Science data processing. No software overlap with KC1.JM1 C Real-time predictive ground systemPC1 C Flight software for earth orbiting satellite
21 Inputs– Size (SLOC, Comments)– Complexity (McCabe Cyclomatic Comp.)– Vocabulary (Halstead Operators, Operands)
1 Output: Number of Defects
Data PreprocessingData Preprocessing
ProjectOriginal
SizeSize w/ No
Bad, No Dups0
Defects1+
Defects%
Defects
CM1 498 441 393 48 10.9%JM1 10,885 8911 6904 2007 22.5%KC1 2109 1211 896 315 26.0%KC2 522 374 269 105 28.1%PC1 1109 953 883 70 7.3%
Reduced to 2 classes
Experiment 1Experiment 1
6904 with
0 Defects
2007with
1+ Defects
JM1
}22%
Training40% of Original Data
Nice Test Nasty Test
Experiment 1 ContinuedExperiment 1 ContinuedTraining
Nice Test Nasty Test
Remaining Vectorsfrom Data set
Remaining Vectorsfrom Data set
J48 and Naïve Bayes Classifiers from WEKA200 Trials (100 Nice Test Data + 100 Nasty Test Data)
– CM1– JM1– KC1– KC2– PC1
Experiment 1 ContinuedExperiment 1 Continued
20 Nice Trials + 20 Nasty Trials
Results: AccuracyResults: AccuracyNice Test Set Nasty Test Set
J48 NaïveBayes
J48 NaïveBayes
CM1 97.4% 88.3% 6.2% 37.4%JM1 94.6% 94.8% 16.3% 17.7%KC1 90.9% 87.5% 22.8% 30.9%KC2 88.3% 94.1% 42.3% 36.0%PC1 97.8% 91.9% 19.8% 35.8%
OverallAverage
94.4% 93.6% 18.7% 21.2%
Results: Average Confusion MatrixResults: Average Confusion Matrix
J48 Naïve Bayes2 3 3 2
58 1021 68 1011
J48 Naïve Bayes50 249 60 2412 7 3 5
Average Nice Results
Average Nasty Results
Note thedistribution:
0 Defects
1+ Defects
Experiment 2: 60% Train, KNN=3Experiment 2: 60% Train, KNN=3Accuracy
NeighborDescription
# ofTRUEs
# ofFALSEs J48
NaïveBayes
PPP None None NA NA
PPN 0 354 88 90
PNP 0 5 40 20
NPP None None NA NA
PNN 3 0 100 0
NPN 13 0 31 100
NNP 110 0 25 28
NNN None None NA NA
Assessing Experiment DifficultyAssessing Experiment Difficulty
Exp_Difficulty = 1 - Matches / Total_Test_Instances
Match = Test vector’s nearest neighbor is from the same class instance in the training set.
Experimental Difficulty = 1
Experimental Difficulty = 0
Hard experiment
Easy experiment
Assessing Overall Data DifficultyAssessing Overall Data Difficulty
Overall Data Difficulty = 1 - Matches / Total_Data_Instances
Match = A data vector’s nearest neighbor is from the same class instance as another vector in the data set.
Overall Data Difficulty = 1
Overall Data Difficulty = 0
Difficult Data
Easy Data
Discussion: Anticipated BenefitsDiscussion: Anticipated BenefitsMethod for characterizing difficulty of
experimentMore realistic modelsEasy to implementCan be integrated into N-Way Cross ValidationCan apply to various types of SE data sets:
– Defect Prediction– Effort Estimation
Can be extended beyond SE to other domains
Discussion: Potential ProblemsDiscussion: Potential Problems
More work needs to be doneAgreement on how to measure Experimental
DifficultyExtra overheadImplicitly or Explicitly Data Staved Domain
How to get more ML in SE?
ConclusionsConclusions
Assess experiments/data for their difficulty
Benefits:More credibility to the modeling processMore reliable predictorsMore realistic models
Thanks to the reviewers for their comments!
AcknowledgementsAcknowledgements
1) M. Jørgensen, A Review of Studies on Expert Estimation of Software Development Effort, Journal Systems and Software, Vol 70, Issues 1-2, 2004, Pp. 37-60.
ReferencesReferences