nearest neighbor sampling for better defect prediction gary d. boetticher department of software...

Nearest Neighbor Sampling for Better Nearest Neighbor Sampling for Better Defect PredictionDefect Prediction

Gary D. BoetticherDepartment of Software EngineeringUniversity of Houston - Clear Lake

Houston, Texas, USA

The Problem: The Problem: Why is there not more ML in Why is there not more ML in Software Engineering?Software Engineering?

Human-Based62 to 86% [Jørgensen 2004]

AlgorithmicMachine

Learning7 to 16%

Key IdeaKey Idea

More ML in SE through a more More ML in SE through a more defined experimental process.defined experimental process.

AgendaAgenda

A better defined process for better predicting (quality)

Experiments: Nearest Neighbor Sampling on PROMISE

Defect data sets

Extending the approach

Discussion

Conclusions

A Better Defined ProcessA Better Defined Process

Emphasis of ML approachesEmphasis on Measuring Success

– PRED(X)– Accuracy– MARE

Prediction success depends upon the relationship between training and test data.

PROMISE Defect Data (from NASA)PROMISE Defect Data (from NASA)Project Code Description

CM1 C NASA spacecraft instrumentKC1 C++ Storage management for receiving/processing ground dataKC2 C++ Science data processing. No software overlap with KC1.JM1 C Real-time predictive ground systemPC1 C Flight software for earth orbiting satellite

21 Inputs– Size (SLOC, Comments)– Complexity (McCabe Cyclomatic Comp.)– Vocabulary (Halstead Operators, Operands)

1 Output: Number of Defects

Data PreprocessingData Preprocessing

ProjectOriginal

SizeSize w/ No

Bad, No Dups0

Defects1+

Defects%

Defects

CM1 498 441 393 48 10.9%JM1 10,885 8911 6904 2007 22.5%KC1 2109 1211 896 315 26.0%KC2 522 374 269 105 28.1%PC1 1109 953 883 70 7.3%

Reduced to 2 classes

Experiment 1Experiment 1

6904 with

0 Defects

2007with

1+ Defects

JM1

}22%

Training40% of Original Data

Nice Test Nasty Test

Experiment 1 ContinuedExperiment 1 ContinuedTraining

Nice Test Nasty Test

Remaining Vectorsfrom Data set

Remaining Vectorsfrom Data set

J48 and Naïve Bayes Classifiers from WEKA200 Trials (100 Nice Test Data + 100 Nasty Test Data)

– CM1– JM1– KC1– KC2– PC1

Experiment 1 ContinuedExperiment 1 Continued

20 Nice Trials + 20 Nasty Trials

Results: AccuracyResults: AccuracyNice Test Set Nasty Test Set

J48 NaïveBayes

J48 NaïveBayes

CM1 97.4% 88.3% 6.2% 37.4%JM1 94.6% 94.8% 16.3% 17.7%KC1 90.9% 87.5% 22.8% 30.9%KC2 88.3% 94.1% 42.3% 36.0%PC1 97.8% 91.9% 19.8% 35.8%

OverallAverage

94.4% 93.6% 18.7% 21.2%

Results: Average Confusion MatrixResults: Average Confusion Matrix

J48 Naïve Bayes2 3 3 2

58 1021 68 1011

J48 Naïve Bayes50 249 60 2412 7 3 5

Average Nice Results

Average Nasty Results

Note thedistribution:

0 Defects

1+ Defects

Experiment 2: 60% Train, KNN=3Experiment 2: 60% Train, KNN=3Accuracy

NeighborDescription

# ofTRUEs

# ofFALSEs J48

NaïveBayes

PPP None None NA NA

PPN 0 354 88 90

PNP 0 5 40 20

NPP None None NA NA

PNN 3 0 100 0

NPN 13 0 31 100

NNP 110 0 25 28

NNN None None NA NA

Assessing Experiment DifficultyAssessing Experiment Difficulty

Exp_Difficulty = 1 - Matches / Total_Test_Instances

Match = Test vector’s nearest neighbor is from the same class instance in the training set.

Experimental Difficulty = 1

Experimental Difficulty = 0

Hard experiment

Easy experiment

Assessing Overall Data DifficultyAssessing Overall Data Difficulty

Overall Data Difficulty = 1 - Matches / Total_Data_Instances

Match = A data vector’s nearest neighbor is from the same class instance as another vector in the data set.

Overall Data Difficulty = 1

Overall Data Difficulty = 0

Difficult Data

Easy Data

Discussion: Anticipated BenefitsDiscussion: Anticipated BenefitsMethod for characterizing difficulty of

experimentMore realistic modelsEasy to implementCan be integrated into N-Way Cross ValidationCan apply to various types of SE data sets:

– Defect Prediction– Effort Estimation

Can be extended beyond SE to other domains

Discussion: Potential ProblemsDiscussion: Potential Problems

More work needs to be doneAgreement on how to measure Experimental

DifficultyExtra overheadImplicitly or Explicitly Data Staved Domain

How to get more ML in SE?

ConclusionsConclusions

Assess experiments/data for their difficulty

Benefits:More credibility to the modeling processMore reliable predictorsMore realistic models

Thanks to the reviewers for their comments!

AcknowledgementsAcknowledgements

1) M. Jørgensen, A Review of Studies on Expert Estimation of Software Development Effort, Journal Systems and Software, Vol 70, Issues 1-2, 2004, Pp. 37-60.

ReferencesReferences

nearest neighbor sampling for better defect prediction gary d. boetticher department of software...

Documents

data set slide

difficult data easy

data vectors

nice test data

accuracy slide

data preprocessing

usa slide

domains slide