promise 2011: "an iterative semi-supervised approach to software fault prediction"

An Iterative Semi-supervised Approach to Software Fault Prediction

Huihua Lu, Bojan Cukic, Mark Culp

Lane Department of Computer Science and Electrical EngineeringDepartment of Statistics

West Virginia UniversityMorgantown, WV

September 2011

Presentation Outline

• Introduction

• Semi-supervised Learning

• Methodology

• Experiments

• Results and Discussion

• Conclusion and Future Work

Introduction

• Software Quality Assurance– Identify where faults hide, subject to V&V– Without automation, costly and time-consuming

• Software Fault Prediction– Software metrics: code metrics, complexity metrics, etc.– Software fault prediction models identify faulty modules

• Supervised learning algorithms are the norm

• Practical Problem– For one-of-a-kind systems or new systems, ground truth data

may be sparse

• The Goal of Our Study– Evaluate the performance of semi-supervised learning

approaches

Goal of the Study

Can we match the performance of supervised learning fault prediction models from a smaller set of labeled modules?

Consequence:If very few modules are labeled (very real scenario), include unlabeled modules for training Most published studies use 50% or more software modules for model training. This is not practical for new projects.

• Introduction

• Methodology

• Experiments

Semi-Supervised Learning-1

• Supervised Learning– Train a model from labeled (training) data only– Labeled data could be expensive to create

• Modules receive labels through detailed V&V

• Semi-Supervised Learning– Train a model from both the labeled data and the

unlabeled data • Include new modules as they become available in

a version control system.– Unlabeled data are the modules with unknown fault

content

Semi-Supervised Learning-2

• Traditional Semi-supervised Learning algorithms– Co-training

• Assumption: features can be separated into two sets

– Generative Learning (EM algorithm)• Assumption: need the knowledge of the distribution

of data

– Self-training• Assumption: None

Related Work

• In software fault prediction– Khoshgoftaar: Inductive semi-supervised learning

• Data from one project separated into labeled and unlabeled sets; performance are evaluated on a different project

• Achieved better performance than a tree-based supervised algorithm-C4.5

– Khoshgoftaar: Clustering based semi-supervised learning

• Extend unsupervised learning into semi-supervised learning• Better partitioning than unsupervised learning• Assume that human domain experts participate in classifying

modules into fault-prone and not-fault-prone.

– Many supervised learning modeling approaches.

• Introduction

• Methodology

• Experiments

Methodology-1

• Fitting the Fits (FTF) semi-supervised algorithm– A variant of Self-training [3]– Idea: Reduce the semi-supervised problem to some

form of a supervised problem– The Algorithm:

),(, ,ˆ )2(

following Repeat the

),(, ,

ˆ :Initialize

LLLULU

YXDXDφY

Initialize the labels for U

Reset the labels for L

Fit the labels for U+L

Methodology-2

• The Base Learner:– Initializes the labels for unlabeled data

– “Improves” the labels of unlabeled data in iterations

– May lead to global convergence

• Random Forests– A good choice in the domain based on previous work– Robust to noise

ULU XDφYei ,.,. 0

XDφY kk ,ˆ i.e., 1

Software Data Sets

• These are large NASA MDP projects (> 1,000 modules)

Performance Measures

• Labels in binary classification problem:– 1 - fault prone module– 0 - not fault prone module– For each module, estimate the probability

• Area under ROC curve and Probability of Detection (PD) used for performance comparison

– PD = ||

)1Pr( cY

}75.0,5.0,1.0{

• Introduction

• Methodology

• Experiments

Experiments

• FTF with Random Forests vs. Random Forest

• Does FTF outperform supervised learning with the same size of labeled modules?– Size of labeled data: 2%, 5%, 10%, 25%, 50%– Stop the FTF algorithm after 50 iterations

• Is the behavior and performance of FTF consistent over different software projects?

• Introduction

• Methodology

• Experiments

Results: PC 3

Results at threshold 0.5

At threshold 0.1

Overall Comparison

• Introduction

• Methodology

• Experiments

Summary

• Does FTF with Random Forests as base learner outperform supervised learning with Random Forest?

– Yes, in most cases.– Improvement modest and not statistically significant

• How small can the size of the labeled data set be for the FTF to start outperforming supervised learning?

– When 5% or more modules labeled, semi-supervised approach seems a promising direction.

– Performance improves in comparison to the same size of labeled modules

• Is the behavior and performance of FTF consistent over different data sets?

– Yes

Future Work

• Try out different base learners with FTF– Base learner in FTF has dramatic effects. RF used

because it performs well in software fault modeling– RF does not converge, other base learners might– Analyze robustness to noise

• Expand on projects of different size or from different domains

• Introduce more sophisticated semi-supervised algorithms

Questions

• Please direct questions to

Bojan Cukic: bojan.cukic@mail.wvu.edu

Huihua Lu: hlu3@mix.wvu.edu

promise 2011: "an iterative semi-supervised approach to software fault prediction"

Technology

3 semi-supervised text classiﬁcation using...

of promise of promise

iterative decoding for wireless networks -...

iterative algorithms inkamra/pdf/igwa.pdf · iterative...

using self-supervised learning can improve model...

linear algebra & numerical analysis -...

iterative attention mining for weakly supervised thoracic...

iterative development

an iterative algorithm for extending learners to a...

iterative project management module 1 - iterative and...

deep learning in neuroradiology · unsupervised learning...

chap04 iterative

iterative examples

leonid e. zhukov - leonid zhukovlecture outline 1 label...

unit - iv iterative process planning: 10. iterative

coupled bayesian sets algorithm for semi-supervised...

iterative methods in combinatorial...

iterative deepening

bimodal delivering on the promise - gartner...bimodal =...

iterative development and agile...