software quality analysis by combining multiple projects and learners

Software quality analysis by combining multiple projectsand learners

Taghi M. Khoshgoftaar Æ Pierre Rebours Æ Naeem Seliya

Published online: 8 July 2008� Springer Science+Business Media, LLC 2008

Abstract When building software quality models, the approach often consists of

training data mining learners on a single fit dataset. Typically, this fit dataset contains

software metrics collected during a past release of the software project that we want to

predict the quality of. In order to improve the predictive accuracy of such quality

models, it is common practice to combine the predictive results of multiple learners to

take advantage of their respective biases. Although multi-learner classifiers have been

proven to be successful in some cases, the improvement is not always significant because

the information in the fit dataset sometimes can be insufficient. We present an innovative

method to build software quality models using majority voting to combine the predic-

tions of multiple learners induced on multiple training datasets. To our knowledge, no

previous study in software quality has attempted to take advantage of multiple software

project data repositories which are generally spread across the organization. In a large

scale empirical study involving seven real-world datasets and seventeen learners, we

show that, on average, combining the predictions of one learner trained on multiple

datasets significantly improves the predictive performance compared to one learner

induced on a single fit dataset. We also demonstrate empirically that combining multiple

learners trained on a single training dataset does not significantly improve the average

predictive accuracy compared to the use of a single learner induced on a single fit

dataset.

T. M. Khoshgoftaar (&) � P. ReboursComputer Science and Engineering, Florida Atlantic University,777 Glades Road, Boca Raton, FL 33431, USAe-mail: [email protected]

P. Rebourse-mail: [email protected]

N. SeliyaComputer and Information Science, University of Michigan-Dearborn,4901 Evergreen Road, Dearborn, MI 48128, USAe-mail: [email protected]

123

Software Qual J (2009) 17:25–49DOI 10.1007/s11219-008-9058-3

Keywords Multiple software metrics repositories � Software quality classification

model � Multiple learners � Cost of misclassification � Data mining � Majority voting

1 Introduction

In the context of software quality, program modules are typically labeled as either fault-

prone (fp) or not fault-prone (nfp) (Fenton and Pfleeger 1997; Khoshgoftaar et al. 2000).

The need for reliable and high quality products often leads software managers to use

Software Quality Classification Models (SQCMs), which allow them to direct improve-

ment efforts to software modules with higher risk. Such models are designed to identify,

prior to deployment, software modules that are likely to be fault-prone during operations.

Hence, a cost-effective utilization of resources can be implemented for software testing,

inspection, and quality enhancement of these modules. SQCMs are often based on

inductive learning algorithms which generalize the concepts learned from a set of training

instances (i.e., fit dataset) and apply these concepts to the currently under-development

instances (i.e., test dataset). Typically, the fit dataset is made up of modules (i.e., instances)

related to the past release or to a very similar project.

During the early usage of SQCMs, practitioners used to rely on a single learning

algorithm. As for almost any data mining application area, there are a number of classi-

fication algorithms developed from different theories and methodologies, such as case-

based reasoning (Emam et al. 2001), logistic regression (Khoshgoftaar and Allen 1999),

support vector machines (Xing et al. 2005), naive bayes (Menzies et al. 2007), and clas-

sification trees (Khoshgoftaar et al. 2000). Typically, many of these algorithms are trained

and the practitioner chooses the one that performs the best in cross-validation or on a

separate validation set. However, each paradigm dictates a certain model that comes with a

set of assumptions which may lead to a high bias if the assumptions do not hold, especially

if the training dataset is noisy (Khoshgoftaar et al. 2006; Khoshgoftaar and Seliya 2004).

Consequently, it has became common practice to combine the predictions of several

learners to achieve a higher reliability of the SQCMs. This approach is often referred to as

a multi-expert or multi-learner system (Gamberger et al. 1996) because base models

generated by machine learning can be regarded as experts (Witten and Frank 2000).

A multi-learner system can be an effective solution because it combines complementary

classification procedures. Unfortunately, even with appropriate biases, a multi-learner

classifier may remain inefficient if the base learners are trained on a dataset with limited

amount of information. Therefore, we believe that increasing the amount of information

available for knowledge discovery is the key to building adaptable and robust software

quality models (Witten and Frank 2000), and to ultimately improve the generalization

accuracy of such models.

It is very common nowadays for an organization to maintain several software metrics

repositories for each undertaken project (Fenton and Pfleeger 1997; Meulen et al. 2007;

Nikora and Munson 2003). The data in these repositories are likely to follow similar patterns,

especially if the organization enforces the same development life cycle, as well as the same

coding and testing practices. Consequently, in this paper, we propose to use all available past

projects to build SQCMs because we believe that enlarging the set of available software

modules for training will improve the predictive accuracy of the final software quality

model. In other words, we believe that we can detect more accurately which software

modules are faulty by increasing the volume and variety of data to learn from.

26 Software Qual J (2009) 17:25–49

123

We describe a new multi-learner classifier which combines the predictions of mul-

tiple learners successively trained on multiple datasets. The combination function of this

multi-learner multi-dataset classifier is majority voting. This approach is innovative

because it has never been attempted in the context of software quality engineering.

Moreover, this study remains practical because it provides a simple way for the prac-

titioner to create better SQCMs by simply leveraging the existing information spread

across datasets. Four classification scenarios are then evaluated against each other.

These classification scenarios predict the instances in the test dataset based on the

respective predictions of a single learner induced on a single fit dataset (Scenario 1),

multiple learners induced on a single dataset (Scenario 2), a single learner induced on

multiple datasets (Scenario 3), and multiple learners induced on multiple datasets

(Scenario 4).

The performances of these classification scenarios are assessed by a large-scale

empirical study using seven real-world software metrics repositories and seventeen

well-proven learners. To the best of our knowledge, this study is unique in both its

scale and its application domain. We demonstrate that, on average, using one learner

induced on multiple fit datasets (Scenario 3) achieves better cost-wise performances

than using a single learner trained on a single dataset (Scenario 1). We also show that

relying solely on multiple learners (Scenario 2) does not significantly improve the

performance compared to the use of a single classifier induced on a single dataset

(Scenario 1).

The rest of the paper is organized as follows. In Sect. 2, we present related research.

Section 3 describes the implementation of the multi-learner multi-dataset classifier. We

also describe the modeling methodology involved in our empirical investigation. Section 4

presents the details of the experimental study and discusses the results. Finally, we draw

some useful conclusions from the empirical results in Sect. 5.

2 Related work

Voting is a well-known technique to combine the decisions of peer experts. Voting

techniques include majority voting, weighted voting, plurality voting, instance runoff

voting, and threshold voting (Yacoub et al. 2003). These techniques are derived from a

more general technique referred to as weighted k-out-of-n systems. The predictions of the

base classifiers can also be combined by hybrid approaches (Alpaydin 1997; Ho et al.

1994; Yacoub et al. 2003). A hybrid approach usually takes advantage of the strengths of

individual classifiers and avoids their weaknesses (Ho et al. 1994). Stacked generalization

extends voting in a sense that the output of the learners is incorporated through a combiner

system which is also trained (Wolpert 1992).

In software quality modeling, there are often numerous types of features which can be

used to represent and identify fault-prone modules (Fenton and Pfleeger 1997). As a

consequence, it is technically difficult for a single classifier to make use of all the

features. Ho et al. (1994) and Chen et al. (1997) observed that features and classifiers of

different types complement one another in classification performance. However, the

problem of combining potentially conflicting decisions by multiple classifiers still

remains unresolved. As noted by Alpaydin (1998), combination approaches can be

divided into two groups. In a uni-representation, all learners use the same representation.

Therefore, the learners should be different in order to obtain different decisions. In a

multi-representation, a single learner can use different representations of the same input.

Software Qual J (2009) 17:25–49 27

123

For example, in software quality models, software metrics can represent product,

resource, or process attributes of the modules (Fenton and Pfleeger 1997).

Krogh and Vedelsby (1995) defined ambiguity as the variation of the output of voters

averaged over unlabeled data to quantify the disagreement among the voters. In the

context of multiple neural networks, they showed that the ambiguity needs to be

maximized for minimal error. If the voters are strongly biased, the ambiguity will be

small because voters implement very similar functions and will agree on inputs even

outside the training set. They also noted that one way to increase the ambiguity is to

train the voters on different datasets. Hansen and Salamon (1990) mentioned that, by

taking majority voting and selecting independent experts with a success probability

higher than 0.50, success increases as the number of voting classifiers increases. Mani

(1991) argued that variance among voters decreases as the number of independent

voters increases.

Meir (1995) showed that for linear regression, by training experts on disjoint datasets

and using voting, a large decrease in variance can be achieved due to the independence of

experts. Alpaydin (1997) made similar conclusions with a nearest neighbor classifier.

While most studies assume that each estimator is trained on the complete dataset, Meir

envisaged a situation where the dataset is divided into several subsets, with each of them

used to form a different estimator. Similarly, in our previous work (Khoshgoftaar and

Rebours 2004), we initially divided the training set to build a less biased multi-expert

system to detect noisy instances more efficiently.

In a recent study (Khoshgoftaar and Seliya 2004), we have empirically demonstrated

that while using a very large number of diverse classification techniques for building

software quality classification models, classification accuracy does not show a dramatic

improvement. Instead of searching for a classification technique that performs well for a

given software measurement dataset, we concluded that the software process should focus

on improving the quality of the data (Khoshgoftaar and Seliya 2004).

3 Methodology

3.1 Classification scenarios

We present a new multi-expert classifier which combines the prediction of multiple base

learners induced on multiple training datasets. Each base algorithm has a trade-off, in a

sense that it introduces a bias during the learning process. As there is no benefit in

combining multiple learners that always make similar decisions, the aim is to find a set of

learners which differ in their decisions so that they complement each other (Alpaydin

1998).

Suppose that the practitioner selects m different algorithms (learners). Metrics collected

on n past projects are also available. In this study, we exclusively focus on a uni-repre-sentation system (Alpaydin 1998), i.e., the collected metrics in the n fit datasets as well as

in the test dataset are of the same type and number. Given a test dataset E, the m learning

algorithms and n fit datasets can be combined to predict E. More specifically, each of the malgorithms is trained on each of the n fit datasets until m 9 n base models are built. These

base models are then used to predict for the test dataset E; m 9 n vectors of base estimates

of instances in E are therefore generated. Finally, the vectors are combined using the voting

scheme described next.

28 Software Qual J (2009) 17:25–49

123

3.1.1 Voting technique

The simplest way to combine multiple experts is by majority voting. If the number of

experts is even, then we are likely to encounter a tie where half of the experts predict the

program module as nfp, and the other half predict it as fp. Due to the nature of software

quality engineering, where the cost of misclassifying a fp program module as nfp is much

more costly than misclassifying a nfp program module as fp, a conservative approach is

recommended (Khoshgoftaar and Allen 1999). More specifically, the final decision is fprather than nfp in case of a tie.

3.1.2 Pre-processing

An induced learner can only predict instances in the test dataset if the relative values and

ranges of the attributes in the training dataset are similar to those in the test dataset. It is

therefore recommended that the attribute values (i.e., the software metrics in our case) be

first scaled so that their relative values and ranges are approximately equal. In this study,

we normalize all the metrics to a zero mean and unit variance. Let Ikl be the measured

value of the lth attribute of the kth program module. The scaled value I0kl corresponding to

Ikl is given by (Jain 1991):

I0kl ¼Ikl � �Il

Slð1Þ

where �Il and Sl are the measured mean and standard deviation of the lth attribute

respectively.

The actual values of the normalized metrics are generally not comprehensible by the

typical user of software metrics, in that they range from negative to positive numbers.

Similar to (Munson and Khoshgoftaar 1992), the values of the metrics used in this study

represent a scaled version of the normalized metrics defined as follows:

I00kl ¼ 10I0kl þ 50 ð2Þ

These scaled-relative metrics will have a mean of 50 and a standard deviation of 10.

3.1.3 Algorithm

Figure 1 describes the implementation of the multi-learner multi-dataset classifier. Li=1,…,m

denotes the ith base learner. Dj=1,…,n is the jth fit dataset. D0j is the normalized jth fit dataset.

D00j is the scaled-relative jth fit dataset. E and kEk represent the test dataset and the number

of instances in the test dataset respectively. Ik¼1;...;kEk is the kth program module of dataset

E. I00k is the scaled-relative program module of Ik. LiðI00k ;D00j Þ ¼ ck is the predictive class of

scaled-relative program module I00k obtained by inducing learning scheme Li on training set

D00j : In the case of software classification models, ck 2 fnfp; fpg: c ¼ fc1; . . .; ckEkg is the

vector of the final estimates of instances in dataset E. As mentioned previously, the fit

datasets Dj=1,…,n and the test dataset E use the same representation to describe software

modules.

Initially, before using the output of the m 9 n experts, the test and fit datasets are

normalized (steps 1 and 4, respectively) and scaled (steps 2 and 5, respectively). The mselected learners are then fine-tuned on each of the n fit datasets (step 7) based on a model

selection strategy of obtaining a preferred balance of equality between the Type I and Type

Software Qual J (2009) 17:25–49 29

123

II error rates. Given an instance Ik, a counter, Sk, is defined (step 11). For each learner and

for each fit dataset, if learner Li induced on fit dataset D00j predicts scaled-relative program

module I00k as fp, counter Sk is incremented (step 14). Once all m 9 n experts cast their

votes, the counter associated to a given instance will indicate the final predicted value of

that program module. If half or more of the experts agree on classifying instance Ik as fp,

then the final predicted label of that instance is fp (step 17). Otherwise, instance Ik is

classified as nfp (step 18).

3.1.4 Four classification scenarios

The classifier described in Fig. 1 can be specialized by modifying the values of some

parameters such as the number of classifiers (m), or the number of fit datasets (n). The

notation ‘‘m 9 n’’ represents a classification scheme which combines the predictions of mlearners induced on n fit datasets. In this work, the following four classification scenarios

are investigated:

Scenario 1: single-learner single-dataset classifier: This classifier uses a single learner

(m = 1) induced on a single fit dataset (n = 1) to predict the test dataset. In this study,

such a scenario is labeled as 1 9 1.

Scenario 2: multi-learner single-dataset classifier: This classifier combines m learners

induced on a single fit dataset (n = 1). This technique has been covered extensively by

researchers, as discussed in Sect. 2. It is labeled as m 9 1.

Scenario 3: single-learner multi-dataset classifier: The predictions of one learner

(m = 1) induced on n fit datasets are combined. It is labeled as 1 9 n.

Scenario 4: multi-learner multi-dataset classifier: This is the most extensive technique

as it uses m learners induced on n fit datasets. It is denoted as m 9 n.

Fig. 1 Implementation of themulti-learner multi-datasetclassifier

30 Software Qual J (2009) 17:25–49

123

3.2 Merging: an alternative classification technique

In order to take advantage of the availability of multiple fit datasets, one may decide to

merge them into one single dataset. A multi-learner classifier can then be built based on mlearners induced on this merged dataset. Of course, such an approach should standardize

the data prior to the merge. Instead of building m 9 n base learners, the practitioner builds

only m classifiers. Since the newly created dataset now contains more varieties of cases

because it originates from various projects, one may argue that data mining learners

induced on this new dataset would be more accurate than those induced on any dataset

before merging. We compare this merging approach with the other approaches (or sce-

narios) presented in the previous section. While this alternate classification technique is

relatively simpler, there are some notable drawbacks, as explained below.

• As opposed to the multi-dataset approach, this alternative technique makes impossible

the use of multi-features datasets. Unless the learning algorithms can cope efficiently

with missing values, the attributes of the instances need to be of the same type before

merging.

• Inducing learners on one large fit dataset will create complex hypotheses which are

difficult to interpret. Besides, the fit datasets can have different levels of noise

(Khoshgoftaar et al. 2006; Khoshgoftaar and Rebours 2004) or different sizes. As a

consequence, the training of the algorithms on the merged dataset can be more

intricate.

• The induction of data mining learners on the newly merged fit dataset would be slower

for polynomially bounded algorithms O(ab), where a represents the number of

instances in the dataset, and b is a given positive constant.

3.3 Analysis of variance

ANalysis Of VAriance, commonly known as ANOVA, is a statistical technique for

examining whether independent groups or populations are significantly different from one

another. In our study, the one-way ANOVA design is selected to analyze the performances

of the four classification scenarios. In this design, a classification scenario corresponds to a

group. More specifically, the single-learner single-dataset classification scenario (1 9 1),

the multi-learner (single-dataset) scenario (m 9 1), the (single-learner) multi-dataset sce-

nario (1 9 n), and the multi-learner multi-dataset scenario (m 9 n) relate to the first,

second, third, and fourth groups of the ANOVA model, respectively.

Let Y1j; . . .; Ynjj represent a random sample of nj observations taken from the population

of group j. How many observations can we possibly collect from the population of each

group? In our experiment, m different data mining algorithms are used. In addition, n + 1

datasets are available and the dependent variable (i.e., fp or nfp) is known for all the

datasets. m 9 (n + 1) base learners can then be fine-tuned and each of them can generate

base estimates for n datasets. Hence, m 9 (n + 1) 9 n vectors of base estimates can be

obtained. For example, a single-learner multi-dataset classifier (Scenario 3) can be created

by combining the predictions of one of the m base learners successively trained on ndatasets. This classifier can then be applied on the test dataset. Therefore, n3 = m 9

(n + 1) combinations are possible for classification Scenario 3. Similarly, n1 = m 9 n 9

(n + 1), n2 = n 9 (n + 1), and n4 = n + 1.

Yij, the ith observation in group j (where i = 1,…,nj and j = 1,…,4), can be represented

by the following model (Berenson et al. 1983):

Software Qual J (2009) 17:25–49 31

123

Yij ¼ lþ Aj þ eij ð3Þ

where l is the overall effect to all the observations; Aj ¼ l:j � l; the treatment effect

related to the jth group; eij = Yij - l.j, the experimental error associated with the ithobservation in group j; l.j, the true mean of the jth group. The F statistic validates whether

the four population means are equal (Berenson et al. 1983; Jain 1991).

It is assumed that the observed responses in each of the groups represent random

samples drawn from four independent normal populations having equal variance (i.e.,

r21 ¼ � � � ¼ r2

4). The Shapiro test validates the assumption of normality (i.e., Nðl:j; rjÞ).The property of homoscedasticity (i.e., equal variability) is tested by using L Levene test,

since the four sample sizes are different (Berenson et al. 1983). If we have good reason to

believe this assumption has been violated, a good strategy is to seek appropriate trans-

formations to normalize the data.

If it is concluded that there are significant differences between the means of at least two

of the groups, a posteriori comparison methods determine which of the four groups are

significantly different as well as which of the four groups appear to differ from each other

only by chance. Since the four sample sizes are very different in this study, Scheffe S tests

will be used later to evaluate the4

2

� �¼ 6 paired-comparisons (Berenson et al. 1983).

3.4 Model-selection and evaluation strategy

Our empirical study is related to two-group classification in the context of software quality.

Software modules are typically labeled as either fault-prone (fp) or not fault-prone (nfp).

Hence, two types of misclassification errors can occur: Type I error (or false positive) and

Type II error (or false negative). The Type I and Type II errors are generally inversely

proportional for a given dataset. Hence, software quality engineers often recommend

selecting a classification model that has a preferred balance between the two error rates.

This is often obtained by building models with different parameter settings, such as tree

depth in decision trees.

The model selection strategy used in our study is based on obtaining a preferred balance

of equality between the Type I and Type II error rate, with the Type II error rate being as

low as possible. We have used this strategy effectively in several of our prior studies on

software quality analysis of high assurance systems (Khoshgoftaar et al. 2000; Khosh-

goftaar and Rebours 2004; Khoshgoftaar and Seliya 2004). In this study, the model

selection strategy is semi-automated such that the respective model parameters are fine-

tuned to obtain the best possible accuracy on the training dataset while maintaining the

preferred balance between the Type I and Type II error rates, with the latter as low as

possible. Low Type I and Type II misclassification rates ensure the detection of a sig-

nificantly large number of fp modules, and at the same time, keeps the number of nfpmodules predicted to be fp (i.e., ineffective testing and inspection) low.

Comparing the performance of different classification methods based on the two mis-

classification rates (Type I and Type II) can be a difficult task (Khoshgoftaar et al. 2006;

Khoshgoftaar and Rebours 2004). In the context of (two-group) software quality classifi-

cation, where there is likely to be a vast disparity between the prior probabilities of the two

classes (fp and nfp) and the cost of the two types of misclassification, the Expected Cost of

Misclassification (ECM) is more appropriate as a practical measure for comparison

(Khoshgoftaar and Allen 1999):

32 Software Qual J (2009) 17:25–49

123

ECM ¼ CIPrðfpjnfpÞpnfp þ CIIPrðnfpjfpÞpfp ð4Þ

where CI and CII are costs of Type I and Type II misclassification errors respectively, pfp

and pnfp are prior probabilities of fp modules and nfp modules, Pr(fp|nfp) is the probability

that a nfp module would be misclassified as fp, and Pr(nfp|fp) is the probability that a fpmodule would be misclassified as nfp.

In practice, it is difficult to quantify the actual costs of misclassification at the time of

modeling. Hence, we define the Normalized Expected Cost of Misclassification (NECM):

NECM ¼ ECM

CI¼ PrðfpjnfpÞpnfp þ

CII

CIPrðnfpjfpÞpfp ð5Þ

NECM facilitates the use of cost ratio CII

CI¼ c; which can be more readily estimated using

software engineering heuristics for a given project (Khoshgoftaar and Allen 1999).

4 Empirical evaluation

4.1 Description of the software measurement datasets

The software metrics and quality data used in our study originate from seven NASA

software projects. Obtained through the NASA Metrics Data Program, these datasets

include software measurement data and associated error data collected at the function level

(Khoshgoftaar et al. 2006; Khoshgoftaar and Rebours 2004; Khoshgoftaar and Seliya

2004). Each instance of these datasets is a program module. The quality of a module is

described by its Error Rate, i.e., number of defects in the module, and Defect, whether or

not the module has any defects. Even though these projects are not directly dependent on

each other, they share marked commonalities. All software projects used in our study are

developed based on a general NASA software development process, and all pertain to

mission critical software applications. They are all high assurance and complex real-time

systems. It is practical and relevant to leverage the information spread across these datasets

in order to predict the quality of an ongoing similar project.

We selected thirteen primitive software metrics for our study: three McCabe metrics

(Cyclomatic Complexity; Essential Complexity; and Design Complexity); five

metrics of Line Count ðLoc Code And Comment; Loc Total; Loc Comment; Loc Blank;Loc ExecutableÞ; four basic Halstead metrics ðUnique Operators; Unique Operands;Total Operators; Total OperandsÞ; and one metric for Branch Count. Classifiers are

built using the thirteen software metrics as independent variables and the module-class as the

dependent variable (i.e., fp or nfp). It is important to note that the software measurements are

primarily governed by their availability, the internal workings of the respective projects, and

the data collection tools used by the projects. We only use functionally oriented metrics for all

software datasets, solely because of their availability. This is an unfortunate case of a real-

world software engineering situation where one has to work with what is available rather than

the most ideal situation.

The use of specific software metrics in the case study does not advocate their effec-

tiveness—different software projects may collect and consider different sets of software

measurements for analysis (Khoshgoftaar and Rebours 2004; Nikora and Munson 2004).

We note that the selection of a best set of predictors in estimation problems has been an

ongoing subject of study in software engineering. For example, Cuadrado et al. (2006)

Software Qual J (2009) 17:25–49 33

123

consider an approach to improve the selection of cost drivers in parametric models for

software cost estimation. They analyze various factors that affect the importance of a cost

driver, and use empirical evidence to formulate an aggregation mechanism for cost driver

selection. In the context of software quality classification, Menzies et al. (2007) summarize

that instead of selecting a best set of software quality indicators, empirical studies should

focus on building software quality classification models that are useful and practical. They

summarize that the best attributes to use for defect prediction vary from dataset to dataset,

confirming a relatively similar observation made by Shepperd and Kadoda (2001).

The datasets are related to projects of various sizes written with various programming

languages. Table 1 summarizes the seven datasets used in this case study. Those datasets

are referred to as JM1, KC1, KC2, KC3, CM1, MW1, and PC1 respectively. It is worth

mentioning that the KC datasets are written using object-oriented languages; however,

object-oriented metrics provided for those projects were associated with a different set of

modules in the respective projects. Moreover, those modules were not associated with

known defect data. These problems with available object-oriented metrics prevented their

inclusion in our study. Each software system and its dataset is briefly described below.

• JM1 is a real-time C project which has approximately 315 KLOC (Fenton and Pfleeger

1997). There are eight years of error data associated with the metrics. The changes to the

modules are based on the changes reported within the problem reports. We processed

JM1 to eliminate redundancy, obvious noisy instances, and observations with missing

values. The pre-processed dataset contains 8850 modules, and of these instances, 1687

have one or more faults. The maximum number of faults in a software module is 26.

• KC1 is project that is comprised of logical groups of computer software components

(CSCs) within a large ground system. KC1 is made up of 43 KLOC of C++ code. The

dataset contains 2107 instances, and of these instances, 325 have one or more faults and

1782 have zero faults. The maximum number of faults in a module is 7.

• KC2 is a C++ program, with metrics collected at the function level. The KC2 project

is the science data processing unit of a storage management system used for receiving

and processing ground data for missions. The dataset includes only those modules that

were developed by NASA software developers. The dataset contains 520 instances, and

of these instances, 106 have one or more faults and 414 have zero faults. The maximum

number of faults in a software module is 13.

• KC3 has been coded in 18 KLOC of Java. This software application collects, processes,

and delivers satellite meta-data. The dataset contains 458 instances, and of theses

instances, 43 have one or more faults and 415 have zero faults. The maximum number

of faults in a module is 6.

• CM1 is written in C Code with approximately 20 KLOC. The data available for this

project is from a science instrument. It contains 505 instances, and of these instances,

48 have one or more faults and 457 have zero faults. The maximum number of faults in

a module is 5.

• MW1 is the software from a zero gravity experiment related to combustion. The

experiment is now completed. It is comprised of 8000 lines of C code. The dataset

contains 403 modules, and of these instances, 31 have one or more faults and 372 have

zero faults. The maximum number of faults in a module is 4.

• PC1 is a flight software from an earth orbiting satellite that is no longer operational. It

consists of 40 KLOC of C code. The dataset contains 1107 instances, and of these

instances, 76 have one or more faults and 1031 have zero faults. The maximum number

of faults in a module is 9.

34 Software Qual J (2009) 17:25–49

123

4.2 Presentation of the selected learners

As there is no benefit in combining multiple learners that always make similar decisions,

seventeen learners which differ in their underlying concepts are selected to complement

each other. These classification techniques are summarized in Table 2. Several of these

classifiers are implemented in the WEKA data mining tool (Witten and Frank 2000). Some

additional information regarding these classification techniques are presented in the

Appendix. However, a descriptive discussion of each technique is out of scope for this

paper.

Table 1 Summary of the Software Data Repositories

Dataset nfp fp Total Language

JM1a 7163 1687 8850 C

KC1 1782 325 2107 C++

KC2 414 106 520 C++

KC3 415 43 458 Java

CM1 457 48 505 C

MW1 372 31 403 C

PC1 1031 76 1107 C

a A pre-processed dataset

Table 2 Summary of the selected classifiers

Family Classification technique Acronym

Instance-based learners Locally Weighted Learning (with Decision Stump)(Atkeson et al. 1997)

LWLStump

1-Instance Based Learning (Witten and Frank 2000) IB1

k-Instance Based Learning (Emam et al. 2001) IBk

Meta learner Bagging (Breiman 1996) Bagging

Function-based learners Sequential Minimal Optimization (Platt 1998) SMO

Logistic Regression (Khoshgoftaar and Allen 1999) LRa

Rule-based learners Ripple Down Rules (Gaines and Compton 1995) Ridor

One Rule (Holte 1993) OneR

Lines-of-Code LOCa

Decision Table (Kohavi 1995) DecisionTable

Tree-based learners WEKA’s implementation of C4.5 (Quinlan 1993) J48

Partial Decision Tree (Frank and Witten 1998) PART

Tree-Disc Classification Tree (Khoshgoftaar et al. 2000) TDa

Alternate Decision Tree (Freund and Mason 1999) ADTree

Repeated Incremental Reduced Error Pruning (Cohen 1995) JRip

Random Forest (Witten and Frank 2000) RandomForest

Bayesian learner Naive Bayes (Frank et al. 2000) NaiveBayes

a Implemented by data mining tools other than WEKA

Software Qual J (2009) 17:25–49 35

123

4.3 Empirical results

4.3.1 Quality-of-fit results

We present the quality-of-fit of the m = 17 base learners induced on the 7 available

datasets. For most base classifiers, the predictions are obtained using 10-fold cross-vali-

dation. Extensive tests on numerous different datasets, with different learning techniques,

show that 10 is about the right number of folds to obtain the best estimate of error (Witten

and Frank 2000). In the case of LOC, TD, and LR, resubstitution is used due to the

unavailability of a cross-validation feature within the modeling tool used.

Tables 3 and 4 present the quality-of-fit in terms of Type I, Type II, and Overall

misclassification rates. The critical ranges for the misclassification rates are also provided

using the t-test (Jain 1991) at significance level a = 0.10. For each of the seventeen

learners listed in Table 3, the results are averaged across the seven datasets. We observe

that Tree Disk (TD) performs the best on average; this result is confirmed by previous

studies (Khoshgoftaar et al. 2000, 2006). It is also worth noting the relatively good

accuracy of Random Forest, Logistic Regression, and IBk. Similarly, for each of the seven

available datasets, the results in Table 4 are averaged across the seventeen learners. The

quality-of-fit greatly varies across the datasets, indicating that some datasets may be more

noisy than others (Khoshgoftaar et al. 2006).

4.3.2 Quality-of-test results

Table 5 presents the total number of classifiers obtained using the four classification sce-

narios (nj). For example, there are 42 possible combinations for classification scenario 2,

Table 3 Quality-of-fit averaged across fit datasets

Base learner Type I Type II Overall

J48 27.0% (±4.4%) 26.1% (±3.9%) 26.9% (±4.4%)

JRip 26.9% (±4.0%) 24.4% (±3.9%) 26.6% (±3.9%)

NaiveBayes 29.0% (±3.4%) 29.4% (±3.3%) 29.0% (±3.3%)

DecisionTable 29.4% (±4.9%) 28.7% (±4.7%) 29.3% (±4.9%)

RandomForest 25.6% (±2.9%) 24.1% (±3.5%) 25.4% (±3.0%)

OneR 26.6% (±3.2%) 25.8% (±3.4%) 26.5% (±3.2%)

PART 26.4% (±3.8%) 25.2% (±4.3%) 26.2% (±3.9%)

IBk 25.4% (±3.6%) 24.8% (±3.9%) 25.3% (±3.5%)

IB1 30.8% (±2.8%) 27.8% (±3.4%) 30.4% (±2.8%)

ADTree 26.9% (±3.1%) 27.0% (±2.9%) 26.9% (±3.1%)

Ridor 36.6% (±15.6%) 24.5% (±6.6%) 35.3% (±13.6%)

LWLStump 26.9% (±3.2%) 26.8% (±3.5%) 26.9% (±3.3%)

SMO 26.2% (±3.8%) 26.7% (±3.7%) 26.3% (±3.8%)

Bagging 26.9% (±3.5%) 24.4% (±2.9%) 26.5% (±3.4%)

LOCa 31.5% (±1.9%) 28.8% (±3.9%) 31.1% (±2.3%)

LRa 25.9% (±3.2%) 25.5% (±3.1%) 25.8% (±3.1%)

TDa 21.2% (±3.5%) 20.5% (±3.5%) 21.0% (±3.4%)

Average 27.9% (±1.1%) 26.2% (±0.8%) 27.7% (±1.0%)

a Quality-of-fit obtained by resubstitution

36 Software Qual J (2009) 17:25–49

123

implying that for each of the 7 datasets used as a fit dataset, the other 6 are used as test

datasets. Similarly, for Scenario 1, there are 714 possible combinations, reflecting 17

learners trained on each of the 7 datasets and applied to each of the remaining 6 datasets.

For Scenario 3, there are 119 possible combinations, i.e., each of the 17 learners trained on

6 fit datasets is applied to the remaining 1 dataset—all 7 datasets are given an opportunity

to belong to the 6 datasets used for training, and each dataset is given an opportunity to be

used as a test dataset for a learner trained on the 6 fit datasets. Finally, for Scenario 4, there

are 7 possible combinations since the 17 learners trained on 6 fit datasets are applied (in

turn) to each of the 7 datasets. In total, 882 vectors of final estimates (test data estimates)

are generated based on the four classification scenarios.

Tables 6–9 summarize the quality-of-test both in terms of misclassification rates (i.e.,

Type I and Type II) and normalized expected costs of misclassification (i.e., NECM) for

classification Scenarios 1, 2, 3, and 4 respectively. For brevity, the results are grouped and

averaged by test datasets. NECMs are produced for cost ratios (c) 15, 20 and 25. This range

of cost ratios was used in similar studies (Khoshgoftaar et al. 2000, 2006) and is considered

appropriate for high-assurance software systems. The averages ð �Y:jÞ and the standard

deviations (S.j) for the respective response variables are presented as well.

The classification scenarios 2, 3, and 4 achieve, on average, better performances in

terms of NECM compared to the Scenario 1. In other words, combining multiple base

estimates improves the predictive accuracy of the final classifier. Moreover, the standard

deviation of Scenario 1 is higher than any of the other scenarios (S.1[S.2[S.4[S.3). As

noted by Alpaydin (1998) and Mani (1991), the voting method can be thought of as a

regularizer which smoothes the predictive results. In terms of misclassification cost, we

observe that the multi-dataset classification scenarios (Scenarios 3 and 4) perform better on

average than the single-dataset scenarios (Scenarios 1 and 2), that is, �Y:3 [ �Y:4 [ �Y:2 [ �Y:1:

Table 4 Quality-of-fit averaged across base learners

Fit dataset Type I Type II Overall

JM1 33.6% (±0.5%) 33.1% (±0.8%) 33.5% (±0.5%)

KC1 27.7% (±0.8%) 27.1% (±0.9%) 27.6% (±0.8%)

KC2 21.7% (±1.2%) 20.1% (±0.8%) 21.3% (±1.0%)

KC3 25.5% (±1.3%) 23.8% (±1.3%) 25.4% (±1.3%)

CM1 30.2% (±6.1%) 24.6% (±2.4%) 29.7% (±5.4%)

MW1 31.2% (±1.9%) 30.2% (±1.6%) 31.2% (±1.9%)

PC1 23.0% (±2.0%) 22.5% (±1.9%) 23.0% (±2.0%)

Average 27.9% (±1.1%) 26.2% (±0.8%) 27.7% (±1.0%)

Table 5 Sample sizes of the four groups

Classificationscenarios

Number oflearners, m

Number of fitdatasets, n

Number of experts,n 9 m

Sample size, nj

1 9 1 1 1 1 17 9 6 9 7 = 714

m 9 1 17 1 17 6 9 7 = 42

1 9 n 1 6 6 17 9 7 = 119

m 9 n 17 6 102 7

P4j¼1 nj ¼ 882

Software Qual J (2009) 17:25–49 37

123

These conclusions, which are identical for all cost ratios, suggest that using multiple

datasets improve the predictive accuracies of the final classifiers compared to the use of a

single dataset. To determine whether the improvements are significant in terms of pre-

dictive performance, a statistical analysis is considered.

Table 6 Quality-of-test of classification scenario 1 (1 9 1)

Test dataset c = 15 c = 20 c = 25 Type I Type II

JM1 1.640 2.103 2.566 31.0% 48.6%

KC1 1.111 1.406 1.702 26.7% 38.3%

KC2 1.200 1.520 1.840 30.2% 31.4%

KC3 0.718 0.871 1.025 28.3% 32.7%

CM1 0.819 1.000 1.181 30.6% 38.1%

MW1 0.679 0.802 0.926 33.5% 32.0%

PC1 0.710 0.843 0.976 33.4% 38.7%

�Y:1 0.982 1.221 1.459 30.5% 37.1%

S.1 0.396 0.536 0.677 11.7% 13.0%

n1 = 714

Table 7 Quality-of-test of classification scenario 2 (m 9 1)


JM1 1.602 2.072 2.541 23.7% 49.3%

KC1 1.046 1.327 1.609 23.7% 36.5%

KC2 0.973 1.236 1.499 23.2% 25.8%

KC3 0.680 0.834 0.989 23.8% 32.9%

CM1 0.751 0.921 1.091 26.7% 35.8%

MW1 0.634 0.746 0.857 32.4% 29.0%

PC1 0.657 0.787 0.916 28.8% 37.7%

�Y:2 0.906 1.132 1.358 26.1% 35.3%

S.2 0.347 0.473 0.600 4.9% 10.0%

n2 = 42

Table 8 Quality-of-test of classification scenario 3 (1 9 n)


JM1 1.441 1.838 2.235 30.9% 41.6%

KC1 0.929 1.159 1.388 28.4% 29.8%

KC2 0.803 0.992 1.180 29.9% 18.5%

KC3 0.592 0.700 0.808 29.6% 23.0%

CM1 0.720 0.863 1.007 32.1% 30.1%

MW1 0.631 0.729 0.827 36.6% 25.4%

PC1 0.629 0.731 0.833 34.6% 29.7%

�Y:3 0.821 1.002 1.182 31.7% 28.3%

S.3 0.287 0.390 0.493 5.3% 8.4%

n3 = 119

38 Software Qual J (2009) 17:25–49

123

4.4 Statistical analysis

The one-way ANOVA model is selected to be the underlying model of the experiment,

while the four populations are the four classification scenarios. Table 5 presents the sample

sizes (nj) of the related four population samples. Notice that the sample size of the group

related to the multi-learner multi-dataset classification scenario is only equal to 7. Con-

sequently, a relatively large confidence interval is expected for the 4th group. Similar to

(Khoshgoftaar et al. 2006), three ANOVA models are successively built based on the

response variable NECM at cost ratios 15, 20, and 25.

4.4.1 Preliminary transformation

It is important to validate the assumption of independence, normality, and homogeneity of

variances before building ANOVA models. Figure 2 presents a normal quantile plot of the

error data (Jain 1991) when the cost ratio (c) is equal to 20. The shorter tail on one end and

a longer tail on the other are characteristics of an asymmetric distribution. Several tests

such as Fmax, L-Levene, W-Shapiro (Berenson et al. 1983) indicate that the previously

mentioned assumptions do not hold. The same conclusions are reached for cost ratios 15

and 25.

An inverse transformation is hence used to stabilize the variance. In the remainder of

this paper, a star (*) will indicate that the data is transformed using an inverse transfor-

mation. Figure 3 presents the quantile-quantile plot, after the transformation and when

c = 20. Statistical tests indicate that the assumptions related to the ANOVA model now

hold. This transformation is also successful in stabilizing the variance and normalizing the

samples at cost ratios 15 and 25.

4.4.2 ANOVA models

Due to the restrictions on the paper size, only the ANOVA model for cost ratio 20 is

presented in Table 10. DF, SS, and MS refer to the degrees of freedom, the sum of squares,

and the mean squares, respectively. The F-value is selected at 90% confidence level (i.e.,

a = 0.10). The p-value related to the F-test is also provided. Since the p-value is well

below the significance level a, we can conclude that there is a significant difference in the

Table 9 Quality-of-test of classification scenario 4 (m 9 n)


JM1 1.477 1.906 2.334 23.5% 45.0%

KC1 0.996 1.257 1.518 25.1% 33.8%

KC2 0.769 0.962 1.154 24.2% 18.9%

KC3 0.624 0.755 0.886 25.5% 27.9%

CM1 0.776 0.954 1.133 26.7% 37.5%

MW1 0.625 0.737 0.849 31.5% 29.0%

PC1 0.584 0.692 0.800 27.7% 31.6%

�Y:4 0.836 1.038 1.239 26.3% 32.0%

S.4 0.315 0.429 0.542 2.7% 8.2%

n4 = 7

Software Qual J (2009) 17:25–49 39

123

means of the four groups. Figure 4 plots a visual representation of the ANOVA model

presented in Table 10. We observe that the average predictive performance of the base

classifiers (Scenario 1, �Y�:1) is lower than the average of the four combined sample pop-

ulations ð �Y�::Þ:

4.4.3 Contrasts

Scheffe S method is employed to determine which differences among the means are in fact

significant (Chen et al. 1997). The first part of Table 11 (predictive performances) presents

the six possible pairwise comparisons between the four groups at different cost ratios, c.

For example, L1 represents the difference between the means of the first and second

classification scenarios (1 9 1 and m 9 1, respectively).1 The estimated contrast ðL1 ¼�Y�:1 � �Y�:2Þ is equal to -0.052 at cost ratio 20. A negative number indicates that the sample

average of the first group is lower than the sample average of the second group. Conse-

quently, the average cost of misclassification of group 1 is higher than in group 2, because

of the inverse transformation procedure. Hence, group 2 performs better than group 1 on

average. The p-value represents the probability that the two means are, in fact, not

Fig. 2 Normal quantile-quantile plot for the error data before transformation (c = 20)

Fig. 3 Normal quantile-quantile plot for the error data after transformation (c = 20)

1 l.j and l�:j are exchangeably used throughout the paper.

40 Software Qual J (2009) 17:25–49

123

significant (i.e., l.1 = l.2). Since p-value = 0.812 for the first contrast, we conclude that,

on average, the predicted performances of the multi-learner single-dataset classifiers are

not significantly better than those of the base classifiers.

This conclusion corroborates our previous study (Khoshgoftaar and Seliya 2004). The

quality of software measurement data plays a critical role in the accuracy and the use-

fulness of classification models. The practitioner should not expect any significant

improvement in the accuracy of the learners if the information in the training dataset is

limited. A multi-learner system should only be justified if the combined decisions are

better than those of any single classifier in the system (Ho et al. 1994). Majority voting

remains, however, a simplistic combination technique. Future work should implement

more sophisticated combination techniques, as opposed to a simple majority voting, in

order to assess the usefulness of a classification technique based on the output of different

learners induced on the same fit dataset.

Contrasts L3; L5; and L6 indicate that the multi-learners multi-dataset approach is not

significantly different from any other classification scenarios. The confidence intervals of

these contrasts are too high to conclude any significant difference because the sample size

of group 4 is small (n4 = 7). The only significant observation in Table 11 is the com-

parison between the single-learner single-dataset classification scenario and the single-

learner multi-dataset classification scenario. L2 is significant (see p-values in bold) at cost

ratio 15, 20, and 25. This contrast shows that, on average, a single-learner multi-dataset

classifier produces better predictive results (i.e., lower misclassification cost) than a single-

learner single-dataset system. It demonstrates that the use of several training datasets

leverages the amount of information available for knowledge discovery. In other words,

data mining more relevant information leads to better and robust predictive models.

Table 10 ANOVA table comparing performance among the four scenarios (c = 20)

Variation DF SS MS F F-Critical p-Value

Among groups 3 2.789 0.930 8.204 2.154 0.000

Within groups 878 99.485 0.113

Total 881 102.274

Fig. 4 Visual representation of the ANOVA model (c = 20)

Software Qual J (2009) 17:25–49 41

123

It is also practical to know whether there is a significant difference between the clas-

sification scenarios using multiple datasets and the scenarios using a single dataset. The

contrast is defined as follows:

L7 ¼l:1 þ l:2

2� l:3 þ l:4

2ð6Þ

Similarly, it is worth assessing the contrast between classifiers using multiple learners

and using a single learner:

L8 ¼l:1 þ l:3

2� l:2 þ l:4

2ð7Þ

L7 and L8 are, however, not significantly different due to the unbalanced sample sizes.

4.5 Comparison with the merging approach

The alternate merging approach (as discussed in Sect. 3.2) in which six of the seven

datasets were successively merged into one fit dataset, while the remaining dataset was

used as the test dataset. Prior to building the respective models, two additional independent

variables (or software attributes) were added to the respective fit and test datasets. The first

variable indicated the size of the (one of seven) software project that a program module

belonged to. We categorized the seven datasets into small, medium, and large sizes based

on the number of modules in each dataset. The second variable is a Boolean metric

representing whether or not an instance belonged to a dataset of an object-oriented soft-

ware system.

The modeling and prediction results for the merging approach are not presented due to

space considerations. However, we do summarize the findings of the comparative study in

this section. When compared to our studies using a single learner, i.e., Scenario 1 and

Scenario 3, it was observed that the multi-dataset approach of Scenario 3 yielded signif-

icantly better results (p-value = 0.0000) than the merging approach. In addition, the

single-learner single-dataset approach of Scenario 1 also performed better than the merging

approach, however, the improvement was not significant at 5%. In summary, the merging

approach did not provide better results than the other multi-dataset approach investigated

in our study.

Table 11 Contrasts among the four classification scenarios

Ll Ll p-Value

15 20 25 15 20 25

L1 ¼ l:1 � l:2 -0.073 -0.052 -0.039 0.671 0.812 0.890

L2 ¼ l:1 � l:3 -0.177 -0.162 -0.148 0.000 0.000 0.000

L3 ¼ l:1 � l:4 -0.153 -0.120 -0.097 0.759 0.831 0.875

L4 ¼ l:2 � l:3 -0.104 -0.110 -0.109 0.494 0.350 0.277

L5 ¼ l:2 � l:4 -0.079 -0.068 -0.058 0.965 0.970 0.975

L6 ¼ l:3 � l:4 0.024 0.043 0.051 0.999 0.991 0.980

L7 ¼ l:1þl:22� l:3þl:4

2-0.128 -0.115 -0.103 0.444 0.454 0.469

L8 ¼ l:1þl:32� l:2þl:4

2-0.025 -0.005 0.006 0.992 1.000 1.000

42 Software Qual J (2009) 17:25–49

123

4.6 Threats to empirical validity

Due to the many human factors that affect software development, and consequently

software quality, controlled experiments for evaluating the usefulness of empirical models

are not practical. We adopted a case study approach in the empirical investigations pre-

sented in this study. To be credible, the software engineering community demands that the

subject of an empirical study have the following characteristics (Wohlin et al. 2000):

• Developed by a group, and not by an individual

• Be as large as industry size projects, and not a toy problem

• Developed by professionals, and not by students

• Developed in an industry/government organization setting, and not in a laboratory

We note that our case studies fulfill all of the above criteria. The software systems

investigated in our study were developed by professionals in a government software

development organization. In addition, each system was developed to address a real-world

problem.

Empirical studies that evaluate measurements and models across multiple projects

should take care in assessing the scope and impact of its analysis and conclusion. For

example, C and C++ projects may be considered similar if they are developed using the

procedural paradigm. Combining object-oriented project data (e.g., Java) with non-object-

oriented project data (e.g., C or C++) needs careful consideration. For example,

per-module lines of code of an object-oriented software tends to be lower than that of a

non-object-oriented software.

The proposed process of combining multiple learners and datasets for software quality

analysis included four C projects, two C++ projects and one Java project. In the

alternate classification approach of merging all datasets, we introduced two additional

metrics: one to capture dataset size variation and another to capture whether a module

belongs to an OO project. The average Loc Total of the C projects was relatively

similar to that of the C++ projects. The average Loc Total for the Java project, while

slightly lower, was relatively comparative. All datasets were normalized and scaled to

account for variation in dataset size. In addition to commonality of development orga-

nization and application domain, all projects were characterized by the same set of

metrics. Such similarity among the projects was considered sufficient for the primary

scope of our study, i.e., improving software quality analysis by evaluating multiple

software project repositories.

5 Conclusion

This paper presented an empirical study where seventeen classification models were

induced on seven NASA software projects. To our knowledge, this empirical work is one

of the largest in terms of both scale and scope: 119 (17 9 7) base classification models

were built, and more than 700 vectors of base estimates were generated.

Four classification scenarios were investigated. The first scenario applies the more

classical approach: training one classifier with a single fit dataset and predicting the test

dataset. The second approach is a popular method in data mining: a classifier is built based

on the prediction of multiple learners induced on the same dataset. The third approach

consists of using the prediction of the same learner induced on multiple fit datasets (multi-

dataset classifier). Finally, the most generic approach combines the predictions of multiple

Software Qual J (2009) 17:25–49 43

123

learners built on multiple fit datasets and applied on the dataset we want to predict. Such a

technique is referred to as multi-learner multi-dataset classifier.

It is shown that, on average, the single-learner multi-dataset classifiers (Scenario 3)

performs significantly better than the corresponding single-learner single-dataset (Scenario

1) classifiers. We also demonstrated that there is no significant difference between the

predictions of the base classifiers and the multi-learner single-dataset classifiers.

These two conclusions are critical for complex data mining problems such as software

quality classification. The practitioner should not rely solely on sophisticated and/or robust

algorithms to generate accurate predictions. When the information is either noisy or limited,

the practitioner should dedicate resources toward more information gathering. In the case of

software quality engineering and as demonstrated by this study, it is advised to use software

data (software metrics) repositories of existing projects similar to the current project (for

which prediction is needed) in terms of quality requirement and application domain.

Future works can investigate the use of learners trained on different representations of

the same input (i.e., multi-representation). For example, it may be worthwhile to inves-

tigate whether the inclusions of object-oriented metrics and other measurements, such as

software process metrics, improve the performance of the four different classification

scenarios. Another research direction would consist of studying more sophisticated voting

schemes, such as weighted voting scheme, cascading, or stacked generalization. Such

schemes may improve the predictive accuracy of the multi-learner multi-dataset classifier.

Acknowledgments We thank the three reviewers and the Associate Editor for their constructive critique andcomments. We also thank the various members of the Empirical Software Engineering Laboratory and DataMining and Machine Learning Laboratory at Florida Atlantic University for their reviews of this paper. We aregrateful to the staff of the NASA Metrics Data Program for making the software measurement data available.

Appendix

This section presents a brief description of the seventeen classifiers trained on the seven

software measurement datasets. The first fifteen are part of the WEKA data mining tool

which is publicly available (Witten and Frank 2000).

J48 is WEKA’s implementation of C4.5, the decision tree algorithm introduced by

Quinlan (1993). The C4.5 algorithm is an inductive supervised learning system which

employs decision trees to represent the underlying structure of the input data. The algo-

rithm is comprised of four principal components for constructing and evaluating the

classification tree models: decision tree generator, production rule generator, decision tree

interpreter, and production rule interpreter. The algorithm is known to be one of the most

robust induction learning algorithms available (Khoshgoftaar et al. 2000).

JRip is WEKA’s implementation of the rule-based learning algorithm, RIPPER

(Repeated Incremental Pruning to Produce Error Reduction)—a modification of the IREP

(Incremental Reduced Error Pruning). RIPPER was proposed by Cohen (1995), and was

shown to compare favorably with C4.5. Both RIPPER and C4.5 rules start with an initial

model and iteratively improve it using heuristic techniques. However, for large noisy

datasets, the former generally seems to start with an initial model that is about the right

size, while the latter starts with an extremely large initial model. This means that RIPPER

is more search-efficient.

Naive Bayes is one of the most simple techniques available for classification. Based on

Bayesian rule of conditional probability, this technique ‘‘naively’’ assumes that attributes are

44 Software Qual J (2009) 17:25–49

123

independent of each other given the class, which may not be completely true in the real world

(Witten and Frank 2000). Despite the over-simplification of the actual relationship between

the attributes, Naive Bayes has been shown to perform fairly well, especially when used

along with feature selection techniques that remove irrelevant attributes (Frank et al. 2000).

Decision Table is one the simplest methods for learning from input data. The rules

learned from the input data have the same form as input: the form of a decision table, i.e., a

list of rules in a table format. The problem of constructing a decision table involves the

selection of appropriate attributes for inclusion, and getting rid of irrelevant attributes.

When determining a class for a test instance, all that one has to do is to look up the

appropriate conditions in the list of the rules—the decision table. They are appealing in

real-time environment, since they provide a constant classification time on average

(Kohavi 1995).

Random Forest is a classifier consisting of a collection of tree-structured classifiers

(Witten and Frank 2000). The Random Forest classifies a new object from an input vector

by examining the input vector on each tree in the forest. Each tree casts a unit vote at the

input vector by giving a classification. The forest selects the classification having the most

votes over all the trees.

OneR, introduced by Holte (1993), is one of the simplest algorithms available in

machine learning. Despite its simplicity, it compares favorably to many of the state-of-the-art machine learning techniques. It chooses the most single informative attribute and bases

the rule on this attribute alone. In practice, simple rules often achieve surprisingly high

accuracy, which could be attributed to the underlying rudimentary structure of many real-

world datasets (Witten and Frank 2000).

PART is a simple, yet surprisingly effective method for learning decision lists based on

the repeated generation of partial decision trees in a divide-and-conquer manner (Frank and

Witten 1998). It builds a rule, removes the instances covered by the rule, and continues

creating rules recursively for the remaining instances until none are left (Frank and Witten

1998). PART, unlike the two dominant practical implementations of rule learners, C4.5

and Ripper, avoids the time consuming phase of post-processing for global optimization,

maintaining comparable classification accuracy.

Instance-Based Learning (IB1) is a popular classification scheme. The working

hypothesis of the technique is that the program module under examination (i.e., test case)

would belong to the same class as that of other similar instances. Different instance-based

learning algorithms vary in the context of the selected number of nearest neighbors, the

measures used to compute similarity between instances, and the solution algorithm for

predicting the class of a test instance. WEKA’s implementation of one instance-based

classifier, IB1, uses only one nearest neighbor to predict the class of a test instance. In this

study, the similarity measure used is Euclidean distance.

IBk is WEKA’s implementation of an instance-based learning technique with k nearest

neighbors. Selecting only one nearest neighbor to predict the class of a test instance,

especially in the presence of noise, may lead to increased inaccuracy (Witten and Frank

2000). In IBk, the class of the test case is predicted by majority voting of the k nearest

neighbors (Emam et al. 2001). Like IB1, the similarity measure used to determine the

nearest neighbors is Euclidean distance.

Alternating Decision Trees (ADTree) is a relatively new machine learning technique

proposed by Freund and Mason (1999). It combines the power of boosting and decision

trees in a very simple manner generalizing decision trees, voted decision trees, and voted

decision stumps. Since ADTree has alternating layers of decision nodes and prediction

nodes in its tree structure, it is called Alternating Decision Tree.

Software Qual J (2009) 17:25–49 45

123

RIpple-DOwn Rule (Ridor) was introduced by Compton and Jansen (1990) as a

methodology for acquisition and maintenance of large rule-based systems. The basic idea

behind the technique is to make incremental changes while constructing and maintaining a

complex knowledge structure in a well-defined and restricted manner such that the effects

of the changes do not propagate globally and are well confined in the structure, unlike

standard production rules.

Locally Weighted Learning (LWL) is a non-adaptive lazy-learning technique that is

gaining popularity in the machine learning community. Local weighting reduces unnec-

essary bias of global function fitting, and gives more flexibility, retaining the desirable

properties such as smoothness and statistical analyzability (Atkeson et al. 1997). There are

mainly three elements for Locally Weighted Learning: distance function, separable cri-

terion, and sufficient data. LWL uses locally weighted training to combine training data,

using a distance function to fit a surface to nearby points. It must be used in conjunction

with another classifier to perform classification. In this study, Decision Stump is combined

with LWL (LWLStump).

Sequential Minimal Optimization (SMO), proposed by Platt (1998), is a conceptually

simple, but subtle algorithm for training support vector machines, which involves solving a

very large quadratic programming (QP) optimization problem, using Osuna’s theorem to

ensure convergence. The problem is resolved by dividing the large QP problem into

smaller pieces of QP problems which are then solved analytically in steps instead of the

traditional time-consuming way of numerical QP optimization. SMO is highly scalable as

its memory requirement is linearly dependent on the size of training dataset.

Bagging (BAG) combines classifiers by randomly re-sampling from the original training

dataset, building a classifier for each re-sampled dataset, and using the prediction of each

classifier in a simple vote to obtain the combined decision on the test data (Breiman 1996).

The combined decision or final hypothesis for classification is obtained using an unweighed

vote. Typically, an unstable or weak learner is used to combine the decision to ensure that

small changes in the training data will yield significantly different models.

Lines-of-Code (LOC) is one of the most commonly used software measure for repre-

senting the complexity of software program modules. In our study, the modules were first

sorted in an ascending order of their LOC. The underlying assumption is that a larger-size

program module is likely to have more software faults than a relatively smaller-size module.

Based on a specific threshold value of lines of code, the modules with LOC lower than this

threshold are predicted as nfp, and the rest as fp. The threshold value is varied until the

preferred balance of equality between the Type I and Type II error rates is obtained.

Logistic Regression (LR) is a statistical modeling technique that offers good model

interpretation. Logistic Regression suits software quality modeling because most software

engineering measures do have a monotonic relationship with faults that are inherent in the

underlying software development processes (Fenton and Pfleeger 1997; Khoshgoftaar and

Allen 1999).

Tree Disc (TD) is a SAS macro implementation of the modified CHAID algorithm

(Khoshgoftaar et al. 2000). It constructs a decision tree to predict a specified categorical

dependent variable from one or more predictor (independent) variables. The decision tree

is computed by recursively partitioning the dataset into two or more subsets of observa-

tions, based on the categories of one of the predictor variables until some stopping criterion

is met. The variable that is most significantly associated with the dependent variable is

selected to be the predictor variable. The association level is measured by a chi-squared

test of independence.

46 Software Qual J (2009) 17:25–49

123

References

Alpaydin, E. (1997). Voting over multiple condensed neareast neighbors. Artificial Intelligence Review,11(1–5), 115–132.

Alpaydin, E. (1998). Techniques for combining multiple learners. In E. Alpaydin (Ed.), Proceedings ofengineering of intelligent systems conference (Vol. 2 of 6–12). ICSC Press.

Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Locally weighted learning. Artificial IntelligenceReview, 11(1–5), 11–73.

Berenson, M. L., Levine, D. M., & Goldstein, M. (1983). Intermediate statistical methods and applications.Prentice-Hall.

Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.Chen, K., Wang, L., & Chi, H. (1997). Methods of combining multiple classifiers with different features and

their applications to text-independent speaker identification. International Journal of Pattern Recog-nition and Artificial Intelligence, 11(3), 417–445.

Cohen, W. W. (1995). Fast effective rule induction. In A. Prieditis & S. Russell (Eds.), Proceedings of the12th international conference on machine learning (pp. 115–123). Tahoe City, CA: MorganKaufmann.

Compton, P., & Jansen, R. (1990). Knowledge in context: A strategy for expert system maintenance. In C. J.Barter & M. J. Brooks (Eds.), 2nd Australian joint artificial intelligence conference (pp. 292–306).Adelaide, Australia: Springer-Verlag.

Cuadrado-Gallego, J. J., Fernndez-Sanz, L., & Sicilia, M. A. (2006). Enhancing input value selection inparametric software cost estimation models through second level cost drivers. Software QualityJournal, 14(4), 330–357.

Emam, K. E., Benlarbi, S., Goel, N., & Rai, S. N. (2001). Comparing case-based reasoning classifiers forpredicting high-risk software componenets. Journal of Systems and Software, 55(3), 301–320. ElsevierScience Publishing.

Fenton, N. E., & Pfleeger, S. L. (1997). Software metrics: A rigorous and practical approach (2nd ed.).Boston, MA: PWS Publishing.

Frank, E., Trigg, L., Holmes, G., & Witten, I. H. (2000). Naive bayes for regression. Machine Learning,41(1), 5–25.

Frank, E., & Witten, I. H. (1998). Generating accurate rule sets without global optimization. In Proceedingsof the 15th international conference on machine learning (pp. 144–151). Morgan Kaufmann.

Freund, Y., & Mason, L. (1999). The alternating decision tree learning algorithm. In Proceedings of 16thinternational conference on machine learning (pp. 124–133). Bled, Slovenia: Morgan Kaufmann.

Gaines, B. R., & Compton, P. (1995). Induction of ripple-down rules applied to modeling large databases.Journal of Intelligent Information Systems, 5(3), 211–228.

Gamberger, D., Lavrac, N., & Dzeroski, S. (1996). Noise elimination in inductive concept learning: A casestudy in medical diagnosis. In Algorithmic learning theory: Proceedings of the 7th internationalworkshop (Vol. 1160, pp. 199–212). Sydney, Australia: Springer-Verlag.

Hansen, L. K., & Salamon, P. (1990). Neural network ensemble. In IEEE transactions on pattern analysisand machine intelligence (Vol. 12, pp. 993–1001).

Ho, T. K., Hull, J. J., & Srihari, S. N. (1994). Decision combination in multiple classifier systems. IEEEtransactions on pattern analysis and machine intelligence, 16(1).

Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets.Machine Learning, 11, 63–91.

Jain, R. (1991) The art of computer systems performance analysis: Techniques for experimental design,measurement, simulation, and modeling. John Wiley & Sons.

Khoshgoftaar, T. M., & Allen, E. B. (1999). Logistic regression modeling of software quality. InternationalJournal of Reliability, Quality, and Safety Engineering, 6(4), 303–317.

Khoshgoftaar, T. M., Joshi, V., & Seliya, N. (2006). Noise elimination with ensemble-classifier noisefiltering: Case studies in software engineering. International Journal of Software Engineering andKnowledge Engineering, 16(1), 1–24.

Khoshgoftaar, T. M., & Rebours, P. (2004). Generating multiple noise elimination filters with the ensemble-partitioning filter. In Proceedings of the 2004 IEEE International Conference on Information Reuseand Integration (pp. 369–375). Las Vegas, NV, November 2004.

Khoshgoftaar, T. M., & Seliya, N. (2004). The necessity of assuring quality in software measurement data.In Proceedings of the 10th international symposium on software metrics (pp. 119–130). Chicago, IL:IEEE Computer Society Press.

Khoshgoftaar, T. M., Yuan, X., & Allen, E. B. (2000). Balancing misclassification rates in classification treemodels of software quality. Empirical Software Engineering, 5, 313–330.

Software Qual J (2009) 17:25–49 47

123

Kohavi, R. (1995). The power of decision tables. In N. Lavrac & S. Wrobel (Eds.), Proceedings of theeuropean conference on machine learning, Lecture Notes in Artificial Intelligence (pp. 174–189).Springer-Verlag.

Krogh, A., & Vedelsby, J. (1995). Neural network ensembles, cross validation and active learning. InAdvances in neural information processing systems (pp. 231–238). Cambridge MA: MIT Press.

Mani, G. (1991). Lowering variance of decisions by using artificial neural network ensembles. NeuralComputation, 3, 484–486.

Meir, R. (1995). Bias, variance and the combination of estimators; the case of linear least squares. In D. T.G. Tesauro & T. Leen (Eds.), Advances in neural information processing systems. Cambridge, MA:MIT Press.

Menzies, T., Greenwald, J., & Frank, A. (2007). Data mining static code attributes to learn defect predictors.IEEE transactions on software engineering, 33(1), 2–13.

Meulen, M. J., & Revilla, M. A. (2007). Correlations between internal software metrics and softwaredependability in a large population of small C/C++ programs. In Proceedings of the 18th IEEEinternational symposium on software reliability engineering, ISSRE 2007 (pp. 203–208). Trollhattan,Sweden, November 2007.

Munson, J. C., & Khoshgoftaar, T. M. (1992). The detection of fault-prone programs. IEEE Transactions onSoftware Engineering, 18(5).

Nikora, A. P., & Munson, J. C. (2003). Understanding the nature of software evolution. In Proceedings ofthe 2003 international conference on software maintenance (pp. 83–93), September 2003.

Nikora, A. P., & Munson, J. C. (2004). The effects of fault counting methods on fault model quality. InProceedings of the 28th international computer software and applications conference, COMPSAC2004 (vol. 1, pp. 192–201), September 2004.

Platt, J. C. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines.Technical Report 98-14, Microsoft Research, Redmond, WA, April 1998.

Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann.Shepperd, M., & Kadoda, G. (2001). Comparing software prediction techniques using simulation. IEEE

Transactions on Software Engineering, 27(11), 1014–1022.Witten, I. H., & Frank, E. (2000). Data mining, practical machine learning tools and techniques with Java

implementations. San Francisco, CA: Morgan Kaufmann.Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., & Wesslen, A. (2000). Experimentation in

software engineering: An introduction Kluwer international series in software engineering. Boston,MA: Kluwer Academic Publishers.

Wolpert, D. (1992). Stacked generalization. Neural Network, 5(2), 241–259.Xing, F., Guo, P., & Lyu, M. R. (2005). A novel method for early software quality prediction based on

support vector machine. In Proceedings of the 16th international symposium on software reliabilityengineering (p. 10), November 2005.

Yacoub, S., Lin, X., Simske, S., & Burns, J. (2003). Automating the analysis of voting systems. In 14thinternational symposium on software reliability engineering (pp. 203). Denver, CO, November 2003.

Author Biographies

Taghi M. Khoshgoftaar is a professor of the Department of ComputerScience and Engineering, Florida Atlantic University and the Directorof the Empirical Software Engineering and Data Mining and MachineLearning Laboratories. His research interests are in software engi-neering, software metrics, software reliability and quality engineering,computational intelligence, computer performance evaluation, datamining, machine learning, and statistical modeling. He has publishedmore than 350 refereed papers in these areas. He is a member of theIEEE, IEEE Computer Society, and IEEE Reliability Society. He wasthe program chair and general Chair of the IEEE International Con-ference on Tools with Artificial Intelligence in 2004 and 2005respectively and is the Program chair of the 20th International Con-ference on Software Engineering and Knowledge Engineering (2008).He has served on technical program committees of various interna-tional conferences, symposia, and workshops. Also, he has served as

North American Editor of the Software Quality Journal, and is on the editorial boards of the journalsSoftware Quality and Fuzzy systems.

48 Software Qual J (2009) 17:25–49

123

Pierre Rebours received the M.S. degree in Computer Engineering‘‘from Florida Atlantic University, Boca Raton, FL, USA, in April,2004.’’ His research interests include quality of data and data mining.

Naeem Seliya is an Assistant Professor of Computer and InformationScience at the University of Michigan-Dearborn. He received his Ph.D.in Computer Engineering from Florida Atlantic University, BocaRaton, FL, USA in 2005. His research interests include softwareengineering, data mining and machine learning, software measure-ment, software reliability and quality engineering, softwarearchitecture, computer data security, and network intrusion detection.He is a member of the IEEE and the Association for ComputingMachinery.

Software Qual J (2009) 17:25–49 49

123

software quality analysis by combining multiple projects and learners

Documents