software quality analysis by combining multiple projects and learners
TRANSCRIPT
Software quality analysis by combining multiple projectsand learners
Taghi M. Khoshgoftaar Æ Pierre Rebours Æ Naeem Seliya
Published online: 8 July 2008� Springer Science+Business Media, LLC 2008
Abstract When building software quality models, the approach often consists of
training data mining learners on a single fit dataset. Typically, this fit dataset contains
software metrics collected during a past release of the software project that we want to
predict the quality of. In order to improve the predictive accuracy of such quality
models, it is common practice to combine the predictive results of multiple learners to
take advantage of their respective biases. Although multi-learner classifiers have been
proven to be successful in some cases, the improvement is not always significant because
the information in the fit dataset sometimes can be insufficient. We present an innovative
method to build software quality models using majority voting to combine the predic-
tions of multiple learners induced on multiple training datasets. To our knowledge, no
previous study in software quality has attempted to take advantage of multiple software
project data repositories which are generally spread across the organization. In a large
scale empirical study involving seven real-world datasets and seventeen learners, we
show that, on average, combining the predictions of one learner trained on multiple
datasets significantly improves the predictive performance compared to one learner
induced on a single fit dataset. We also demonstrate empirically that combining multiple
learners trained on a single training dataset does not significantly improve the average
predictive accuracy compared to the use of a single learner induced on a single fit
dataset.
T. M. Khoshgoftaar (&) � P. ReboursComputer Science and Engineering, Florida Atlantic University,777 Glades Road, Boca Raton, FL 33431, USAe-mail: [email protected]
P. Rebourse-mail: [email protected]
N. SeliyaComputer and Information Science, University of Michigan-Dearborn,4901 Evergreen Road, Dearborn, MI 48128, USAe-mail: [email protected]
123
Software Qual J (2009) 17:25–49DOI 10.1007/s11219-008-9058-3
Keywords Multiple software metrics repositories � Software quality classification
model � Multiple learners � Cost of misclassification � Data mining � Majority voting
1 Introduction
In the context of software quality, program modules are typically labeled as either fault-
prone (fp) or not fault-prone (nfp) (Fenton and Pfleeger 1997; Khoshgoftaar et al. 2000).
The need for reliable and high quality products often leads software managers to use
Software Quality Classification Models (SQCMs), which allow them to direct improve-
ment efforts to software modules with higher risk. Such models are designed to identify,
prior to deployment, software modules that are likely to be fault-prone during operations.
Hence, a cost-effective utilization of resources can be implemented for software testing,
inspection, and quality enhancement of these modules. SQCMs are often based on
inductive learning algorithms which generalize the concepts learned from a set of training
instances (i.e., fit dataset) and apply these concepts to the currently under-development
instances (i.e., test dataset). Typically, the fit dataset is made up of modules (i.e., instances)
related to the past release or to a very similar project.
During the early usage of SQCMs, practitioners used to rely on a single learning
algorithm. As for almost any data mining application area, there are a number of classi-
fication algorithms developed from different theories and methodologies, such as case-
based reasoning (Emam et al. 2001), logistic regression (Khoshgoftaar and Allen 1999),
support vector machines (Xing et al. 2005), naive bayes (Menzies et al. 2007), and clas-
sification trees (Khoshgoftaar et al. 2000). Typically, many of these algorithms are trained
and the practitioner chooses the one that performs the best in cross-validation or on a
separate validation set. However, each paradigm dictates a certain model that comes with a
set of assumptions which may lead to a high bias if the assumptions do not hold, especially
if the training dataset is noisy (Khoshgoftaar et al. 2006; Khoshgoftaar and Seliya 2004).
Consequently, it has became common practice to combine the predictions of several
learners to achieve a higher reliability of the SQCMs. This approach is often referred to as
a multi-expert or multi-learner system (Gamberger et al. 1996) because base models
generated by machine learning can be regarded as experts (Witten and Frank 2000).
A multi-learner system can be an effective solution because it combines complementary
classification procedures. Unfortunately, even with appropriate biases, a multi-learner
classifier may remain inefficient if the base learners are trained on a dataset with limited
amount of information. Therefore, we believe that increasing the amount of information
available for knowledge discovery is the key to building adaptable and robust software
quality models (Witten and Frank 2000), and to ultimately improve the generalization
accuracy of such models.
It is very common nowadays for an organization to maintain several software metrics
repositories for each undertaken project (Fenton and Pfleeger 1997; Meulen et al. 2007;
Nikora and Munson 2003). The data in these repositories are likely to follow similar patterns,
especially if the organization enforces the same development life cycle, as well as the same
coding and testing practices. Consequently, in this paper, we propose to use all available past
projects to build SQCMs because we believe that enlarging the set of available software
modules for training will improve the predictive accuracy of the final software quality
model. In other words, we believe that we can detect more accurately which software
modules are faulty by increasing the volume and variety of data to learn from.
26 Software Qual J (2009) 17:25–49
123
We describe a new multi-learner classifier which combines the predictions of mul-
tiple learners successively trained on multiple datasets. The combination function of this
multi-learner multi-dataset classifier is majority voting. This approach is innovative
because it has never been attempted in the context of software quality engineering.
Moreover, this study remains practical because it provides a simple way for the prac-
titioner to create better SQCMs by simply leveraging the existing information spread
across datasets. Four classification scenarios are then evaluated against each other.
These classification scenarios predict the instances in the test dataset based on the
respective predictions of a single learner induced on a single fit dataset (Scenario 1),
multiple learners induced on a single dataset (Scenario 2), a single learner induced on
multiple datasets (Scenario 3), and multiple learners induced on multiple datasets
(Scenario 4).
The performances of these classification scenarios are assessed by a large-scale
empirical study using seven real-world software metrics repositories and seventeen
well-proven learners. To the best of our knowledge, this study is unique in both its
scale and its application domain. We demonstrate that, on average, using one learner
induced on multiple fit datasets (Scenario 3) achieves better cost-wise performances
than using a single learner trained on a single dataset (Scenario 1). We also show that
relying solely on multiple learners (Scenario 2) does not significantly improve the
performance compared to the use of a single classifier induced on a single dataset
(Scenario 1).
The rest of the paper is organized as follows. In Sect. 2, we present related research.
Section 3 describes the implementation of the multi-learner multi-dataset classifier. We
also describe the modeling methodology involved in our empirical investigation. Section 4
presents the details of the experimental study and discusses the results. Finally, we draw
some useful conclusions from the empirical results in Sect. 5.
2 Related work
Voting is a well-known technique to combine the decisions of peer experts. Voting
techniques include majority voting, weighted voting, plurality voting, instance runoff
voting, and threshold voting (Yacoub et al. 2003). These techniques are derived from a
more general technique referred to as weighted k-out-of-n systems. The predictions of the
base classifiers can also be combined by hybrid approaches (Alpaydin 1997; Ho et al.
1994; Yacoub et al. 2003). A hybrid approach usually takes advantage of the strengths of
individual classifiers and avoids their weaknesses (Ho et al. 1994). Stacked generalization
extends voting in a sense that the output of the learners is incorporated through a combiner
system which is also trained (Wolpert 1992).
In software quality modeling, there are often numerous types of features which can be
used to represent and identify fault-prone modules (Fenton and Pfleeger 1997). As a
consequence, it is technically difficult for a single classifier to make use of all the
features. Ho et al. (1994) and Chen et al. (1997) observed that features and classifiers of
different types complement one another in classification performance. However, the
problem of combining potentially conflicting decisions by multiple classifiers still
remains unresolved. As noted by Alpaydin (1998), combination approaches can be
divided into two groups. In a uni-representation, all learners use the same representation.
Therefore, the learners should be different in order to obtain different decisions. In a
multi-representation, a single learner can use different representations of the same input.
Software Qual J (2009) 17:25–49 27
123
For example, in software quality models, software metrics can represent product,
resource, or process attributes of the modules (Fenton and Pfleeger 1997).
Krogh and Vedelsby (1995) defined ambiguity as the variation of the output of voters
averaged over unlabeled data to quantify the disagreement among the voters. In the
context of multiple neural networks, they showed that the ambiguity needs to be
maximized for minimal error. If the voters are strongly biased, the ambiguity will be
small because voters implement very similar functions and will agree on inputs even
outside the training set. They also noted that one way to increase the ambiguity is to
train the voters on different datasets. Hansen and Salamon (1990) mentioned that, by
taking majority voting and selecting independent experts with a success probability
higher than 0.50, success increases as the number of voting classifiers increases. Mani
(1991) argued that variance among voters decreases as the number of independent
voters increases.
Meir (1995) showed that for linear regression, by training experts on disjoint datasets
and using voting, a large decrease in variance can be achieved due to the independence of
experts. Alpaydin (1997) made similar conclusions with a nearest neighbor classifier.
While most studies assume that each estimator is trained on the complete dataset, Meir
envisaged a situation where the dataset is divided into several subsets, with each of them
used to form a different estimator. Similarly, in our previous work (Khoshgoftaar and
Rebours 2004), we initially divided the training set to build a less biased multi-expert
system to detect noisy instances more efficiently.
In a recent study (Khoshgoftaar and Seliya 2004), we have empirically demonstrated
that while using a very large number of diverse classification techniques for building
software quality classification models, classification accuracy does not show a dramatic
improvement. Instead of searching for a classification technique that performs well for a
given software measurement dataset, we concluded that the software process should focus
on improving the quality of the data (Khoshgoftaar and Seliya 2004).
3 Methodology
3.1 Classification scenarios
We present a new multi-expert classifier which combines the prediction of multiple base
learners induced on multiple training datasets. Each base algorithm has a trade-off, in a
sense that it introduces a bias during the learning process. As there is no benefit in
combining multiple learners that always make similar decisions, the aim is to find a set of
learners which differ in their decisions so that they complement each other (Alpaydin
1998).
Suppose that the practitioner selects m different algorithms (learners). Metrics collected
on n past projects are also available. In this study, we exclusively focus on a uni-repre-sentation system (Alpaydin 1998), i.e., the collected metrics in the n fit datasets as well as
in the test dataset are of the same type and number. Given a test dataset E, the m learning
algorithms and n fit datasets can be combined to predict E. More specifically, each of the malgorithms is trained on each of the n fit datasets until m 9 n base models are built. These
base models are then used to predict for the test dataset E; m 9 n vectors of base estimates
of instances in E are therefore generated. Finally, the vectors are combined using the voting
scheme described next.
28 Software Qual J (2009) 17:25–49
123
3.1.1 Voting technique
The simplest way to combine multiple experts is by majority voting. If the number of
experts is even, then we are likely to encounter a tie where half of the experts predict the
program module as nfp, and the other half predict it as fp. Due to the nature of software
quality engineering, where the cost of misclassifying a fp program module as nfp is much
more costly than misclassifying a nfp program module as fp, a conservative approach is
recommended (Khoshgoftaar and Allen 1999). More specifically, the final decision is fprather than nfp in case of a tie.
3.1.2 Pre-processing
An induced learner can only predict instances in the test dataset if the relative values and
ranges of the attributes in the training dataset are similar to those in the test dataset. It is
therefore recommended that the attribute values (i.e., the software metrics in our case) be
first scaled so that their relative values and ranges are approximately equal. In this study,
we normalize all the metrics to a zero mean and unit variance. Let Ikl be the measured
value of the lth attribute of the kth program module. The scaled value I0kl corresponding to
Ikl is given by (Jain 1991):
I0kl ¼Ikl � �Il
Slð1Þ
where �Il and Sl are the measured mean and standard deviation of the lth attribute
respectively.
The actual values of the normalized metrics are generally not comprehensible by the
typical user of software metrics, in that they range from negative to positive numbers.
Similar to (Munson and Khoshgoftaar 1992), the values of the metrics used in this study
represent a scaled version of the normalized metrics defined as follows:
I00kl ¼ 10I0kl þ 50 ð2Þ
These scaled-relative metrics will have a mean of 50 and a standard deviation of 10.
3.1.3 Algorithm
Figure 1 describes the implementation of the multi-learner multi-dataset classifier. Li=1,…,m
denotes the ith base learner. Dj=1,…,n is the jth fit dataset. D0j is the normalized jth fit dataset.
D00j is the scaled-relative jth fit dataset. E and kEk represent the test dataset and the number
of instances in the test dataset respectively. Ik¼1;...;kEk is the kth program module of dataset
E. I00k is the scaled-relative program module of Ik. LiðI00k ;D00j Þ ¼ ck is the predictive class of
scaled-relative program module I00k obtained by inducing learning scheme Li on training set
D00j : In the case of software classification models, ck 2 fnfp; fpg: c ¼ fc1; . . .; ckEkg is the
vector of the final estimates of instances in dataset E. As mentioned previously, the fit
datasets Dj=1,…,n and the test dataset E use the same representation to describe software
modules.
Initially, before using the output of the m 9 n experts, the test and fit datasets are
normalized (steps 1 and 4, respectively) and scaled (steps 2 and 5, respectively). The mselected learners are then fine-tuned on each of the n fit datasets (step 7) based on a model
selection strategy of obtaining a preferred balance of equality between the Type I and Type
Software Qual J (2009) 17:25–49 29
123
II error rates. Given an instance Ik, a counter, Sk, is defined (step 11). For each learner and
for each fit dataset, if learner Li induced on fit dataset D00j predicts scaled-relative program
module I00k as fp, counter Sk is incremented (step 14). Once all m 9 n experts cast their
votes, the counter associated to a given instance will indicate the final predicted value of
that program module. If half or more of the experts agree on classifying instance Ik as fp,
then the final predicted label of that instance is fp (step 17). Otherwise, instance Ik is
classified as nfp (step 18).
3.1.4 Four classification scenarios
The classifier described in Fig. 1 can be specialized by modifying the values of some
parameters such as the number of classifiers (m), or the number of fit datasets (n). The
notation ‘‘m 9 n’’ represents a classification scheme which combines the predictions of mlearners induced on n fit datasets. In this work, the following four classification scenarios
are investigated:
Scenario 1: single-learner single-dataset classifier: This classifier uses a single learner
(m = 1) induced on a single fit dataset (n = 1) to predict the test dataset. In this study,
such a scenario is labeled as 1 9 1.
Scenario 2: multi-learner single-dataset classifier: This classifier combines m learners
induced on a single fit dataset (n = 1). This technique has been covered extensively by
researchers, as discussed in Sect. 2. It is labeled as m 9 1.
Scenario 3: single-learner multi-dataset classifier: The predictions of one learner
(m = 1) induced on n fit datasets are combined. It is labeled as 1 9 n.
Scenario 4: multi-learner multi-dataset classifier: This is the most extensive technique
as it uses m learners induced on n fit datasets. It is denoted as m 9 n.
Fig. 1 Implementation of themulti-learner multi-datasetclassifier
30 Software Qual J (2009) 17:25–49
123
3.2 Merging: an alternative classification technique
In order to take advantage of the availability of multiple fit datasets, one may decide to
merge them into one single dataset. A multi-learner classifier can then be built based on mlearners induced on this merged dataset. Of course, such an approach should standardize
the data prior to the merge. Instead of building m 9 n base learners, the practitioner builds
only m classifiers. Since the newly created dataset now contains more varieties of cases
because it originates from various projects, one may argue that data mining learners
induced on this new dataset would be more accurate than those induced on any dataset
before merging. We compare this merging approach with the other approaches (or sce-
narios) presented in the previous section. While this alternate classification technique is
relatively simpler, there are some notable drawbacks, as explained below.
• As opposed to the multi-dataset approach, this alternative technique makes impossible
the use of multi-features datasets. Unless the learning algorithms can cope efficiently
with missing values, the attributes of the instances need to be of the same type before
merging.
• Inducing learners on one large fit dataset will create complex hypotheses which are
difficult to interpret. Besides, the fit datasets can have different levels of noise
(Khoshgoftaar et al. 2006; Khoshgoftaar and Rebours 2004) or different sizes. As a
consequence, the training of the algorithms on the merged dataset can be more
intricate.
• The induction of data mining learners on the newly merged fit dataset would be slower
for polynomially bounded algorithms O(ab), where a represents the number of
instances in the dataset, and b is a given positive constant.
3.3 Analysis of variance
ANalysis Of VAriance, commonly known as ANOVA, is a statistical technique for
examining whether independent groups or populations are significantly different from one
another. In our study, the one-way ANOVA design is selected to analyze the performances
of the four classification scenarios. In this design, a classification scenario corresponds to a
group. More specifically, the single-learner single-dataset classification scenario (1 9 1),
the multi-learner (single-dataset) scenario (m 9 1), the (single-learner) multi-dataset sce-
nario (1 9 n), and the multi-learner multi-dataset scenario (m 9 n) relate to the first,
second, third, and fourth groups of the ANOVA model, respectively.
Let Y1j; . . .; Ynjj represent a random sample of nj observations taken from the population
of group j. How many observations can we possibly collect from the population of each
group? In our experiment, m different data mining algorithms are used. In addition, n + 1
datasets are available and the dependent variable (i.e., fp or nfp) is known for all the
datasets. m 9 (n + 1) base learners can then be fine-tuned and each of them can generate
base estimates for n datasets. Hence, m 9 (n + 1) 9 n vectors of base estimates can be
obtained. For example, a single-learner multi-dataset classifier (Scenario 3) can be created
by combining the predictions of one of the m base learners successively trained on ndatasets. This classifier can then be applied on the test dataset. Therefore, n3 = m 9
(n + 1) combinations are possible for classification Scenario 3. Similarly, n1 = m 9 n 9
(n + 1), n2 = n 9 (n + 1), and n4 = n + 1.
Yij, the ith observation in group j (where i = 1,…,nj and j = 1,…,4), can be represented
by the following model (Berenson et al. 1983):
Software Qual J (2009) 17:25–49 31
123
Yij ¼ lþ Aj þ eij ð3Þ
where l is the overall effect to all the observations; Aj ¼ l:j � l; the treatment effect
related to the jth group; eij = Yij - l.j, the experimental error associated with the ithobservation in group j; l.j, the true mean of the jth group. The F statistic validates whether
the four population means are equal (Berenson et al. 1983; Jain 1991).
It is assumed that the observed responses in each of the groups represent random
samples drawn from four independent normal populations having equal variance (i.e.,
r21 ¼ � � � ¼ r2
4). The Shapiro test validates the assumption of normality (i.e., Nðl:j; rjÞ).The property of homoscedasticity (i.e., equal variability) is tested by using L Levene test,
since the four sample sizes are different (Berenson et al. 1983). If we have good reason to
believe this assumption has been violated, a good strategy is to seek appropriate trans-
formations to normalize the data.
If it is concluded that there are significant differences between the means of at least two
of the groups, a posteriori comparison methods determine which of the four groups are
significantly different as well as which of the four groups appear to differ from each other
only by chance. Since the four sample sizes are very different in this study, Scheffe S tests
will be used later to evaluate the4
2
� �¼ 6 paired-comparisons (Berenson et al. 1983).
3.4 Model-selection and evaluation strategy
Our empirical study is related to two-group classification in the context of software quality.
Software modules are typically labeled as either fault-prone (fp) or not fault-prone (nfp).
Hence, two types of misclassification errors can occur: Type I error (or false positive) and
Type II error (or false negative). The Type I and Type II errors are generally inversely
proportional for a given dataset. Hence, software quality engineers often recommend
selecting a classification model that has a preferred balance between the two error rates.
This is often obtained by building models with different parameter settings, such as tree
depth in decision trees.
The model selection strategy used in our study is based on obtaining a preferred balance
of equality between the Type I and Type II error rate, with the Type II error rate being as
low as possible. We have used this strategy effectively in several of our prior studies on
software quality analysis of high assurance systems (Khoshgoftaar et al. 2000; Khosh-
goftaar and Rebours 2004; Khoshgoftaar and Seliya 2004). In this study, the model
selection strategy is semi-automated such that the respective model parameters are fine-
tuned to obtain the best possible accuracy on the training dataset while maintaining the
preferred balance between the Type I and Type II error rates, with the latter as low as
possible. Low Type I and Type II misclassification rates ensure the detection of a sig-
nificantly large number of fp modules, and at the same time, keeps the number of nfpmodules predicted to be fp (i.e., ineffective testing and inspection) low.
Comparing the performance of different classification methods based on the two mis-
classification rates (Type I and Type II) can be a difficult task (Khoshgoftaar et al. 2006;
Khoshgoftaar and Rebours 2004). In the context of (two-group) software quality classifi-
cation, where there is likely to be a vast disparity between the prior probabilities of the two
classes (fp and nfp) and the cost of the two types of misclassification, the Expected Cost of
Misclassification (ECM) is more appropriate as a practical measure for comparison
(Khoshgoftaar and Allen 1999):
32 Software Qual J (2009) 17:25–49
123
ECM ¼ CIPrðfpjnfpÞpnfp þ CIIPrðnfpjfpÞpfp ð4Þ
where CI and CII are costs of Type I and Type II misclassification errors respectively, pfp
and pnfp are prior probabilities of fp modules and nfp modules, Pr(fp|nfp) is the probability
that a nfp module would be misclassified as fp, and Pr(nfp|fp) is the probability that a fpmodule would be misclassified as nfp.
In practice, it is difficult to quantify the actual costs of misclassification at the time of
modeling. Hence, we define the Normalized Expected Cost of Misclassification (NECM):
NECM ¼ ECM
CI¼ PrðfpjnfpÞpnfp þ
CII
CIPrðnfpjfpÞpfp ð5Þ
NECM facilitates the use of cost ratio CII
CI¼ c; which can be more readily estimated using
software engineering heuristics for a given project (Khoshgoftaar and Allen 1999).
4 Empirical evaluation
4.1 Description of the software measurement datasets
The software metrics and quality data used in our study originate from seven NASA
software projects. Obtained through the NASA Metrics Data Program, these datasets
include software measurement data and associated error data collected at the function level
(Khoshgoftaar et al. 2006; Khoshgoftaar and Rebours 2004; Khoshgoftaar and Seliya
2004). Each instance of these datasets is a program module. The quality of a module is
described by its Error Rate, i.e., number of defects in the module, and Defect, whether or
not the module has any defects. Even though these projects are not directly dependent on
each other, they share marked commonalities. All software projects used in our study are
developed based on a general NASA software development process, and all pertain to
mission critical software applications. They are all high assurance and complex real-time
systems. It is practical and relevant to leverage the information spread across these datasets
in order to predict the quality of an ongoing similar project.
We selected thirteen primitive software metrics for our study: three McCabe metrics
(Cyclomatic Complexity; Essential Complexity; and Design Complexity); five
metrics of Line Count ðLoc Code And Comment; Loc Total; Loc Comment; Loc Blank;Loc ExecutableÞ; four basic Halstead metrics ðUnique Operators; Unique Operands;Total Operators; Total OperandsÞ; and one metric for Branch Count. Classifiers are
built using the thirteen software metrics as independent variables and the module-class as the
dependent variable (i.e., fp or nfp). It is important to note that the software measurements are
primarily governed by their availability, the internal workings of the respective projects, and
the data collection tools used by the projects. We only use functionally oriented metrics for all
software datasets, solely because of their availability. This is an unfortunate case of a real-
world software engineering situation where one has to work with what is available rather than
the most ideal situation.
The use of specific software metrics in the case study does not advocate their effec-
tiveness—different software projects may collect and consider different sets of software
measurements for analysis (Khoshgoftaar and Rebours 2004; Nikora and Munson 2004).
We note that the selection of a best set of predictors in estimation problems has been an
ongoing subject of study in software engineering. For example, Cuadrado et al. (2006)
Software Qual J (2009) 17:25–49 33
123
consider an approach to improve the selection of cost drivers in parametric models for
software cost estimation. They analyze various factors that affect the importance of a cost
driver, and use empirical evidence to formulate an aggregation mechanism for cost driver
selection. In the context of software quality classification, Menzies et al. (2007) summarize
that instead of selecting a best set of software quality indicators, empirical studies should
focus on building software quality classification models that are useful and practical. They
summarize that the best attributes to use for defect prediction vary from dataset to dataset,
confirming a relatively similar observation made by Shepperd and Kadoda (2001).
The datasets are related to projects of various sizes written with various programming
languages. Table 1 summarizes the seven datasets used in this case study. Those datasets
are referred to as JM1, KC1, KC2, KC3, CM1, MW1, and PC1 respectively. It is worth
mentioning that the KC datasets are written using object-oriented languages; however,
object-oriented metrics provided for those projects were associated with a different set of
modules in the respective projects. Moreover, those modules were not associated with
known defect data. These problems with available object-oriented metrics prevented their
inclusion in our study. Each software system and its dataset is briefly described below.
• JM1 is a real-time C project which has approximately 315 KLOC (Fenton and Pfleeger
1997). There are eight years of error data associated with the metrics. The changes to the
modules are based on the changes reported within the problem reports. We processed
JM1 to eliminate redundancy, obvious noisy instances, and observations with missing
values. The pre-processed dataset contains 8850 modules, and of these instances, 1687
have one or more faults. The maximum number of faults in a software module is 26.
• KC1 is project that is comprised of logical groups of computer software components
(CSCs) within a large ground system. KC1 is made up of 43 KLOC of C++ code. The
dataset contains 2107 instances, and of these instances, 325 have one or more faults and
1782 have zero faults. The maximum number of faults in a module is 7.
• KC2 is a C++ program, with metrics collected at the function level. The KC2 project
is the science data processing unit of a storage management system used for receiving
and processing ground data for missions. The dataset includes only those modules that
were developed by NASA software developers. The dataset contains 520 instances, and
of these instances, 106 have one or more faults and 414 have zero faults. The maximum
number of faults in a software module is 13.
• KC3 has been coded in 18 KLOC of Java. This software application collects, processes,
and delivers satellite meta-data. The dataset contains 458 instances, and of theses
instances, 43 have one or more faults and 415 have zero faults. The maximum number
of faults in a module is 6.
• CM1 is written in C Code with approximately 20 KLOC. The data available for this
project is from a science instrument. It contains 505 instances, and of these instances,
48 have one or more faults and 457 have zero faults. The maximum number of faults in
a module is 5.
• MW1 is the software from a zero gravity experiment related to combustion. The
experiment is now completed. It is comprised of 8000 lines of C code. The dataset
contains 403 modules, and of these instances, 31 have one or more faults and 372 have
zero faults. The maximum number of faults in a module is 4.
• PC1 is a flight software from an earth orbiting satellite that is no longer operational. It
consists of 40 KLOC of C code. The dataset contains 1107 instances, and of these
instances, 76 have one or more faults and 1031 have zero faults. The maximum number
of faults in a module is 9.
34 Software Qual J (2009) 17:25–49
123
4.2 Presentation of the selected learners
As there is no benefit in combining multiple learners that always make similar decisions,
seventeen learners which differ in their underlying concepts are selected to complement
each other. These classification techniques are summarized in Table 2. Several of these
classifiers are implemented in the WEKA data mining tool (Witten and Frank 2000). Some
additional information regarding these classification techniques are presented in the
Appendix. However, a descriptive discussion of each technique is out of scope for this
paper.
Table 1 Summary of the Software Data Repositories
Dataset nfp fp Total Language
JM1a 7163 1687 8850 C
KC1 1782 325 2107 C++
KC2 414 106 520 C++
KC3 415 43 458 Java
CM1 457 48 505 C
MW1 372 31 403 C
PC1 1031 76 1107 C
a A pre-processed dataset
Table 2 Summary of the selected classifiers
Family Classification technique Acronym
Instance-based learners Locally Weighted Learning (with Decision Stump)(Atkeson et al. 1997)
LWLStump
1-Instance Based Learning (Witten and Frank 2000) IB1
k-Instance Based Learning (Emam et al. 2001) IBk
Meta learner Bagging (Breiman 1996) Bagging
Function-based learners Sequential Minimal Optimization (Platt 1998) SMO
Logistic Regression (Khoshgoftaar and Allen 1999) LRa
Rule-based learners Ripple Down Rules (Gaines and Compton 1995) Ridor
One Rule (Holte 1993) OneR
Lines-of-Code LOCa
Decision Table (Kohavi 1995) DecisionTable
Tree-based learners WEKA’s implementation of C4.5 (Quinlan 1993) J48
Partial Decision Tree (Frank and Witten 1998) PART
Tree-Disc Classification Tree (Khoshgoftaar et al. 2000) TDa
Alternate Decision Tree (Freund and Mason 1999) ADTree
Repeated Incremental Reduced Error Pruning (Cohen 1995) JRip
Random Forest (Witten and Frank 2000) RandomForest
Bayesian learner Naive Bayes (Frank et al. 2000) NaiveBayes
a Implemented by data mining tools other than WEKA
Software Qual J (2009) 17:25–49 35
123
4.3 Empirical results
4.3.1 Quality-of-fit results
We present the quality-of-fit of the m = 17 base learners induced on the 7 available
datasets. For most base classifiers, the predictions are obtained using 10-fold cross-vali-
dation. Extensive tests on numerous different datasets, with different learning techniques,
show that 10 is about the right number of folds to obtain the best estimate of error (Witten
and Frank 2000). In the case of LOC, TD, and LR, resubstitution is used due to the
unavailability of a cross-validation feature within the modeling tool used.
Tables 3 and 4 present the quality-of-fit in terms of Type I, Type II, and Overall
misclassification rates. The critical ranges for the misclassification rates are also provided
using the t-test (Jain 1991) at significance level a = 0.10. For each of the seventeen
learners listed in Table 3, the results are averaged across the seven datasets. We observe
that Tree Disk (TD) performs the best on average; this result is confirmed by previous
studies (Khoshgoftaar et al. 2000, 2006). It is also worth noting the relatively good
accuracy of Random Forest, Logistic Regression, and IBk. Similarly, for each of the seven
available datasets, the results in Table 4 are averaged across the seventeen learners. The
quality-of-fit greatly varies across the datasets, indicating that some datasets may be more
noisy than others (Khoshgoftaar et al. 2006).
4.3.2 Quality-of-test results
Table 5 presents the total number of classifiers obtained using the four classification sce-
narios (nj). For example, there are 42 possible combinations for classification scenario 2,
Table 3 Quality-of-fit averaged across fit datasets
Base learner Type I Type II Overall
J48 27.0% (±4.4%) 26.1% (±3.9%) 26.9% (±4.4%)
JRip 26.9% (±4.0%) 24.4% (±3.9%) 26.6% (±3.9%)
NaiveBayes 29.0% (±3.4%) 29.4% (±3.3%) 29.0% (±3.3%)
DecisionTable 29.4% (±4.9%) 28.7% (±4.7%) 29.3% (±4.9%)
RandomForest 25.6% (±2.9%) 24.1% (±3.5%) 25.4% (±3.0%)
OneR 26.6% (±3.2%) 25.8% (±3.4%) 26.5% (±3.2%)
PART 26.4% (±3.8%) 25.2% (±4.3%) 26.2% (±3.9%)
IBk 25.4% (±3.6%) 24.8% (±3.9%) 25.3% (±3.5%)
IB1 30.8% (±2.8%) 27.8% (±3.4%) 30.4% (±2.8%)
ADTree 26.9% (±3.1%) 27.0% (±2.9%) 26.9% (±3.1%)
Ridor 36.6% (±15.6%) 24.5% (±6.6%) 35.3% (±13.6%)
LWLStump 26.9% (±3.2%) 26.8% (±3.5%) 26.9% (±3.3%)
SMO 26.2% (±3.8%) 26.7% (±3.7%) 26.3% (±3.8%)
Bagging 26.9% (±3.5%) 24.4% (±2.9%) 26.5% (±3.4%)
LOCa 31.5% (±1.9%) 28.8% (±3.9%) 31.1% (±2.3%)
LRa 25.9% (±3.2%) 25.5% (±3.1%) 25.8% (±3.1%)
TDa 21.2% (±3.5%) 20.5% (±3.5%) 21.0% (±3.4%)
Average 27.9% (±1.1%) 26.2% (±0.8%) 27.7% (±1.0%)
a Quality-of-fit obtained by resubstitution
36 Software Qual J (2009) 17:25–49
123
implying that for each of the 7 datasets used as a fit dataset, the other 6 are used as test
datasets. Similarly, for Scenario 1, there are 714 possible combinations, reflecting 17
learners trained on each of the 7 datasets and applied to each of the remaining 6 datasets.
For Scenario 3, there are 119 possible combinations, i.e., each of the 17 learners trained on
6 fit datasets is applied to the remaining 1 dataset—all 7 datasets are given an opportunity
to belong to the 6 datasets used for training, and each dataset is given an opportunity to be
used as a test dataset for a learner trained on the 6 fit datasets. Finally, for Scenario 4, there
are 7 possible combinations since the 17 learners trained on 6 fit datasets are applied (in
turn) to each of the 7 datasets. In total, 882 vectors of final estimates (test data estimates)
are generated based on the four classification scenarios.
Tables 6–9 summarize the quality-of-test both in terms of misclassification rates (i.e.,
Type I and Type II) and normalized expected costs of misclassification (i.e., NECM) for
classification Scenarios 1, 2, 3, and 4 respectively. For brevity, the results are grouped and
averaged by test datasets. NECMs are produced for cost ratios (c) 15, 20 and 25. This range
of cost ratios was used in similar studies (Khoshgoftaar et al. 2000, 2006) and is considered
appropriate for high-assurance software systems. The averages ð �Y:jÞ and the standard
deviations (S.j) for the respective response variables are presented as well.
The classification scenarios 2, 3, and 4 achieve, on average, better performances in
terms of NECM compared to the Scenario 1. In other words, combining multiple base
estimates improves the predictive accuracy of the final classifier. Moreover, the standard
deviation of Scenario 1 is higher than any of the other scenarios (S.1[S.2[S.4[S.3). As
noted by Alpaydin (1998) and Mani (1991), the voting method can be thought of as a
regularizer which smoothes the predictive results. In terms of misclassification cost, we
observe that the multi-dataset classification scenarios (Scenarios 3 and 4) perform better on
average than the single-dataset scenarios (Scenarios 1 and 2), that is, �Y:3 [ �Y:4 [ �Y:2 [ �Y:1:
Table 4 Quality-of-fit averaged across base learners
Fit dataset Type I Type II Overall
JM1 33.6% (±0.5%) 33.1% (±0.8%) 33.5% (±0.5%)
KC1 27.7% (±0.8%) 27.1% (±0.9%) 27.6% (±0.8%)
KC2 21.7% (±1.2%) 20.1% (±0.8%) 21.3% (±1.0%)
KC3 25.5% (±1.3%) 23.8% (±1.3%) 25.4% (±1.3%)
CM1 30.2% (±6.1%) 24.6% (±2.4%) 29.7% (±5.4%)
MW1 31.2% (±1.9%) 30.2% (±1.6%) 31.2% (±1.9%)
PC1 23.0% (±2.0%) 22.5% (±1.9%) 23.0% (±2.0%)
Average 27.9% (±1.1%) 26.2% (±0.8%) 27.7% (±1.0%)
Table 5 Sample sizes of the four groups
Classificationscenarios
Number oflearners, m
Number of fitdatasets, n
Number of experts,n 9 m
Sample size, nj
1 9 1 1 1 1 17 9 6 9 7 = 714
m 9 1 17 1 17 6 9 7 = 42
1 9 n 1 6 6 17 9 7 = 119
m 9 n 17 6 102 7
P4j¼1 nj ¼ 882
Software Qual J (2009) 17:25–49 37
123
These conclusions, which are identical for all cost ratios, suggest that using multiple
datasets improve the predictive accuracies of the final classifiers compared to the use of a
single dataset. To determine whether the improvements are significant in terms of pre-
dictive performance, a statistical analysis is considered.
Table 6 Quality-of-test of classification scenario 1 (1 9 1)
Test dataset c = 15 c = 20 c = 25 Type I Type II
JM1 1.640 2.103 2.566 31.0% 48.6%
KC1 1.111 1.406 1.702 26.7% 38.3%
KC2 1.200 1.520 1.840 30.2% 31.4%
KC3 0.718 0.871 1.025 28.3% 32.7%
CM1 0.819 1.000 1.181 30.6% 38.1%
MW1 0.679 0.802 0.926 33.5% 32.0%
PC1 0.710 0.843 0.976 33.4% 38.7%
�Y:1 0.982 1.221 1.459 30.5% 37.1%
S.1 0.396 0.536 0.677 11.7% 13.0%
n1 = 714
Table 7 Quality-of-test of classification scenario 2 (m 9 1)
Test dataset c = 15 c = 20 c = 25 Type I Type II
JM1 1.602 2.072 2.541 23.7% 49.3%
KC1 1.046 1.327 1.609 23.7% 36.5%
KC2 0.973 1.236 1.499 23.2% 25.8%
KC3 0.680 0.834 0.989 23.8% 32.9%
CM1 0.751 0.921 1.091 26.7% 35.8%
MW1 0.634 0.746 0.857 32.4% 29.0%
PC1 0.657 0.787 0.916 28.8% 37.7%
�Y:2 0.906 1.132 1.358 26.1% 35.3%
S.2 0.347 0.473 0.600 4.9% 10.0%
n2 = 42
Table 8 Quality-of-test of classification scenario 3 (1 9 n)
Test dataset c = 15 c = 20 c = 25 Type I Type II
JM1 1.441 1.838 2.235 30.9% 41.6%
KC1 0.929 1.159 1.388 28.4% 29.8%
KC2 0.803 0.992 1.180 29.9% 18.5%
KC3 0.592 0.700 0.808 29.6% 23.0%
CM1 0.720 0.863 1.007 32.1% 30.1%
MW1 0.631 0.729 0.827 36.6% 25.4%
PC1 0.629 0.731 0.833 34.6% 29.7%
�Y:3 0.821 1.002 1.182 31.7% 28.3%
S.3 0.287 0.390 0.493 5.3% 8.4%
n3 = 119
38 Software Qual J (2009) 17:25–49
123
4.4 Statistical analysis
The one-way ANOVA model is selected to be the underlying model of the experiment,
while the four populations are the four classification scenarios. Table 5 presents the sample
sizes (nj) of the related four population samples. Notice that the sample size of the group
related to the multi-learner multi-dataset classification scenario is only equal to 7. Con-
sequently, a relatively large confidence interval is expected for the 4th group. Similar to
(Khoshgoftaar et al. 2006), three ANOVA models are successively built based on the
response variable NECM at cost ratios 15, 20, and 25.
4.4.1 Preliminary transformation
It is important to validate the assumption of independence, normality, and homogeneity of
variances before building ANOVA models. Figure 2 presents a normal quantile plot of the
error data (Jain 1991) when the cost ratio (c) is equal to 20. The shorter tail on one end and
a longer tail on the other are characteristics of an asymmetric distribution. Several tests
such as Fmax, L-Levene, W-Shapiro (Berenson et al. 1983) indicate that the previously
mentioned assumptions do not hold. The same conclusions are reached for cost ratios 15
and 25.
An inverse transformation is hence used to stabilize the variance. In the remainder of
this paper, a star (*) will indicate that the data is transformed using an inverse transfor-
mation. Figure 3 presents the quantile-quantile plot, after the transformation and when
c = 20. Statistical tests indicate that the assumptions related to the ANOVA model now
hold. This transformation is also successful in stabilizing the variance and normalizing the
samples at cost ratios 15 and 25.
4.4.2 ANOVA models
Due to the restrictions on the paper size, only the ANOVA model for cost ratio 20 is
presented in Table 10. DF, SS, and MS refer to the degrees of freedom, the sum of squares,
and the mean squares, respectively. The F-value is selected at 90% confidence level (i.e.,
a = 0.10). The p-value related to the F-test is also provided. Since the p-value is well
below the significance level a, we can conclude that there is a significant difference in the
Table 9 Quality-of-test of classification scenario 4 (m 9 n)
Test dataset c = 15 c = 20 c = 25 Type I Type II
JM1 1.477 1.906 2.334 23.5% 45.0%
KC1 0.996 1.257 1.518 25.1% 33.8%
KC2 0.769 0.962 1.154 24.2% 18.9%
KC3 0.624 0.755 0.886 25.5% 27.9%
CM1 0.776 0.954 1.133 26.7% 37.5%
MW1 0.625 0.737 0.849 31.5% 29.0%
PC1 0.584 0.692 0.800 27.7% 31.6%
�Y:4 0.836 1.038 1.239 26.3% 32.0%
S.4 0.315 0.429 0.542 2.7% 8.2%
n4 = 7
Software Qual J (2009) 17:25–49 39
123
means of the four groups. Figure 4 plots a visual representation of the ANOVA model
presented in Table 10. We observe that the average predictive performance of the base
classifiers (Scenario 1, �Y�:1) is lower than the average of the four combined sample pop-
ulations ð �Y�::Þ:
4.4.3 Contrasts
Scheffe S method is employed to determine which differences among the means are in fact
significant (Chen et al. 1997). The first part of Table 11 (predictive performances) presents
the six possible pairwise comparisons between the four groups at different cost ratios, c.
For example, L1 represents the difference between the means of the first and second
classification scenarios (1 9 1 and m 9 1, respectively).1 The estimated contrast ðL1 ¼�Y�:1 � �Y�:2Þ is equal to -0.052 at cost ratio 20. A negative number indicates that the sample
average of the first group is lower than the sample average of the second group. Conse-
quently, the average cost of misclassification of group 1 is higher than in group 2, because
of the inverse transformation procedure. Hence, group 2 performs better than group 1 on
average. The p-value represents the probability that the two means are, in fact, not
Fig. 2 Normal quantile-quantile plot for the error data before transformation (c = 20)
Fig. 3 Normal quantile-quantile plot for the error data after transformation (c = 20)
1 l.j and l�:j are exchangeably used throughout the paper.
40 Software Qual J (2009) 17:25–49
123
significant (i.e., l.1 = l.2). Since p-value = 0.812 for the first contrast, we conclude that,
on average, the predicted performances of the multi-learner single-dataset classifiers are
not significantly better than those of the base classifiers.
This conclusion corroborates our previous study (Khoshgoftaar and Seliya 2004). The
quality of software measurement data plays a critical role in the accuracy and the use-
fulness of classification models. The practitioner should not expect any significant
improvement in the accuracy of the learners if the information in the training dataset is
limited. A multi-learner system should only be justified if the combined decisions are
better than those of any single classifier in the system (Ho et al. 1994). Majority voting
remains, however, a simplistic combination technique. Future work should implement
more sophisticated combination techniques, as opposed to a simple majority voting, in
order to assess the usefulness of a classification technique based on the output of different
learners induced on the same fit dataset.
Contrasts L3; L5; and L6 indicate that the multi-learners multi-dataset approach is not
significantly different from any other classification scenarios. The confidence intervals of
these contrasts are too high to conclude any significant difference because the sample size
of group 4 is small (n4 = 7). The only significant observation in Table 11 is the com-
parison between the single-learner single-dataset classification scenario and the single-
learner multi-dataset classification scenario. L2 is significant (see p-values in bold) at cost
ratio 15, 20, and 25. This contrast shows that, on average, a single-learner multi-dataset
classifier produces better predictive results (i.e., lower misclassification cost) than a single-
learner single-dataset system. It demonstrates that the use of several training datasets
leverages the amount of information available for knowledge discovery. In other words,
data mining more relevant information leads to better and robust predictive models.
Table 10 ANOVA table comparing performance among the four scenarios (c = 20)
Variation DF SS MS F F-Critical p-Value
Among groups 3 2.789 0.930 8.204 2.154 0.000
Within groups 878 99.485 0.113
Total 881 102.274
Fig. 4 Visual representation of the ANOVA model (c = 20)
Software Qual J (2009) 17:25–49 41
123
It is also practical to know whether there is a significant difference between the clas-
sification scenarios using multiple datasets and the scenarios using a single dataset. The
contrast is defined as follows:
L7 ¼l:1 þ l:2
2� l:3 þ l:4
2ð6Þ
Similarly, it is worth assessing the contrast between classifiers using multiple learners
and using a single learner:
L8 ¼l:1 þ l:3
2� l:2 þ l:4
2ð7Þ
L7 and L8 are, however, not significantly different due to the unbalanced sample sizes.
4.5 Comparison with the merging approach
The alternate merging approach (as discussed in Sect. 3.2) in which six of the seven
datasets were successively merged into one fit dataset, while the remaining dataset was
used as the test dataset. Prior to building the respective models, two additional independent
variables (or software attributes) were added to the respective fit and test datasets. The first
variable indicated the size of the (one of seven) software project that a program module
belonged to. We categorized the seven datasets into small, medium, and large sizes based
on the number of modules in each dataset. The second variable is a Boolean metric
representing whether or not an instance belonged to a dataset of an object-oriented soft-
ware system.
The modeling and prediction results for the merging approach are not presented due to
space considerations. However, we do summarize the findings of the comparative study in
this section. When compared to our studies using a single learner, i.e., Scenario 1 and
Scenario 3, it was observed that the multi-dataset approach of Scenario 3 yielded signif-
icantly better results (p-value = 0.0000) than the merging approach. In addition, the
single-learner single-dataset approach of Scenario 1 also performed better than the merging
approach, however, the improvement was not significant at 5%. In summary, the merging
approach did not provide better results than the other multi-dataset approach investigated
in our study.
Table 11 Contrasts among the four classification scenarios
Ll Ll p-Value
15 20 25 15 20 25
L1 ¼ l:1 � l:2 -0.073 -0.052 -0.039 0.671 0.812 0.890
L2 ¼ l:1 � l:3 -0.177 -0.162 -0.148 0.000 0.000 0.000
L3 ¼ l:1 � l:4 -0.153 -0.120 -0.097 0.759 0.831 0.875
L4 ¼ l:2 � l:3 -0.104 -0.110 -0.109 0.494 0.350 0.277
L5 ¼ l:2 � l:4 -0.079 -0.068 -0.058 0.965 0.970 0.975
L6 ¼ l:3 � l:4 0.024 0.043 0.051 0.999 0.991 0.980
L7 ¼ l:1þl:22� l:3þl:4
2-0.128 -0.115 -0.103 0.444 0.454 0.469
L8 ¼ l:1þl:32� l:2þl:4
2-0.025 -0.005 0.006 0.992 1.000 1.000
42 Software Qual J (2009) 17:25–49
123
4.6 Threats to empirical validity
Due to the many human factors that affect software development, and consequently
software quality, controlled experiments for evaluating the usefulness of empirical models
are not practical. We adopted a case study approach in the empirical investigations pre-
sented in this study. To be credible, the software engineering community demands that the
subject of an empirical study have the following characteristics (Wohlin et al. 2000):
• Developed by a group, and not by an individual
• Be as large as industry size projects, and not a toy problem
• Developed by professionals, and not by students
• Developed in an industry/government organization setting, and not in a laboratory
We note that our case studies fulfill all of the above criteria. The software systems
investigated in our study were developed by professionals in a government software
development organization. In addition, each system was developed to address a real-world
problem.
Empirical studies that evaluate measurements and models across multiple projects
should take care in assessing the scope and impact of its analysis and conclusion. For
example, C and C++ projects may be considered similar if they are developed using the
procedural paradigm. Combining object-oriented project data (e.g., Java) with non-object-
oriented project data (e.g., C or C++) needs careful consideration. For example,
per-module lines of code of an object-oriented software tends to be lower than that of a
non-object-oriented software.
The proposed process of combining multiple learners and datasets for software quality
analysis included four C projects, two C++ projects and one Java project. In the
alternate classification approach of merging all datasets, we introduced two additional
metrics: one to capture dataset size variation and another to capture whether a module
belongs to an OO project. The average Loc Total of the C projects was relatively
similar to that of the C++ projects. The average Loc Total for the Java project, while
slightly lower, was relatively comparative. All datasets were normalized and scaled to
account for variation in dataset size. In addition to commonality of development orga-
nization and application domain, all projects were characterized by the same set of
metrics. Such similarity among the projects was considered sufficient for the primary
scope of our study, i.e., improving software quality analysis by evaluating multiple
software project repositories.
5 Conclusion
This paper presented an empirical study where seventeen classification models were
induced on seven NASA software projects. To our knowledge, this empirical work is one
of the largest in terms of both scale and scope: 119 (17 9 7) base classification models
were built, and more than 700 vectors of base estimates were generated.
Four classification scenarios were investigated. The first scenario applies the more
classical approach: training one classifier with a single fit dataset and predicting the test
dataset. The second approach is a popular method in data mining: a classifier is built based
on the prediction of multiple learners induced on the same dataset. The third approach
consists of using the prediction of the same learner induced on multiple fit datasets (multi-
dataset classifier). Finally, the most generic approach combines the predictions of multiple
Software Qual J (2009) 17:25–49 43
123
learners built on multiple fit datasets and applied on the dataset we want to predict. Such a
technique is referred to as multi-learner multi-dataset classifier.
It is shown that, on average, the single-learner multi-dataset classifiers (Scenario 3)
performs significantly better than the corresponding single-learner single-dataset (Scenario
1) classifiers. We also demonstrated that there is no significant difference between the
predictions of the base classifiers and the multi-learner single-dataset classifiers.
These two conclusions are critical for complex data mining problems such as software
quality classification. The practitioner should not rely solely on sophisticated and/or robust
algorithms to generate accurate predictions. When the information is either noisy or limited,
the practitioner should dedicate resources toward more information gathering. In the case of
software quality engineering and as demonstrated by this study, it is advised to use software
data (software metrics) repositories of existing projects similar to the current project (for
which prediction is needed) in terms of quality requirement and application domain.
Future works can investigate the use of learners trained on different representations of
the same input (i.e., multi-representation). For example, it may be worthwhile to inves-
tigate whether the inclusions of object-oriented metrics and other measurements, such as
software process metrics, improve the performance of the four different classification
scenarios. Another research direction would consist of studying more sophisticated voting
schemes, such as weighted voting scheme, cascading, or stacked generalization. Such
schemes may improve the predictive accuracy of the multi-learner multi-dataset classifier.
Acknowledgments We thank the three reviewers and the Associate Editor for their constructive critique andcomments. We also thank the various members of the Empirical Software Engineering Laboratory and DataMining and Machine Learning Laboratory at Florida Atlantic University for their reviews of this paper. We aregrateful to the staff of the NASA Metrics Data Program for making the software measurement data available.
Appendix
This section presents a brief description of the seventeen classifiers trained on the seven
software measurement datasets. The first fifteen are part of the WEKA data mining tool
which is publicly available (Witten and Frank 2000).
J48 is WEKA’s implementation of C4.5, the decision tree algorithm introduced by
Quinlan (1993). The C4.5 algorithm is an inductive supervised learning system which
employs decision trees to represent the underlying structure of the input data. The algo-
rithm is comprised of four principal components for constructing and evaluating the
classification tree models: decision tree generator, production rule generator, decision tree
interpreter, and production rule interpreter. The algorithm is known to be one of the most
robust induction learning algorithms available (Khoshgoftaar et al. 2000).
JRip is WEKA’s implementation of the rule-based learning algorithm, RIPPER
(Repeated Incremental Pruning to Produce Error Reduction)—a modification of the IREP
(Incremental Reduced Error Pruning). RIPPER was proposed by Cohen (1995), and was
shown to compare favorably with C4.5. Both RIPPER and C4.5 rules start with an initial
model and iteratively improve it using heuristic techniques. However, for large noisy
datasets, the former generally seems to start with an initial model that is about the right
size, while the latter starts with an extremely large initial model. This means that RIPPER
is more search-efficient.
Naive Bayes is one of the most simple techniques available for classification. Based on
Bayesian rule of conditional probability, this technique ‘‘naively’’ assumes that attributes are
44 Software Qual J (2009) 17:25–49
123
independent of each other given the class, which may not be completely true in the real world
(Witten and Frank 2000). Despite the over-simplification of the actual relationship between
the attributes, Naive Bayes has been shown to perform fairly well, especially when used
along with feature selection techniques that remove irrelevant attributes (Frank et al. 2000).
Decision Table is one the simplest methods for learning from input data. The rules
learned from the input data have the same form as input: the form of a decision table, i.e., a
list of rules in a table format. The problem of constructing a decision table involves the
selection of appropriate attributes for inclusion, and getting rid of irrelevant attributes.
When determining a class for a test instance, all that one has to do is to look up the
appropriate conditions in the list of the rules—the decision table. They are appealing in
real-time environment, since they provide a constant classification time on average
(Kohavi 1995).
Random Forest is a classifier consisting of a collection of tree-structured classifiers
(Witten and Frank 2000). The Random Forest classifies a new object from an input vector
by examining the input vector on each tree in the forest. Each tree casts a unit vote at the
input vector by giving a classification. The forest selects the classification having the most
votes over all the trees.
OneR, introduced by Holte (1993), is one of the simplest algorithms available in
machine learning. Despite its simplicity, it compares favorably to many of the state-of-the-art machine learning techniques. It chooses the most single informative attribute and bases
the rule on this attribute alone. In practice, simple rules often achieve surprisingly high
accuracy, which could be attributed to the underlying rudimentary structure of many real-
world datasets (Witten and Frank 2000).
PART is a simple, yet surprisingly effective method for learning decision lists based on
the repeated generation of partial decision trees in a divide-and-conquer manner (Frank and
Witten 1998). It builds a rule, removes the instances covered by the rule, and continues
creating rules recursively for the remaining instances until none are left (Frank and Witten
1998). PART, unlike the two dominant practical implementations of rule learners, C4.5
and Ripper, avoids the time consuming phase of post-processing for global optimization,
maintaining comparable classification accuracy.
Instance-Based Learning (IB1) is a popular classification scheme. The working
hypothesis of the technique is that the program module under examination (i.e., test case)
would belong to the same class as that of other similar instances. Different instance-based
learning algorithms vary in the context of the selected number of nearest neighbors, the
measures used to compute similarity between instances, and the solution algorithm for
predicting the class of a test instance. WEKA’s implementation of one instance-based
classifier, IB1, uses only one nearest neighbor to predict the class of a test instance. In this
study, the similarity measure used is Euclidean distance.
IBk is WEKA’s implementation of an instance-based learning technique with k nearest
neighbors. Selecting only one nearest neighbor to predict the class of a test instance,
especially in the presence of noise, may lead to increased inaccuracy (Witten and Frank
2000). In IBk, the class of the test case is predicted by majority voting of the k nearest
neighbors (Emam et al. 2001). Like IB1, the similarity measure used to determine the
nearest neighbors is Euclidean distance.
Alternating Decision Trees (ADTree) is a relatively new machine learning technique
proposed by Freund and Mason (1999). It combines the power of boosting and decision
trees in a very simple manner generalizing decision trees, voted decision trees, and voted
decision stumps. Since ADTree has alternating layers of decision nodes and prediction
nodes in its tree structure, it is called Alternating Decision Tree.
Software Qual J (2009) 17:25–49 45
123
RIpple-DOwn Rule (Ridor) was introduced by Compton and Jansen (1990) as a
methodology for acquisition and maintenance of large rule-based systems. The basic idea
behind the technique is to make incremental changes while constructing and maintaining a
complex knowledge structure in a well-defined and restricted manner such that the effects
of the changes do not propagate globally and are well confined in the structure, unlike
standard production rules.
Locally Weighted Learning (LWL) is a non-adaptive lazy-learning technique that is
gaining popularity in the machine learning community. Local weighting reduces unnec-
essary bias of global function fitting, and gives more flexibility, retaining the desirable
properties such as smoothness and statistical analyzability (Atkeson et al. 1997). There are
mainly three elements for Locally Weighted Learning: distance function, separable cri-
terion, and sufficient data. LWL uses locally weighted training to combine training data,
using a distance function to fit a surface to nearby points. It must be used in conjunction
with another classifier to perform classification. In this study, Decision Stump is combined
with LWL (LWLStump).
Sequential Minimal Optimization (SMO), proposed by Platt (1998), is a conceptually
simple, but subtle algorithm for training support vector machines, which involves solving a
very large quadratic programming (QP) optimization problem, using Osuna’s theorem to
ensure convergence. The problem is resolved by dividing the large QP problem into
smaller pieces of QP problems which are then solved analytically in steps instead of the
traditional time-consuming way of numerical QP optimization. SMO is highly scalable as
its memory requirement is linearly dependent on the size of training dataset.
Bagging (BAG) combines classifiers by randomly re-sampling from the original training
dataset, building a classifier for each re-sampled dataset, and using the prediction of each
classifier in a simple vote to obtain the combined decision on the test data (Breiman 1996).
The combined decision or final hypothesis for classification is obtained using an unweighed
vote. Typically, an unstable or weak learner is used to combine the decision to ensure that
small changes in the training data will yield significantly different models.
Lines-of-Code (LOC) is one of the most commonly used software measure for repre-
senting the complexity of software program modules. In our study, the modules were first
sorted in an ascending order of their LOC. The underlying assumption is that a larger-size
program module is likely to have more software faults than a relatively smaller-size module.
Based on a specific threshold value of lines of code, the modules with LOC lower than this
threshold are predicted as nfp, and the rest as fp. The threshold value is varied until the
preferred balance of equality between the Type I and Type II error rates is obtained.
Logistic Regression (LR) is a statistical modeling technique that offers good model
interpretation. Logistic Regression suits software quality modeling because most software
engineering measures do have a monotonic relationship with faults that are inherent in the
underlying software development processes (Fenton and Pfleeger 1997; Khoshgoftaar and
Allen 1999).
Tree Disc (TD) is a SAS macro implementation of the modified CHAID algorithm
(Khoshgoftaar et al. 2000). It constructs a decision tree to predict a specified categorical
dependent variable from one or more predictor (independent) variables. The decision tree
is computed by recursively partitioning the dataset into two or more subsets of observa-
tions, based on the categories of one of the predictor variables until some stopping criterion
is met. The variable that is most significantly associated with the dependent variable is
selected to be the predictor variable. The association level is measured by a chi-squared
test of independence.
46 Software Qual J (2009) 17:25–49
123
References
Alpaydin, E. (1997). Voting over multiple condensed neareast neighbors. Artificial Intelligence Review,11(1–5), 115–132.
Alpaydin, E. (1998). Techniques for combining multiple learners. In E. Alpaydin (Ed.), Proceedings ofengineering of intelligent systems conference (Vol. 2 of 6–12). ICSC Press.
Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Locally weighted learning. Artificial IntelligenceReview, 11(1–5), 11–73.
Berenson, M. L., Levine, D. M., & Goldstein, M. (1983). Intermediate statistical methods and applications.Prentice-Hall.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.Chen, K., Wang, L., & Chi, H. (1997). Methods of combining multiple classifiers with different features and
their applications to text-independent speaker identification. International Journal of Pattern Recog-nition and Artificial Intelligence, 11(3), 417–445.
Cohen, W. W. (1995). Fast effective rule induction. In A. Prieditis & S. Russell (Eds.), Proceedings of the12th international conference on machine learning (pp. 115–123). Tahoe City, CA: MorganKaufmann.
Compton, P., & Jansen, R. (1990). Knowledge in context: A strategy for expert system maintenance. In C. J.Barter & M. J. Brooks (Eds.), 2nd Australian joint artificial intelligence conference (pp. 292–306).Adelaide, Australia: Springer-Verlag.
Cuadrado-Gallego, J. J., Fernndez-Sanz, L., & Sicilia, M. A. (2006). Enhancing input value selection inparametric software cost estimation models through second level cost drivers. Software QualityJournal, 14(4), 330–357.
Emam, K. E., Benlarbi, S., Goel, N., & Rai, S. N. (2001). Comparing case-based reasoning classifiers forpredicting high-risk software componenets. Journal of Systems and Software, 55(3), 301–320. ElsevierScience Publishing.
Fenton, N. E., & Pfleeger, S. L. (1997). Software metrics: A rigorous and practical approach (2nd ed.).Boston, MA: PWS Publishing.
Frank, E., Trigg, L., Holmes, G., & Witten, I. H. (2000). Naive bayes for regression. Machine Learning,41(1), 5–25.
Frank, E., & Witten, I. H. (1998). Generating accurate rule sets without global optimization. In Proceedingsof the 15th international conference on machine learning (pp. 144–151). Morgan Kaufmann.
Freund, Y., & Mason, L. (1999). The alternating decision tree learning algorithm. In Proceedings of 16thinternational conference on machine learning (pp. 124–133). Bled, Slovenia: Morgan Kaufmann.
Gaines, B. R., & Compton, P. (1995). Induction of ripple-down rules applied to modeling large databases.Journal of Intelligent Information Systems, 5(3), 211–228.
Gamberger, D., Lavrac, N., & Dzeroski, S. (1996). Noise elimination in inductive concept learning: A casestudy in medical diagnosis. In Algorithmic learning theory: Proceedings of the 7th internationalworkshop (Vol. 1160, pp. 199–212). Sydney, Australia: Springer-Verlag.
Hansen, L. K., & Salamon, P. (1990). Neural network ensemble. In IEEE transactions on pattern analysisand machine intelligence (Vol. 12, pp. 993–1001).
Ho, T. K., Hull, J. J., & Srihari, S. N. (1994). Decision combination in multiple classifier systems. IEEEtransactions on pattern analysis and machine intelligence, 16(1).
Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets.Machine Learning, 11, 63–91.
Jain, R. (1991) The art of computer systems performance analysis: Techniques for experimental design,measurement, simulation, and modeling. John Wiley & Sons.
Khoshgoftaar, T. M., & Allen, E. B. (1999). Logistic regression modeling of software quality. InternationalJournal of Reliability, Quality, and Safety Engineering, 6(4), 303–317.
Khoshgoftaar, T. M., Joshi, V., & Seliya, N. (2006). Noise elimination with ensemble-classifier noisefiltering: Case studies in software engineering. International Journal of Software Engineering andKnowledge Engineering, 16(1), 1–24.
Khoshgoftaar, T. M., & Rebours, P. (2004). Generating multiple noise elimination filters with the ensemble-partitioning filter. In Proceedings of the 2004 IEEE International Conference on Information Reuseand Integration (pp. 369–375). Las Vegas, NV, November 2004.
Khoshgoftaar, T. M., & Seliya, N. (2004). The necessity of assuring quality in software measurement data.In Proceedings of the 10th international symposium on software metrics (pp. 119–130). Chicago, IL:IEEE Computer Society Press.
Khoshgoftaar, T. M., Yuan, X., & Allen, E. B. (2000). Balancing misclassification rates in classification treemodels of software quality. Empirical Software Engineering, 5, 313–330.
Software Qual J (2009) 17:25–49 47
123
Kohavi, R. (1995). The power of decision tables. In N. Lavrac & S. Wrobel (Eds.), Proceedings of theeuropean conference on machine learning, Lecture Notes in Artificial Intelligence (pp. 174–189).Springer-Verlag.
Krogh, A., & Vedelsby, J. (1995). Neural network ensembles, cross validation and active learning. InAdvances in neural information processing systems (pp. 231–238). Cambridge MA: MIT Press.
Mani, G. (1991). Lowering variance of decisions by using artificial neural network ensembles. NeuralComputation, 3, 484–486.
Meir, R. (1995). Bias, variance and the combination of estimators; the case of linear least squares. In D. T.G. Tesauro & T. Leen (Eds.), Advances in neural information processing systems. Cambridge, MA:MIT Press.
Menzies, T., Greenwald, J., & Frank, A. (2007). Data mining static code attributes to learn defect predictors.IEEE transactions on software engineering, 33(1), 2–13.
Meulen, M. J., & Revilla, M. A. (2007). Correlations between internal software metrics and softwaredependability in a large population of small C/C++ programs. In Proceedings of the 18th IEEEinternational symposium on software reliability engineering, ISSRE 2007 (pp. 203–208). Trollhattan,Sweden, November 2007.
Munson, J. C., & Khoshgoftaar, T. M. (1992). The detection of fault-prone programs. IEEE Transactions onSoftware Engineering, 18(5).
Nikora, A. P., & Munson, J. C. (2003). Understanding the nature of software evolution. In Proceedings ofthe 2003 international conference on software maintenance (pp. 83–93), September 2003.
Nikora, A. P., & Munson, J. C. (2004). The effects of fault counting methods on fault model quality. InProceedings of the 28th international computer software and applications conference, COMPSAC2004 (vol. 1, pp. 192–201), September 2004.
Platt, J. C. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines.Technical Report 98-14, Microsoft Research, Redmond, WA, April 1998.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann.Shepperd, M., & Kadoda, G. (2001). Comparing software prediction techniques using simulation. IEEE
Transactions on Software Engineering, 27(11), 1014–1022.Witten, I. H., & Frank, E. (2000). Data mining, practical machine learning tools and techniques with Java
implementations. San Francisco, CA: Morgan Kaufmann.Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., & Wesslen, A. (2000). Experimentation in
software engineering: An introduction Kluwer international series in software engineering. Boston,MA: Kluwer Academic Publishers.
Wolpert, D. (1992). Stacked generalization. Neural Network, 5(2), 241–259.Xing, F., Guo, P., & Lyu, M. R. (2005). A novel method for early software quality prediction based on
support vector machine. In Proceedings of the 16th international symposium on software reliabilityengineering (p. 10), November 2005.
Yacoub, S., Lin, X., Simske, S., & Burns, J. (2003). Automating the analysis of voting systems. In 14thinternational symposium on software reliability engineering (pp. 203). Denver, CO, November 2003.
Author Biographies
Taghi M. Khoshgoftaar is a professor of the Department of ComputerScience and Engineering, Florida Atlantic University and the Directorof the Empirical Software Engineering and Data Mining and MachineLearning Laboratories. His research interests are in software engi-neering, software metrics, software reliability and quality engineering,computational intelligence, computer performance evaluation, datamining, machine learning, and statistical modeling. He has publishedmore than 350 refereed papers in these areas. He is a member of theIEEE, IEEE Computer Society, and IEEE Reliability Society. He wasthe program chair and general Chair of the IEEE International Con-ference on Tools with Artificial Intelligence in 2004 and 2005respectively and is the Program chair of the 20th International Con-ference on Software Engineering and Knowledge Engineering (2008).He has served on technical program committees of various interna-tional conferences, symposia, and workshops. Also, he has served as
North American Editor of the Software Quality Journal, and is on the editorial boards of the journalsSoftware Quality and Fuzzy systems.
48 Software Qual J (2009) 17:25–49
123
Pierre Rebours received the M.S. degree in Computer Engineering‘‘from Florida Atlantic University, Boca Raton, FL, USA, in April,2004.’’ His research interests include quality of data and data mining.
Naeem Seliya is an Assistant Professor of Computer and InformationScience at the University of Michigan-Dearborn. He received his Ph.D.in Computer Engineering from Florida Atlantic University, BocaRaton, FL, USA in 2005. His research interests include softwareengineering, data mining and machine learning, software measure-ment, software reliability and quality engineering, softwarearchitecture, computer data security, and network intrusion detection.He is a member of the IEEE and the Association for ComputingMachinery.
Software Qual J (2009) 17:25–49 49
123