research article statistical and machine learning...

16
Research Article Statistical and Machine Learning Methods for Software Fault Prediction Using CK Metric Suite: A Comparative Analysis Yeresime Suresh, Lov Kumar, and Santanu Ku. Rath Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Odisha 769008, India Correspondence should be addressed to Yeresime Suresh; [email protected] Received 31 August 2013; Accepted 16 January 2014; Published 4 March 2014 Academic Editors: K. Framling, Z. Shen, and S. K. Shukla Copyright © 2014 Yeresime Suresh et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Experimental validation of soſtware metrics in fault prediction for object-oriented methods using statistical and machine learning methods is necessary. By the process of validation the quality of soſtware product in a soſtware organization is ensured. Object- oriented metrics play a crucial role in predicting faults. is paper examines the application of linear regression, logistic regression, and artificial neural network methods for soſtware fault prediction using Chidamber and Kemerer (CK) metrics. Here, fault is considered as dependent variable and CK metric suite as independent variables. Statistical methods such as linear regression, logistic regression, and machine learning methods such as neural network (and its different forms) are being applied for detecting faults associated with the classes. e comparison approach was applied for a case study, that is, Apache integration framework (AIF) version 1.6. e analysis highlights the significance of weighted method per class (WMC) metric for fault classification, and also the analysis shows that the hybrid approach of radial basis function network obtained better fault prediction rate when compared with other three neural network models. 1. Introduction Present day soſtware development is mostly based on object- oriented paradigm. e quality of object-oriented soſtware can be best assessed by the use of soſtware metrics. A number of metrics have been proposed by researchers and practitioners to evaluate the quality of soſtware. ese metrics help to verify the quality attributes of a soſtware such as effort and fault proneness. e usefulness of these metrics lies in their ability to predict the reliability of the developed soſtware. In practice, soſtware quality mainly refers to reliability, maintainability, and understandability. Reliability is generally measured by the number of faults found in the developed soſtware. Soſtware fault prediction is a challenging task for researchers before the soſtware is released. Hence, accurate fault predic- tion is one of the major goals so as to release a soſtware having the least possible faults. is paper aims to assess the influence of CK metrics, keeping in view of predicting faults for an open-source soſt- ware product. Statistical methods such as linear regression and logistic regression are used for classification of faulty classes. Machine learning algorithms such as artificial neural network (ANN), functional link artificial neural network (FLANN), and radial basis function network (RBFN) are applied for prediction of faults, and probabilistic neural network (PNN) is used for classification of faults. It is observed in literature that metric suites have been validated for small data sets. In this approach, the results achieved for an input data set of 965 classes were validated by comparing with the results obtained by Basili et al. [1] for statistical analysis. e rest of the paper is organized as follows. Section 2 summarizes soſtware metrics and their usage in fault pre- diction. Section 3 highlights research background. Section 4 describes the proposed work for fault prediction by applying various statistical and machine learning methods. Section 5 highlights the parameters used for evaluating the perfor- mance of each of the applied techniques. Section 6 presents the results and analysis of fault prediction. Section 7 con- cludes the paper with scope for future work. Hindawi Publishing Corporation ISRN Soware Engineering Volume 2014, Article ID 251083, 15 pages http://dx.doi.org/10.1155/2014/251083

Upload: others

Post on 20-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

Research ArticleStatistical and Machine Learning Methods forSoftware Fault Prediction Using CK Metric SuiteA Comparative Analysis

Yeresime Suresh Lov Kumar and Santanu Ku Rath

Department of Computer Science and Engineering National Institute of Technology Rourkela Odisha 769008 India

Correspondence should be addressed to Yeresime Suresh sureshvec04gmailcom

Received 31 August 2013 Accepted 16 January 2014 Published 4 March 2014

Academic Editors K Framling Z Shen and S K Shukla

Copyright copy 2014 Yeresime Suresh et alThis is an open access article distributed under theCreativeCommonsAttribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Experimental validation of software metrics in fault prediction for object-oriented methods using statistical and machine learningmethods is necessary By the process of validation the quality of software product in a software organization is ensured Object-oriented metrics play a crucial role in predicting faultsThis paper examines the application of linear regression logistic regressionand artificial neural network methods for software fault prediction using Chidamber and Kemerer (CK) metrics Here fault isconsidered as dependent variable andCKmetric suite as independent variables Statisticalmethods such as linear regression logisticregression and machine learning methods such as neural network (and its different forms) are being applied for detecting faultsassociated with the classes The comparison approach was applied for a case study that is Apache integration framework (AIF)version 16 The analysis highlights the significance of weighted method per class (WMC) metric for fault classification and alsothe analysis shows that the hybrid approach of radial basis function network obtained better fault prediction rate when comparedwith other three neural network models

1 Introduction

Present day software development is mostly based on object-oriented paradigm The quality of object-oriented softwarecan be best assessed by the use of software metrics Anumber of metrics have been proposed by researchers andpractitioners to evaluate the quality of softwareThesemetricshelp to verify the quality attributes of a software such as effortand fault proneness

The usefulness of these metrics lies in their ability topredict the reliability of the developed software In practicesoftware quality mainly refers to reliability maintainabilityand understandability Reliability is generally measured bythe number of faults found in the developed softwareSoftware fault prediction is a challenging task for researchersbefore the software is released Hence accurate fault predic-tion is one of themajor goals so as to release a software havingthe least possible faults

This paper aims to assess the influence of CK metricskeeping in view of predicting faults for an open-source soft-ware product Statistical methods such as linear regression

and logistic regression are used for classification of faultyclasses Machine learning algorithms such as artificial neuralnetwork (ANN) functional link artificial neural network(FLANN) and radial basis function network (RBFN) areapplied for prediction of faults and probabilistic neuralnetwork (PNN) is used for classification of faults It isobserved in literature that metric suites have been validatedfor small data sets In this approach the results achieved foran input data set of 965 classes were validated by comparingwith the results obtained by Basili et al [1] for statisticalanalysis

The rest of the paper is organized as follows Section 2summarizes software metrics and their usage in fault pre-diction Section 3 highlights research background Section 4describes the proposed work for fault prediction by applyingvarious statistical and machine learning methods Section 5highlights the parameters used for evaluating the perfor-mance of each of the applied techniques Section 6 presentsthe results and analysis of fault prediction Section 7 con-cludes the paper with scope for future work

Hindawi Publishing CorporationISRN Soware EngineeringVolume 2014 Article ID 251083 15 pageshttpdxdoiorg1011552014251083

2 ISRN Software Engineering

2 Related Work

This section presents a review of the literature on the use ofsoftwaremetrics and their application in fault predictionThemost commonly used metric suites indicating the quality ofany software are McCabe [2] Halstead [3] Li and Henry [4]CK metric [5] Abreu MOOD metric suite [6] Lorenz andKidd [7] Martinrsquos metric suite [8] Tegarden et al [9] Meloand Abreu [10] Briand et al [11] Etzkorn et al [12] and soforth Out of these metrics CK metric suite is observed tobe used very often by the following authors as mentioned inTable 1 for predicting faults at class level

Basili et al [1] experimentally analyzed the impact ofCK metric suite on fault prediction Briand et al [13] foundout the relationship between fault and the metrics usingunivariate and multivariate logistic regression models Tanget al [14] investigated the dependency between CK metricsuite and the object-oriented system faults Emam et al[15] conducted empirical validation on Java application andfound that export coupling has great influence on faultsKhoshgoftaar et al [16 17] conducted experimental analysison telecommunication model and found that ANN modelis more accurate than any discriminant model In theirapproach nine softwaremetricswere used formodules devel-oped in procedural paradigm Since then ANN approach hastaken a rise in their usage for prediction modeling

3 Research Background

The following subsections highlight the data set being usedfor fault prediction Data are normalized to obtain betteraccuracy and then dependent and independent variables arechosen for fault prediction

31 Empirical Data Collection Metric suites are used anddefined for different goals such as fault prediction effortestimation reusability and maintenance In this paper themost commonly used metric that is CK metric suite [5] isused for fault prediction

The CK metric suite consists of six metrics namelyweighted method per class (WMC) depth of inheritancetree (DIT) number of children (NOC) coupling betweenobjects (CBO) response for class (RFC) and lack of cohesion(LCOM) [5] Table 2 gives a short note on the six CK metricsand the threshold for each of the six metrics

The metric values of the suite are extracted using Chi-damber and Kemerer Java Metrics (CKJM) tool CKJM toolsextract object-orientedmetrics by processing the byte code ofcompiled Java classesThis tool is being used to extractmetricvalues for three versions of Apache integration framework(AIF an open-source framework) available in the Promisedata repository [18] The versions of the AIF used from therepository are developed in Java language The CK metricvalues of the AIF are used for fault prediction

32 Data Normalization ANN models accept normalizeddata which lie in the range of 0 to 1 In the literature it is

Table 1 Fault prediction using CK metrics

Author Prediction techniqueBasili et al [1] Multivariate logistic regressionBriand et al [13] Multivariate logistic regressionKanmani and Rymend [29] Regression neural networkNagappan and Laurie [30] Multiple linear regressionOlague et al [31] Multivariate logistic regressionAggarwal et al [32] Statistical regression analysisWu [33] Decision tree analysisKapila and Singh [34] Bayesian inference

observed that techniques such asMin-Max normalization119885-Score normalization and Decimal scaling are being used fornormalizing the data In this paper Min-Max normalization[19] technique is used to normalize the data

Min-Max normalization performs a linear transforma-tion on the original data Each of the actual data 119889 of attribute119901 is mapped to a normalized value 1198891015840 which lies in the rangeof 0 to 1 The Min-Max normalization is calculated by usingthe equation

Normalized (119889) = 1198891015840 = 119889 minusmin (119875)max (119901) minusmin (119901)

(1)

where min(119901) and max(119901) represent the minimum andmaximum values of the attribute respectively

33 Dependent and Independent Variables The goal of thisstudy is to explore the relationship between object-orientedmetrics and fault proneness at the class level In this papera fault in a class is considered as a dependent variable andeach of the CK metrics is an independent variable It isintended to develop a function between fault of a class andCKmetrics (WMC DIT NOC CBO RFC and LCOM) Fault isa function of WMC DIT NOC CBO RFC and LCOM andcan be represented as shown in the following equation

Faults = 119891 (WMCDITNOCCBORFC LCOM) (2)

4 Proposed Work for Fault Prediction

The following subsections highlight the various statistical andmachine learning methods used for fault classification

41 StatisticalMethods This section describes the applicationof statistical methods for fault prediction Regression analysismethods such as linear regression and logistic regressionanalysis are applied In regression analysis the value ofunknown variable is predicted based on the value of one ormore known variables

411 Linear Regression Analysis Linear regression is a sta-tistical technique and establishes a linear (ie straight-line)relationship between variables This technique is used whenfaults are distributed over a wide range of classes

ISRN Software Engineering 3

Table 2 CK metric suite

CK metric Description ValueWMC Sum of the complexities of all class methods LowDIT Maximum length from the node to the root of the tree ltsixNOC Number of immediate subclasses subordinate to a class in the class hierarchy LowCBO Count of the number of other classes to which it is coupled LowRFC A set of methods that can potentially be executed in response to a message received by an object of that class LowLCOM Measures the dissimilarity of methods in a class via instanced variables Low

Linear regression analysis is of two types

(a) univariate linear regression and(b) multivariate linear regression

Univariate linear regression is based on

119884 = 1205730+ 1205731119883 (3)

where 119884 represents dependent variables (accuracy rate forthis case) and 119883 represents independent variables (CKmetrics for this case)

In case of multivariate linear regression the linear regres-sion is based on

119884 = 1205730+ 12057311198831+ 12057321198832+ 12057331198833+ sdot sdot sdot + 120573

119901119883119901 (4)

where 119883119894is the independent variable 120573

0is a constant and 119910

is the dependent variable Table 8 shows the result of linearregression analysis for three versions of AIF

412 Logistic Regression Analysis Logistic regression analy-sis is used for predicting the outcome of dependent variablesbased on one or more independent variable(s) A dependentvariable can take only two values So the dependent variableof a class containing bugs is divided into two groups onegroup containing zero bugs and the other group having atleast one bug

Logistic regression analysis is of two types

(a) univariate logistic regression and(b) multivariate logistic regression

(a) Univariate Logistic Regression Analysis Univariate logisticregression is carried out to find the impact of an individualmetric on predicting the faults of a class The univariatelogistic regression is based on

120587 (119909) =1198901205730+12057311198831

1 + 1198901205730+12057311198831 (5)

where 119909 is an independent variable and 1205730and 120573

1represent

the constant and coefficient values respectively Logit func-tion can be developed as follows

logit [120587 (119909)] = 1205730+ 1205731119883 (6)

where 120587 represents the probability of a fault found in the classduring validation phase

The results of univariate logistic regression for AIF aretabulated in Table 9 The values of obtained coefficient arethe estimated regression coefficientsThe probability of faultsbeing detected for a class is dependent on the coefficientvalue (positive or negative) Higher coefficient value meansgreater probability of a fault being detected The significanceof coefficient value is determined by the 119875 value The 119875 valuewas assessed based on the significance level (120572) 119877 coefficientis the proportion of the total variation in the dependentvariable explained in the regression model High value of119877 indicates greater correlation between faults and the CKmetrics

(b) Multivariate Logistic Regression Analysis Multivariatelogistic regression is used to construct a prediction model forthe fault proneness of classes In thismethodmetrics are usedin combination The multivariate logistic regression model isbased on the following equation

120587 (119909) =1198901205730+12057311198831+12057321198832+12057331198833+sdotsdotsdot+120573119901119883119901

1 + 1198901205730+12057311198831+12057321198832+12057331198833+sdotsdotsdot+120573119901119883119901

(7)

where 119909119894is the independent variable 120587 represents the

probability of a fault found in the class during validationphase and119901 represents the number of independent variablesThe Logit function can be formed as follows

logit [120587 (119909)] = 1205730+ 12057311198831+ 12057321198832+ 12057331198833+ sdot sdot sdot + 120573

119901119883119901 (8)

Equation (8) shows that logistic regression is really just astandard linear regression model where the dichotomousoutcome of the result is transformed by the logit transformThe value of120587(119909) lies in the range 0 gt 120587(119909) lt 1 After the logittransforms the value of 120587(119909) lies in the range minusinfin gt 120587(119909) lt

+infin

42 Machine Learning Methods Besides the statisticalapproach this paper also implements four other machinelearning techniques Machine learning techniques havebeen used in this paper to predict the accuracy rate in faultprediction using CK metric suite

This section gives a brief description of the basic structureand working of machine learning methods applied for faultprediction

421 Artificial Neural Network Figure 1 shows the archi-tecture of ANN which contains three layers namely inputlayer hidden layer and output layer Computational featuresinvolved in ANN architecture can be very well applied forfault prediction

4 ISRN Software Engineering

Input layer

Output layer

Hidden layer

Figure 1 A typical FFNN

In this paper for input layer linear activation function hasbeen used that is the output of the input layer ldquo119874

119894rdquo is input

of the input layer ldquo119868119894rdquo which is represented as follows

119874119894= 119868119894 (9)

For hidden layer and output layer sigmoidal (squashed-S)function is used The output of hidden layer 119874

ℎfor input of

hidden layer 119868ℎis represented as follows

119874ℎ=

1

1 + 119890minus119868ℎ (10)

Output of the output layer ldquo119874119900rdquo for the input of the output

layer ldquo119874119894rdquo is represented as follows

119874119900=

1

1 + 119890minus119874119894 (11)

A neural network can be represented as follows

1198841015840= 119891 (119882119883) (12)

where 119883 is the input vector 1198841015840 is the output vector and 119882is weight vector The weight vector 119882 is updated in everyiteration so as to reduce the mean square error (MSE) valueMSE is formulated as follows

MSE = 1

119899

119899

sum

119894=1

(1199101015840

119894minus 119910119894)2

(13)

where 119910 is the actual output and 1199101015840 is the expected output Inthe literature differentmethods are available to updateweightvector (ldquo119882rdquo) such as Gradient descent method Newtonrsquosmethod Quasi-Newton method Gauss Newton Conjugate-gradient method and Levenberg Marquardt method In thispaper Gradient descent and Levenberg Marquardt methodsare used for updating the weight vector119882

(a) Gradient Descent Method Gradient descent is one of themethods for updating the weight during learning phase [20]Gradient descent method uses first-order derivative of totalerror to find the 119898119894119899119894119898119886 in error space Normally gradient

vector 119866 is defined as the first-order derivative of errorfunction Error function is represented as follows

119864119896=1

2(119879119896minus 119874119896)2 (14)

and 119866 is given as

119866 =120597119889

120597119889119882(119864119896) =

120597119889

120597119889119882(1

2(119879119896minus 119874119896)2

) (15)

After computing the value of gradient vector 119866 in eachiteration weighted vector119882 is updated as follows

119882119896+1

= 119882119896minus 120572119866119896 (16)

where119882119896+1

is the updated weight119882119896is the current weight

119866119896is a gradient vector and 120572 is the learning parameter

(b) Levenberg Marquardt (LM) Method LM method locatestheminimumofmultivariate function in an iterativemannerIt is expressed as the sum of squares of nonlinear real-valued functions [21 22] This method is used for updatingthe weights during learning phase LM method is fast andstable in terms of its execution when compared with gradientdescent method (LM method is a combination of steepestdescent andGauss-Newtonmethods) In LMmethod weightvector119882 is updated as follows

119882119896+1

= 119882119896minus (119869119879

119896119869119896+ 120583119868)minus1

119869119896119890119896 (17)

where119882119896+1

is the updated weight119882119896is the current weight

119869 is Jacobian matrix and 120583 is combination coefficient that iswhen 120583 is very small then it acts as Gauss-Newton methodand if 120583 is very large then it acts as Gradient descent method

Jacobian matrix is calculated as follows

119869 =

[[[[[[[[[[[[[[[

[

120597119889

1205971198891198821

(11986411)

120597119889

1205971198891198822

(11986411) sdot sdot sdot

120597119889

119889119882119873

(11986411)

120597119889

1205971198891198821

(11986412)

120597119889

1205971198891198822

(11986412) sdot sdot sdot

120597119889

120597119889119882119873

(11986412)

120597119889

1205971198891198821

(119864119875119872

)120597119889

1205971198891198822

(119864119875119872

) sdot sdot sdot120597119889

120597119889119882119873

(119864119875119872

)

]]]]]]]]]]]]]]]

]

(18)

where 119873 is number of weights 119875 is the number of inputpatterns and119872 is the number of output patterns

422 Functional Link Artificial Neural Network (FLANN)FLANN initially proposed by Pao [23] is a flat networkhaving a single layer that is the hidden layers are omittedInput variables generated by linear links of neural networkare linearly weighed Functional links act on elements ofinput variables by generating a set of linearly independentfunctions These links are evaluated as functions with thevariables as the arguments Figure 2 shows the single layered

ISRN Software Engineering 5

Adaptivealgorithm

Error

+1

X1

X2

x0

x1

w0

w1

w2

sum

sum

S120588

minus

+

y

y

Cos(120587x1)

Sin(120587x1)

Cos(120587x2)

x2

Sin(120587x2)Fu

nctio

nal

Expa

nsio

n

w

y

x1 middot middot middot x2

Figure 2 Flat net structure of FLANN

architecture of FLANN FLANN architecture offers lesscomputational overhead and higher convergence speed whencompared with other ANN techniques

Using FLANN output is calculated as follows

119910 =

119899

sum

119894=1

119882119894119883119894 (19)

where 119910 is the predicted value119882 is the weight vector and119883is the functional block and is defined as follows

119883

= [1 1199091 sin (120587119909

1) cos (120587119909

1) 1199092 sin (120587119909

2) cos (120587119909

2) ]

(20)

and weight is updated as follows

119882119894(119896 + 1) = 119882

119894(119896) + 120572119890

119894(119896) 119909119894(119896) (21)

having 120572 as the learning rate and 119890119894as the error value ldquo119890

119894rdquo is

formulated as follows

119890119894= 119910119894minus 119910119894 (22)

here 119910 and 119910 represent actual and the obtained (predicted)values respectively

423 Radial Basis Function Network (RBFN) RBFN is afeed-forward neural network (FFNN) trained using super-vised training algorithm RBFN is generally configured by asingle hidden layer where the activation function is chosenfrom a class of functions called basis functions

RBFN is one of the ANN techniques which containsthree layers namely input hidden and output layer Figure 3shows the structure of a typical RBFN in its basic forminvolving three entirely different layers RBFN contains ℎnumber of hidden centers represented as 119862

1 1198622 1198623 119862

Output layer

Input layer

Hidden layer ofradial basis function

x1

x2

x3

x4

xp

C1

C2

Ch

w1

w2

wn

y998400

1206011

1206012

120601h

Figure 3 RBFN network

The target output is computed as follows

1199101015840=

119899

sum

119894=1

120601119894119882119894 (23)

where 119882119894is the weight of the 119894th center 120601 is the radial

function and1199101015840 is the target output Table 3 shows the variousradial functions available in the literature

In this paper Gaussian function is used as a radialfunction and 119911 the distance vector is calculated as follows

119911 =10038171003817100381710038171003817119909119895minus 119888119895

10038171003817100381710038171003817 (24)

where 119909119895is input vector that lies in the receptive field for

center 119888119895 In this paper gradient descent learning and hybrid

learning techniques are used for updating weight and centerrespectively

6 ISRN Software Engineering

Table 3 Radial function

Radial function Mathematical expressionGaussian radial function 120601(119911) = 119890

minus(119911221205902)

Thin plate spline 120601(119911) = 1199112 log 119911

Quadratic 120601(119911) = (1199112+ 1199032)12

Inverse quadratic 120601(119911) =1

(1199112 + 1199032)12

The advantage of using RBFN lies in its training ratewhich is faster when compared with propagation networksand is less susceptible to problem with nonstationary inputs

(a) Gradient Descent Learning Technique Gradient descentlearning is a technique used for updating the weight119882 andcenter 119862 The center 119862 in gradient learning is updated as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1

120597119889

120597119889119862119894119895

(119864119896) (25)

and weight119882 is updated as

119882119894(119896 + 1) = 119882

119894(119896) minus 120578

2

120597119889

120597119889119882119894

(119864119896) (26)

where 1205781and 120578

2are the learning coefficients for updating

center and weight respectively

(b) Hybrid Learning Technique In hybrid learning techniqueradial function relocates their center in self-organized man-ner while the weights are updated using learning algorithmIn this paper least mean square (LMS) algorithm is used forupdating the weights while the center is updated only whenit satisfies the following conditions

(a) Euclidean distance between the input pattern and thenearest center is greater than the threshold value and

(b) MSE is greater than the desired accuracy

After satisfying the above conditions the Euclidean distanceis used to find the centers close to 119909 and then the centers areupdated as follows

119862119894(119896 + 1) = 119862

119894(119896) + 120572 (119909 minus 119862

119894(119896)) (27)

After every updation the center moves closer to 119909

424 Probabilistic Neural Network (PNN) PNN was intro-duced by Specht [24] It is a feed-forward neural networkwhich has been basically derived from Bayesian network andstatistical algorithm

In PNN the network is organized as multilayered feed-forward network with four layers such as input hiddensummation and output layer Figure 4 shows the basicarchitecture of PNN

The input layer first computes the distance from inputvector to the training input vectorsThe second layer consistsof a Gaussian function which is formed using the given setof data points as centers The summation layers sum up the

Table 4 Confusionmatrix to classify a class as faulty and not-faulty

No (prediction) Yes (prediction)No (actual) True negative (TN) False positive (FP)Yes (actual) False negative (FN) True positive (TP)

contribution of each class of input and produce a net outputwhich is vector of probabilities The fourth layer determinesthe fault prediction rate

PNN technique is faster when compared to multilayerperceptron networks and also is more accurate The majorconcern lies in finding an accurate smoothing parameter ldquo120590rdquoto obtain better classification The following function is usedin hidden layer

120601 (119911) = 119890minus(11991121205902) (28)

where 119911 = 119909 minus 119888

119909 is the input

119888 is the center and

119911 is the Euclidean distance between the center and theinput vector

5 Performance Evaluation Parameters

The following subsections give the basic definitions of theperformance parameters used in statistical and machinelearning methods for fault prediction

51 Statistical Analysis The performance parameters forstatistical analysis can be determined based on the confusionmatrix [25] as shown in Table 4

511 Precision It is defined as the degree to which therepeated measurements under unchanged conditions showthe same results

Precision = TPFP + TP

(29)

512 Correctness Correctness as defined by Briand et al [13]is the ratio of the number of modules correctly classified asfault prone to the total number of modules classified as faultprone

Correctness = TPFP + TP

(30)

513 Completeness According toBriand et al [13] complete-ness is the ratio of number of faults in classes classified as faultprone to the total number of faults in the system

Completeness = TPFN + TP

(31)

ISRN Software Engineering 7

Input layer

Pattern layer(Training set)

Summationlayer

Output class

Out putlayer

X1

X3

X11

X13

X12

X21

X22

X23

X31

X32

X33

y11

y12

y13

y21

y22

y23

y31

y32

y33

1

2

3

X2

Max (g1 g2 g3)

Xg2(X)

g1(X)

g3(X)

Figure 4 Basic structure of PNN

514 Accuracy Accuracy as defined by Yaun et al [26] isthe proportion of predicted fault prone modules that areinspected out of all modules

Accuracy = TN + TPTN + FP + FN + TP

(32)

515 1198772 Statistic 1198772 also known as coefficient of multipledetermination is a measure of power of correlation betweenpredicted and actual number of faults [25] The higher thevalue of this statistic themore is the accuracy of the predictedmodel

1198772= 1 minus

sum119899

119894=1(119910119894minus 119910119894)2

sum119899

119894=1(119910119894minus 119910)2 (33)

where 119910119894is the actual number of faults 119910

119894is the predicted

number of faults and 119910 is the average number of faults

52 Machine Learning Fault prediction accuracy for fourof the applied ANN is determined by using performanceevaluation parameters such as mean absolute error (MAE)mean absolute relative error (MARE) rootmean square error(RMSE) and standard error of the mean (SEM)

521 Mean Absolute Error (MAE) This performance param-eter determines how close the values of predicted and actualfault (accuracy) rate differ

MAE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816 (34)

522 Mean Absolute Relative Error (MARE) Consider

MARE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816

119910119894

(35)

In (35) a numerical value of 005 is added in thedenominator in order to avoid numerical overflow (divisionby zero) The modified MARE is formulated as

MARE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816

119910119894+ 005

(36)

523 Root Mean Square Error (RMSE) This performanceparameter determines the differences in the values of pre-dicted and actual fault (accuracy) rate

RMSE = radic 1

119899

119899

sum

119894=1

(119910119894minus 1199101015840

119894)2

(37)

In (35) (36) and (37) 119910119894is actual value and 1199101015840

119894is expected

value

524 Standard Error of the Mean (SEM) It is the deviationof predicted value from the actual fault (accuracy) rate

SEM =SDradic119899

(38)

where SD is sample standard deviation and ldquo119899rdquo is the numberof samples

8 ISRN Software Engineering

Table 5 Distribution of bugs for AIF version 16

Number of classes Percentageof bugs

Number ofassociated bugs

777 805181 0101 104663 132 33161 216 16580 314 14508 46 06218 52 02073 63 03109 75 05181 81 01036 91 01036 103 03109 111 01036 131 01036 171 01036 181 01036 28965 10000 142

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 14

12

10

8

6

4

2

0minus20 0 20 40 60 80 100 120 140 160 180

Figure 5 WMC of AIF version 16

6 Results and Analysis

In this section the relationship between value of metrics andthe fault found in a class is determined In this approachthe comparative study involves using six CK metrics as inputnodes and the output is the achieved fault prediction rateFault prediction is performed for AIF version 16

61 Fault Data To perform statistical analysis bugs werecollected fromPromise data repository [18] Table 5 shows thedistribution of bugs based on the number of occurrences (interms of percentage of class containing number of bugs) forAIF version 16

AIF version 16 contains 965 numbers of classes inwhich 777 classes contain zero bugs (805181) 104663 ofclasses contain at least one bug 33161 of classes containa minimum of two bugs 16580 of classes contain threebugs 14508 of classes contain four bugs 06218 of classescontain five bugs 02073 of the classes contain six bugs

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 50

40

45

30

35

20

25

10

5

15

00 1 2 3 4 5 6

Figure 6 DIT of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

90

80

70

60

50

40

30

20

10

0minus5 0 5 10 15 20 25 30 35 40

Figure 7 NOC of AIF version 16

03109 of classes contain seven and eleven bugs 05181 ofclasses contain eight bugs and 01036 of the class containnine thirteen seventeen eighteen and twenty-eight bugs

62 Metrics Data CK metric values for WMC DIT NOCCBO RFC and LCOM respectively for AIF version 16 aregraphically represented in Figures 5 6 7 8 9 and 10

63 Descriptive Statistics and Correlation Analysis This sub-section gives the comparative analysis of the fault datadescriptive statistics of classes and the correlation among thesix metrics with that of Basili et al [1] Basili et al studiedobject-oriented systems written in C++ language They car-ried out an experiment in which they set up eight projectgroups each consisting of three students Each group hadthe same task of developing smallmedium-sized softwaresystem Since all the necessary documentation (for instancereports about faults and their fixes) were available they couldsearch for relationships between fault density and metricsThey used the same CK metric suite Logistic regression wasemployed to analyze the relationship betweenmetrics and thefault proneness of classes

The obtained CK metric values of AIF version 16 arecompared with the results of Basili et al [1] In comparisonwith Basili the total number of classes considered is muchgreater that is 965 classes were considered (Vs 180) Table 6shows the comparative statistical analysis results obtained for

ISRN Software Engineering 9

Table 6 Descriptive statistics of classes

WMC DIT NOC CBO RFC LCOMBasili et al [1]

Max 9900 900 10500 1300 3000 42600Min 100 000 000 000 000 000Median 950 000 1950 000 500 000Mean 1340 132 3391 023 680 970Std Dev 1490 199 3337 154 756 6377

AIF version 16Max 16600 600 3900 44800 32200 13617Min 000 000 000 000 000 000Median 500 100 000 700 1400 400Mean 857 195 0052 1110 2142 7933Std Dev 1120 127 263 2252 2500 52375

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

10

9

8

7

6

5

4

3

2

1

00 50 100 150 200 250 300 350 400 450minus50

Figure 8 CBO of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 7

6

5

4

3

2

1

0minus50 0 50 100 150 200 250 300 350

Figure 9 RFC of AIF version 16

Basili et al andAIF version 16 forCKmetrics indicatingMaxMin Median and Standard deviation

The dependency between CK metrics is computed usingPearsonrsquos correlations (1198772 coefficient of determination) andcompared with Basili et al [1] for AIF version 16 Thecoefficient of determination 1198772 is useful because it gives theproportion of the variance (fluctuation) of one variable that ispredictable from the other variable It is ameasure that allowsa researcher to determine how certain one can be in makingpredictions from a certain modelgraph Table 7 shows thePearsonrsquos correlations for the data set used by Basili et al [1]and the correlation metrics of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

30

25

20

15

10

5

0minus200 0 2000 4000 6000 8000 10000 12000 14000

Figure 10 LCOM of AIF version 16

From Table 7 wrt AIF version 16 it is observed thatcorrelation between WMC and RFC is 077 which is highlycorrelated that is these two metrics are very much linearlydependent on each other Similarly correlation betweenWMC and DIT is 0 which indicates that they are looselycorrelated that is there is no dependency between these twometrics

64 Fault Prediction Using Statistical Methods

641 Linear Regression Analysis Table 8 shows resultsobtained for linear regression analysis in which the fault isconsidered as the dependent variable and the CK metrics arethe independent variables

ldquo119877rdquo represents the coefficient of correlation ldquo119875rdquo refers tothe significance of the metric value If 119875 lt 0001 then themetrics are of very great significance in fault prediction

642 Logistic Regression Analysis The logistic regressionmethod helps to indicate whether a class is faulty or notbut does not convey anything about the possible numberof faults in the class Univariate and multivariate logisticregression techniques are applied to predict whether the

10 ISRN Software Engineering

Table 7 Correlations between metrics

WMC DIT NOC CBO RFC LCOMBasili et al [1]

WMC 100 002 024 000 013 038DIT 100 000 000 000 001NOC 100 000 000 000CBO 100 031 001RFC 100 009LCOM 100

AIF version 16WMC 100 000 003 010 077 060DIT 100 000 000 000 001NOC 100 0024 0025 0027CBO 100 008 005RFC 100 042LCOM 100

Table 8 Linear regression analysis

Version 119877 119875 value Std error12 05360 0000 0111414 05024 0000 0145016 05154 0000 00834

1

08

06

04

02

0

minus4 minus3 minus2 minus1 0 1 2 3 4

1(1 + exp(minusq))

Figure 11 Logistic graph

class is faulty or not Univariate regression analysis is usedto examine the effect of each metric on fault of the classwhile multivariate regression analysis is used to examine thecommon effectiveness of metrics on fault of the class Theresults of three versions of AIF are compared consideringthese two statistical techniques Figure 11 shows the typicalldquo119878rdquo curve obtained (similar to Sigmoid function) for the AIFversion 16 using multivariate logistic regression Tables 9and 10 contain the tabulated values for the results obtainedby applying univariate and multivariate regression analysisrespectively

From Table 9 it can be observed that all metrics of CKsuite are highly significant except for DIT The 119875 value forthe three versions (wrt DIT) is 0335 0108 and 03527respectively Higher values of ldquo119875rdquo are an indication of lesssignificance

Univariate and multivariate logistic regression statisticalmethods were used for classifying a class as faulty or notfaulty Logistic regression was applied with a threshold value05 that is120587 gt 05 indicates that a class is classified as ldquofaultyrdquootherwise it is categorized as ldquonot faultyrdquo class

Tables 11 and 12 represent the confusion matrix fornumber of classes with faults before and after applyingregression analysis respectively for AIF version 16 FromTable 11 it is clear that before applying the logistic regressiona total number of 777 classes contained zero bugs and 188classes contained at least one bug After applying logisticregression (Table 12) a total of 767 + 16 classes are classifiedcorrectly with accuracy of 8113

The performance parameters of all three versions of theAIF are shown in Table 13 obtained by applying univariateand multivariate logistic regression analysis Here precisioncorrectness completeness and accuracy [1 13 27 28] aretaken as a performance parameters By using multivariatelogistic regression accuracy of AIF version 12 is found to be6444 accuracy of AIF version 14 is 8337 and that of AIFversion 16 is 8113

From the results obtained by applying linear and logisticregression analysis it is found that out of the six metricsWMC appears to have more impact in predicting faults

65 Fault Prediction Using Neural Network

651 Artificial Neural Network ANN is an interconnectedgroup of nodes In this paper three layers of ANN areconsidered in which six nodes act as input nodes nine nodesrepresent the hidden nodes and one node acts as outputnode

ANN is a three-phase network the phases are used forlearning validation and testing purposes So in this article70 of total input pattern is considered for learning phase15 for validation and the rest 15 for testingThe regressionanalysis carried out classifies whether a class is faulty or notfaulty The prediction models of ANN and its forms such as

ISRN Software Engineering 11

Table 9 Analysis of univariate regression

Coefficient Constant 119875 value 119877 valuever 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 0028 005 003 minus083 minus211 minus177 00013 00007 000 0130 0240 018DIT minus0067 010 005 minus046 minus183 minus153 0335 0108 03257 minus0039 0054 002NOC 0137 009 013 minus066 minus167 minus150 00007 000 000 0136 013 016CBO 0011 001 002 minus071 minus180 minus166 0017 000 000 0096 015 017RFC 0012 002 001 minus086 minus215 minus179 00014 000 000 0130 023 017LCOM 0007 0007 0007 minus064 minus167 minus148 00349 00004 00007 0085 011 011

Table 10 Multivariate logistic regression analysis

CoefficientAIF version 12 AIF version 14 AIF version 16

WMC 00195 00574 00320DIT minus0041 0000 0000NOC 01231 0000 0000CBO 0005 0008 0001RFC 00071 00081 00109LCOM 0 minus0001 0Constant minus0917 minus2785 minus2157

Table 11 Before applying regression

Not-faulty FaultyNot-Faulty 777 0Faulty 188 0

Table 12 After applying regression

Not-faulty FaultyNot-Faulty 767 10Faulty 172 16

PNN RBFN and FLANN not only classify the class as faultyor not faulty but also highlight the number of bugs foundin the class and these bugs are fixed in the testing phase ofsoftware development life cycle

In this paper six CKmetrics are taken as input and outputis the fault prediction accuracy rate required for developingthe software The network is trained using Gradient descentmethod and Levenberg Marquardt method

(a) Gradient Descent Method Gradient descent method isused for updating the weights using (15) and (16) Table 14shows the performance metrics of AIF version 16 Figure 12shows the graph plot for variation ofmean square error valueswrt no of epoch (or iteration) for AIF version 16

(b) Levenberg Marquardt Method Levenberg Marquardtmethod [21 22] is a technique for updating weights In caseof Gradient descent method learning rate 120572 is constant but inLevenbergMarquardt method learning rate 120572 varies in everyiteration So this method consumes less number of iterations

Mea

n sq

uare

erro

r

07

06

05

04

03

02

01

00 20 40 60 80 100 120 140 160 180

Number of iterations

Figure 12 MSE versus number of epoch wrt Gradient descent NN

Mea

n sq

uare

erro

r

Number of iterations

02

015

01

005

00 2 4 6 8 10 12 14

Figure 13 MSE versus number of epoch wrt Levenberg-marquardtNN

to train the network Table 15 shows the performance metricsfor AIF version 16 using Levenberg Marquardt method

Figure 13 shows the graph plot for variation of meansquare error values wrt number of epoch for AIF version 16

12 ISRN Software Engineering

Table 13 Precision correctness completeness and accuracy for three versions of AIF

Precision () Correctness () Completeness () Accuracy ()ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 6111 4117 5714 6111 4117 5714 509 482 425 6613 8402 8171DIT mdash mdash mdash mdash mdash mdash 0 0 0 6447 8337 8051NOC 75 75 6666 75 75 6666 555 206 531 6578 836 8103CBO 60 5714 7777 60 5714 7777 277 275 372 648 8348 8103RFC 6666 3636 50 6666 3636 50 462 275 212 6529 8302 8051LCOM 6666 50 60 066 05 06 277 68 159 6496 8337 8062MULTI 6875 50 6153 6875 50 6153 1018 758 851 6644 8337 8113

Table 14 Accuracy prediction using gradient descent NN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00594 1093 00617 minus02038 00044 00048 940437

Table 15 Accuracy prediction using Levenberg Marquardt

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00023 11203 00308 minus02189 00022 00041 904977

Table 16 Accuracy prediction using FLANN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00304 07097 00390 03308 24601119890 minus 06 00050 963769

Table 17 Accuracy prediction using basic RBFN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00279 03875 00573 01969 0059 0006 972792

652 Functional Link Artificial Neural Network (FLANN)FLANN architecture for software fault prediction is a singlelayer feed-forward neural network consisting of an input andoutput layer FLANN does not incorporate any hidden layerand hence has less computational cost In this paper adaptivealgorithm has been used for updating the weights as shownin (21) Figure 14 shows the variation of mean square valuesagainst number of epochs for AIF version 16 Table 16 showsthe performance metrics of FLANN

653 Radial Basis Function Network In this paper Gaussianradial function is used as a radial function Gradient descentlearning and hybrid learning methods are used for updatingthe centers and weights respectively

Three layered RBFN has been considered in which sixCK metrics are taken as input nodes nine hidden centers aretaken as hidden nodes and output is the fault prediction rateTable 17 shows the performance metrics for AIF version 16

(a) Gradient Descent Learning Method Equations (25) and(26) are used for updating center and weight during trainingphase After simplifying (25) the equation is represented as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1(1199101015840minus 119910)119882119894

120601119894

1205902(119909119895minus 119862119894119895(119896)) (39)

and the modified Equation (26) is formulated as

119882119894(119896 + 1) = 119882

119894(119896) + 120578

2(1199101015840minus 119910) 120601

119894 (40)

where 120590 is the width of the center and 119896 is the currentiteration number Table 18 shows the performancemetrics forAIF version 16 Figure 15 indicates the variation of MSE wrtnumber of epochs

(b) Hybrid Learning Method In Hybrid learning methodcenters are updated using (27) while weights are updatedusing supervised learning methods In this paper least meansquare error (LMSE) algorithm is used for updating theweights Table 19 shows the performance matrix for AIFversion 16 Figure 16 shows the graph for variation of MSEversus number of epochs

654 Probabilistic Neural Network (PNN) As mentioned inSection 424 PNN is a multilayered feed-forward networkwith four layers such as input hidden summation andoutput layer

In PNN 50 of faulty and nonfaulty classes are takenas input for hidden layers Gaussian elimination (28) isused as a hidden node function The summation layers sum

ISRN Software Engineering 13

Table 18 Accuracy prediction using RBFN gradient

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00207 02316 00323 03041 16302119890 minus 05 00041 972475

Table 19 Accuracy prediction using hybrid RBFN

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00614 01032 00316 09184 31834119890 minus 79 00013 984783

Number of iterations

Mea

n sq

uare

erro

r

0 10 20 30 40 50 60 700

01

02

03

04

05

06

07

08

09

Figure 14 Graph plot for MSE versus number of iterations (epoch)wrt FLANN

Number of iterations

Mea

n sq

uare

erro

r 0015

001

005

00 5 10 15 20 25 30 35 40 45

Figure 15 MSE versus number of epochs wrt gradient RBFN

contribution of each class of input patterns and producea net output which is a vector of probabilities The outputpattern having maximum summation value is classified intorespective class Figure 17 shows the variation of accuracy fordifferent values of smoothing parameter

66 Comparison Table 20 shows the tabulated results forthe obtained performance parameter values number ofepochs and accuracy rate by applying three neural networktechniques This performance table is an indication of betterfault prediction model In this comparative analysis theperformance parameter mean square error (MSE) was takenas a criterion to compute the performance parameters (suchas MARE MSE number of epochs and accuracy rate)when four neural network techniques were applied Duringthis process the MSE value of 0002 was set a thresholdfor evaluation Based on the number of iterations and theaccuracy rate obtained by the respective NN technique bestprediction model was determined

Number of iterationsM

ean

squa

re er

ror

006

005

003

002

001

004

00 2 4 6 8 10 12 14

Figure 16 MSE versus number of epochs wrt hybrid RBFN

Smoothing parameter

Accu

racy

()

865

86

855

85

845

84

835

83

825

820 05 1 15 2 25 3 35 4 45 5

Figure 17 Accuracy rate versus smoothing parameter

From Table 20 it is evident that gradient NN methodobtained an accuracy rate of 9404 in 162 epochs (iter-ations) LM technique which is an improvised model ofANN obtained 904 accuracy rate This accuracy rate isless than gradient NN but this approach (LM method) tookonly 13 epochs PNN method achieved a classification rate of8641

The three types of RBFN namely basic RBFN gradientand hybrid methods obtained a prediction rate of 97279724 and 9847 respectively Considering the number ofepochs RBFN hybridmethod obtained better prediction rateof 9847 in only 14 epochs when compared with gradientmethod (41 epochs) and basic RBFN approaches

FLANN architecture obtained 9637 accuracy rate withless computational cost involved FLANN obtained accuracyrate in 66 epochs as it has no hidden layer involved in itsarchitecture

14 ISRN Software Engineering

Table 20 Performance metrics

Performance parametersAI technique Epoch MAE MARE RMSE Std Error AccuracyGradient descent 162 00594 10930 00617 00048 9404LM 13 00023 11203 00308 00041 9049RBFN basic mdash 00279 03875 00573 006 9727RBFN gradient 41 00207 02316 00323 00041 9724RBFN hybrid 14 00614 01032 00316 00013 9847FLANN 66 00304 07097 00390 00050 9637

The performance of PNN is shown in Figure 17 Highestaccuracy in prediction was obtained for smoothing parame-ter value of 17 PNN obtained a classification rate of 8641

RBFN using hybrid learning model gives the least valuesfor MAE MARE RMSE and high accuracy rate Hencefrom the obtained results by using ANN techniques it can beconcluded that RBFNhybrid approach obtained the best faultprediction rate in less number of epochswhen comparedwithother three ANN techniques

7 Conclusion

System analyst use of prediction models to classify faultprone classes as faulty or not faulty is the need of the dayfor researchers as well as practitioners So more reliableapproaches for prediction need to be modeled In this papertwo approaches namely statistical methods and machinelearning techniques were applied for fault prediction Theapplication of statistical and machine learning methods infault prediction requires enormous amount of data andanalyzing this huge amount of data is necessary with the helpof a better prediction model

This paper proposes a comparative study of differentprediction models for fault prediction for an open-sourceproject Fault prediction using statistical and machine learn-ing methods were carried out for AIF by coding in MATLABenvironment Statistical methods such as linear regressionand logistic regression were applied Also machine learningtechniques such as artificial neural network (gradient descentand Levenberg Marquardt methods) Functional link artifi-cial neural network radial basis function network (RBFNbasic RBFN gradient and RBFN hybrid) and probabilisticneural network techniques were applied for fault predictionanalysis

It can be concluded from the statistical regression analysisthat out of six CK metrics WMC appears to be more usefulin predicting faults Table 20 shows that hybrid approachof RBFN obtained better fault prediction in less number ofepochs (14 iterations) when compared with the other threeneural network techniques

In future work should be replicated to other open-sourceprojects like Mozilla using different AI techniques to analyzewhich model performs better in achieving higher accuracyfor fault prediction Also fault prediction accuracy should bemeasured by combining multiple computational intelligencetechniques

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] V R Basili L C Briand and W L Melo ldquoA validationof object-oriented design metrics as quality indicatorsrdquo IEEETransactions on Software Engineering vol 22 no 10 pp 751ndash761 1996

[2] T J McCabe ldquoA Complexity Measurerdquo IEEE Transactions onSoftware Engineering vol 2 no 4 pp 308ndash320 1976

[3] M H Halstead Elements of Software Science Elsevier ScienceNew York NY USA 1977

[4] W Li and S Henry ldquoMaintenance metrics for the Object-Oriented paradigmrdquo in Proceedings of the 1st InternationalSoftware Metrics Symposium pp 52ndash60 1993

[5] S R Chidamber and C F Kemerer ldquoMetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[6] F B E Abreu andR Carapuca ldquoObject-Oriented software engi-neering measuring and controlling the development processrdquoin Proceedings of the 4th International Conference on SoftwareQuality pp 1ndash8 McLean Va USA October 1994

[7] M Lorenz and J Kidd Object-Oriented Software MetricsPrentice Hall Englewood NJ USA 1994

[8] R Martin ldquoOO design quality metricsmdashan analysis of depen-denciesrdquo in Proceedings of the Workshop Pragmatic and Theo-retical Directions in Object-Oriented Software Metrics (OOPSLArsquo94) 1994

[9] D P Tegarden S D Sheetz and D E Monarchi ldquoA softwarecomplexity model of object-oriented systemsrdquoDecision SupportSystems vol 13 no 3-4 pp 241ndash262 1995

[10] W Melo and F B E Abreu ldquoEvaluating the impact of object-oriented design on software qualityrdquo in Proceedings of the 3rdInternational Software Metrics Symposium pp 90ndash99 BerlinGermany March 1996

[11] L Briand P Devanbu and W Melo ldquoInvestigation intocoupling measures for C++rdquo in Proceedings of the IEEE 19thInternational Conference on Software EngineeringAssociation forComputing Machinery pp 412ndash421 May 1997

[12] L Etzkorn J Bansiya and C Davis ldquoDesign and code com-plexity metrics for OO classesrdquo Journal of Object-OrientedProgramming vol 12 no 1 pp 35ndash40 1999

[13] L C Briand JWust JWDaly andDVictor Porter ldquoExploringthe relationships between designmeasures and software qualityin object-oriented systemsrdquoThe Journal of Systems and Softwarevol 51 no 3 pp 245ndash273 2000

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

2 ISRN Software Engineering

2 Related Work

This section presents a review of the literature on the use ofsoftwaremetrics and their application in fault predictionThemost commonly used metric suites indicating the quality ofany software are McCabe [2] Halstead [3] Li and Henry [4]CK metric [5] Abreu MOOD metric suite [6] Lorenz andKidd [7] Martinrsquos metric suite [8] Tegarden et al [9] Meloand Abreu [10] Briand et al [11] Etzkorn et al [12] and soforth Out of these metrics CK metric suite is observed tobe used very often by the following authors as mentioned inTable 1 for predicting faults at class level

Basili et al [1] experimentally analyzed the impact ofCK metric suite on fault prediction Briand et al [13] foundout the relationship between fault and the metrics usingunivariate and multivariate logistic regression models Tanget al [14] investigated the dependency between CK metricsuite and the object-oriented system faults Emam et al[15] conducted empirical validation on Java application andfound that export coupling has great influence on faultsKhoshgoftaar et al [16 17] conducted experimental analysison telecommunication model and found that ANN modelis more accurate than any discriminant model In theirapproach nine softwaremetricswere used formodules devel-oped in procedural paradigm Since then ANN approach hastaken a rise in their usage for prediction modeling

3 Research Background

The following subsections highlight the data set being usedfor fault prediction Data are normalized to obtain betteraccuracy and then dependent and independent variables arechosen for fault prediction

31 Empirical Data Collection Metric suites are used anddefined for different goals such as fault prediction effortestimation reusability and maintenance In this paper themost commonly used metric that is CK metric suite [5] isused for fault prediction

The CK metric suite consists of six metrics namelyweighted method per class (WMC) depth of inheritancetree (DIT) number of children (NOC) coupling betweenobjects (CBO) response for class (RFC) and lack of cohesion(LCOM) [5] Table 2 gives a short note on the six CK metricsand the threshold for each of the six metrics

The metric values of the suite are extracted using Chi-damber and Kemerer Java Metrics (CKJM) tool CKJM toolsextract object-orientedmetrics by processing the byte code ofcompiled Java classesThis tool is being used to extractmetricvalues for three versions of Apache integration framework(AIF an open-source framework) available in the Promisedata repository [18] The versions of the AIF used from therepository are developed in Java language The CK metricvalues of the AIF are used for fault prediction

32 Data Normalization ANN models accept normalizeddata which lie in the range of 0 to 1 In the literature it is

Table 1 Fault prediction using CK metrics

Author Prediction techniqueBasili et al [1] Multivariate logistic regressionBriand et al [13] Multivariate logistic regressionKanmani and Rymend [29] Regression neural networkNagappan and Laurie [30] Multiple linear regressionOlague et al [31] Multivariate logistic regressionAggarwal et al [32] Statistical regression analysisWu [33] Decision tree analysisKapila and Singh [34] Bayesian inference

observed that techniques such asMin-Max normalization119885-Score normalization and Decimal scaling are being used fornormalizing the data In this paper Min-Max normalization[19] technique is used to normalize the data

Min-Max normalization performs a linear transforma-tion on the original data Each of the actual data 119889 of attribute119901 is mapped to a normalized value 1198891015840 which lies in the rangeof 0 to 1 The Min-Max normalization is calculated by usingthe equation

Normalized (119889) = 1198891015840 = 119889 minusmin (119875)max (119901) minusmin (119901)

(1)

where min(119901) and max(119901) represent the minimum andmaximum values of the attribute respectively

33 Dependent and Independent Variables The goal of thisstudy is to explore the relationship between object-orientedmetrics and fault proneness at the class level In this papera fault in a class is considered as a dependent variable andeach of the CK metrics is an independent variable It isintended to develop a function between fault of a class andCKmetrics (WMC DIT NOC CBO RFC and LCOM) Fault isa function of WMC DIT NOC CBO RFC and LCOM andcan be represented as shown in the following equation

Faults = 119891 (WMCDITNOCCBORFC LCOM) (2)

4 Proposed Work for Fault Prediction

The following subsections highlight the various statistical andmachine learning methods used for fault classification

41 StatisticalMethods This section describes the applicationof statistical methods for fault prediction Regression analysismethods such as linear regression and logistic regressionanalysis are applied In regression analysis the value ofunknown variable is predicted based on the value of one ormore known variables

411 Linear Regression Analysis Linear regression is a sta-tistical technique and establishes a linear (ie straight-line)relationship between variables This technique is used whenfaults are distributed over a wide range of classes

ISRN Software Engineering 3

Table 2 CK metric suite

CK metric Description ValueWMC Sum of the complexities of all class methods LowDIT Maximum length from the node to the root of the tree ltsixNOC Number of immediate subclasses subordinate to a class in the class hierarchy LowCBO Count of the number of other classes to which it is coupled LowRFC A set of methods that can potentially be executed in response to a message received by an object of that class LowLCOM Measures the dissimilarity of methods in a class via instanced variables Low

Linear regression analysis is of two types

(a) univariate linear regression and(b) multivariate linear regression

Univariate linear regression is based on

119884 = 1205730+ 1205731119883 (3)

where 119884 represents dependent variables (accuracy rate forthis case) and 119883 represents independent variables (CKmetrics for this case)

In case of multivariate linear regression the linear regres-sion is based on

119884 = 1205730+ 12057311198831+ 12057321198832+ 12057331198833+ sdot sdot sdot + 120573

119901119883119901 (4)

where 119883119894is the independent variable 120573

0is a constant and 119910

is the dependent variable Table 8 shows the result of linearregression analysis for three versions of AIF

412 Logistic Regression Analysis Logistic regression analy-sis is used for predicting the outcome of dependent variablesbased on one or more independent variable(s) A dependentvariable can take only two values So the dependent variableof a class containing bugs is divided into two groups onegroup containing zero bugs and the other group having atleast one bug

Logistic regression analysis is of two types

(a) univariate logistic regression and(b) multivariate logistic regression

(a) Univariate Logistic Regression Analysis Univariate logisticregression is carried out to find the impact of an individualmetric on predicting the faults of a class The univariatelogistic regression is based on

120587 (119909) =1198901205730+12057311198831

1 + 1198901205730+12057311198831 (5)

where 119909 is an independent variable and 1205730and 120573

1represent

the constant and coefficient values respectively Logit func-tion can be developed as follows

logit [120587 (119909)] = 1205730+ 1205731119883 (6)

where 120587 represents the probability of a fault found in the classduring validation phase

The results of univariate logistic regression for AIF aretabulated in Table 9 The values of obtained coefficient arethe estimated regression coefficientsThe probability of faultsbeing detected for a class is dependent on the coefficientvalue (positive or negative) Higher coefficient value meansgreater probability of a fault being detected The significanceof coefficient value is determined by the 119875 value The 119875 valuewas assessed based on the significance level (120572) 119877 coefficientis the proportion of the total variation in the dependentvariable explained in the regression model High value of119877 indicates greater correlation between faults and the CKmetrics

(b) Multivariate Logistic Regression Analysis Multivariatelogistic regression is used to construct a prediction model forthe fault proneness of classes In thismethodmetrics are usedin combination The multivariate logistic regression model isbased on the following equation

120587 (119909) =1198901205730+12057311198831+12057321198832+12057331198833+sdotsdotsdot+120573119901119883119901

1 + 1198901205730+12057311198831+12057321198832+12057331198833+sdotsdotsdot+120573119901119883119901

(7)

where 119909119894is the independent variable 120587 represents the

probability of a fault found in the class during validationphase and119901 represents the number of independent variablesThe Logit function can be formed as follows

logit [120587 (119909)] = 1205730+ 12057311198831+ 12057321198832+ 12057331198833+ sdot sdot sdot + 120573

119901119883119901 (8)

Equation (8) shows that logistic regression is really just astandard linear regression model where the dichotomousoutcome of the result is transformed by the logit transformThe value of120587(119909) lies in the range 0 gt 120587(119909) lt 1 After the logittransforms the value of 120587(119909) lies in the range minusinfin gt 120587(119909) lt

+infin

42 Machine Learning Methods Besides the statisticalapproach this paper also implements four other machinelearning techniques Machine learning techniques havebeen used in this paper to predict the accuracy rate in faultprediction using CK metric suite

This section gives a brief description of the basic structureand working of machine learning methods applied for faultprediction

421 Artificial Neural Network Figure 1 shows the archi-tecture of ANN which contains three layers namely inputlayer hidden layer and output layer Computational featuresinvolved in ANN architecture can be very well applied forfault prediction

4 ISRN Software Engineering

Input layer

Output layer

Hidden layer

Figure 1 A typical FFNN

In this paper for input layer linear activation function hasbeen used that is the output of the input layer ldquo119874

119894rdquo is input

of the input layer ldquo119868119894rdquo which is represented as follows

119874119894= 119868119894 (9)

For hidden layer and output layer sigmoidal (squashed-S)function is used The output of hidden layer 119874

ℎfor input of

hidden layer 119868ℎis represented as follows

119874ℎ=

1

1 + 119890minus119868ℎ (10)

Output of the output layer ldquo119874119900rdquo for the input of the output

layer ldquo119874119894rdquo is represented as follows

119874119900=

1

1 + 119890minus119874119894 (11)

A neural network can be represented as follows

1198841015840= 119891 (119882119883) (12)

where 119883 is the input vector 1198841015840 is the output vector and 119882is weight vector The weight vector 119882 is updated in everyiteration so as to reduce the mean square error (MSE) valueMSE is formulated as follows

MSE = 1

119899

119899

sum

119894=1

(1199101015840

119894minus 119910119894)2

(13)

where 119910 is the actual output and 1199101015840 is the expected output Inthe literature differentmethods are available to updateweightvector (ldquo119882rdquo) such as Gradient descent method Newtonrsquosmethod Quasi-Newton method Gauss Newton Conjugate-gradient method and Levenberg Marquardt method In thispaper Gradient descent and Levenberg Marquardt methodsare used for updating the weight vector119882

(a) Gradient Descent Method Gradient descent is one of themethods for updating the weight during learning phase [20]Gradient descent method uses first-order derivative of totalerror to find the 119898119894119899119894119898119886 in error space Normally gradient

vector 119866 is defined as the first-order derivative of errorfunction Error function is represented as follows

119864119896=1

2(119879119896minus 119874119896)2 (14)

and 119866 is given as

119866 =120597119889

120597119889119882(119864119896) =

120597119889

120597119889119882(1

2(119879119896minus 119874119896)2

) (15)

After computing the value of gradient vector 119866 in eachiteration weighted vector119882 is updated as follows

119882119896+1

= 119882119896minus 120572119866119896 (16)

where119882119896+1

is the updated weight119882119896is the current weight

119866119896is a gradient vector and 120572 is the learning parameter

(b) Levenberg Marquardt (LM) Method LM method locatestheminimumofmultivariate function in an iterativemannerIt is expressed as the sum of squares of nonlinear real-valued functions [21 22] This method is used for updatingthe weights during learning phase LM method is fast andstable in terms of its execution when compared with gradientdescent method (LM method is a combination of steepestdescent andGauss-Newtonmethods) In LMmethod weightvector119882 is updated as follows

119882119896+1

= 119882119896minus (119869119879

119896119869119896+ 120583119868)minus1

119869119896119890119896 (17)

where119882119896+1

is the updated weight119882119896is the current weight

119869 is Jacobian matrix and 120583 is combination coefficient that iswhen 120583 is very small then it acts as Gauss-Newton methodand if 120583 is very large then it acts as Gradient descent method

Jacobian matrix is calculated as follows

119869 =

[[[[[[[[[[[[[[[

[

120597119889

1205971198891198821

(11986411)

120597119889

1205971198891198822

(11986411) sdot sdot sdot

120597119889

119889119882119873

(11986411)

120597119889

1205971198891198821

(11986412)

120597119889

1205971198891198822

(11986412) sdot sdot sdot

120597119889

120597119889119882119873

(11986412)

120597119889

1205971198891198821

(119864119875119872

)120597119889

1205971198891198822

(119864119875119872

) sdot sdot sdot120597119889

120597119889119882119873

(119864119875119872

)

]]]]]]]]]]]]]]]

]

(18)

where 119873 is number of weights 119875 is the number of inputpatterns and119872 is the number of output patterns

422 Functional Link Artificial Neural Network (FLANN)FLANN initially proposed by Pao [23] is a flat networkhaving a single layer that is the hidden layers are omittedInput variables generated by linear links of neural networkare linearly weighed Functional links act on elements ofinput variables by generating a set of linearly independentfunctions These links are evaluated as functions with thevariables as the arguments Figure 2 shows the single layered

ISRN Software Engineering 5

Adaptivealgorithm

Error

+1

X1

X2

x0

x1

w0

w1

w2

sum

sum

S120588

minus

+

y

y

Cos(120587x1)

Sin(120587x1)

Cos(120587x2)

x2

Sin(120587x2)Fu

nctio

nal

Expa

nsio

n

w

y

x1 middot middot middot x2

Figure 2 Flat net structure of FLANN

architecture of FLANN FLANN architecture offers lesscomputational overhead and higher convergence speed whencompared with other ANN techniques

Using FLANN output is calculated as follows

119910 =

119899

sum

119894=1

119882119894119883119894 (19)

where 119910 is the predicted value119882 is the weight vector and119883is the functional block and is defined as follows

119883

= [1 1199091 sin (120587119909

1) cos (120587119909

1) 1199092 sin (120587119909

2) cos (120587119909

2) ]

(20)

and weight is updated as follows

119882119894(119896 + 1) = 119882

119894(119896) + 120572119890

119894(119896) 119909119894(119896) (21)

having 120572 as the learning rate and 119890119894as the error value ldquo119890

119894rdquo is

formulated as follows

119890119894= 119910119894minus 119910119894 (22)

here 119910 and 119910 represent actual and the obtained (predicted)values respectively

423 Radial Basis Function Network (RBFN) RBFN is afeed-forward neural network (FFNN) trained using super-vised training algorithm RBFN is generally configured by asingle hidden layer where the activation function is chosenfrom a class of functions called basis functions

RBFN is one of the ANN techniques which containsthree layers namely input hidden and output layer Figure 3shows the structure of a typical RBFN in its basic forminvolving three entirely different layers RBFN contains ℎnumber of hidden centers represented as 119862

1 1198622 1198623 119862

Output layer

Input layer

Hidden layer ofradial basis function

x1

x2

x3

x4

xp

C1

C2

Ch

w1

w2

wn

y998400

1206011

1206012

120601h

Figure 3 RBFN network

The target output is computed as follows

1199101015840=

119899

sum

119894=1

120601119894119882119894 (23)

where 119882119894is the weight of the 119894th center 120601 is the radial

function and1199101015840 is the target output Table 3 shows the variousradial functions available in the literature

In this paper Gaussian function is used as a radialfunction and 119911 the distance vector is calculated as follows

119911 =10038171003817100381710038171003817119909119895minus 119888119895

10038171003817100381710038171003817 (24)

where 119909119895is input vector that lies in the receptive field for

center 119888119895 In this paper gradient descent learning and hybrid

learning techniques are used for updating weight and centerrespectively

6 ISRN Software Engineering

Table 3 Radial function

Radial function Mathematical expressionGaussian radial function 120601(119911) = 119890

minus(119911221205902)

Thin plate spline 120601(119911) = 1199112 log 119911

Quadratic 120601(119911) = (1199112+ 1199032)12

Inverse quadratic 120601(119911) =1

(1199112 + 1199032)12

The advantage of using RBFN lies in its training ratewhich is faster when compared with propagation networksand is less susceptible to problem with nonstationary inputs

(a) Gradient Descent Learning Technique Gradient descentlearning is a technique used for updating the weight119882 andcenter 119862 The center 119862 in gradient learning is updated as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1

120597119889

120597119889119862119894119895

(119864119896) (25)

and weight119882 is updated as

119882119894(119896 + 1) = 119882

119894(119896) minus 120578

2

120597119889

120597119889119882119894

(119864119896) (26)

where 1205781and 120578

2are the learning coefficients for updating

center and weight respectively

(b) Hybrid Learning Technique In hybrid learning techniqueradial function relocates their center in self-organized man-ner while the weights are updated using learning algorithmIn this paper least mean square (LMS) algorithm is used forupdating the weights while the center is updated only whenit satisfies the following conditions

(a) Euclidean distance between the input pattern and thenearest center is greater than the threshold value and

(b) MSE is greater than the desired accuracy

After satisfying the above conditions the Euclidean distanceis used to find the centers close to 119909 and then the centers areupdated as follows

119862119894(119896 + 1) = 119862

119894(119896) + 120572 (119909 minus 119862

119894(119896)) (27)

After every updation the center moves closer to 119909

424 Probabilistic Neural Network (PNN) PNN was intro-duced by Specht [24] It is a feed-forward neural networkwhich has been basically derived from Bayesian network andstatistical algorithm

In PNN the network is organized as multilayered feed-forward network with four layers such as input hiddensummation and output layer Figure 4 shows the basicarchitecture of PNN

The input layer first computes the distance from inputvector to the training input vectorsThe second layer consistsof a Gaussian function which is formed using the given setof data points as centers The summation layers sum up the

Table 4 Confusionmatrix to classify a class as faulty and not-faulty

No (prediction) Yes (prediction)No (actual) True negative (TN) False positive (FP)Yes (actual) False negative (FN) True positive (TP)

contribution of each class of input and produce a net outputwhich is vector of probabilities The fourth layer determinesthe fault prediction rate

PNN technique is faster when compared to multilayerperceptron networks and also is more accurate The majorconcern lies in finding an accurate smoothing parameter ldquo120590rdquoto obtain better classification The following function is usedin hidden layer

120601 (119911) = 119890minus(11991121205902) (28)

where 119911 = 119909 minus 119888

119909 is the input

119888 is the center and

119911 is the Euclidean distance between the center and theinput vector

5 Performance Evaluation Parameters

The following subsections give the basic definitions of theperformance parameters used in statistical and machinelearning methods for fault prediction

51 Statistical Analysis The performance parameters forstatistical analysis can be determined based on the confusionmatrix [25] as shown in Table 4

511 Precision It is defined as the degree to which therepeated measurements under unchanged conditions showthe same results

Precision = TPFP + TP

(29)

512 Correctness Correctness as defined by Briand et al [13]is the ratio of the number of modules correctly classified asfault prone to the total number of modules classified as faultprone

Correctness = TPFP + TP

(30)

513 Completeness According toBriand et al [13] complete-ness is the ratio of number of faults in classes classified as faultprone to the total number of faults in the system

Completeness = TPFN + TP

(31)

ISRN Software Engineering 7

Input layer

Pattern layer(Training set)

Summationlayer

Output class

Out putlayer

X1

X3

X11

X13

X12

X21

X22

X23

X31

X32

X33

y11

y12

y13

y21

y22

y23

y31

y32

y33

1

2

3

X2

Max (g1 g2 g3)

Xg2(X)

g1(X)

g3(X)

Figure 4 Basic structure of PNN

514 Accuracy Accuracy as defined by Yaun et al [26] isthe proportion of predicted fault prone modules that areinspected out of all modules

Accuracy = TN + TPTN + FP + FN + TP

(32)

515 1198772 Statistic 1198772 also known as coefficient of multipledetermination is a measure of power of correlation betweenpredicted and actual number of faults [25] The higher thevalue of this statistic themore is the accuracy of the predictedmodel

1198772= 1 minus

sum119899

119894=1(119910119894minus 119910119894)2

sum119899

119894=1(119910119894minus 119910)2 (33)

where 119910119894is the actual number of faults 119910

119894is the predicted

number of faults and 119910 is the average number of faults

52 Machine Learning Fault prediction accuracy for fourof the applied ANN is determined by using performanceevaluation parameters such as mean absolute error (MAE)mean absolute relative error (MARE) rootmean square error(RMSE) and standard error of the mean (SEM)

521 Mean Absolute Error (MAE) This performance param-eter determines how close the values of predicted and actualfault (accuracy) rate differ

MAE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816 (34)

522 Mean Absolute Relative Error (MARE) Consider

MARE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816

119910119894

(35)

In (35) a numerical value of 005 is added in thedenominator in order to avoid numerical overflow (divisionby zero) The modified MARE is formulated as

MARE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816

119910119894+ 005

(36)

523 Root Mean Square Error (RMSE) This performanceparameter determines the differences in the values of pre-dicted and actual fault (accuracy) rate

RMSE = radic 1

119899

119899

sum

119894=1

(119910119894minus 1199101015840

119894)2

(37)

In (35) (36) and (37) 119910119894is actual value and 1199101015840

119894is expected

value

524 Standard Error of the Mean (SEM) It is the deviationof predicted value from the actual fault (accuracy) rate

SEM =SDradic119899

(38)

where SD is sample standard deviation and ldquo119899rdquo is the numberof samples

8 ISRN Software Engineering

Table 5 Distribution of bugs for AIF version 16

Number of classes Percentageof bugs

Number ofassociated bugs

777 805181 0101 104663 132 33161 216 16580 314 14508 46 06218 52 02073 63 03109 75 05181 81 01036 91 01036 103 03109 111 01036 131 01036 171 01036 181 01036 28965 10000 142

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 14

12

10

8

6

4

2

0minus20 0 20 40 60 80 100 120 140 160 180

Figure 5 WMC of AIF version 16

6 Results and Analysis

In this section the relationship between value of metrics andthe fault found in a class is determined In this approachthe comparative study involves using six CK metrics as inputnodes and the output is the achieved fault prediction rateFault prediction is performed for AIF version 16

61 Fault Data To perform statistical analysis bugs werecollected fromPromise data repository [18] Table 5 shows thedistribution of bugs based on the number of occurrences (interms of percentage of class containing number of bugs) forAIF version 16

AIF version 16 contains 965 numbers of classes inwhich 777 classes contain zero bugs (805181) 104663 ofclasses contain at least one bug 33161 of classes containa minimum of two bugs 16580 of classes contain threebugs 14508 of classes contain four bugs 06218 of classescontain five bugs 02073 of the classes contain six bugs

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 50

40

45

30

35

20

25

10

5

15

00 1 2 3 4 5 6

Figure 6 DIT of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

90

80

70

60

50

40

30

20

10

0minus5 0 5 10 15 20 25 30 35 40

Figure 7 NOC of AIF version 16

03109 of classes contain seven and eleven bugs 05181 ofclasses contain eight bugs and 01036 of the class containnine thirteen seventeen eighteen and twenty-eight bugs

62 Metrics Data CK metric values for WMC DIT NOCCBO RFC and LCOM respectively for AIF version 16 aregraphically represented in Figures 5 6 7 8 9 and 10

63 Descriptive Statistics and Correlation Analysis This sub-section gives the comparative analysis of the fault datadescriptive statistics of classes and the correlation among thesix metrics with that of Basili et al [1] Basili et al studiedobject-oriented systems written in C++ language They car-ried out an experiment in which they set up eight projectgroups each consisting of three students Each group hadthe same task of developing smallmedium-sized softwaresystem Since all the necessary documentation (for instancereports about faults and their fixes) were available they couldsearch for relationships between fault density and metricsThey used the same CK metric suite Logistic regression wasemployed to analyze the relationship betweenmetrics and thefault proneness of classes

The obtained CK metric values of AIF version 16 arecompared with the results of Basili et al [1] In comparisonwith Basili the total number of classes considered is muchgreater that is 965 classes were considered (Vs 180) Table 6shows the comparative statistical analysis results obtained for

ISRN Software Engineering 9

Table 6 Descriptive statistics of classes

WMC DIT NOC CBO RFC LCOMBasili et al [1]

Max 9900 900 10500 1300 3000 42600Min 100 000 000 000 000 000Median 950 000 1950 000 500 000Mean 1340 132 3391 023 680 970Std Dev 1490 199 3337 154 756 6377

AIF version 16Max 16600 600 3900 44800 32200 13617Min 000 000 000 000 000 000Median 500 100 000 700 1400 400Mean 857 195 0052 1110 2142 7933Std Dev 1120 127 263 2252 2500 52375

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

10

9

8

7

6

5

4

3

2

1

00 50 100 150 200 250 300 350 400 450minus50

Figure 8 CBO of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 7

6

5

4

3

2

1

0minus50 0 50 100 150 200 250 300 350

Figure 9 RFC of AIF version 16

Basili et al andAIF version 16 forCKmetrics indicatingMaxMin Median and Standard deviation

The dependency between CK metrics is computed usingPearsonrsquos correlations (1198772 coefficient of determination) andcompared with Basili et al [1] for AIF version 16 Thecoefficient of determination 1198772 is useful because it gives theproportion of the variance (fluctuation) of one variable that ispredictable from the other variable It is ameasure that allowsa researcher to determine how certain one can be in makingpredictions from a certain modelgraph Table 7 shows thePearsonrsquos correlations for the data set used by Basili et al [1]and the correlation metrics of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

30

25

20

15

10

5

0minus200 0 2000 4000 6000 8000 10000 12000 14000

Figure 10 LCOM of AIF version 16

From Table 7 wrt AIF version 16 it is observed thatcorrelation between WMC and RFC is 077 which is highlycorrelated that is these two metrics are very much linearlydependent on each other Similarly correlation betweenWMC and DIT is 0 which indicates that they are looselycorrelated that is there is no dependency between these twometrics

64 Fault Prediction Using Statistical Methods

641 Linear Regression Analysis Table 8 shows resultsobtained for linear regression analysis in which the fault isconsidered as the dependent variable and the CK metrics arethe independent variables

ldquo119877rdquo represents the coefficient of correlation ldquo119875rdquo refers tothe significance of the metric value If 119875 lt 0001 then themetrics are of very great significance in fault prediction

642 Logistic Regression Analysis The logistic regressionmethod helps to indicate whether a class is faulty or notbut does not convey anything about the possible numberof faults in the class Univariate and multivariate logisticregression techniques are applied to predict whether the

10 ISRN Software Engineering

Table 7 Correlations between metrics

WMC DIT NOC CBO RFC LCOMBasili et al [1]

WMC 100 002 024 000 013 038DIT 100 000 000 000 001NOC 100 000 000 000CBO 100 031 001RFC 100 009LCOM 100

AIF version 16WMC 100 000 003 010 077 060DIT 100 000 000 000 001NOC 100 0024 0025 0027CBO 100 008 005RFC 100 042LCOM 100

Table 8 Linear regression analysis

Version 119877 119875 value Std error12 05360 0000 0111414 05024 0000 0145016 05154 0000 00834

1

08

06

04

02

0

minus4 minus3 minus2 minus1 0 1 2 3 4

1(1 + exp(minusq))

Figure 11 Logistic graph

class is faulty or not Univariate regression analysis is usedto examine the effect of each metric on fault of the classwhile multivariate regression analysis is used to examine thecommon effectiveness of metrics on fault of the class Theresults of three versions of AIF are compared consideringthese two statistical techniques Figure 11 shows the typicalldquo119878rdquo curve obtained (similar to Sigmoid function) for the AIFversion 16 using multivariate logistic regression Tables 9and 10 contain the tabulated values for the results obtainedby applying univariate and multivariate regression analysisrespectively

From Table 9 it can be observed that all metrics of CKsuite are highly significant except for DIT The 119875 value forthe three versions (wrt DIT) is 0335 0108 and 03527respectively Higher values of ldquo119875rdquo are an indication of lesssignificance

Univariate and multivariate logistic regression statisticalmethods were used for classifying a class as faulty or notfaulty Logistic regression was applied with a threshold value05 that is120587 gt 05 indicates that a class is classified as ldquofaultyrdquootherwise it is categorized as ldquonot faultyrdquo class

Tables 11 and 12 represent the confusion matrix fornumber of classes with faults before and after applyingregression analysis respectively for AIF version 16 FromTable 11 it is clear that before applying the logistic regressiona total number of 777 classes contained zero bugs and 188classes contained at least one bug After applying logisticregression (Table 12) a total of 767 + 16 classes are classifiedcorrectly with accuracy of 8113

The performance parameters of all three versions of theAIF are shown in Table 13 obtained by applying univariateand multivariate logistic regression analysis Here precisioncorrectness completeness and accuracy [1 13 27 28] aretaken as a performance parameters By using multivariatelogistic regression accuracy of AIF version 12 is found to be6444 accuracy of AIF version 14 is 8337 and that of AIFversion 16 is 8113

From the results obtained by applying linear and logisticregression analysis it is found that out of the six metricsWMC appears to have more impact in predicting faults

65 Fault Prediction Using Neural Network

651 Artificial Neural Network ANN is an interconnectedgroup of nodes In this paper three layers of ANN areconsidered in which six nodes act as input nodes nine nodesrepresent the hidden nodes and one node acts as outputnode

ANN is a three-phase network the phases are used forlearning validation and testing purposes So in this article70 of total input pattern is considered for learning phase15 for validation and the rest 15 for testingThe regressionanalysis carried out classifies whether a class is faulty or notfaulty The prediction models of ANN and its forms such as

ISRN Software Engineering 11

Table 9 Analysis of univariate regression

Coefficient Constant 119875 value 119877 valuever 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 0028 005 003 minus083 minus211 minus177 00013 00007 000 0130 0240 018DIT minus0067 010 005 minus046 minus183 minus153 0335 0108 03257 minus0039 0054 002NOC 0137 009 013 minus066 minus167 minus150 00007 000 000 0136 013 016CBO 0011 001 002 minus071 minus180 minus166 0017 000 000 0096 015 017RFC 0012 002 001 minus086 minus215 minus179 00014 000 000 0130 023 017LCOM 0007 0007 0007 minus064 minus167 minus148 00349 00004 00007 0085 011 011

Table 10 Multivariate logistic regression analysis

CoefficientAIF version 12 AIF version 14 AIF version 16

WMC 00195 00574 00320DIT minus0041 0000 0000NOC 01231 0000 0000CBO 0005 0008 0001RFC 00071 00081 00109LCOM 0 minus0001 0Constant minus0917 minus2785 minus2157

Table 11 Before applying regression

Not-faulty FaultyNot-Faulty 777 0Faulty 188 0

Table 12 After applying regression

Not-faulty FaultyNot-Faulty 767 10Faulty 172 16

PNN RBFN and FLANN not only classify the class as faultyor not faulty but also highlight the number of bugs foundin the class and these bugs are fixed in the testing phase ofsoftware development life cycle

In this paper six CKmetrics are taken as input and outputis the fault prediction accuracy rate required for developingthe software The network is trained using Gradient descentmethod and Levenberg Marquardt method

(a) Gradient Descent Method Gradient descent method isused for updating the weights using (15) and (16) Table 14shows the performance metrics of AIF version 16 Figure 12shows the graph plot for variation ofmean square error valueswrt no of epoch (or iteration) for AIF version 16

(b) Levenberg Marquardt Method Levenberg Marquardtmethod [21 22] is a technique for updating weights In caseof Gradient descent method learning rate 120572 is constant but inLevenbergMarquardt method learning rate 120572 varies in everyiteration So this method consumes less number of iterations

Mea

n sq

uare

erro

r

07

06

05

04

03

02

01

00 20 40 60 80 100 120 140 160 180

Number of iterations

Figure 12 MSE versus number of epoch wrt Gradient descent NN

Mea

n sq

uare

erro

r

Number of iterations

02

015

01

005

00 2 4 6 8 10 12 14

Figure 13 MSE versus number of epoch wrt Levenberg-marquardtNN

to train the network Table 15 shows the performance metricsfor AIF version 16 using Levenberg Marquardt method

Figure 13 shows the graph plot for variation of meansquare error values wrt number of epoch for AIF version 16

12 ISRN Software Engineering

Table 13 Precision correctness completeness and accuracy for three versions of AIF

Precision () Correctness () Completeness () Accuracy ()ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 6111 4117 5714 6111 4117 5714 509 482 425 6613 8402 8171DIT mdash mdash mdash mdash mdash mdash 0 0 0 6447 8337 8051NOC 75 75 6666 75 75 6666 555 206 531 6578 836 8103CBO 60 5714 7777 60 5714 7777 277 275 372 648 8348 8103RFC 6666 3636 50 6666 3636 50 462 275 212 6529 8302 8051LCOM 6666 50 60 066 05 06 277 68 159 6496 8337 8062MULTI 6875 50 6153 6875 50 6153 1018 758 851 6644 8337 8113

Table 14 Accuracy prediction using gradient descent NN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00594 1093 00617 minus02038 00044 00048 940437

Table 15 Accuracy prediction using Levenberg Marquardt

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00023 11203 00308 minus02189 00022 00041 904977

Table 16 Accuracy prediction using FLANN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00304 07097 00390 03308 24601119890 minus 06 00050 963769

Table 17 Accuracy prediction using basic RBFN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00279 03875 00573 01969 0059 0006 972792

652 Functional Link Artificial Neural Network (FLANN)FLANN architecture for software fault prediction is a singlelayer feed-forward neural network consisting of an input andoutput layer FLANN does not incorporate any hidden layerand hence has less computational cost In this paper adaptivealgorithm has been used for updating the weights as shownin (21) Figure 14 shows the variation of mean square valuesagainst number of epochs for AIF version 16 Table 16 showsthe performance metrics of FLANN

653 Radial Basis Function Network In this paper Gaussianradial function is used as a radial function Gradient descentlearning and hybrid learning methods are used for updatingthe centers and weights respectively

Three layered RBFN has been considered in which sixCK metrics are taken as input nodes nine hidden centers aretaken as hidden nodes and output is the fault prediction rateTable 17 shows the performance metrics for AIF version 16

(a) Gradient Descent Learning Method Equations (25) and(26) are used for updating center and weight during trainingphase After simplifying (25) the equation is represented as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1(1199101015840minus 119910)119882119894

120601119894

1205902(119909119895minus 119862119894119895(119896)) (39)

and the modified Equation (26) is formulated as

119882119894(119896 + 1) = 119882

119894(119896) + 120578

2(1199101015840minus 119910) 120601

119894 (40)

where 120590 is the width of the center and 119896 is the currentiteration number Table 18 shows the performancemetrics forAIF version 16 Figure 15 indicates the variation of MSE wrtnumber of epochs

(b) Hybrid Learning Method In Hybrid learning methodcenters are updated using (27) while weights are updatedusing supervised learning methods In this paper least meansquare error (LMSE) algorithm is used for updating theweights Table 19 shows the performance matrix for AIFversion 16 Figure 16 shows the graph for variation of MSEversus number of epochs

654 Probabilistic Neural Network (PNN) As mentioned inSection 424 PNN is a multilayered feed-forward networkwith four layers such as input hidden summation andoutput layer

In PNN 50 of faulty and nonfaulty classes are takenas input for hidden layers Gaussian elimination (28) isused as a hidden node function The summation layers sum

ISRN Software Engineering 13

Table 18 Accuracy prediction using RBFN gradient

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00207 02316 00323 03041 16302119890 minus 05 00041 972475

Table 19 Accuracy prediction using hybrid RBFN

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00614 01032 00316 09184 31834119890 minus 79 00013 984783

Number of iterations

Mea

n sq

uare

erro

r

0 10 20 30 40 50 60 700

01

02

03

04

05

06

07

08

09

Figure 14 Graph plot for MSE versus number of iterations (epoch)wrt FLANN

Number of iterations

Mea

n sq

uare

erro

r 0015

001

005

00 5 10 15 20 25 30 35 40 45

Figure 15 MSE versus number of epochs wrt gradient RBFN

contribution of each class of input patterns and producea net output which is a vector of probabilities The outputpattern having maximum summation value is classified intorespective class Figure 17 shows the variation of accuracy fordifferent values of smoothing parameter

66 Comparison Table 20 shows the tabulated results forthe obtained performance parameter values number ofepochs and accuracy rate by applying three neural networktechniques This performance table is an indication of betterfault prediction model In this comparative analysis theperformance parameter mean square error (MSE) was takenas a criterion to compute the performance parameters (suchas MARE MSE number of epochs and accuracy rate)when four neural network techniques were applied Duringthis process the MSE value of 0002 was set a thresholdfor evaluation Based on the number of iterations and theaccuracy rate obtained by the respective NN technique bestprediction model was determined

Number of iterationsM

ean

squa

re er

ror

006

005

003

002

001

004

00 2 4 6 8 10 12 14

Figure 16 MSE versus number of epochs wrt hybrid RBFN

Smoothing parameter

Accu

racy

()

865

86

855

85

845

84

835

83

825

820 05 1 15 2 25 3 35 4 45 5

Figure 17 Accuracy rate versus smoothing parameter

From Table 20 it is evident that gradient NN methodobtained an accuracy rate of 9404 in 162 epochs (iter-ations) LM technique which is an improvised model ofANN obtained 904 accuracy rate This accuracy rate isless than gradient NN but this approach (LM method) tookonly 13 epochs PNN method achieved a classification rate of8641

The three types of RBFN namely basic RBFN gradientand hybrid methods obtained a prediction rate of 97279724 and 9847 respectively Considering the number ofepochs RBFN hybridmethod obtained better prediction rateof 9847 in only 14 epochs when compared with gradientmethod (41 epochs) and basic RBFN approaches

FLANN architecture obtained 9637 accuracy rate withless computational cost involved FLANN obtained accuracyrate in 66 epochs as it has no hidden layer involved in itsarchitecture

14 ISRN Software Engineering

Table 20 Performance metrics

Performance parametersAI technique Epoch MAE MARE RMSE Std Error AccuracyGradient descent 162 00594 10930 00617 00048 9404LM 13 00023 11203 00308 00041 9049RBFN basic mdash 00279 03875 00573 006 9727RBFN gradient 41 00207 02316 00323 00041 9724RBFN hybrid 14 00614 01032 00316 00013 9847FLANN 66 00304 07097 00390 00050 9637

The performance of PNN is shown in Figure 17 Highestaccuracy in prediction was obtained for smoothing parame-ter value of 17 PNN obtained a classification rate of 8641

RBFN using hybrid learning model gives the least valuesfor MAE MARE RMSE and high accuracy rate Hencefrom the obtained results by using ANN techniques it can beconcluded that RBFNhybrid approach obtained the best faultprediction rate in less number of epochswhen comparedwithother three ANN techniques

7 Conclusion

System analyst use of prediction models to classify faultprone classes as faulty or not faulty is the need of the dayfor researchers as well as practitioners So more reliableapproaches for prediction need to be modeled In this papertwo approaches namely statistical methods and machinelearning techniques were applied for fault prediction Theapplication of statistical and machine learning methods infault prediction requires enormous amount of data andanalyzing this huge amount of data is necessary with the helpof a better prediction model

This paper proposes a comparative study of differentprediction models for fault prediction for an open-sourceproject Fault prediction using statistical and machine learn-ing methods were carried out for AIF by coding in MATLABenvironment Statistical methods such as linear regressionand logistic regression were applied Also machine learningtechniques such as artificial neural network (gradient descentand Levenberg Marquardt methods) Functional link artifi-cial neural network radial basis function network (RBFNbasic RBFN gradient and RBFN hybrid) and probabilisticneural network techniques were applied for fault predictionanalysis

It can be concluded from the statistical regression analysisthat out of six CK metrics WMC appears to be more usefulin predicting faults Table 20 shows that hybrid approachof RBFN obtained better fault prediction in less number ofepochs (14 iterations) when compared with the other threeneural network techniques

In future work should be replicated to other open-sourceprojects like Mozilla using different AI techniques to analyzewhich model performs better in achieving higher accuracyfor fault prediction Also fault prediction accuracy should bemeasured by combining multiple computational intelligencetechniques

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] V R Basili L C Briand and W L Melo ldquoA validationof object-oriented design metrics as quality indicatorsrdquo IEEETransactions on Software Engineering vol 22 no 10 pp 751ndash761 1996

[2] T J McCabe ldquoA Complexity Measurerdquo IEEE Transactions onSoftware Engineering vol 2 no 4 pp 308ndash320 1976

[3] M H Halstead Elements of Software Science Elsevier ScienceNew York NY USA 1977

[4] W Li and S Henry ldquoMaintenance metrics for the Object-Oriented paradigmrdquo in Proceedings of the 1st InternationalSoftware Metrics Symposium pp 52ndash60 1993

[5] S R Chidamber and C F Kemerer ldquoMetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[6] F B E Abreu andR Carapuca ldquoObject-Oriented software engi-neering measuring and controlling the development processrdquoin Proceedings of the 4th International Conference on SoftwareQuality pp 1ndash8 McLean Va USA October 1994

[7] M Lorenz and J Kidd Object-Oriented Software MetricsPrentice Hall Englewood NJ USA 1994

[8] R Martin ldquoOO design quality metricsmdashan analysis of depen-denciesrdquo in Proceedings of the Workshop Pragmatic and Theo-retical Directions in Object-Oriented Software Metrics (OOPSLArsquo94) 1994

[9] D P Tegarden S D Sheetz and D E Monarchi ldquoA softwarecomplexity model of object-oriented systemsrdquoDecision SupportSystems vol 13 no 3-4 pp 241ndash262 1995

[10] W Melo and F B E Abreu ldquoEvaluating the impact of object-oriented design on software qualityrdquo in Proceedings of the 3rdInternational Software Metrics Symposium pp 90ndash99 BerlinGermany March 1996

[11] L Briand P Devanbu and W Melo ldquoInvestigation intocoupling measures for C++rdquo in Proceedings of the IEEE 19thInternational Conference on Software EngineeringAssociation forComputing Machinery pp 412ndash421 May 1997

[12] L Etzkorn J Bansiya and C Davis ldquoDesign and code com-plexity metrics for OO classesrdquo Journal of Object-OrientedProgramming vol 12 no 1 pp 35ndash40 1999

[13] L C Briand JWust JWDaly andDVictor Porter ldquoExploringthe relationships between designmeasures and software qualityin object-oriented systemsrdquoThe Journal of Systems and Softwarevol 51 no 3 pp 245ndash273 2000

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

ISRN Software Engineering 3

Table 2 CK metric suite

CK metric Description ValueWMC Sum of the complexities of all class methods LowDIT Maximum length from the node to the root of the tree ltsixNOC Number of immediate subclasses subordinate to a class in the class hierarchy LowCBO Count of the number of other classes to which it is coupled LowRFC A set of methods that can potentially be executed in response to a message received by an object of that class LowLCOM Measures the dissimilarity of methods in a class via instanced variables Low

Linear regression analysis is of two types

(a) univariate linear regression and(b) multivariate linear regression

Univariate linear regression is based on

119884 = 1205730+ 1205731119883 (3)

where 119884 represents dependent variables (accuracy rate forthis case) and 119883 represents independent variables (CKmetrics for this case)

In case of multivariate linear regression the linear regres-sion is based on

119884 = 1205730+ 12057311198831+ 12057321198832+ 12057331198833+ sdot sdot sdot + 120573

119901119883119901 (4)

where 119883119894is the independent variable 120573

0is a constant and 119910

is the dependent variable Table 8 shows the result of linearregression analysis for three versions of AIF

412 Logistic Regression Analysis Logistic regression analy-sis is used for predicting the outcome of dependent variablesbased on one or more independent variable(s) A dependentvariable can take only two values So the dependent variableof a class containing bugs is divided into two groups onegroup containing zero bugs and the other group having atleast one bug

Logistic regression analysis is of two types

(a) univariate logistic regression and(b) multivariate logistic regression

(a) Univariate Logistic Regression Analysis Univariate logisticregression is carried out to find the impact of an individualmetric on predicting the faults of a class The univariatelogistic regression is based on

120587 (119909) =1198901205730+12057311198831

1 + 1198901205730+12057311198831 (5)

where 119909 is an independent variable and 1205730and 120573

1represent

the constant and coefficient values respectively Logit func-tion can be developed as follows

logit [120587 (119909)] = 1205730+ 1205731119883 (6)

where 120587 represents the probability of a fault found in the classduring validation phase

The results of univariate logistic regression for AIF aretabulated in Table 9 The values of obtained coefficient arethe estimated regression coefficientsThe probability of faultsbeing detected for a class is dependent on the coefficientvalue (positive or negative) Higher coefficient value meansgreater probability of a fault being detected The significanceof coefficient value is determined by the 119875 value The 119875 valuewas assessed based on the significance level (120572) 119877 coefficientis the proportion of the total variation in the dependentvariable explained in the regression model High value of119877 indicates greater correlation between faults and the CKmetrics

(b) Multivariate Logistic Regression Analysis Multivariatelogistic regression is used to construct a prediction model forthe fault proneness of classes In thismethodmetrics are usedin combination The multivariate logistic regression model isbased on the following equation

120587 (119909) =1198901205730+12057311198831+12057321198832+12057331198833+sdotsdotsdot+120573119901119883119901

1 + 1198901205730+12057311198831+12057321198832+12057331198833+sdotsdotsdot+120573119901119883119901

(7)

where 119909119894is the independent variable 120587 represents the

probability of a fault found in the class during validationphase and119901 represents the number of independent variablesThe Logit function can be formed as follows

logit [120587 (119909)] = 1205730+ 12057311198831+ 12057321198832+ 12057331198833+ sdot sdot sdot + 120573

119901119883119901 (8)

Equation (8) shows that logistic regression is really just astandard linear regression model where the dichotomousoutcome of the result is transformed by the logit transformThe value of120587(119909) lies in the range 0 gt 120587(119909) lt 1 After the logittransforms the value of 120587(119909) lies in the range minusinfin gt 120587(119909) lt

+infin

42 Machine Learning Methods Besides the statisticalapproach this paper also implements four other machinelearning techniques Machine learning techniques havebeen used in this paper to predict the accuracy rate in faultprediction using CK metric suite

This section gives a brief description of the basic structureand working of machine learning methods applied for faultprediction

421 Artificial Neural Network Figure 1 shows the archi-tecture of ANN which contains three layers namely inputlayer hidden layer and output layer Computational featuresinvolved in ANN architecture can be very well applied forfault prediction

4 ISRN Software Engineering

Input layer

Output layer

Hidden layer

Figure 1 A typical FFNN

In this paper for input layer linear activation function hasbeen used that is the output of the input layer ldquo119874

119894rdquo is input

of the input layer ldquo119868119894rdquo which is represented as follows

119874119894= 119868119894 (9)

For hidden layer and output layer sigmoidal (squashed-S)function is used The output of hidden layer 119874

ℎfor input of

hidden layer 119868ℎis represented as follows

119874ℎ=

1

1 + 119890minus119868ℎ (10)

Output of the output layer ldquo119874119900rdquo for the input of the output

layer ldquo119874119894rdquo is represented as follows

119874119900=

1

1 + 119890minus119874119894 (11)

A neural network can be represented as follows

1198841015840= 119891 (119882119883) (12)

where 119883 is the input vector 1198841015840 is the output vector and 119882is weight vector The weight vector 119882 is updated in everyiteration so as to reduce the mean square error (MSE) valueMSE is formulated as follows

MSE = 1

119899

119899

sum

119894=1

(1199101015840

119894minus 119910119894)2

(13)

where 119910 is the actual output and 1199101015840 is the expected output Inthe literature differentmethods are available to updateweightvector (ldquo119882rdquo) such as Gradient descent method Newtonrsquosmethod Quasi-Newton method Gauss Newton Conjugate-gradient method and Levenberg Marquardt method In thispaper Gradient descent and Levenberg Marquardt methodsare used for updating the weight vector119882

(a) Gradient Descent Method Gradient descent is one of themethods for updating the weight during learning phase [20]Gradient descent method uses first-order derivative of totalerror to find the 119898119894119899119894119898119886 in error space Normally gradient

vector 119866 is defined as the first-order derivative of errorfunction Error function is represented as follows

119864119896=1

2(119879119896minus 119874119896)2 (14)

and 119866 is given as

119866 =120597119889

120597119889119882(119864119896) =

120597119889

120597119889119882(1

2(119879119896minus 119874119896)2

) (15)

After computing the value of gradient vector 119866 in eachiteration weighted vector119882 is updated as follows

119882119896+1

= 119882119896minus 120572119866119896 (16)

where119882119896+1

is the updated weight119882119896is the current weight

119866119896is a gradient vector and 120572 is the learning parameter

(b) Levenberg Marquardt (LM) Method LM method locatestheminimumofmultivariate function in an iterativemannerIt is expressed as the sum of squares of nonlinear real-valued functions [21 22] This method is used for updatingthe weights during learning phase LM method is fast andstable in terms of its execution when compared with gradientdescent method (LM method is a combination of steepestdescent andGauss-Newtonmethods) In LMmethod weightvector119882 is updated as follows

119882119896+1

= 119882119896minus (119869119879

119896119869119896+ 120583119868)minus1

119869119896119890119896 (17)

where119882119896+1

is the updated weight119882119896is the current weight

119869 is Jacobian matrix and 120583 is combination coefficient that iswhen 120583 is very small then it acts as Gauss-Newton methodand if 120583 is very large then it acts as Gradient descent method

Jacobian matrix is calculated as follows

119869 =

[[[[[[[[[[[[[[[

[

120597119889

1205971198891198821

(11986411)

120597119889

1205971198891198822

(11986411) sdot sdot sdot

120597119889

119889119882119873

(11986411)

120597119889

1205971198891198821

(11986412)

120597119889

1205971198891198822

(11986412) sdot sdot sdot

120597119889

120597119889119882119873

(11986412)

120597119889

1205971198891198821

(119864119875119872

)120597119889

1205971198891198822

(119864119875119872

) sdot sdot sdot120597119889

120597119889119882119873

(119864119875119872

)

]]]]]]]]]]]]]]]

]

(18)

where 119873 is number of weights 119875 is the number of inputpatterns and119872 is the number of output patterns

422 Functional Link Artificial Neural Network (FLANN)FLANN initially proposed by Pao [23] is a flat networkhaving a single layer that is the hidden layers are omittedInput variables generated by linear links of neural networkare linearly weighed Functional links act on elements ofinput variables by generating a set of linearly independentfunctions These links are evaluated as functions with thevariables as the arguments Figure 2 shows the single layered

ISRN Software Engineering 5

Adaptivealgorithm

Error

+1

X1

X2

x0

x1

w0

w1

w2

sum

sum

S120588

minus

+

y

y

Cos(120587x1)

Sin(120587x1)

Cos(120587x2)

x2

Sin(120587x2)Fu

nctio

nal

Expa

nsio

n

w

y

x1 middot middot middot x2

Figure 2 Flat net structure of FLANN

architecture of FLANN FLANN architecture offers lesscomputational overhead and higher convergence speed whencompared with other ANN techniques

Using FLANN output is calculated as follows

119910 =

119899

sum

119894=1

119882119894119883119894 (19)

where 119910 is the predicted value119882 is the weight vector and119883is the functional block and is defined as follows

119883

= [1 1199091 sin (120587119909

1) cos (120587119909

1) 1199092 sin (120587119909

2) cos (120587119909

2) ]

(20)

and weight is updated as follows

119882119894(119896 + 1) = 119882

119894(119896) + 120572119890

119894(119896) 119909119894(119896) (21)

having 120572 as the learning rate and 119890119894as the error value ldquo119890

119894rdquo is

formulated as follows

119890119894= 119910119894minus 119910119894 (22)

here 119910 and 119910 represent actual and the obtained (predicted)values respectively

423 Radial Basis Function Network (RBFN) RBFN is afeed-forward neural network (FFNN) trained using super-vised training algorithm RBFN is generally configured by asingle hidden layer where the activation function is chosenfrom a class of functions called basis functions

RBFN is one of the ANN techniques which containsthree layers namely input hidden and output layer Figure 3shows the structure of a typical RBFN in its basic forminvolving three entirely different layers RBFN contains ℎnumber of hidden centers represented as 119862

1 1198622 1198623 119862

Output layer

Input layer

Hidden layer ofradial basis function

x1

x2

x3

x4

xp

C1

C2

Ch

w1

w2

wn

y998400

1206011

1206012

120601h

Figure 3 RBFN network

The target output is computed as follows

1199101015840=

119899

sum

119894=1

120601119894119882119894 (23)

where 119882119894is the weight of the 119894th center 120601 is the radial

function and1199101015840 is the target output Table 3 shows the variousradial functions available in the literature

In this paper Gaussian function is used as a radialfunction and 119911 the distance vector is calculated as follows

119911 =10038171003817100381710038171003817119909119895minus 119888119895

10038171003817100381710038171003817 (24)

where 119909119895is input vector that lies in the receptive field for

center 119888119895 In this paper gradient descent learning and hybrid

learning techniques are used for updating weight and centerrespectively

6 ISRN Software Engineering

Table 3 Radial function

Radial function Mathematical expressionGaussian radial function 120601(119911) = 119890

minus(119911221205902)

Thin plate spline 120601(119911) = 1199112 log 119911

Quadratic 120601(119911) = (1199112+ 1199032)12

Inverse quadratic 120601(119911) =1

(1199112 + 1199032)12

The advantage of using RBFN lies in its training ratewhich is faster when compared with propagation networksand is less susceptible to problem with nonstationary inputs

(a) Gradient Descent Learning Technique Gradient descentlearning is a technique used for updating the weight119882 andcenter 119862 The center 119862 in gradient learning is updated as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1

120597119889

120597119889119862119894119895

(119864119896) (25)

and weight119882 is updated as

119882119894(119896 + 1) = 119882

119894(119896) minus 120578

2

120597119889

120597119889119882119894

(119864119896) (26)

where 1205781and 120578

2are the learning coefficients for updating

center and weight respectively

(b) Hybrid Learning Technique In hybrid learning techniqueradial function relocates their center in self-organized man-ner while the weights are updated using learning algorithmIn this paper least mean square (LMS) algorithm is used forupdating the weights while the center is updated only whenit satisfies the following conditions

(a) Euclidean distance between the input pattern and thenearest center is greater than the threshold value and

(b) MSE is greater than the desired accuracy

After satisfying the above conditions the Euclidean distanceis used to find the centers close to 119909 and then the centers areupdated as follows

119862119894(119896 + 1) = 119862

119894(119896) + 120572 (119909 minus 119862

119894(119896)) (27)

After every updation the center moves closer to 119909

424 Probabilistic Neural Network (PNN) PNN was intro-duced by Specht [24] It is a feed-forward neural networkwhich has been basically derived from Bayesian network andstatistical algorithm

In PNN the network is organized as multilayered feed-forward network with four layers such as input hiddensummation and output layer Figure 4 shows the basicarchitecture of PNN

The input layer first computes the distance from inputvector to the training input vectorsThe second layer consistsof a Gaussian function which is formed using the given setof data points as centers The summation layers sum up the

Table 4 Confusionmatrix to classify a class as faulty and not-faulty

No (prediction) Yes (prediction)No (actual) True negative (TN) False positive (FP)Yes (actual) False negative (FN) True positive (TP)

contribution of each class of input and produce a net outputwhich is vector of probabilities The fourth layer determinesthe fault prediction rate

PNN technique is faster when compared to multilayerperceptron networks and also is more accurate The majorconcern lies in finding an accurate smoothing parameter ldquo120590rdquoto obtain better classification The following function is usedin hidden layer

120601 (119911) = 119890minus(11991121205902) (28)

where 119911 = 119909 minus 119888

119909 is the input

119888 is the center and

119911 is the Euclidean distance between the center and theinput vector

5 Performance Evaluation Parameters

The following subsections give the basic definitions of theperformance parameters used in statistical and machinelearning methods for fault prediction

51 Statistical Analysis The performance parameters forstatistical analysis can be determined based on the confusionmatrix [25] as shown in Table 4

511 Precision It is defined as the degree to which therepeated measurements under unchanged conditions showthe same results

Precision = TPFP + TP

(29)

512 Correctness Correctness as defined by Briand et al [13]is the ratio of the number of modules correctly classified asfault prone to the total number of modules classified as faultprone

Correctness = TPFP + TP

(30)

513 Completeness According toBriand et al [13] complete-ness is the ratio of number of faults in classes classified as faultprone to the total number of faults in the system

Completeness = TPFN + TP

(31)

ISRN Software Engineering 7

Input layer

Pattern layer(Training set)

Summationlayer

Output class

Out putlayer

X1

X3

X11

X13

X12

X21

X22

X23

X31

X32

X33

y11

y12

y13

y21

y22

y23

y31

y32

y33

1

2

3

X2

Max (g1 g2 g3)

Xg2(X)

g1(X)

g3(X)

Figure 4 Basic structure of PNN

514 Accuracy Accuracy as defined by Yaun et al [26] isthe proportion of predicted fault prone modules that areinspected out of all modules

Accuracy = TN + TPTN + FP + FN + TP

(32)

515 1198772 Statistic 1198772 also known as coefficient of multipledetermination is a measure of power of correlation betweenpredicted and actual number of faults [25] The higher thevalue of this statistic themore is the accuracy of the predictedmodel

1198772= 1 minus

sum119899

119894=1(119910119894minus 119910119894)2

sum119899

119894=1(119910119894minus 119910)2 (33)

where 119910119894is the actual number of faults 119910

119894is the predicted

number of faults and 119910 is the average number of faults

52 Machine Learning Fault prediction accuracy for fourof the applied ANN is determined by using performanceevaluation parameters such as mean absolute error (MAE)mean absolute relative error (MARE) rootmean square error(RMSE) and standard error of the mean (SEM)

521 Mean Absolute Error (MAE) This performance param-eter determines how close the values of predicted and actualfault (accuracy) rate differ

MAE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816 (34)

522 Mean Absolute Relative Error (MARE) Consider

MARE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816

119910119894

(35)

In (35) a numerical value of 005 is added in thedenominator in order to avoid numerical overflow (divisionby zero) The modified MARE is formulated as

MARE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816

119910119894+ 005

(36)

523 Root Mean Square Error (RMSE) This performanceparameter determines the differences in the values of pre-dicted and actual fault (accuracy) rate

RMSE = radic 1

119899

119899

sum

119894=1

(119910119894minus 1199101015840

119894)2

(37)

In (35) (36) and (37) 119910119894is actual value and 1199101015840

119894is expected

value

524 Standard Error of the Mean (SEM) It is the deviationof predicted value from the actual fault (accuracy) rate

SEM =SDradic119899

(38)

where SD is sample standard deviation and ldquo119899rdquo is the numberof samples

8 ISRN Software Engineering

Table 5 Distribution of bugs for AIF version 16

Number of classes Percentageof bugs

Number ofassociated bugs

777 805181 0101 104663 132 33161 216 16580 314 14508 46 06218 52 02073 63 03109 75 05181 81 01036 91 01036 103 03109 111 01036 131 01036 171 01036 181 01036 28965 10000 142

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 14

12

10

8

6

4

2

0minus20 0 20 40 60 80 100 120 140 160 180

Figure 5 WMC of AIF version 16

6 Results and Analysis

In this section the relationship between value of metrics andthe fault found in a class is determined In this approachthe comparative study involves using six CK metrics as inputnodes and the output is the achieved fault prediction rateFault prediction is performed for AIF version 16

61 Fault Data To perform statistical analysis bugs werecollected fromPromise data repository [18] Table 5 shows thedistribution of bugs based on the number of occurrences (interms of percentage of class containing number of bugs) forAIF version 16

AIF version 16 contains 965 numbers of classes inwhich 777 classes contain zero bugs (805181) 104663 ofclasses contain at least one bug 33161 of classes containa minimum of two bugs 16580 of classes contain threebugs 14508 of classes contain four bugs 06218 of classescontain five bugs 02073 of the classes contain six bugs

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 50

40

45

30

35

20

25

10

5

15

00 1 2 3 4 5 6

Figure 6 DIT of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

90

80

70

60

50

40

30

20

10

0minus5 0 5 10 15 20 25 30 35 40

Figure 7 NOC of AIF version 16

03109 of classes contain seven and eleven bugs 05181 ofclasses contain eight bugs and 01036 of the class containnine thirteen seventeen eighteen and twenty-eight bugs

62 Metrics Data CK metric values for WMC DIT NOCCBO RFC and LCOM respectively for AIF version 16 aregraphically represented in Figures 5 6 7 8 9 and 10

63 Descriptive Statistics and Correlation Analysis This sub-section gives the comparative analysis of the fault datadescriptive statistics of classes and the correlation among thesix metrics with that of Basili et al [1] Basili et al studiedobject-oriented systems written in C++ language They car-ried out an experiment in which they set up eight projectgroups each consisting of three students Each group hadthe same task of developing smallmedium-sized softwaresystem Since all the necessary documentation (for instancereports about faults and their fixes) were available they couldsearch for relationships between fault density and metricsThey used the same CK metric suite Logistic regression wasemployed to analyze the relationship betweenmetrics and thefault proneness of classes

The obtained CK metric values of AIF version 16 arecompared with the results of Basili et al [1] In comparisonwith Basili the total number of classes considered is muchgreater that is 965 classes were considered (Vs 180) Table 6shows the comparative statistical analysis results obtained for

ISRN Software Engineering 9

Table 6 Descriptive statistics of classes

WMC DIT NOC CBO RFC LCOMBasili et al [1]

Max 9900 900 10500 1300 3000 42600Min 100 000 000 000 000 000Median 950 000 1950 000 500 000Mean 1340 132 3391 023 680 970Std Dev 1490 199 3337 154 756 6377

AIF version 16Max 16600 600 3900 44800 32200 13617Min 000 000 000 000 000 000Median 500 100 000 700 1400 400Mean 857 195 0052 1110 2142 7933Std Dev 1120 127 263 2252 2500 52375

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

10

9

8

7

6

5

4

3

2

1

00 50 100 150 200 250 300 350 400 450minus50

Figure 8 CBO of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 7

6

5

4

3

2

1

0minus50 0 50 100 150 200 250 300 350

Figure 9 RFC of AIF version 16

Basili et al andAIF version 16 forCKmetrics indicatingMaxMin Median and Standard deviation

The dependency between CK metrics is computed usingPearsonrsquos correlations (1198772 coefficient of determination) andcompared with Basili et al [1] for AIF version 16 Thecoefficient of determination 1198772 is useful because it gives theproportion of the variance (fluctuation) of one variable that ispredictable from the other variable It is ameasure that allowsa researcher to determine how certain one can be in makingpredictions from a certain modelgraph Table 7 shows thePearsonrsquos correlations for the data set used by Basili et al [1]and the correlation metrics of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

30

25

20

15

10

5

0minus200 0 2000 4000 6000 8000 10000 12000 14000

Figure 10 LCOM of AIF version 16

From Table 7 wrt AIF version 16 it is observed thatcorrelation between WMC and RFC is 077 which is highlycorrelated that is these two metrics are very much linearlydependent on each other Similarly correlation betweenWMC and DIT is 0 which indicates that they are looselycorrelated that is there is no dependency between these twometrics

64 Fault Prediction Using Statistical Methods

641 Linear Regression Analysis Table 8 shows resultsobtained for linear regression analysis in which the fault isconsidered as the dependent variable and the CK metrics arethe independent variables

ldquo119877rdquo represents the coefficient of correlation ldquo119875rdquo refers tothe significance of the metric value If 119875 lt 0001 then themetrics are of very great significance in fault prediction

642 Logistic Regression Analysis The logistic regressionmethod helps to indicate whether a class is faulty or notbut does not convey anything about the possible numberof faults in the class Univariate and multivariate logisticregression techniques are applied to predict whether the

10 ISRN Software Engineering

Table 7 Correlations between metrics

WMC DIT NOC CBO RFC LCOMBasili et al [1]

WMC 100 002 024 000 013 038DIT 100 000 000 000 001NOC 100 000 000 000CBO 100 031 001RFC 100 009LCOM 100

AIF version 16WMC 100 000 003 010 077 060DIT 100 000 000 000 001NOC 100 0024 0025 0027CBO 100 008 005RFC 100 042LCOM 100

Table 8 Linear regression analysis

Version 119877 119875 value Std error12 05360 0000 0111414 05024 0000 0145016 05154 0000 00834

1

08

06

04

02

0

minus4 minus3 minus2 minus1 0 1 2 3 4

1(1 + exp(minusq))

Figure 11 Logistic graph

class is faulty or not Univariate regression analysis is usedto examine the effect of each metric on fault of the classwhile multivariate regression analysis is used to examine thecommon effectiveness of metrics on fault of the class Theresults of three versions of AIF are compared consideringthese two statistical techniques Figure 11 shows the typicalldquo119878rdquo curve obtained (similar to Sigmoid function) for the AIFversion 16 using multivariate logistic regression Tables 9and 10 contain the tabulated values for the results obtainedby applying univariate and multivariate regression analysisrespectively

From Table 9 it can be observed that all metrics of CKsuite are highly significant except for DIT The 119875 value forthe three versions (wrt DIT) is 0335 0108 and 03527respectively Higher values of ldquo119875rdquo are an indication of lesssignificance

Univariate and multivariate logistic regression statisticalmethods were used for classifying a class as faulty or notfaulty Logistic regression was applied with a threshold value05 that is120587 gt 05 indicates that a class is classified as ldquofaultyrdquootherwise it is categorized as ldquonot faultyrdquo class

Tables 11 and 12 represent the confusion matrix fornumber of classes with faults before and after applyingregression analysis respectively for AIF version 16 FromTable 11 it is clear that before applying the logistic regressiona total number of 777 classes contained zero bugs and 188classes contained at least one bug After applying logisticregression (Table 12) a total of 767 + 16 classes are classifiedcorrectly with accuracy of 8113

The performance parameters of all three versions of theAIF are shown in Table 13 obtained by applying univariateand multivariate logistic regression analysis Here precisioncorrectness completeness and accuracy [1 13 27 28] aretaken as a performance parameters By using multivariatelogistic regression accuracy of AIF version 12 is found to be6444 accuracy of AIF version 14 is 8337 and that of AIFversion 16 is 8113

From the results obtained by applying linear and logisticregression analysis it is found that out of the six metricsWMC appears to have more impact in predicting faults

65 Fault Prediction Using Neural Network

651 Artificial Neural Network ANN is an interconnectedgroup of nodes In this paper three layers of ANN areconsidered in which six nodes act as input nodes nine nodesrepresent the hidden nodes and one node acts as outputnode

ANN is a three-phase network the phases are used forlearning validation and testing purposes So in this article70 of total input pattern is considered for learning phase15 for validation and the rest 15 for testingThe regressionanalysis carried out classifies whether a class is faulty or notfaulty The prediction models of ANN and its forms such as

ISRN Software Engineering 11

Table 9 Analysis of univariate regression

Coefficient Constant 119875 value 119877 valuever 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 0028 005 003 minus083 minus211 minus177 00013 00007 000 0130 0240 018DIT minus0067 010 005 minus046 minus183 minus153 0335 0108 03257 minus0039 0054 002NOC 0137 009 013 minus066 minus167 minus150 00007 000 000 0136 013 016CBO 0011 001 002 minus071 minus180 minus166 0017 000 000 0096 015 017RFC 0012 002 001 minus086 minus215 minus179 00014 000 000 0130 023 017LCOM 0007 0007 0007 minus064 minus167 minus148 00349 00004 00007 0085 011 011

Table 10 Multivariate logistic regression analysis

CoefficientAIF version 12 AIF version 14 AIF version 16

WMC 00195 00574 00320DIT minus0041 0000 0000NOC 01231 0000 0000CBO 0005 0008 0001RFC 00071 00081 00109LCOM 0 minus0001 0Constant minus0917 minus2785 minus2157

Table 11 Before applying regression

Not-faulty FaultyNot-Faulty 777 0Faulty 188 0

Table 12 After applying regression

Not-faulty FaultyNot-Faulty 767 10Faulty 172 16

PNN RBFN and FLANN not only classify the class as faultyor not faulty but also highlight the number of bugs foundin the class and these bugs are fixed in the testing phase ofsoftware development life cycle

In this paper six CKmetrics are taken as input and outputis the fault prediction accuracy rate required for developingthe software The network is trained using Gradient descentmethod and Levenberg Marquardt method

(a) Gradient Descent Method Gradient descent method isused for updating the weights using (15) and (16) Table 14shows the performance metrics of AIF version 16 Figure 12shows the graph plot for variation ofmean square error valueswrt no of epoch (or iteration) for AIF version 16

(b) Levenberg Marquardt Method Levenberg Marquardtmethod [21 22] is a technique for updating weights In caseof Gradient descent method learning rate 120572 is constant but inLevenbergMarquardt method learning rate 120572 varies in everyiteration So this method consumes less number of iterations

Mea

n sq

uare

erro

r

07

06

05

04

03

02

01

00 20 40 60 80 100 120 140 160 180

Number of iterations

Figure 12 MSE versus number of epoch wrt Gradient descent NN

Mea

n sq

uare

erro

r

Number of iterations

02

015

01

005

00 2 4 6 8 10 12 14

Figure 13 MSE versus number of epoch wrt Levenberg-marquardtNN

to train the network Table 15 shows the performance metricsfor AIF version 16 using Levenberg Marquardt method

Figure 13 shows the graph plot for variation of meansquare error values wrt number of epoch for AIF version 16

12 ISRN Software Engineering

Table 13 Precision correctness completeness and accuracy for three versions of AIF

Precision () Correctness () Completeness () Accuracy ()ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 6111 4117 5714 6111 4117 5714 509 482 425 6613 8402 8171DIT mdash mdash mdash mdash mdash mdash 0 0 0 6447 8337 8051NOC 75 75 6666 75 75 6666 555 206 531 6578 836 8103CBO 60 5714 7777 60 5714 7777 277 275 372 648 8348 8103RFC 6666 3636 50 6666 3636 50 462 275 212 6529 8302 8051LCOM 6666 50 60 066 05 06 277 68 159 6496 8337 8062MULTI 6875 50 6153 6875 50 6153 1018 758 851 6644 8337 8113

Table 14 Accuracy prediction using gradient descent NN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00594 1093 00617 minus02038 00044 00048 940437

Table 15 Accuracy prediction using Levenberg Marquardt

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00023 11203 00308 minus02189 00022 00041 904977

Table 16 Accuracy prediction using FLANN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00304 07097 00390 03308 24601119890 minus 06 00050 963769

Table 17 Accuracy prediction using basic RBFN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00279 03875 00573 01969 0059 0006 972792

652 Functional Link Artificial Neural Network (FLANN)FLANN architecture for software fault prediction is a singlelayer feed-forward neural network consisting of an input andoutput layer FLANN does not incorporate any hidden layerand hence has less computational cost In this paper adaptivealgorithm has been used for updating the weights as shownin (21) Figure 14 shows the variation of mean square valuesagainst number of epochs for AIF version 16 Table 16 showsthe performance metrics of FLANN

653 Radial Basis Function Network In this paper Gaussianradial function is used as a radial function Gradient descentlearning and hybrid learning methods are used for updatingthe centers and weights respectively

Three layered RBFN has been considered in which sixCK metrics are taken as input nodes nine hidden centers aretaken as hidden nodes and output is the fault prediction rateTable 17 shows the performance metrics for AIF version 16

(a) Gradient Descent Learning Method Equations (25) and(26) are used for updating center and weight during trainingphase After simplifying (25) the equation is represented as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1(1199101015840minus 119910)119882119894

120601119894

1205902(119909119895minus 119862119894119895(119896)) (39)

and the modified Equation (26) is formulated as

119882119894(119896 + 1) = 119882

119894(119896) + 120578

2(1199101015840minus 119910) 120601

119894 (40)

where 120590 is the width of the center and 119896 is the currentiteration number Table 18 shows the performancemetrics forAIF version 16 Figure 15 indicates the variation of MSE wrtnumber of epochs

(b) Hybrid Learning Method In Hybrid learning methodcenters are updated using (27) while weights are updatedusing supervised learning methods In this paper least meansquare error (LMSE) algorithm is used for updating theweights Table 19 shows the performance matrix for AIFversion 16 Figure 16 shows the graph for variation of MSEversus number of epochs

654 Probabilistic Neural Network (PNN) As mentioned inSection 424 PNN is a multilayered feed-forward networkwith four layers such as input hidden summation andoutput layer

In PNN 50 of faulty and nonfaulty classes are takenas input for hidden layers Gaussian elimination (28) isused as a hidden node function The summation layers sum

ISRN Software Engineering 13

Table 18 Accuracy prediction using RBFN gradient

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00207 02316 00323 03041 16302119890 minus 05 00041 972475

Table 19 Accuracy prediction using hybrid RBFN

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00614 01032 00316 09184 31834119890 minus 79 00013 984783

Number of iterations

Mea

n sq

uare

erro

r

0 10 20 30 40 50 60 700

01

02

03

04

05

06

07

08

09

Figure 14 Graph plot for MSE versus number of iterations (epoch)wrt FLANN

Number of iterations

Mea

n sq

uare

erro

r 0015

001

005

00 5 10 15 20 25 30 35 40 45

Figure 15 MSE versus number of epochs wrt gradient RBFN

contribution of each class of input patterns and producea net output which is a vector of probabilities The outputpattern having maximum summation value is classified intorespective class Figure 17 shows the variation of accuracy fordifferent values of smoothing parameter

66 Comparison Table 20 shows the tabulated results forthe obtained performance parameter values number ofepochs and accuracy rate by applying three neural networktechniques This performance table is an indication of betterfault prediction model In this comparative analysis theperformance parameter mean square error (MSE) was takenas a criterion to compute the performance parameters (suchas MARE MSE number of epochs and accuracy rate)when four neural network techniques were applied Duringthis process the MSE value of 0002 was set a thresholdfor evaluation Based on the number of iterations and theaccuracy rate obtained by the respective NN technique bestprediction model was determined

Number of iterationsM

ean

squa

re er

ror

006

005

003

002

001

004

00 2 4 6 8 10 12 14

Figure 16 MSE versus number of epochs wrt hybrid RBFN

Smoothing parameter

Accu

racy

()

865

86

855

85

845

84

835

83

825

820 05 1 15 2 25 3 35 4 45 5

Figure 17 Accuracy rate versus smoothing parameter

From Table 20 it is evident that gradient NN methodobtained an accuracy rate of 9404 in 162 epochs (iter-ations) LM technique which is an improvised model ofANN obtained 904 accuracy rate This accuracy rate isless than gradient NN but this approach (LM method) tookonly 13 epochs PNN method achieved a classification rate of8641

The three types of RBFN namely basic RBFN gradientand hybrid methods obtained a prediction rate of 97279724 and 9847 respectively Considering the number ofepochs RBFN hybridmethod obtained better prediction rateof 9847 in only 14 epochs when compared with gradientmethod (41 epochs) and basic RBFN approaches

FLANN architecture obtained 9637 accuracy rate withless computational cost involved FLANN obtained accuracyrate in 66 epochs as it has no hidden layer involved in itsarchitecture

14 ISRN Software Engineering

Table 20 Performance metrics

Performance parametersAI technique Epoch MAE MARE RMSE Std Error AccuracyGradient descent 162 00594 10930 00617 00048 9404LM 13 00023 11203 00308 00041 9049RBFN basic mdash 00279 03875 00573 006 9727RBFN gradient 41 00207 02316 00323 00041 9724RBFN hybrid 14 00614 01032 00316 00013 9847FLANN 66 00304 07097 00390 00050 9637

The performance of PNN is shown in Figure 17 Highestaccuracy in prediction was obtained for smoothing parame-ter value of 17 PNN obtained a classification rate of 8641

RBFN using hybrid learning model gives the least valuesfor MAE MARE RMSE and high accuracy rate Hencefrom the obtained results by using ANN techniques it can beconcluded that RBFNhybrid approach obtained the best faultprediction rate in less number of epochswhen comparedwithother three ANN techniques

7 Conclusion

System analyst use of prediction models to classify faultprone classes as faulty or not faulty is the need of the dayfor researchers as well as practitioners So more reliableapproaches for prediction need to be modeled In this papertwo approaches namely statistical methods and machinelearning techniques were applied for fault prediction Theapplication of statistical and machine learning methods infault prediction requires enormous amount of data andanalyzing this huge amount of data is necessary with the helpof a better prediction model

This paper proposes a comparative study of differentprediction models for fault prediction for an open-sourceproject Fault prediction using statistical and machine learn-ing methods were carried out for AIF by coding in MATLABenvironment Statistical methods such as linear regressionand logistic regression were applied Also machine learningtechniques such as artificial neural network (gradient descentand Levenberg Marquardt methods) Functional link artifi-cial neural network radial basis function network (RBFNbasic RBFN gradient and RBFN hybrid) and probabilisticneural network techniques were applied for fault predictionanalysis

It can be concluded from the statistical regression analysisthat out of six CK metrics WMC appears to be more usefulin predicting faults Table 20 shows that hybrid approachof RBFN obtained better fault prediction in less number ofepochs (14 iterations) when compared with the other threeneural network techniques

In future work should be replicated to other open-sourceprojects like Mozilla using different AI techniques to analyzewhich model performs better in achieving higher accuracyfor fault prediction Also fault prediction accuracy should bemeasured by combining multiple computational intelligencetechniques

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] V R Basili L C Briand and W L Melo ldquoA validationof object-oriented design metrics as quality indicatorsrdquo IEEETransactions on Software Engineering vol 22 no 10 pp 751ndash761 1996

[2] T J McCabe ldquoA Complexity Measurerdquo IEEE Transactions onSoftware Engineering vol 2 no 4 pp 308ndash320 1976

[3] M H Halstead Elements of Software Science Elsevier ScienceNew York NY USA 1977

[4] W Li and S Henry ldquoMaintenance metrics for the Object-Oriented paradigmrdquo in Proceedings of the 1st InternationalSoftware Metrics Symposium pp 52ndash60 1993

[5] S R Chidamber and C F Kemerer ldquoMetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[6] F B E Abreu andR Carapuca ldquoObject-Oriented software engi-neering measuring and controlling the development processrdquoin Proceedings of the 4th International Conference on SoftwareQuality pp 1ndash8 McLean Va USA October 1994

[7] M Lorenz and J Kidd Object-Oriented Software MetricsPrentice Hall Englewood NJ USA 1994

[8] R Martin ldquoOO design quality metricsmdashan analysis of depen-denciesrdquo in Proceedings of the Workshop Pragmatic and Theo-retical Directions in Object-Oriented Software Metrics (OOPSLArsquo94) 1994

[9] D P Tegarden S D Sheetz and D E Monarchi ldquoA softwarecomplexity model of object-oriented systemsrdquoDecision SupportSystems vol 13 no 3-4 pp 241ndash262 1995

[10] W Melo and F B E Abreu ldquoEvaluating the impact of object-oriented design on software qualityrdquo in Proceedings of the 3rdInternational Software Metrics Symposium pp 90ndash99 BerlinGermany March 1996

[11] L Briand P Devanbu and W Melo ldquoInvestigation intocoupling measures for C++rdquo in Proceedings of the IEEE 19thInternational Conference on Software EngineeringAssociation forComputing Machinery pp 412ndash421 May 1997

[12] L Etzkorn J Bansiya and C Davis ldquoDesign and code com-plexity metrics for OO classesrdquo Journal of Object-OrientedProgramming vol 12 no 1 pp 35ndash40 1999

[13] L C Briand JWust JWDaly andDVictor Porter ldquoExploringthe relationships between designmeasures and software qualityin object-oriented systemsrdquoThe Journal of Systems and Softwarevol 51 no 3 pp 245ndash273 2000

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

4 ISRN Software Engineering

Input layer

Output layer

Hidden layer

Figure 1 A typical FFNN

In this paper for input layer linear activation function hasbeen used that is the output of the input layer ldquo119874

119894rdquo is input

of the input layer ldquo119868119894rdquo which is represented as follows

119874119894= 119868119894 (9)

For hidden layer and output layer sigmoidal (squashed-S)function is used The output of hidden layer 119874

ℎfor input of

hidden layer 119868ℎis represented as follows

119874ℎ=

1

1 + 119890minus119868ℎ (10)

Output of the output layer ldquo119874119900rdquo for the input of the output

layer ldquo119874119894rdquo is represented as follows

119874119900=

1

1 + 119890minus119874119894 (11)

A neural network can be represented as follows

1198841015840= 119891 (119882119883) (12)

where 119883 is the input vector 1198841015840 is the output vector and 119882is weight vector The weight vector 119882 is updated in everyiteration so as to reduce the mean square error (MSE) valueMSE is formulated as follows

MSE = 1

119899

119899

sum

119894=1

(1199101015840

119894minus 119910119894)2

(13)

where 119910 is the actual output and 1199101015840 is the expected output Inthe literature differentmethods are available to updateweightvector (ldquo119882rdquo) such as Gradient descent method Newtonrsquosmethod Quasi-Newton method Gauss Newton Conjugate-gradient method and Levenberg Marquardt method In thispaper Gradient descent and Levenberg Marquardt methodsare used for updating the weight vector119882

(a) Gradient Descent Method Gradient descent is one of themethods for updating the weight during learning phase [20]Gradient descent method uses first-order derivative of totalerror to find the 119898119894119899119894119898119886 in error space Normally gradient

vector 119866 is defined as the first-order derivative of errorfunction Error function is represented as follows

119864119896=1

2(119879119896minus 119874119896)2 (14)

and 119866 is given as

119866 =120597119889

120597119889119882(119864119896) =

120597119889

120597119889119882(1

2(119879119896minus 119874119896)2

) (15)

After computing the value of gradient vector 119866 in eachiteration weighted vector119882 is updated as follows

119882119896+1

= 119882119896minus 120572119866119896 (16)

where119882119896+1

is the updated weight119882119896is the current weight

119866119896is a gradient vector and 120572 is the learning parameter

(b) Levenberg Marquardt (LM) Method LM method locatestheminimumofmultivariate function in an iterativemannerIt is expressed as the sum of squares of nonlinear real-valued functions [21 22] This method is used for updatingthe weights during learning phase LM method is fast andstable in terms of its execution when compared with gradientdescent method (LM method is a combination of steepestdescent andGauss-Newtonmethods) In LMmethod weightvector119882 is updated as follows

119882119896+1

= 119882119896minus (119869119879

119896119869119896+ 120583119868)minus1

119869119896119890119896 (17)

where119882119896+1

is the updated weight119882119896is the current weight

119869 is Jacobian matrix and 120583 is combination coefficient that iswhen 120583 is very small then it acts as Gauss-Newton methodand if 120583 is very large then it acts as Gradient descent method

Jacobian matrix is calculated as follows

119869 =

[[[[[[[[[[[[[[[

[

120597119889

1205971198891198821

(11986411)

120597119889

1205971198891198822

(11986411) sdot sdot sdot

120597119889

119889119882119873

(11986411)

120597119889

1205971198891198821

(11986412)

120597119889

1205971198891198822

(11986412) sdot sdot sdot

120597119889

120597119889119882119873

(11986412)

120597119889

1205971198891198821

(119864119875119872

)120597119889

1205971198891198822

(119864119875119872

) sdot sdot sdot120597119889

120597119889119882119873

(119864119875119872

)

]]]]]]]]]]]]]]]

]

(18)

where 119873 is number of weights 119875 is the number of inputpatterns and119872 is the number of output patterns

422 Functional Link Artificial Neural Network (FLANN)FLANN initially proposed by Pao [23] is a flat networkhaving a single layer that is the hidden layers are omittedInput variables generated by linear links of neural networkare linearly weighed Functional links act on elements ofinput variables by generating a set of linearly independentfunctions These links are evaluated as functions with thevariables as the arguments Figure 2 shows the single layered

ISRN Software Engineering 5

Adaptivealgorithm

Error

+1

X1

X2

x0

x1

w0

w1

w2

sum

sum

S120588

minus

+

y

y

Cos(120587x1)

Sin(120587x1)

Cos(120587x2)

x2

Sin(120587x2)Fu

nctio

nal

Expa

nsio

n

w

y

x1 middot middot middot x2

Figure 2 Flat net structure of FLANN

architecture of FLANN FLANN architecture offers lesscomputational overhead and higher convergence speed whencompared with other ANN techniques

Using FLANN output is calculated as follows

119910 =

119899

sum

119894=1

119882119894119883119894 (19)

where 119910 is the predicted value119882 is the weight vector and119883is the functional block and is defined as follows

119883

= [1 1199091 sin (120587119909

1) cos (120587119909

1) 1199092 sin (120587119909

2) cos (120587119909

2) ]

(20)

and weight is updated as follows

119882119894(119896 + 1) = 119882

119894(119896) + 120572119890

119894(119896) 119909119894(119896) (21)

having 120572 as the learning rate and 119890119894as the error value ldquo119890

119894rdquo is

formulated as follows

119890119894= 119910119894minus 119910119894 (22)

here 119910 and 119910 represent actual and the obtained (predicted)values respectively

423 Radial Basis Function Network (RBFN) RBFN is afeed-forward neural network (FFNN) trained using super-vised training algorithm RBFN is generally configured by asingle hidden layer where the activation function is chosenfrom a class of functions called basis functions

RBFN is one of the ANN techniques which containsthree layers namely input hidden and output layer Figure 3shows the structure of a typical RBFN in its basic forminvolving three entirely different layers RBFN contains ℎnumber of hidden centers represented as 119862

1 1198622 1198623 119862

Output layer

Input layer

Hidden layer ofradial basis function

x1

x2

x3

x4

xp

C1

C2

Ch

w1

w2

wn

y998400

1206011

1206012

120601h

Figure 3 RBFN network

The target output is computed as follows

1199101015840=

119899

sum

119894=1

120601119894119882119894 (23)

where 119882119894is the weight of the 119894th center 120601 is the radial

function and1199101015840 is the target output Table 3 shows the variousradial functions available in the literature

In this paper Gaussian function is used as a radialfunction and 119911 the distance vector is calculated as follows

119911 =10038171003817100381710038171003817119909119895minus 119888119895

10038171003817100381710038171003817 (24)

where 119909119895is input vector that lies in the receptive field for

center 119888119895 In this paper gradient descent learning and hybrid

learning techniques are used for updating weight and centerrespectively

6 ISRN Software Engineering

Table 3 Radial function

Radial function Mathematical expressionGaussian radial function 120601(119911) = 119890

minus(119911221205902)

Thin plate spline 120601(119911) = 1199112 log 119911

Quadratic 120601(119911) = (1199112+ 1199032)12

Inverse quadratic 120601(119911) =1

(1199112 + 1199032)12

The advantage of using RBFN lies in its training ratewhich is faster when compared with propagation networksand is less susceptible to problem with nonstationary inputs

(a) Gradient Descent Learning Technique Gradient descentlearning is a technique used for updating the weight119882 andcenter 119862 The center 119862 in gradient learning is updated as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1

120597119889

120597119889119862119894119895

(119864119896) (25)

and weight119882 is updated as

119882119894(119896 + 1) = 119882

119894(119896) minus 120578

2

120597119889

120597119889119882119894

(119864119896) (26)

where 1205781and 120578

2are the learning coefficients for updating

center and weight respectively

(b) Hybrid Learning Technique In hybrid learning techniqueradial function relocates their center in self-organized man-ner while the weights are updated using learning algorithmIn this paper least mean square (LMS) algorithm is used forupdating the weights while the center is updated only whenit satisfies the following conditions

(a) Euclidean distance between the input pattern and thenearest center is greater than the threshold value and

(b) MSE is greater than the desired accuracy

After satisfying the above conditions the Euclidean distanceis used to find the centers close to 119909 and then the centers areupdated as follows

119862119894(119896 + 1) = 119862

119894(119896) + 120572 (119909 minus 119862

119894(119896)) (27)

After every updation the center moves closer to 119909

424 Probabilistic Neural Network (PNN) PNN was intro-duced by Specht [24] It is a feed-forward neural networkwhich has been basically derived from Bayesian network andstatistical algorithm

In PNN the network is organized as multilayered feed-forward network with four layers such as input hiddensummation and output layer Figure 4 shows the basicarchitecture of PNN

The input layer first computes the distance from inputvector to the training input vectorsThe second layer consistsof a Gaussian function which is formed using the given setof data points as centers The summation layers sum up the

Table 4 Confusionmatrix to classify a class as faulty and not-faulty

No (prediction) Yes (prediction)No (actual) True negative (TN) False positive (FP)Yes (actual) False negative (FN) True positive (TP)

contribution of each class of input and produce a net outputwhich is vector of probabilities The fourth layer determinesthe fault prediction rate

PNN technique is faster when compared to multilayerperceptron networks and also is more accurate The majorconcern lies in finding an accurate smoothing parameter ldquo120590rdquoto obtain better classification The following function is usedin hidden layer

120601 (119911) = 119890minus(11991121205902) (28)

where 119911 = 119909 minus 119888

119909 is the input

119888 is the center and

119911 is the Euclidean distance between the center and theinput vector

5 Performance Evaluation Parameters

The following subsections give the basic definitions of theperformance parameters used in statistical and machinelearning methods for fault prediction

51 Statistical Analysis The performance parameters forstatistical analysis can be determined based on the confusionmatrix [25] as shown in Table 4

511 Precision It is defined as the degree to which therepeated measurements under unchanged conditions showthe same results

Precision = TPFP + TP

(29)

512 Correctness Correctness as defined by Briand et al [13]is the ratio of the number of modules correctly classified asfault prone to the total number of modules classified as faultprone

Correctness = TPFP + TP

(30)

513 Completeness According toBriand et al [13] complete-ness is the ratio of number of faults in classes classified as faultprone to the total number of faults in the system

Completeness = TPFN + TP

(31)

ISRN Software Engineering 7

Input layer

Pattern layer(Training set)

Summationlayer

Output class

Out putlayer

X1

X3

X11

X13

X12

X21

X22

X23

X31

X32

X33

y11

y12

y13

y21

y22

y23

y31

y32

y33

1

2

3

X2

Max (g1 g2 g3)

Xg2(X)

g1(X)

g3(X)

Figure 4 Basic structure of PNN

514 Accuracy Accuracy as defined by Yaun et al [26] isthe proportion of predicted fault prone modules that areinspected out of all modules

Accuracy = TN + TPTN + FP + FN + TP

(32)

515 1198772 Statistic 1198772 also known as coefficient of multipledetermination is a measure of power of correlation betweenpredicted and actual number of faults [25] The higher thevalue of this statistic themore is the accuracy of the predictedmodel

1198772= 1 minus

sum119899

119894=1(119910119894minus 119910119894)2

sum119899

119894=1(119910119894minus 119910)2 (33)

where 119910119894is the actual number of faults 119910

119894is the predicted

number of faults and 119910 is the average number of faults

52 Machine Learning Fault prediction accuracy for fourof the applied ANN is determined by using performanceevaluation parameters such as mean absolute error (MAE)mean absolute relative error (MARE) rootmean square error(RMSE) and standard error of the mean (SEM)

521 Mean Absolute Error (MAE) This performance param-eter determines how close the values of predicted and actualfault (accuracy) rate differ

MAE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816 (34)

522 Mean Absolute Relative Error (MARE) Consider

MARE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816

119910119894

(35)

In (35) a numerical value of 005 is added in thedenominator in order to avoid numerical overflow (divisionby zero) The modified MARE is formulated as

MARE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816

119910119894+ 005

(36)

523 Root Mean Square Error (RMSE) This performanceparameter determines the differences in the values of pre-dicted and actual fault (accuracy) rate

RMSE = radic 1

119899

119899

sum

119894=1

(119910119894minus 1199101015840

119894)2

(37)

In (35) (36) and (37) 119910119894is actual value and 1199101015840

119894is expected

value

524 Standard Error of the Mean (SEM) It is the deviationof predicted value from the actual fault (accuracy) rate

SEM =SDradic119899

(38)

where SD is sample standard deviation and ldquo119899rdquo is the numberof samples

8 ISRN Software Engineering

Table 5 Distribution of bugs for AIF version 16

Number of classes Percentageof bugs

Number ofassociated bugs

777 805181 0101 104663 132 33161 216 16580 314 14508 46 06218 52 02073 63 03109 75 05181 81 01036 91 01036 103 03109 111 01036 131 01036 171 01036 181 01036 28965 10000 142

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 14

12

10

8

6

4

2

0minus20 0 20 40 60 80 100 120 140 160 180

Figure 5 WMC of AIF version 16

6 Results and Analysis

In this section the relationship between value of metrics andthe fault found in a class is determined In this approachthe comparative study involves using six CK metrics as inputnodes and the output is the achieved fault prediction rateFault prediction is performed for AIF version 16

61 Fault Data To perform statistical analysis bugs werecollected fromPromise data repository [18] Table 5 shows thedistribution of bugs based on the number of occurrences (interms of percentage of class containing number of bugs) forAIF version 16

AIF version 16 contains 965 numbers of classes inwhich 777 classes contain zero bugs (805181) 104663 ofclasses contain at least one bug 33161 of classes containa minimum of two bugs 16580 of classes contain threebugs 14508 of classes contain four bugs 06218 of classescontain five bugs 02073 of the classes contain six bugs

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 50

40

45

30

35

20

25

10

5

15

00 1 2 3 4 5 6

Figure 6 DIT of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

90

80

70

60

50

40

30

20

10

0minus5 0 5 10 15 20 25 30 35 40

Figure 7 NOC of AIF version 16

03109 of classes contain seven and eleven bugs 05181 ofclasses contain eight bugs and 01036 of the class containnine thirteen seventeen eighteen and twenty-eight bugs

62 Metrics Data CK metric values for WMC DIT NOCCBO RFC and LCOM respectively for AIF version 16 aregraphically represented in Figures 5 6 7 8 9 and 10

63 Descriptive Statistics and Correlation Analysis This sub-section gives the comparative analysis of the fault datadescriptive statistics of classes and the correlation among thesix metrics with that of Basili et al [1] Basili et al studiedobject-oriented systems written in C++ language They car-ried out an experiment in which they set up eight projectgroups each consisting of three students Each group hadthe same task of developing smallmedium-sized softwaresystem Since all the necessary documentation (for instancereports about faults and their fixes) were available they couldsearch for relationships between fault density and metricsThey used the same CK metric suite Logistic regression wasemployed to analyze the relationship betweenmetrics and thefault proneness of classes

The obtained CK metric values of AIF version 16 arecompared with the results of Basili et al [1] In comparisonwith Basili the total number of classes considered is muchgreater that is 965 classes were considered (Vs 180) Table 6shows the comparative statistical analysis results obtained for

ISRN Software Engineering 9

Table 6 Descriptive statistics of classes

WMC DIT NOC CBO RFC LCOMBasili et al [1]

Max 9900 900 10500 1300 3000 42600Min 100 000 000 000 000 000Median 950 000 1950 000 500 000Mean 1340 132 3391 023 680 970Std Dev 1490 199 3337 154 756 6377

AIF version 16Max 16600 600 3900 44800 32200 13617Min 000 000 000 000 000 000Median 500 100 000 700 1400 400Mean 857 195 0052 1110 2142 7933Std Dev 1120 127 263 2252 2500 52375

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

10

9

8

7

6

5

4

3

2

1

00 50 100 150 200 250 300 350 400 450minus50

Figure 8 CBO of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 7

6

5

4

3

2

1

0minus50 0 50 100 150 200 250 300 350

Figure 9 RFC of AIF version 16

Basili et al andAIF version 16 forCKmetrics indicatingMaxMin Median and Standard deviation

The dependency between CK metrics is computed usingPearsonrsquos correlations (1198772 coefficient of determination) andcompared with Basili et al [1] for AIF version 16 Thecoefficient of determination 1198772 is useful because it gives theproportion of the variance (fluctuation) of one variable that ispredictable from the other variable It is ameasure that allowsa researcher to determine how certain one can be in makingpredictions from a certain modelgraph Table 7 shows thePearsonrsquos correlations for the data set used by Basili et al [1]and the correlation metrics of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

30

25

20

15

10

5

0minus200 0 2000 4000 6000 8000 10000 12000 14000

Figure 10 LCOM of AIF version 16

From Table 7 wrt AIF version 16 it is observed thatcorrelation between WMC and RFC is 077 which is highlycorrelated that is these two metrics are very much linearlydependent on each other Similarly correlation betweenWMC and DIT is 0 which indicates that they are looselycorrelated that is there is no dependency between these twometrics

64 Fault Prediction Using Statistical Methods

641 Linear Regression Analysis Table 8 shows resultsobtained for linear regression analysis in which the fault isconsidered as the dependent variable and the CK metrics arethe independent variables

ldquo119877rdquo represents the coefficient of correlation ldquo119875rdquo refers tothe significance of the metric value If 119875 lt 0001 then themetrics are of very great significance in fault prediction

642 Logistic Regression Analysis The logistic regressionmethod helps to indicate whether a class is faulty or notbut does not convey anything about the possible numberof faults in the class Univariate and multivariate logisticregression techniques are applied to predict whether the

10 ISRN Software Engineering

Table 7 Correlations between metrics

WMC DIT NOC CBO RFC LCOMBasili et al [1]

WMC 100 002 024 000 013 038DIT 100 000 000 000 001NOC 100 000 000 000CBO 100 031 001RFC 100 009LCOM 100

AIF version 16WMC 100 000 003 010 077 060DIT 100 000 000 000 001NOC 100 0024 0025 0027CBO 100 008 005RFC 100 042LCOM 100

Table 8 Linear regression analysis

Version 119877 119875 value Std error12 05360 0000 0111414 05024 0000 0145016 05154 0000 00834

1

08

06

04

02

0

minus4 minus3 minus2 minus1 0 1 2 3 4

1(1 + exp(minusq))

Figure 11 Logistic graph

class is faulty or not Univariate regression analysis is usedto examine the effect of each metric on fault of the classwhile multivariate regression analysis is used to examine thecommon effectiveness of metrics on fault of the class Theresults of three versions of AIF are compared consideringthese two statistical techniques Figure 11 shows the typicalldquo119878rdquo curve obtained (similar to Sigmoid function) for the AIFversion 16 using multivariate logistic regression Tables 9and 10 contain the tabulated values for the results obtainedby applying univariate and multivariate regression analysisrespectively

From Table 9 it can be observed that all metrics of CKsuite are highly significant except for DIT The 119875 value forthe three versions (wrt DIT) is 0335 0108 and 03527respectively Higher values of ldquo119875rdquo are an indication of lesssignificance

Univariate and multivariate logistic regression statisticalmethods were used for classifying a class as faulty or notfaulty Logistic regression was applied with a threshold value05 that is120587 gt 05 indicates that a class is classified as ldquofaultyrdquootherwise it is categorized as ldquonot faultyrdquo class

Tables 11 and 12 represent the confusion matrix fornumber of classes with faults before and after applyingregression analysis respectively for AIF version 16 FromTable 11 it is clear that before applying the logistic regressiona total number of 777 classes contained zero bugs and 188classes contained at least one bug After applying logisticregression (Table 12) a total of 767 + 16 classes are classifiedcorrectly with accuracy of 8113

The performance parameters of all three versions of theAIF are shown in Table 13 obtained by applying univariateand multivariate logistic regression analysis Here precisioncorrectness completeness and accuracy [1 13 27 28] aretaken as a performance parameters By using multivariatelogistic regression accuracy of AIF version 12 is found to be6444 accuracy of AIF version 14 is 8337 and that of AIFversion 16 is 8113

From the results obtained by applying linear and logisticregression analysis it is found that out of the six metricsWMC appears to have more impact in predicting faults

65 Fault Prediction Using Neural Network

651 Artificial Neural Network ANN is an interconnectedgroup of nodes In this paper three layers of ANN areconsidered in which six nodes act as input nodes nine nodesrepresent the hidden nodes and one node acts as outputnode

ANN is a three-phase network the phases are used forlearning validation and testing purposes So in this article70 of total input pattern is considered for learning phase15 for validation and the rest 15 for testingThe regressionanalysis carried out classifies whether a class is faulty or notfaulty The prediction models of ANN and its forms such as

ISRN Software Engineering 11

Table 9 Analysis of univariate regression

Coefficient Constant 119875 value 119877 valuever 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 0028 005 003 minus083 minus211 minus177 00013 00007 000 0130 0240 018DIT minus0067 010 005 minus046 minus183 minus153 0335 0108 03257 minus0039 0054 002NOC 0137 009 013 minus066 minus167 minus150 00007 000 000 0136 013 016CBO 0011 001 002 minus071 minus180 minus166 0017 000 000 0096 015 017RFC 0012 002 001 minus086 minus215 minus179 00014 000 000 0130 023 017LCOM 0007 0007 0007 minus064 minus167 minus148 00349 00004 00007 0085 011 011

Table 10 Multivariate logistic regression analysis

CoefficientAIF version 12 AIF version 14 AIF version 16

WMC 00195 00574 00320DIT minus0041 0000 0000NOC 01231 0000 0000CBO 0005 0008 0001RFC 00071 00081 00109LCOM 0 minus0001 0Constant minus0917 minus2785 minus2157

Table 11 Before applying regression

Not-faulty FaultyNot-Faulty 777 0Faulty 188 0

Table 12 After applying regression

Not-faulty FaultyNot-Faulty 767 10Faulty 172 16

PNN RBFN and FLANN not only classify the class as faultyor not faulty but also highlight the number of bugs foundin the class and these bugs are fixed in the testing phase ofsoftware development life cycle

In this paper six CKmetrics are taken as input and outputis the fault prediction accuracy rate required for developingthe software The network is trained using Gradient descentmethod and Levenberg Marquardt method

(a) Gradient Descent Method Gradient descent method isused for updating the weights using (15) and (16) Table 14shows the performance metrics of AIF version 16 Figure 12shows the graph plot for variation ofmean square error valueswrt no of epoch (or iteration) for AIF version 16

(b) Levenberg Marquardt Method Levenberg Marquardtmethod [21 22] is a technique for updating weights In caseof Gradient descent method learning rate 120572 is constant but inLevenbergMarquardt method learning rate 120572 varies in everyiteration So this method consumes less number of iterations

Mea

n sq

uare

erro

r

07

06

05

04

03

02

01

00 20 40 60 80 100 120 140 160 180

Number of iterations

Figure 12 MSE versus number of epoch wrt Gradient descent NN

Mea

n sq

uare

erro

r

Number of iterations

02

015

01

005

00 2 4 6 8 10 12 14

Figure 13 MSE versus number of epoch wrt Levenberg-marquardtNN

to train the network Table 15 shows the performance metricsfor AIF version 16 using Levenberg Marquardt method

Figure 13 shows the graph plot for variation of meansquare error values wrt number of epoch for AIF version 16

12 ISRN Software Engineering

Table 13 Precision correctness completeness and accuracy for three versions of AIF

Precision () Correctness () Completeness () Accuracy ()ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 6111 4117 5714 6111 4117 5714 509 482 425 6613 8402 8171DIT mdash mdash mdash mdash mdash mdash 0 0 0 6447 8337 8051NOC 75 75 6666 75 75 6666 555 206 531 6578 836 8103CBO 60 5714 7777 60 5714 7777 277 275 372 648 8348 8103RFC 6666 3636 50 6666 3636 50 462 275 212 6529 8302 8051LCOM 6666 50 60 066 05 06 277 68 159 6496 8337 8062MULTI 6875 50 6153 6875 50 6153 1018 758 851 6644 8337 8113

Table 14 Accuracy prediction using gradient descent NN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00594 1093 00617 minus02038 00044 00048 940437

Table 15 Accuracy prediction using Levenberg Marquardt

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00023 11203 00308 minus02189 00022 00041 904977

Table 16 Accuracy prediction using FLANN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00304 07097 00390 03308 24601119890 minus 06 00050 963769

Table 17 Accuracy prediction using basic RBFN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00279 03875 00573 01969 0059 0006 972792

652 Functional Link Artificial Neural Network (FLANN)FLANN architecture for software fault prediction is a singlelayer feed-forward neural network consisting of an input andoutput layer FLANN does not incorporate any hidden layerand hence has less computational cost In this paper adaptivealgorithm has been used for updating the weights as shownin (21) Figure 14 shows the variation of mean square valuesagainst number of epochs for AIF version 16 Table 16 showsthe performance metrics of FLANN

653 Radial Basis Function Network In this paper Gaussianradial function is used as a radial function Gradient descentlearning and hybrid learning methods are used for updatingthe centers and weights respectively

Three layered RBFN has been considered in which sixCK metrics are taken as input nodes nine hidden centers aretaken as hidden nodes and output is the fault prediction rateTable 17 shows the performance metrics for AIF version 16

(a) Gradient Descent Learning Method Equations (25) and(26) are used for updating center and weight during trainingphase After simplifying (25) the equation is represented as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1(1199101015840minus 119910)119882119894

120601119894

1205902(119909119895minus 119862119894119895(119896)) (39)

and the modified Equation (26) is formulated as

119882119894(119896 + 1) = 119882

119894(119896) + 120578

2(1199101015840minus 119910) 120601

119894 (40)

where 120590 is the width of the center and 119896 is the currentiteration number Table 18 shows the performancemetrics forAIF version 16 Figure 15 indicates the variation of MSE wrtnumber of epochs

(b) Hybrid Learning Method In Hybrid learning methodcenters are updated using (27) while weights are updatedusing supervised learning methods In this paper least meansquare error (LMSE) algorithm is used for updating theweights Table 19 shows the performance matrix for AIFversion 16 Figure 16 shows the graph for variation of MSEversus number of epochs

654 Probabilistic Neural Network (PNN) As mentioned inSection 424 PNN is a multilayered feed-forward networkwith four layers such as input hidden summation andoutput layer

In PNN 50 of faulty and nonfaulty classes are takenas input for hidden layers Gaussian elimination (28) isused as a hidden node function The summation layers sum

ISRN Software Engineering 13

Table 18 Accuracy prediction using RBFN gradient

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00207 02316 00323 03041 16302119890 minus 05 00041 972475

Table 19 Accuracy prediction using hybrid RBFN

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00614 01032 00316 09184 31834119890 minus 79 00013 984783

Number of iterations

Mea

n sq

uare

erro

r

0 10 20 30 40 50 60 700

01

02

03

04

05

06

07

08

09

Figure 14 Graph plot for MSE versus number of iterations (epoch)wrt FLANN

Number of iterations

Mea

n sq

uare

erro

r 0015

001

005

00 5 10 15 20 25 30 35 40 45

Figure 15 MSE versus number of epochs wrt gradient RBFN

contribution of each class of input patterns and producea net output which is a vector of probabilities The outputpattern having maximum summation value is classified intorespective class Figure 17 shows the variation of accuracy fordifferent values of smoothing parameter

66 Comparison Table 20 shows the tabulated results forthe obtained performance parameter values number ofepochs and accuracy rate by applying three neural networktechniques This performance table is an indication of betterfault prediction model In this comparative analysis theperformance parameter mean square error (MSE) was takenas a criterion to compute the performance parameters (suchas MARE MSE number of epochs and accuracy rate)when four neural network techniques were applied Duringthis process the MSE value of 0002 was set a thresholdfor evaluation Based on the number of iterations and theaccuracy rate obtained by the respective NN technique bestprediction model was determined

Number of iterationsM

ean

squa

re er

ror

006

005

003

002

001

004

00 2 4 6 8 10 12 14

Figure 16 MSE versus number of epochs wrt hybrid RBFN

Smoothing parameter

Accu

racy

()

865

86

855

85

845

84

835

83

825

820 05 1 15 2 25 3 35 4 45 5

Figure 17 Accuracy rate versus smoothing parameter

From Table 20 it is evident that gradient NN methodobtained an accuracy rate of 9404 in 162 epochs (iter-ations) LM technique which is an improvised model ofANN obtained 904 accuracy rate This accuracy rate isless than gradient NN but this approach (LM method) tookonly 13 epochs PNN method achieved a classification rate of8641

The three types of RBFN namely basic RBFN gradientand hybrid methods obtained a prediction rate of 97279724 and 9847 respectively Considering the number ofepochs RBFN hybridmethod obtained better prediction rateof 9847 in only 14 epochs when compared with gradientmethod (41 epochs) and basic RBFN approaches

FLANN architecture obtained 9637 accuracy rate withless computational cost involved FLANN obtained accuracyrate in 66 epochs as it has no hidden layer involved in itsarchitecture

14 ISRN Software Engineering

Table 20 Performance metrics

Performance parametersAI technique Epoch MAE MARE RMSE Std Error AccuracyGradient descent 162 00594 10930 00617 00048 9404LM 13 00023 11203 00308 00041 9049RBFN basic mdash 00279 03875 00573 006 9727RBFN gradient 41 00207 02316 00323 00041 9724RBFN hybrid 14 00614 01032 00316 00013 9847FLANN 66 00304 07097 00390 00050 9637

The performance of PNN is shown in Figure 17 Highestaccuracy in prediction was obtained for smoothing parame-ter value of 17 PNN obtained a classification rate of 8641

RBFN using hybrid learning model gives the least valuesfor MAE MARE RMSE and high accuracy rate Hencefrom the obtained results by using ANN techniques it can beconcluded that RBFNhybrid approach obtained the best faultprediction rate in less number of epochswhen comparedwithother three ANN techniques

7 Conclusion

System analyst use of prediction models to classify faultprone classes as faulty or not faulty is the need of the dayfor researchers as well as practitioners So more reliableapproaches for prediction need to be modeled In this papertwo approaches namely statistical methods and machinelearning techniques were applied for fault prediction Theapplication of statistical and machine learning methods infault prediction requires enormous amount of data andanalyzing this huge amount of data is necessary with the helpof a better prediction model

This paper proposes a comparative study of differentprediction models for fault prediction for an open-sourceproject Fault prediction using statistical and machine learn-ing methods were carried out for AIF by coding in MATLABenvironment Statistical methods such as linear regressionand logistic regression were applied Also machine learningtechniques such as artificial neural network (gradient descentand Levenberg Marquardt methods) Functional link artifi-cial neural network radial basis function network (RBFNbasic RBFN gradient and RBFN hybrid) and probabilisticneural network techniques were applied for fault predictionanalysis

It can be concluded from the statistical regression analysisthat out of six CK metrics WMC appears to be more usefulin predicting faults Table 20 shows that hybrid approachof RBFN obtained better fault prediction in less number ofepochs (14 iterations) when compared with the other threeneural network techniques

In future work should be replicated to other open-sourceprojects like Mozilla using different AI techniques to analyzewhich model performs better in achieving higher accuracyfor fault prediction Also fault prediction accuracy should bemeasured by combining multiple computational intelligencetechniques

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] V R Basili L C Briand and W L Melo ldquoA validationof object-oriented design metrics as quality indicatorsrdquo IEEETransactions on Software Engineering vol 22 no 10 pp 751ndash761 1996

[2] T J McCabe ldquoA Complexity Measurerdquo IEEE Transactions onSoftware Engineering vol 2 no 4 pp 308ndash320 1976

[3] M H Halstead Elements of Software Science Elsevier ScienceNew York NY USA 1977

[4] W Li and S Henry ldquoMaintenance metrics for the Object-Oriented paradigmrdquo in Proceedings of the 1st InternationalSoftware Metrics Symposium pp 52ndash60 1993

[5] S R Chidamber and C F Kemerer ldquoMetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[6] F B E Abreu andR Carapuca ldquoObject-Oriented software engi-neering measuring and controlling the development processrdquoin Proceedings of the 4th International Conference on SoftwareQuality pp 1ndash8 McLean Va USA October 1994

[7] M Lorenz and J Kidd Object-Oriented Software MetricsPrentice Hall Englewood NJ USA 1994

[8] R Martin ldquoOO design quality metricsmdashan analysis of depen-denciesrdquo in Proceedings of the Workshop Pragmatic and Theo-retical Directions in Object-Oriented Software Metrics (OOPSLArsquo94) 1994

[9] D P Tegarden S D Sheetz and D E Monarchi ldquoA softwarecomplexity model of object-oriented systemsrdquoDecision SupportSystems vol 13 no 3-4 pp 241ndash262 1995

[10] W Melo and F B E Abreu ldquoEvaluating the impact of object-oriented design on software qualityrdquo in Proceedings of the 3rdInternational Software Metrics Symposium pp 90ndash99 BerlinGermany March 1996

[11] L Briand P Devanbu and W Melo ldquoInvestigation intocoupling measures for C++rdquo in Proceedings of the IEEE 19thInternational Conference on Software EngineeringAssociation forComputing Machinery pp 412ndash421 May 1997

[12] L Etzkorn J Bansiya and C Davis ldquoDesign and code com-plexity metrics for OO classesrdquo Journal of Object-OrientedProgramming vol 12 no 1 pp 35ndash40 1999

[13] L C Briand JWust JWDaly andDVictor Porter ldquoExploringthe relationships between designmeasures and software qualityin object-oriented systemsrdquoThe Journal of Systems and Softwarevol 51 no 3 pp 245ndash273 2000

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

ISRN Software Engineering 5

Adaptivealgorithm

Error

+1

X1

X2

x0

x1

w0

w1

w2

sum

sum

S120588

minus

+

y

y

Cos(120587x1)

Sin(120587x1)

Cos(120587x2)

x2

Sin(120587x2)Fu

nctio

nal

Expa

nsio

n

w

y

x1 middot middot middot x2

Figure 2 Flat net structure of FLANN

architecture of FLANN FLANN architecture offers lesscomputational overhead and higher convergence speed whencompared with other ANN techniques

Using FLANN output is calculated as follows

119910 =

119899

sum

119894=1

119882119894119883119894 (19)

where 119910 is the predicted value119882 is the weight vector and119883is the functional block and is defined as follows

119883

= [1 1199091 sin (120587119909

1) cos (120587119909

1) 1199092 sin (120587119909

2) cos (120587119909

2) ]

(20)

and weight is updated as follows

119882119894(119896 + 1) = 119882

119894(119896) + 120572119890

119894(119896) 119909119894(119896) (21)

having 120572 as the learning rate and 119890119894as the error value ldquo119890

119894rdquo is

formulated as follows

119890119894= 119910119894minus 119910119894 (22)

here 119910 and 119910 represent actual and the obtained (predicted)values respectively

423 Radial Basis Function Network (RBFN) RBFN is afeed-forward neural network (FFNN) trained using super-vised training algorithm RBFN is generally configured by asingle hidden layer where the activation function is chosenfrom a class of functions called basis functions

RBFN is one of the ANN techniques which containsthree layers namely input hidden and output layer Figure 3shows the structure of a typical RBFN in its basic forminvolving three entirely different layers RBFN contains ℎnumber of hidden centers represented as 119862

1 1198622 1198623 119862

Output layer

Input layer

Hidden layer ofradial basis function

x1

x2

x3

x4

xp

C1

C2

Ch

w1

w2

wn

y998400

1206011

1206012

120601h

Figure 3 RBFN network

The target output is computed as follows

1199101015840=

119899

sum

119894=1

120601119894119882119894 (23)

where 119882119894is the weight of the 119894th center 120601 is the radial

function and1199101015840 is the target output Table 3 shows the variousradial functions available in the literature

In this paper Gaussian function is used as a radialfunction and 119911 the distance vector is calculated as follows

119911 =10038171003817100381710038171003817119909119895minus 119888119895

10038171003817100381710038171003817 (24)

where 119909119895is input vector that lies in the receptive field for

center 119888119895 In this paper gradient descent learning and hybrid

learning techniques are used for updating weight and centerrespectively

6 ISRN Software Engineering

Table 3 Radial function

Radial function Mathematical expressionGaussian radial function 120601(119911) = 119890

minus(119911221205902)

Thin plate spline 120601(119911) = 1199112 log 119911

Quadratic 120601(119911) = (1199112+ 1199032)12

Inverse quadratic 120601(119911) =1

(1199112 + 1199032)12

The advantage of using RBFN lies in its training ratewhich is faster when compared with propagation networksand is less susceptible to problem with nonstationary inputs

(a) Gradient Descent Learning Technique Gradient descentlearning is a technique used for updating the weight119882 andcenter 119862 The center 119862 in gradient learning is updated as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1

120597119889

120597119889119862119894119895

(119864119896) (25)

and weight119882 is updated as

119882119894(119896 + 1) = 119882

119894(119896) minus 120578

2

120597119889

120597119889119882119894

(119864119896) (26)

where 1205781and 120578

2are the learning coefficients for updating

center and weight respectively

(b) Hybrid Learning Technique In hybrid learning techniqueradial function relocates their center in self-organized man-ner while the weights are updated using learning algorithmIn this paper least mean square (LMS) algorithm is used forupdating the weights while the center is updated only whenit satisfies the following conditions

(a) Euclidean distance between the input pattern and thenearest center is greater than the threshold value and

(b) MSE is greater than the desired accuracy

After satisfying the above conditions the Euclidean distanceis used to find the centers close to 119909 and then the centers areupdated as follows

119862119894(119896 + 1) = 119862

119894(119896) + 120572 (119909 minus 119862

119894(119896)) (27)

After every updation the center moves closer to 119909

424 Probabilistic Neural Network (PNN) PNN was intro-duced by Specht [24] It is a feed-forward neural networkwhich has been basically derived from Bayesian network andstatistical algorithm

In PNN the network is organized as multilayered feed-forward network with four layers such as input hiddensummation and output layer Figure 4 shows the basicarchitecture of PNN

The input layer first computes the distance from inputvector to the training input vectorsThe second layer consistsof a Gaussian function which is formed using the given setof data points as centers The summation layers sum up the

Table 4 Confusionmatrix to classify a class as faulty and not-faulty

No (prediction) Yes (prediction)No (actual) True negative (TN) False positive (FP)Yes (actual) False negative (FN) True positive (TP)

contribution of each class of input and produce a net outputwhich is vector of probabilities The fourth layer determinesthe fault prediction rate

PNN technique is faster when compared to multilayerperceptron networks and also is more accurate The majorconcern lies in finding an accurate smoothing parameter ldquo120590rdquoto obtain better classification The following function is usedin hidden layer

120601 (119911) = 119890minus(11991121205902) (28)

where 119911 = 119909 minus 119888

119909 is the input

119888 is the center and

119911 is the Euclidean distance between the center and theinput vector

5 Performance Evaluation Parameters

The following subsections give the basic definitions of theperformance parameters used in statistical and machinelearning methods for fault prediction

51 Statistical Analysis The performance parameters forstatistical analysis can be determined based on the confusionmatrix [25] as shown in Table 4

511 Precision It is defined as the degree to which therepeated measurements under unchanged conditions showthe same results

Precision = TPFP + TP

(29)

512 Correctness Correctness as defined by Briand et al [13]is the ratio of the number of modules correctly classified asfault prone to the total number of modules classified as faultprone

Correctness = TPFP + TP

(30)

513 Completeness According toBriand et al [13] complete-ness is the ratio of number of faults in classes classified as faultprone to the total number of faults in the system

Completeness = TPFN + TP

(31)

ISRN Software Engineering 7

Input layer

Pattern layer(Training set)

Summationlayer

Output class

Out putlayer

X1

X3

X11

X13

X12

X21

X22

X23

X31

X32

X33

y11

y12

y13

y21

y22

y23

y31

y32

y33

1

2

3

X2

Max (g1 g2 g3)

Xg2(X)

g1(X)

g3(X)

Figure 4 Basic structure of PNN

514 Accuracy Accuracy as defined by Yaun et al [26] isthe proportion of predicted fault prone modules that areinspected out of all modules

Accuracy = TN + TPTN + FP + FN + TP

(32)

515 1198772 Statistic 1198772 also known as coefficient of multipledetermination is a measure of power of correlation betweenpredicted and actual number of faults [25] The higher thevalue of this statistic themore is the accuracy of the predictedmodel

1198772= 1 minus

sum119899

119894=1(119910119894minus 119910119894)2

sum119899

119894=1(119910119894minus 119910)2 (33)

where 119910119894is the actual number of faults 119910

119894is the predicted

number of faults and 119910 is the average number of faults

52 Machine Learning Fault prediction accuracy for fourof the applied ANN is determined by using performanceevaluation parameters such as mean absolute error (MAE)mean absolute relative error (MARE) rootmean square error(RMSE) and standard error of the mean (SEM)

521 Mean Absolute Error (MAE) This performance param-eter determines how close the values of predicted and actualfault (accuracy) rate differ

MAE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816 (34)

522 Mean Absolute Relative Error (MARE) Consider

MARE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816

119910119894

(35)

In (35) a numerical value of 005 is added in thedenominator in order to avoid numerical overflow (divisionby zero) The modified MARE is formulated as

MARE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816

119910119894+ 005

(36)

523 Root Mean Square Error (RMSE) This performanceparameter determines the differences in the values of pre-dicted and actual fault (accuracy) rate

RMSE = radic 1

119899

119899

sum

119894=1

(119910119894minus 1199101015840

119894)2

(37)

In (35) (36) and (37) 119910119894is actual value and 1199101015840

119894is expected

value

524 Standard Error of the Mean (SEM) It is the deviationof predicted value from the actual fault (accuracy) rate

SEM =SDradic119899

(38)

where SD is sample standard deviation and ldquo119899rdquo is the numberof samples

8 ISRN Software Engineering

Table 5 Distribution of bugs for AIF version 16

Number of classes Percentageof bugs

Number ofassociated bugs

777 805181 0101 104663 132 33161 216 16580 314 14508 46 06218 52 02073 63 03109 75 05181 81 01036 91 01036 103 03109 111 01036 131 01036 171 01036 181 01036 28965 10000 142

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 14

12

10

8

6

4

2

0minus20 0 20 40 60 80 100 120 140 160 180

Figure 5 WMC of AIF version 16

6 Results and Analysis

In this section the relationship between value of metrics andthe fault found in a class is determined In this approachthe comparative study involves using six CK metrics as inputnodes and the output is the achieved fault prediction rateFault prediction is performed for AIF version 16

61 Fault Data To perform statistical analysis bugs werecollected fromPromise data repository [18] Table 5 shows thedistribution of bugs based on the number of occurrences (interms of percentage of class containing number of bugs) forAIF version 16

AIF version 16 contains 965 numbers of classes inwhich 777 classes contain zero bugs (805181) 104663 ofclasses contain at least one bug 33161 of classes containa minimum of two bugs 16580 of classes contain threebugs 14508 of classes contain four bugs 06218 of classescontain five bugs 02073 of the classes contain six bugs

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 50

40

45

30

35

20

25

10

5

15

00 1 2 3 4 5 6

Figure 6 DIT of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

90

80

70

60

50

40

30

20

10

0minus5 0 5 10 15 20 25 30 35 40

Figure 7 NOC of AIF version 16

03109 of classes contain seven and eleven bugs 05181 ofclasses contain eight bugs and 01036 of the class containnine thirteen seventeen eighteen and twenty-eight bugs

62 Metrics Data CK metric values for WMC DIT NOCCBO RFC and LCOM respectively for AIF version 16 aregraphically represented in Figures 5 6 7 8 9 and 10

63 Descriptive Statistics and Correlation Analysis This sub-section gives the comparative analysis of the fault datadescriptive statistics of classes and the correlation among thesix metrics with that of Basili et al [1] Basili et al studiedobject-oriented systems written in C++ language They car-ried out an experiment in which they set up eight projectgroups each consisting of three students Each group hadthe same task of developing smallmedium-sized softwaresystem Since all the necessary documentation (for instancereports about faults and their fixes) were available they couldsearch for relationships between fault density and metricsThey used the same CK metric suite Logistic regression wasemployed to analyze the relationship betweenmetrics and thefault proneness of classes

The obtained CK metric values of AIF version 16 arecompared with the results of Basili et al [1] In comparisonwith Basili the total number of classes considered is muchgreater that is 965 classes were considered (Vs 180) Table 6shows the comparative statistical analysis results obtained for

ISRN Software Engineering 9

Table 6 Descriptive statistics of classes

WMC DIT NOC CBO RFC LCOMBasili et al [1]

Max 9900 900 10500 1300 3000 42600Min 100 000 000 000 000 000Median 950 000 1950 000 500 000Mean 1340 132 3391 023 680 970Std Dev 1490 199 3337 154 756 6377

AIF version 16Max 16600 600 3900 44800 32200 13617Min 000 000 000 000 000 000Median 500 100 000 700 1400 400Mean 857 195 0052 1110 2142 7933Std Dev 1120 127 263 2252 2500 52375

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

10

9

8

7

6

5

4

3

2

1

00 50 100 150 200 250 300 350 400 450minus50

Figure 8 CBO of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 7

6

5

4

3

2

1

0minus50 0 50 100 150 200 250 300 350

Figure 9 RFC of AIF version 16

Basili et al andAIF version 16 forCKmetrics indicatingMaxMin Median and Standard deviation

The dependency between CK metrics is computed usingPearsonrsquos correlations (1198772 coefficient of determination) andcompared with Basili et al [1] for AIF version 16 Thecoefficient of determination 1198772 is useful because it gives theproportion of the variance (fluctuation) of one variable that ispredictable from the other variable It is ameasure that allowsa researcher to determine how certain one can be in makingpredictions from a certain modelgraph Table 7 shows thePearsonrsquos correlations for the data set used by Basili et al [1]and the correlation metrics of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

30

25

20

15

10

5

0minus200 0 2000 4000 6000 8000 10000 12000 14000

Figure 10 LCOM of AIF version 16

From Table 7 wrt AIF version 16 it is observed thatcorrelation between WMC and RFC is 077 which is highlycorrelated that is these two metrics are very much linearlydependent on each other Similarly correlation betweenWMC and DIT is 0 which indicates that they are looselycorrelated that is there is no dependency between these twometrics

64 Fault Prediction Using Statistical Methods

641 Linear Regression Analysis Table 8 shows resultsobtained for linear regression analysis in which the fault isconsidered as the dependent variable and the CK metrics arethe independent variables

ldquo119877rdquo represents the coefficient of correlation ldquo119875rdquo refers tothe significance of the metric value If 119875 lt 0001 then themetrics are of very great significance in fault prediction

642 Logistic Regression Analysis The logistic regressionmethod helps to indicate whether a class is faulty or notbut does not convey anything about the possible numberof faults in the class Univariate and multivariate logisticregression techniques are applied to predict whether the

10 ISRN Software Engineering

Table 7 Correlations between metrics

WMC DIT NOC CBO RFC LCOMBasili et al [1]

WMC 100 002 024 000 013 038DIT 100 000 000 000 001NOC 100 000 000 000CBO 100 031 001RFC 100 009LCOM 100

AIF version 16WMC 100 000 003 010 077 060DIT 100 000 000 000 001NOC 100 0024 0025 0027CBO 100 008 005RFC 100 042LCOM 100

Table 8 Linear regression analysis

Version 119877 119875 value Std error12 05360 0000 0111414 05024 0000 0145016 05154 0000 00834

1

08

06

04

02

0

minus4 minus3 minus2 minus1 0 1 2 3 4

1(1 + exp(minusq))

Figure 11 Logistic graph

class is faulty or not Univariate regression analysis is usedto examine the effect of each metric on fault of the classwhile multivariate regression analysis is used to examine thecommon effectiveness of metrics on fault of the class Theresults of three versions of AIF are compared consideringthese two statistical techniques Figure 11 shows the typicalldquo119878rdquo curve obtained (similar to Sigmoid function) for the AIFversion 16 using multivariate logistic regression Tables 9and 10 contain the tabulated values for the results obtainedby applying univariate and multivariate regression analysisrespectively

From Table 9 it can be observed that all metrics of CKsuite are highly significant except for DIT The 119875 value forthe three versions (wrt DIT) is 0335 0108 and 03527respectively Higher values of ldquo119875rdquo are an indication of lesssignificance

Univariate and multivariate logistic regression statisticalmethods were used for classifying a class as faulty or notfaulty Logistic regression was applied with a threshold value05 that is120587 gt 05 indicates that a class is classified as ldquofaultyrdquootherwise it is categorized as ldquonot faultyrdquo class

Tables 11 and 12 represent the confusion matrix fornumber of classes with faults before and after applyingregression analysis respectively for AIF version 16 FromTable 11 it is clear that before applying the logistic regressiona total number of 777 classes contained zero bugs and 188classes contained at least one bug After applying logisticregression (Table 12) a total of 767 + 16 classes are classifiedcorrectly with accuracy of 8113

The performance parameters of all three versions of theAIF are shown in Table 13 obtained by applying univariateand multivariate logistic regression analysis Here precisioncorrectness completeness and accuracy [1 13 27 28] aretaken as a performance parameters By using multivariatelogistic regression accuracy of AIF version 12 is found to be6444 accuracy of AIF version 14 is 8337 and that of AIFversion 16 is 8113

From the results obtained by applying linear and logisticregression analysis it is found that out of the six metricsWMC appears to have more impact in predicting faults

65 Fault Prediction Using Neural Network

651 Artificial Neural Network ANN is an interconnectedgroup of nodes In this paper three layers of ANN areconsidered in which six nodes act as input nodes nine nodesrepresent the hidden nodes and one node acts as outputnode

ANN is a three-phase network the phases are used forlearning validation and testing purposes So in this article70 of total input pattern is considered for learning phase15 for validation and the rest 15 for testingThe regressionanalysis carried out classifies whether a class is faulty or notfaulty The prediction models of ANN and its forms such as

ISRN Software Engineering 11

Table 9 Analysis of univariate regression

Coefficient Constant 119875 value 119877 valuever 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 0028 005 003 minus083 minus211 minus177 00013 00007 000 0130 0240 018DIT minus0067 010 005 minus046 minus183 minus153 0335 0108 03257 minus0039 0054 002NOC 0137 009 013 minus066 minus167 minus150 00007 000 000 0136 013 016CBO 0011 001 002 minus071 minus180 minus166 0017 000 000 0096 015 017RFC 0012 002 001 minus086 minus215 minus179 00014 000 000 0130 023 017LCOM 0007 0007 0007 minus064 minus167 minus148 00349 00004 00007 0085 011 011

Table 10 Multivariate logistic regression analysis

CoefficientAIF version 12 AIF version 14 AIF version 16

WMC 00195 00574 00320DIT minus0041 0000 0000NOC 01231 0000 0000CBO 0005 0008 0001RFC 00071 00081 00109LCOM 0 minus0001 0Constant minus0917 minus2785 minus2157

Table 11 Before applying regression

Not-faulty FaultyNot-Faulty 777 0Faulty 188 0

Table 12 After applying regression

Not-faulty FaultyNot-Faulty 767 10Faulty 172 16

PNN RBFN and FLANN not only classify the class as faultyor not faulty but also highlight the number of bugs foundin the class and these bugs are fixed in the testing phase ofsoftware development life cycle

In this paper six CKmetrics are taken as input and outputis the fault prediction accuracy rate required for developingthe software The network is trained using Gradient descentmethod and Levenberg Marquardt method

(a) Gradient Descent Method Gradient descent method isused for updating the weights using (15) and (16) Table 14shows the performance metrics of AIF version 16 Figure 12shows the graph plot for variation ofmean square error valueswrt no of epoch (or iteration) for AIF version 16

(b) Levenberg Marquardt Method Levenberg Marquardtmethod [21 22] is a technique for updating weights In caseof Gradient descent method learning rate 120572 is constant but inLevenbergMarquardt method learning rate 120572 varies in everyiteration So this method consumes less number of iterations

Mea

n sq

uare

erro

r

07

06

05

04

03

02

01

00 20 40 60 80 100 120 140 160 180

Number of iterations

Figure 12 MSE versus number of epoch wrt Gradient descent NN

Mea

n sq

uare

erro

r

Number of iterations

02

015

01

005

00 2 4 6 8 10 12 14

Figure 13 MSE versus number of epoch wrt Levenberg-marquardtNN

to train the network Table 15 shows the performance metricsfor AIF version 16 using Levenberg Marquardt method

Figure 13 shows the graph plot for variation of meansquare error values wrt number of epoch for AIF version 16

12 ISRN Software Engineering

Table 13 Precision correctness completeness and accuracy for three versions of AIF

Precision () Correctness () Completeness () Accuracy ()ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 6111 4117 5714 6111 4117 5714 509 482 425 6613 8402 8171DIT mdash mdash mdash mdash mdash mdash 0 0 0 6447 8337 8051NOC 75 75 6666 75 75 6666 555 206 531 6578 836 8103CBO 60 5714 7777 60 5714 7777 277 275 372 648 8348 8103RFC 6666 3636 50 6666 3636 50 462 275 212 6529 8302 8051LCOM 6666 50 60 066 05 06 277 68 159 6496 8337 8062MULTI 6875 50 6153 6875 50 6153 1018 758 851 6644 8337 8113

Table 14 Accuracy prediction using gradient descent NN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00594 1093 00617 minus02038 00044 00048 940437

Table 15 Accuracy prediction using Levenberg Marquardt

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00023 11203 00308 minus02189 00022 00041 904977

Table 16 Accuracy prediction using FLANN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00304 07097 00390 03308 24601119890 minus 06 00050 963769

Table 17 Accuracy prediction using basic RBFN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00279 03875 00573 01969 0059 0006 972792

652 Functional Link Artificial Neural Network (FLANN)FLANN architecture for software fault prediction is a singlelayer feed-forward neural network consisting of an input andoutput layer FLANN does not incorporate any hidden layerand hence has less computational cost In this paper adaptivealgorithm has been used for updating the weights as shownin (21) Figure 14 shows the variation of mean square valuesagainst number of epochs for AIF version 16 Table 16 showsthe performance metrics of FLANN

653 Radial Basis Function Network In this paper Gaussianradial function is used as a radial function Gradient descentlearning and hybrid learning methods are used for updatingthe centers and weights respectively

Three layered RBFN has been considered in which sixCK metrics are taken as input nodes nine hidden centers aretaken as hidden nodes and output is the fault prediction rateTable 17 shows the performance metrics for AIF version 16

(a) Gradient Descent Learning Method Equations (25) and(26) are used for updating center and weight during trainingphase After simplifying (25) the equation is represented as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1(1199101015840minus 119910)119882119894

120601119894

1205902(119909119895minus 119862119894119895(119896)) (39)

and the modified Equation (26) is formulated as

119882119894(119896 + 1) = 119882

119894(119896) + 120578

2(1199101015840minus 119910) 120601

119894 (40)

where 120590 is the width of the center and 119896 is the currentiteration number Table 18 shows the performancemetrics forAIF version 16 Figure 15 indicates the variation of MSE wrtnumber of epochs

(b) Hybrid Learning Method In Hybrid learning methodcenters are updated using (27) while weights are updatedusing supervised learning methods In this paper least meansquare error (LMSE) algorithm is used for updating theweights Table 19 shows the performance matrix for AIFversion 16 Figure 16 shows the graph for variation of MSEversus number of epochs

654 Probabilistic Neural Network (PNN) As mentioned inSection 424 PNN is a multilayered feed-forward networkwith four layers such as input hidden summation andoutput layer

In PNN 50 of faulty and nonfaulty classes are takenas input for hidden layers Gaussian elimination (28) isused as a hidden node function The summation layers sum

ISRN Software Engineering 13

Table 18 Accuracy prediction using RBFN gradient

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00207 02316 00323 03041 16302119890 minus 05 00041 972475

Table 19 Accuracy prediction using hybrid RBFN

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00614 01032 00316 09184 31834119890 minus 79 00013 984783

Number of iterations

Mea

n sq

uare

erro

r

0 10 20 30 40 50 60 700

01

02

03

04

05

06

07

08

09

Figure 14 Graph plot for MSE versus number of iterations (epoch)wrt FLANN

Number of iterations

Mea

n sq

uare

erro

r 0015

001

005

00 5 10 15 20 25 30 35 40 45

Figure 15 MSE versus number of epochs wrt gradient RBFN

contribution of each class of input patterns and producea net output which is a vector of probabilities The outputpattern having maximum summation value is classified intorespective class Figure 17 shows the variation of accuracy fordifferent values of smoothing parameter

66 Comparison Table 20 shows the tabulated results forthe obtained performance parameter values number ofepochs and accuracy rate by applying three neural networktechniques This performance table is an indication of betterfault prediction model In this comparative analysis theperformance parameter mean square error (MSE) was takenas a criterion to compute the performance parameters (suchas MARE MSE number of epochs and accuracy rate)when four neural network techniques were applied Duringthis process the MSE value of 0002 was set a thresholdfor evaluation Based on the number of iterations and theaccuracy rate obtained by the respective NN technique bestprediction model was determined

Number of iterationsM

ean

squa

re er

ror

006

005

003

002

001

004

00 2 4 6 8 10 12 14

Figure 16 MSE versus number of epochs wrt hybrid RBFN

Smoothing parameter

Accu

racy

()

865

86

855

85

845

84

835

83

825

820 05 1 15 2 25 3 35 4 45 5

Figure 17 Accuracy rate versus smoothing parameter

From Table 20 it is evident that gradient NN methodobtained an accuracy rate of 9404 in 162 epochs (iter-ations) LM technique which is an improvised model ofANN obtained 904 accuracy rate This accuracy rate isless than gradient NN but this approach (LM method) tookonly 13 epochs PNN method achieved a classification rate of8641

The three types of RBFN namely basic RBFN gradientand hybrid methods obtained a prediction rate of 97279724 and 9847 respectively Considering the number ofepochs RBFN hybridmethod obtained better prediction rateof 9847 in only 14 epochs when compared with gradientmethod (41 epochs) and basic RBFN approaches

FLANN architecture obtained 9637 accuracy rate withless computational cost involved FLANN obtained accuracyrate in 66 epochs as it has no hidden layer involved in itsarchitecture

14 ISRN Software Engineering

Table 20 Performance metrics

Performance parametersAI technique Epoch MAE MARE RMSE Std Error AccuracyGradient descent 162 00594 10930 00617 00048 9404LM 13 00023 11203 00308 00041 9049RBFN basic mdash 00279 03875 00573 006 9727RBFN gradient 41 00207 02316 00323 00041 9724RBFN hybrid 14 00614 01032 00316 00013 9847FLANN 66 00304 07097 00390 00050 9637

The performance of PNN is shown in Figure 17 Highestaccuracy in prediction was obtained for smoothing parame-ter value of 17 PNN obtained a classification rate of 8641

RBFN using hybrid learning model gives the least valuesfor MAE MARE RMSE and high accuracy rate Hencefrom the obtained results by using ANN techniques it can beconcluded that RBFNhybrid approach obtained the best faultprediction rate in less number of epochswhen comparedwithother three ANN techniques

7 Conclusion

System analyst use of prediction models to classify faultprone classes as faulty or not faulty is the need of the dayfor researchers as well as practitioners So more reliableapproaches for prediction need to be modeled In this papertwo approaches namely statistical methods and machinelearning techniques were applied for fault prediction Theapplication of statistical and machine learning methods infault prediction requires enormous amount of data andanalyzing this huge amount of data is necessary with the helpof a better prediction model

This paper proposes a comparative study of differentprediction models for fault prediction for an open-sourceproject Fault prediction using statistical and machine learn-ing methods were carried out for AIF by coding in MATLABenvironment Statistical methods such as linear regressionand logistic regression were applied Also machine learningtechniques such as artificial neural network (gradient descentand Levenberg Marquardt methods) Functional link artifi-cial neural network radial basis function network (RBFNbasic RBFN gradient and RBFN hybrid) and probabilisticneural network techniques were applied for fault predictionanalysis

It can be concluded from the statistical regression analysisthat out of six CK metrics WMC appears to be more usefulin predicting faults Table 20 shows that hybrid approachof RBFN obtained better fault prediction in less number ofepochs (14 iterations) when compared with the other threeneural network techniques

In future work should be replicated to other open-sourceprojects like Mozilla using different AI techniques to analyzewhich model performs better in achieving higher accuracyfor fault prediction Also fault prediction accuracy should bemeasured by combining multiple computational intelligencetechniques

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] V R Basili L C Briand and W L Melo ldquoA validationof object-oriented design metrics as quality indicatorsrdquo IEEETransactions on Software Engineering vol 22 no 10 pp 751ndash761 1996

[2] T J McCabe ldquoA Complexity Measurerdquo IEEE Transactions onSoftware Engineering vol 2 no 4 pp 308ndash320 1976

[3] M H Halstead Elements of Software Science Elsevier ScienceNew York NY USA 1977

[4] W Li and S Henry ldquoMaintenance metrics for the Object-Oriented paradigmrdquo in Proceedings of the 1st InternationalSoftware Metrics Symposium pp 52ndash60 1993

[5] S R Chidamber and C F Kemerer ldquoMetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[6] F B E Abreu andR Carapuca ldquoObject-Oriented software engi-neering measuring and controlling the development processrdquoin Proceedings of the 4th International Conference on SoftwareQuality pp 1ndash8 McLean Va USA October 1994

[7] M Lorenz and J Kidd Object-Oriented Software MetricsPrentice Hall Englewood NJ USA 1994

[8] R Martin ldquoOO design quality metricsmdashan analysis of depen-denciesrdquo in Proceedings of the Workshop Pragmatic and Theo-retical Directions in Object-Oriented Software Metrics (OOPSLArsquo94) 1994

[9] D P Tegarden S D Sheetz and D E Monarchi ldquoA softwarecomplexity model of object-oriented systemsrdquoDecision SupportSystems vol 13 no 3-4 pp 241ndash262 1995

[10] W Melo and F B E Abreu ldquoEvaluating the impact of object-oriented design on software qualityrdquo in Proceedings of the 3rdInternational Software Metrics Symposium pp 90ndash99 BerlinGermany March 1996

[11] L Briand P Devanbu and W Melo ldquoInvestigation intocoupling measures for C++rdquo in Proceedings of the IEEE 19thInternational Conference on Software EngineeringAssociation forComputing Machinery pp 412ndash421 May 1997

[12] L Etzkorn J Bansiya and C Davis ldquoDesign and code com-plexity metrics for OO classesrdquo Journal of Object-OrientedProgramming vol 12 no 1 pp 35ndash40 1999

[13] L C Briand JWust JWDaly andDVictor Porter ldquoExploringthe relationships between designmeasures and software qualityin object-oriented systemsrdquoThe Journal of Systems and Softwarevol 51 no 3 pp 245ndash273 2000

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

6 ISRN Software Engineering

Table 3 Radial function

Radial function Mathematical expressionGaussian radial function 120601(119911) = 119890

minus(119911221205902)

Thin plate spline 120601(119911) = 1199112 log 119911

Quadratic 120601(119911) = (1199112+ 1199032)12

Inverse quadratic 120601(119911) =1

(1199112 + 1199032)12

The advantage of using RBFN lies in its training ratewhich is faster when compared with propagation networksand is less susceptible to problem with nonstationary inputs

(a) Gradient Descent Learning Technique Gradient descentlearning is a technique used for updating the weight119882 andcenter 119862 The center 119862 in gradient learning is updated as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1

120597119889

120597119889119862119894119895

(119864119896) (25)

and weight119882 is updated as

119882119894(119896 + 1) = 119882

119894(119896) minus 120578

2

120597119889

120597119889119882119894

(119864119896) (26)

where 1205781and 120578

2are the learning coefficients for updating

center and weight respectively

(b) Hybrid Learning Technique In hybrid learning techniqueradial function relocates their center in self-organized man-ner while the weights are updated using learning algorithmIn this paper least mean square (LMS) algorithm is used forupdating the weights while the center is updated only whenit satisfies the following conditions

(a) Euclidean distance between the input pattern and thenearest center is greater than the threshold value and

(b) MSE is greater than the desired accuracy

After satisfying the above conditions the Euclidean distanceis used to find the centers close to 119909 and then the centers areupdated as follows

119862119894(119896 + 1) = 119862

119894(119896) + 120572 (119909 minus 119862

119894(119896)) (27)

After every updation the center moves closer to 119909

424 Probabilistic Neural Network (PNN) PNN was intro-duced by Specht [24] It is a feed-forward neural networkwhich has been basically derived from Bayesian network andstatistical algorithm

In PNN the network is organized as multilayered feed-forward network with four layers such as input hiddensummation and output layer Figure 4 shows the basicarchitecture of PNN

The input layer first computes the distance from inputvector to the training input vectorsThe second layer consistsof a Gaussian function which is formed using the given setof data points as centers The summation layers sum up the

Table 4 Confusionmatrix to classify a class as faulty and not-faulty

No (prediction) Yes (prediction)No (actual) True negative (TN) False positive (FP)Yes (actual) False negative (FN) True positive (TP)

contribution of each class of input and produce a net outputwhich is vector of probabilities The fourth layer determinesthe fault prediction rate

PNN technique is faster when compared to multilayerperceptron networks and also is more accurate The majorconcern lies in finding an accurate smoothing parameter ldquo120590rdquoto obtain better classification The following function is usedin hidden layer

120601 (119911) = 119890minus(11991121205902) (28)

where 119911 = 119909 minus 119888

119909 is the input

119888 is the center and

119911 is the Euclidean distance between the center and theinput vector

5 Performance Evaluation Parameters

The following subsections give the basic definitions of theperformance parameters used in statistical and machinelearning methods for fault prediction

51 Statistical Analysis The performance parameters forstatistical analysis can be determined based on the confusionmatrix [25] as shown in Table 4

511 Precision It is defined as the degree to which therepeated measurements under unchanged conditions showthe same results

Precision = TPFP + TP

(29)

512 Correctness Correctness as defined by Briand et al [13]is the ratio of the number of modules correctly classified asfault prone to the total number of modules classified as faultprone

Correctness = TPFP + TP

(30)

513 Completeness According toBriand et al [13] complete-ness is the ratio of number of faults in classes classified as faultprone to the total number of faults in the system

Completeness = TPFN + TP

(31)

ISRN Software Engineering 7

Input layer

Pattern layer(Training set)

Summationlayer

Output class

Out putlayer

X1

X3

X11

X13

X12

X21

X22

X23

X31

X32

X33

y11

y12

y13

y21

y22

y23

y31

y32

y33

1

2

3

X2

Max (g1 g2 g3)

Xg2(X)

g1(X)

g3(X)

Figure 4 Basic structure of PNN

514 Accuracy Accuracy as defined by Yaun et al [26] isthe proportion of predicted fault prone modules that areinspected out of all modules

Accuracy = TN + TPTN + FP + FN + TP

(32)

515 1198772 Statistic 1198772 also known as coefficient of multipledetermination is a measure of power of correlation betweenpredicted and actual number of faults [25] The higher thevalue of this statistic themore is the accuracy of the predictedmodel

1198772= 1 minus

sum119899

119894=1(119910119894minus 119910119894)2

sum119899

119894=1(119910119894minus 119910)2 (33)

where 119910119894is the actual number of faults 119910

119894is the predicted

number of faults and 119910 is the average number of faults

52 Machine Learning Fault prediction accuracy for fourof the applied ANN is determined by using performanceevaluation parameters such as mean absolute error (MAE)mean absolute relative error (MARE) rootmean square error(RMSE) and standard error of the mean (SEM)

521 Mean Absolute Error (MAE) This performance param-eter determines how close the values of predicted and actualfault (accuracy) rate differ

MAE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816 (34)

522 Mean Absolute Relative Error (MARE) Consider

MARE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816

119910119894

(35)

In (35) a numerical value of 005 is added in thedenominator in order to avoid numerical overflow (divisionby zero) The modified MARE is formulated as

MARE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816

119910119894+ 005

(36)

523 Root Mean Square Error (RMSE) This performanceparameter determines the differences in the values of pre-dicted and actual fault (accuracy) rate

RMSE = radic 1

119899

119899

sum

119894=1

(119910119894minus 1199101015840

119894)2

(37)

In (35) (36) and (37) 119910119894is actual value and 1199101015840

119894is expected

value

524 Standard Error of the Mean (SEM) It is the deviationof predicted value from the actual fault (accuracy) rate

SEM =SDradic119899

(38)

where SD is sample standard deviation and ldquo119899rdquo is the numberof samples

8 ISRN Software Engineering

Table 5 Distribution of bugs for AIF version 16

Number of classes Percentageof bugs

Number ofassociated bugs

777 805181 0101 104663 132 33161 216 16580 314 14508 46 06218 52 02073 63 03109 75 05181 81 01036 91 01036 103 03109 111 01036 131 01036 171 01036 181 01036 28965 10000 142

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 14

12

10

8

6

4

2

0minus20 0 20 40 60 80 100 120 140 160 180

Figure 5 WMC of AIF version 16

6 Results and Analysis

In this section the relationship between value of metrics andthe fault found in a class is determined In this approachthe comparative study involves using six CK metrics as inputnodes and the output is the achieved fault prediction rateFault prediction is performed for AIF version 16

61 Fault Data To perform statistical analysis bugs werecollected fromPromise data repository [18] Table 5 shows thedistribution of bugs based on the number of occurrences (interms of percentage of class containing number of bugs) forAIF version 16

AIF version 16 contains 965 numbers of classes inwhich 777 classes contain zero bugs (805181) 104663 ofclasses contain at least one bug 33161 of classes containa minimum of two bugs 16580 of classes contain threebugs 14508 of classes contain four bugs 06218 of classescontain five bugs 02073 of the classes contain six bugs

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 50

40

45

30

35

20

25

10

5

15

00 1 2 3 4 5 6

Figure 6 DIT of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

90

80

70

60

50

40

30

20

10

0minus5 0 5 10 15 20 25 30 35 40

Figure 7 NOC of AIF version 16

03109 of classes contain seven and eleven bugs 05181 ofclasses contain eight bugs and 01036 of the class containnine thirteen seventeen eighteen and twenty-eight bugs

62 Metrics Data CK metric values for WMC DIT NOCCBO RFC and LCOM respectively for AIF version 16 aregraphically represented in Figures 5 6 7 8 9 and 10

63 Descriptive Statistics and Correlation Analysis This sub-section gives the comparative analysis of the fault datadescriptive statistics of classes and the correlation among thesix metrics with that of Basili et al [1] Basili et al studiedobject-oriented systems written in C++ language They car-ried out an experiment in which they set up eight projectgroups each consisting of three students Each group hadthe same task of developing smallmedium-sized softwaresystem Since all the necessary documentation (for instancereports about faults and their fixes) were available they couldsearch for relationships between fault density and metricsThey used the same CK metric suite Logistic regression wasemployed to analyze the relationship betweenmetrics and thefault proneness of classes

The obtained CK metric values of AIF version 16 arecompared with the results of Basili et al [1] In comparisonwith Basili the total number of classes considered is muchgreater that is 965 classes were considered (Vs 180) Table 6shows the comparative statistical analysis results obtained for

ISRN Software Engineering 9

Table 6 Descriptive statistics of classes

WMC DIT NOC CBO RFC LCOMBasili et al [1]

Max 9900 900 10500 1300 3000 42600Min 100 000 000 000 000 000Median 950 000 1950 000 500 000Mean 1340 132 3391 023 680 970Std Dev 1490 199 3337 154 756 6377

AIF version 16Max 16600 600 3900 44800 32200 13617Min 000 000 000 000 000 000Median 500 100 000 700 1400 400Mean 857 195 0052 1110 2142 7933Std Dev 1120 127 263 2252 2500 52375

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

10

9

8

7

6

5

4

3

2

1

00 50 100 150 200 250 300 350 400 450minus50

Figure 8 CBO of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 7

6

5

4

3

2

1

0minus50 0 50 100 150 200 250 300 350

Figure 9 RFC of AIF version 16

Basili et al andAIF version 16 forCKmetrics indicatingMaxMin Median and Standard deviation

The dependency between CK metrics is computed usingPearsonrsquos correlations (1198772 coefficient of determination) andcompared with Basili et al [1] for AIF version 16 Thecoefficient of determination 1198772 is useful because it gives theproportion of the variance (fluctuation) of one variable that ispredictable from the other variable It is ameasure that allowsa researcher to determine how certain one can be in makingpredictions from a certain modelgraph Table 7 shows thePearsonrsquos correlations for the data set used by Basili et al [1]and the correlation metrics of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

30

25

20

15

10

5

0minus200 0 2000 4000 6000 8000 10000 12000 14000

Figure 10 LCOM of AIF version 16

From Table 7 wrt AIF version 16 it is observed thatcorrelation between WMC and RFC is 077 which is highlycorrelated that is these two metrics are very much linearlydependent on each other Similarly correlation betweenWMC and DIT is 0 which indicates that they are looselycorrelated that is there is no dependency between these twometrics

64 Fault Prediction Using Statistical Methods

641 Linear Regression Analysis Table 8 shows resultsobtained for linear regression analysis in which the fault isconsidered as the dependent variable and the CK metrics arethe independent variables

ldquo119877rdquo represents the coefficient of correlation ldquo119875rdquo refers tothe significance of the metric value If 119875 lt 0001 then themetrics are of very great significance in fault prediction

642 Logistic Regression Analysis The logistic regressionmethod helps to indicate whether a class is faulty or notbut does not convey anything about the possible numberof faults in the class Univariate and multivariate logisticregression techniques are applied to predict whether the

10 ISRN Software Engineering

Table 7 Correlations between metrics

WMC DIT NOC CBO RFC LCOMBasili et al [1]

WMC 100 002 024 000 013 038DIT 100 000 000 000 001NOC 100 000 000 000CBO 100 031 001RFC 100 009LCOM 100

AIF version 16WMC 100 000 003 010 077 060DIT 100 000 000 000 001NOC 100 0024 0025 0027CBO 100 008 005RFC 100 042LCOM 100

Table 8 Linear regression analysis

Version 119877 119875 value Std error12 05360 0000 0111414 05024 0000 0145016 05154 0000 00834

1

08

06

04

02

0

minus4 minus3 minus2 minus1 0 1 2 3 4

1(1 + exp(minusq))

Figure 11 Logistic graph

class is faulty or not Univariate regression analysis is usedto examine the effect of each metric on fault of the classwhile multivariate regression analysis is used to examine thecommon effectiveness of metrics on fault of the class Theresults of three versions of AIF are compared consideringthese two statistical techniques Figure 11 shows the typicalldquo119878rdquo curve obtained (similar to Sigmoid function) for the AIFversion 16 using multivariate logistic regression Tables 9and 10 contain the tabulated values for the results obtainedby applying univariate and multivariate regression analysisrespectively

From Table 9 it can be observed that all metrics of CKsuite are highly significant except for DIT The 119875 value forthe three versions (wrt DIT) is 0335 0108 and 03527respectively Higher values of ldquo119875rdquo are an indication of lesssignificance

Univariate and multivariate logistic regression statisticalmethods were used for classifying a class as faulty or notfaulty Logistic regression was applied with a threshold value05 that is120587 gt 05 indicates that a class is classified as ldquofaultyrdquootherwise it is categorized as ldquonot faultyrdquo class

Tables 11 and 12 represent the confusion matrix fornumber of classes with faults before and after applyingregression analysis respectively for AIF version 16 FromTable 11 it is clear that before applying the logistic regressiona total number of 777 classes contained zero bugs and 188classes contained at least one bug After applying logisticregression (Table 12) a total of 767 + 16 classes are classifiedcorrectly with accuracy of 8113

The performance parameters of all three versions of theAIF are shown in Table 13 obtained by applying univariateand multivariate logistic regression analysis Here precisioncorrectness completeness and accuracy [1 13 27 28] aretaken as a performance parameters By using multivariatelogistic regression accuracy of AIF version 12 is found to be6444 accuracy of AIF version 14 is 8337 and that of AIFversion 16 is 8113

From the results obtained by applying linear and logisticregression analysis it is found that out of the six metricsWMC appears to have more impact in predicting faults

65 Fault Prediction Using Neural Network

651 Artificial Neural Network ANN is an interconnectedgroup of nodes In this paper three layers of ANN areconsidered in which six nodes act as input nodes nine nodesrepresent the hidden nodes and one node acts as outputnode

ANN is a three-phase network the phases are used forlearning validation and testing purposes So in this article70 of total input pattern is considered for learning phase15 for validation and the rest 15 for testingThe regressionanalysis carried out classifies whether a class is faulty or notfaulty The prediction models of ANN and its forms such as

ISRN Software Engineering 11

Table 9 Analysis of univariate regression

Coefficient Constant 119875 value 119877 valuever 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 0028 005 003 minus083 minus211 minus177 00013 00007 000 0130 0240 018DIT minus0067 010 005 minus046 minus183 minus153 0335 0108 03257 minus0039 0054 002NOC 0137 009 013 minus066 minus167 minus150 00007 000 000 0136 013 016CBO 0011 001 002 minus071 minus180 minus166 0017 000 000 0096 015 017RFC 0012 002 001 minus086 minus215 minus179 00014 000 000 0130 023 017LCOM 0007 0007 0007 minus064 minus167 minus148 00349 00004 00007 0085 011 011

Table 10 Multivariate logistic regression analysis

CoefficientAIF version 12 AIF version 14 AIF version 16

WMC 00195 00574 00320DIT minus0041 0000 0000NOC 01231 0000 0000CBO 0005 0008 0001RFC 00071 00081 00109LCOM 0 minus0001 0Constant minus0917 minus2785 minus2157

Table 11 Before applying regression

Not-faulty FaultyNot-Faulty 777 0Faulty 188 0

Table 12 After applying regression

Not-faulty FaultyNot-Faulty 767 10Faulty 172 16

PNN RBFN and FLANN not only classify the class as faultyor not faulty but also highlight the number of bugs foundin the class and these bugs are fixed in the testing phase ofsoftware development life cycle

In this paper six CKmetrics are taken as input and outputis the fault prediction accuracy rate required for developingthe software The network is trained using Gradient descentmethod and Levenberg Marquardt method

(a) Gradient Descent Method Gradient descent method isused for updating the weights using (15) and (16) Table 14shows the performance metrics of AIF version 16 Figure 12shows the graph plot for variation ofmean square error valueswrt no of epoch (or iteration) for AIF version 16

(b) Levenberg Marquardt Method Levenberg Marquardtmethod [21 22] is a technique for updating weights In caseof Gradient descent method learning rate 120572 is constant but inLevenbergMarquardt method learning rate 120572 varies in everyiteration So this method consumes less number of iterations

Mea

n sq

uare

erro

r

07

06

05

04

03

02

01

00 20 40 60 80 100 120 140 160 180

Number of iterations

Figure 12 MSE versus number of epoch wrt Gradient descent NN

Mea

n sq

uare

erro

r

Number of iterations

02

015

01

005

00 2 4 6 8 10 12 14

Figure 13 MSE versus number of epoch wrt Levenberg-marquardtNN

to train the network Table 15 shows the performance metricsfor AIF version 16 using Levenberg Marquardt method

Figure 13 shows the graph plot for variation of meansquare error values wrt number of epoch for AIF version 16

12 ISRN Software Engineering

Table 13 Precision correctness completeness and accuracy for three versions of AIF

Precision () Correctness () Completeness () Accuracy ()ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 6111 4117 5714 6111 4117 5714 509 482 425 6613 8402 8171DIT mdash mdash mdash mdash mdash mdash 0 0 0 6447 8337 8051NOC 75 75 6666 75 75 6666 555 206 531 6578 836 8103CBO 60 5714 7777 60 5714 7777 277 275 372 648 8348 8103RFC 6666 3636 50 6666 3636 50 462 275 212 6529 8302 8051LCOM 6666 50 60 066 05 06 277 68 159 6496 8337 8062MULTI 6875 50 6153 6875 50 6153 1018 758 851 6644 8337 8113

Table 14 Accuracy prediction using gradient descent NN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00594 1093 00617 minus02038 00044 00048 940437

Table 15 Accuracy prediction using Levenberg Marquardt

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00023 11203 00308 minus02189 00022 00041 904977

Table 16 Accuracy prediction using FLANN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00304 07097 00390 03308 24601119890 minus 06 00050 963769

Table 17 Accuracy prediction using basic RBFN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00279 03875 00573 01969 0059 0006 972792

652 Functional Link Artificial Neural Network (FLANN)FLANN architecture for software fault prediction is a singlelayer feed-forward neural network consisting of an input andoutput layer FLANN does not incorporate any hidden layerand hence has less computational cost In this paper adaptivealgorithm has been used for updating the weights as shownin (21) Figure 14 shows the variation of mean square valuesagainst number of epochs for AIF version 16 Table 16 showsthe performance metrics of FLANN

653 Radial Basis Function Network In this paper Gaussianradial function is used as a radial function Gradient descentlearning and hybrid learning methods are used for updatingthe centers and weights respectively

Three layered RBFN has been considered in which sixCK metrics are taken as input nodes nine hidden centers aretaken as hidden nodes and output is the fault prediction rateTable 17 shows the performance metrics for AIF version 16

(a) Gradient Descent Learning Method Equations (25) and(26) are used for updating center and weight during trainingphase After simplifying (25) the equation is represented as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1(1199101015840minus 119910)119882119894

120601119894

1205902(119909119895minus 119862119894119895(119896)) (39)

and the modified Equation (26) is formulated as

119882119894(119896 + 1) = 119882

119894(119896) + 120578

2(1199101015840minus 119910) 120601

119894 (40)

where 120590 is the width of the center and 119896 is the currentiteration number Table 18 shows the performancemetrics forAIF version 16 Figure 15 indicates the variation of MSE wrtnumber of epochs

(b) Hybrid Learning Method In Hybrid learning methodcenters are updated using (27) while weights are updatedusing supervised learning methods In this paper least meansquare error (LMSE) algorithm is used for updating theweights Table 19 shows the performance matrix for AIFversion 16 Figure 16 shows the graph for variation of MSEversus number of epochs

654 Probabilistic Neural Network (PNN) As mentioned inSection 424 PNN is a multilayered feed-forward networkwith four layers such as input hidden summation andoutput layer

In PNN 50 of faulty and nonfaulty classes are takenas input for hidden layers Gaussian elimination (28) isused as a hidden node function The summation layers sum

ISRN Software Engineering 13

Table 18 Accuracy prediction using RBFN gradient

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00207 02316 00323 03041 16302119890 minus 05 00041 972475

Table 19 Accuracy prediction using hybrid RBFN

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00614 01032 00316 09184 31834119890 minus 79 00013 984783

Number of iterations

Mea

n sq

uare

erro

r

0 10 20 30 40 50 60 700

01

02

03

04

05

06

07

08

09

Figure 14 Graph plot for MSE versus number of iterations (epoch)wrt FLANN

Number of iterations

Mea

n sq

uare

erro

r 0015

001

005

00 5 10 15 20 25 30 35 40 45

Figure 15 MSE versus number of epochs wrt gradient RBFN

contribution of each class of input patterns and producea net output which is a vector of probabilities The outputpattern having maximum summation value is classified intorespective class Figure 17 shows the variation of accuracy fordifferent values of smoothing parameter

66 Comparison Table 20 shows the tabulated results forthe obtained performance parameter values number ofepochs and accuracy rate by applying three neural networktechniques This performance table is an indication of betterfault prediction model In this comparative analysis theperformance parameter mean square error (MSE) was takenas a criterion to compute the performance parameters (suchas MARE MSE number of epochs and accuracy rate)when four neural network techniques were applied Duringthis process the MSE value of 0002 was set a thresholdfor evaluation Based on the number of iterations and theaccuracy rate obtained by the respective NN technique bestprediction model was determined

Number of iterationsM

ean

squa

re er

ror

006

005

003

002

001

004

00 2 4 6 8 10 12 14

Figure 16 MSE versus number of epochs wrt hybrid RBFN

Smoothing parameter

Accu

racy

()

865

86

855

85

845

84

835

83

825

820 05 1 15 2 25 3 35 4 45 5

Figure 17 Accuracy rate versus smoothing parameter

From Table 20 it is evident that gradient NN methodobtained an accuracy rate of 9404 in 162 epochs (iter-ations) LM technique which is an improvised model ofANN obtained 904 accuracy rate This accuracy rate isless than gradient NN but this approach (LM method) tookonly 13 epochs PNN method achieved a classification rate of8641

The three types of RBFN namely basic RBFN gradientand hybrid methods obtained a prediction rate of 97279724 and 9847 respectively Considering the number ofepochs RBFN hybridmethod obtained better prediction rateof 9847 in only 14 epochs when compared with gradientmethod (41 epochs) and basic RBFN approaches

FLANN architecture obtained 9637 accuracy rate withless computational cost involved FLANN obtained accuracyrate in 66 epochs as it has no hidden layer involved in itsarchitecture

14 ISRN Software Engineering

Table 20 Performance metrics

Performance parametersAI technique Epoch MAE MARE RMSE Std Error AccuracyGradient descent 162 00594 10930 00617 00048 9404LM 13 00023 11203 00308 00041 9049RBFN basic mdash 00279 03875 00573 006 9727RBFN gradient 41 00207 02316 00323 00041 9724RBFN hybrid 14 00614 01032 00316 00013 9847FLANN 66 00304 07097 00390 00050 9637

The performance of PNN is shown in Figure 17 Highestaccuracy in prediction was obtained for smoothing parame-ter value of 17 PNN obtained a classification rate of 8641

RBFN using hybrid learning model gives the least valuesfor MAE MARE RMSE and high accuracy rate Hencefrom the obtained results by using ANN techniques it can beconcluded that RBFNhybrid approach obtained the best faultprediction rate in less number of epochswhen comparedwithother three ANN techniques

7 Conclusion

System analyst use of prediction models to classify faultprone classes as faulty or not faulty is the need of the dayfor researchers as well as practitioners So more reliableapproaches for prediction need to be modeled In this papertwo approaches namely statistical methods and machinelearning techniques were applied for fault prediction Theapplication of statistical and machine learning methods infault prediction requires enormous amount of data andanalyzing this huge amount of data is necessary with the helpof a better prediction model

This paper proposes a comparative study of differentprediction models for fault prediction for an open-sourceproject Fault prediction using statistical and machine learn-ing methods were carried out for AIF by coding in MATLABenvironment Statistical methods such as linear regressionand logistic regression were applied Also machine learningtechniques such as artificial neural network (gradient descentand Levenberg Marquardt methods) Functional link artifi-cial neural network radial basis function network (RBFNbasic RBFN gradient and RBFN hybrid) and probabilisticneural network techniques were applied for fault predictionanalysis

It can be concluded from the statistical regression analysisthat out of six CK metrics WMC appears to be more usefulin predicting faults Table 20 shows that hybrid approachof RBFN obtained better fault prediction in less number ofepochs (14 iterations) when compared with the other threeneural network techniques

In future work should be replicated to other open-sourceprojects like Mozilla using different AI techniques to analyzewhich model performs better in achieving higher accuracyfor fault prediction Also fault prediction accuracy should bemeasured by combining multiple computational intelligencetechniques

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] V R Basili L C Briand and W L Melo ldquoA validationof object-oriented design metrics as quality indicatorsrdquo IEEETransactions on Software Engineering vol 22 no 10 pp 751ndash761 1996

[2] T J McCabe ldquoA Complexity Measurerdquo IEEE Transactions onSoftware Engineering vol 2 no 4 pp 308ndash320 1976

[3] M H Halstead Elements of Software Science Elsevier ScienceNew York NY USA 1977

[4] W Li and S Henry ldquoMaintenance metrics for the Object-Oriented paradigmrdquo in Proceedings of the 1st InternationalSoftware Metrics Symposium pp 52ndash60 1993

[5] S R Chidamber and C F Kemerer ldquoMetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[6] F B E Abreu andR Carapuca ldquoObject-Oriented software engi-neering measuring and controlling the development processrdquoin Proceedings of the 4th International Conference on SoftwareQuality pp 1ndash8 McLean Va USA October 1994

[7] M Lorenz and J Kidd Object-Oriented Software MetricsPrentice Hall Englewood NJ USA 1994

[8] R Martin ldquoOO design quality metricsmdashan analysis of depen-denciesrdquo in Proceedings of the Workshop Pragmatic and Theo-retical Directions in Object-Oriented Software Metrics (OOPSLArsquo94) 1994

[9] D P Tegarden S D Sheetz and D E Monarchi ldquoA softwarecomplexity model of object-oriented systemsrdquoDecision SupportSystems vol 13 no 3-4 pp 241ndash262 1995

[10] W Melo and F B E Abreu ldquoEvaluating the impact of object-oriented design on software qualityrdquo in Proceedings of the 3rdInternational Software Metrics Symposium pp 90ndash99 BerlinGermany March 1996

[11] L Briand P Devanbu and W Melo ldquoInvestigation intocoupling measures for C++rdquo in Proceedings of the IEEE 19thInternational Conference on Software EngineeringAssociation forComputing Machinery pp 412ndash421 May 1997

[12] L Etzkorn J Bansiya and C Davis ldquoDesign and code com-plexity metrics for OO classesrdquo Journal of Object-OrientedProgramming vol 12 no 1 pp 35ndash40 1999

[13] L C Briand JWust JWDaly andDVictor Porter ldquoExploringthe relationships between designmeasures and software qualityin object-oriented systemsrdquoThe Journal of Systems and Softwarevol 51 no 3 pp 245ndash273 2000

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

ISRN Software Engineering 7

Input layer

Pattern layer(Training set)

Summationlayer

Output class

Out putlayer

X1

X3

X11

X13

X12

X21

X22

X23

X31

X32

X33

y11

y12

y13

y21

y22

y23

y31

y32

y33

1

2

3

X2

Max (g1 g2 g3)

Xg2(X)

g1(X)

g3(X)

Figure 4 Basic structure of PNN

514 Accuracy Accuracy as defined by Yaun et al [26] isthe proportion of predicted fault prone modules that areinspected out of all modules

Accuracy = TN + TPTN + FP + FN + TP

(32)

515 1198772 Statistic 1198772 also known as coefficient of multipledetermination is a measure of power of correlation betweenpredicted and actual number of faults [25] The higher thevalue of this statistic themore is the accuracy of the predictedmodel

1198772= 1 minus

sum119899

119894=1(119910119894minus 119910119894)2

sum119899

119894=1(119910119894minus 119910)2 (33)

where 119910119894is the actual number of faults 119910

119894is the predicted

number of faults and 119910 is the average number of faults

52 Machine Learning Fault prediction accuracy for fourof the applied ANN is determined by using performanceevaluation parameters such as mean absolute error (MAE)mean absolute relative error (MARE) rootmean square error(RMSE) and standard error of the mean (SEM)

521 Mean Absolute Error (MAE) This performance param-eter determines how close the values of predicted and actualfault (accuracy) rate differ

MAE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816 (34)

522 Mean Absolute Relative Error (MARE) Consider

MARE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816

119910119894

(35)

In (35) a numerical value of 005 is added in thedenominator in order to avoid numerical overflow (divisionby zero) The modified MARE is formulated as

MARE = 1

119899

119899

sum

119894=1

10038161003816100381610038161003816119910119894minus 1199101015840

119894

10038161003816100381610038161003816

119910119894+ 005

(36)

523 Root Mean Square Error (RMSE) This performanceparameter determines the differences in the values of pre-dicted and actual fault (accuracy) rate

RMSE = radic 1

119899

119899

sum

119894=1

(119910119894minus 1199101015840

119894)2

(37)

In (35) (36) and (37) 119910119894is actual value and 1199101015840

119894is expected

value

524 Standard Error of the Mean (SEM) It is the deviationof predicted value from the actual fault (accuracy) rate

SEM =SDradic119899

(38)

where SD is sample standard deviation and ldquo119899rdquo is the numberof samples

8 ISRN Software Engineering

Table 5 Distribution of bugs for AIF version 16

Number of classes Percentageof bugs

Number ofassociated bugs

777 805181 0101 104663 132 33161 216 16580 314 14508 46 06218 52 02073 63 03109 75 05181 81 01036 91 01036 103 03109 111 01036 131 01036 171 01036 181 01036 28965 10000 142

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 14

12

10

8

6

4

2

0minus20 0 20 40 60 80 100 120 140 160 180

Figure 5 WMC of AIF version 16

6 Results and Analysis

In this section the relationship between value of metrics andthe fault found in a class is determined In this approachthe comparative study involves using six CK metrics as inputnodes and the output is the achieved fault prediction rateFault prediction is performed for AIF version 16

61 Fault Data To perform statistical analysis bugs werecollected fromPromise data repository [18] Table 5 shows thedistribution of bugs based on the number of occurrences (interms of percentage of class containing number of bugs) forAIF version 16

AIF version 16 contains 965 numbers of classes inwhich 777 classes contain zero bugs (805181) 104663 ofclasses contain at least one bug 33161 of classes containa minimum of two bugs 16580 of classes contain threebugs 14508 of classes contain four bugs 06218 of classescontain five bugs 02073 of the classes contain six bugs

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 50

40

45

30

35

20

25

10

5

15

00 1 2 3 4 5 6

Figure 6 DIT of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

90

80

70

60

50

40

30

20

10

0minus5 0 5 10 15 20 25 30 35 40

Figure 7 NOC of AIF version 16

03109 of classes contain seven and eleven bugs 05181 ofclasses contain eight bugs and 01036 of the class containnine thirteen seventeen eighteen and twenty-eight bugs

62 Metrics Data CK metric values for WMC DIT NOCCBO RFC and LCOM respectively for AIF version 16 aregraphically represented in Figures 5 6 7 8 9 and 10

63 Descriptive Statistics and Correlation Analysis This sub-section gives the comparative analysis of the fault datadescriptive statistics of classes and the correlation among thesix metrics with that of Basili et al [1] Basili et al studiedobject-oriented systems written in C++ language They car-ried out an experiment in which they set up eight projectgroups each consisting of three students Each group hadthe same task of developing smallmedium-sized softwaresystem Since all the necessary documentation (for instancereports about faults and their fixes) were available they couldsearch for relationships between fault density and metricsThey used the same CK metric suite Logistic regression wasemployed to analyze the relationship betweenmetrics and thefault proneness of classes

The obtained CK metric values of AIF version 16 arecompared with the results of Basili et al [1] In comparisonwith Basili the total number of classes considered is muchgreater that is 965 classes were considered (Vs 180) Table 6shows the comparative statistical analysis results obtained for

ISRN Software Engineering 9

Table 6 Descriptive statistics of classes

WMC DIT NOC CBO RFC LCOMBasili et al [1]

Max 9900 900 10500 1300 3000 42600Min 100 000 000 000 000 000Median 950 000 1950 000 500 000Mean 1340 132 3391 023 680 970Std Dev 1490 199 3337 154 756 6377

AIF version 16Max 16600 600 3900 44800 32200 13617Min 000 000 000 000 000 000Median 500 100 000 700 1400 400Mean 857 195 0052 1110 2142 7933Std Dev 1120 127 263 2252 2500 52375

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

10

9

8

7

6

5

4

3

2

1

00 50 100 150 200 250 300 350 400 450minus50

Figure 8 CBO of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 7

6

5

4

3

2

1

0minus50 0 50 100 150 200 250 300 350

Figure 9 RFC of AIF version 16

Basili et al andAIF version 16 forCKmetrics indicatingMaxMin Median and Standard deviation

The dependency between CK metrics is computed usingPearsonrsquos correlations (1198772 coefficient of determination) andcompared with Basili et al [1] for AIF version 16 Thecoefficient of determination 1198772 is useful because it gives theproportion of the variance (fluctuation) of one variable that ispredictable from the other variable It is ameasure that allowsa researcher to determine how certain one can be in makingpredictions from a certain modelgraph Table 7 shows thePearsonrsquos correlations for the data set used by Basili et al [1]and the correlation metrics of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

30

25

20

15

10

5

0minus200 0 2000 4000 6000 8000 10000 12000 14000

Figure 10 LCOM of AIF version 16

From Table 7 wrt AIF version 16 it is observed thatcorrelation between WMC and RFC is 077 which is highlycorrelated that is these two metrics are very much linearlydependent on each other Similarly correlation betweenWMC and DIT is 0 which indicates that they are looselycorrelated that is there is no dependency between these twometrics

64 Fault Prediction Using Statistical Methods

641 Linear Regression Analysis Table 8 shows resultsobtained for linear regression analysis in which the fault isconsidered as the dependent variable and the CK metrics arethe independent variables

ldquo119877rdquo represents the coefficient of correlation ldquo119875rdquo refers tothe significance of the metric value If 119875 lt 0001 then themetrics are of very great significance in fault prediction

642 Logistic Regression Analysis The logistic regressionmethod helps to indicate whether a class is faulty or notbut does not convey anything about the possible numberof faults in the class Univariate and multivariate logisticregression techniques are applied to predict whether the

10 ISRN Software Engineering

Table 7 Correlations between metrics

WMC DIT NOC CBO RFC LCOMBasili et al [1]

WMC 100 002 024 000 013 038DIT 100 000 000 000 001NOC 100 000 000 000CBO 100 031 001RFC 100 009LCOM 100

AIF version 16WMC 100 000 003 010 077 060DIT 100 000 000 000 001NOC 100 0024 0025 0027CBO 100 008 005RFC 100 042LCOM 100

Table 8 Linear regression analysis

Version 119877 119875 value Std error12 05360 0000 0111414 05024 0000 0145016 05154 0000 00834

1

08

06

04

02

0

minus4 minus3 minus2 minus1 0 1 2 3 4

1(1 + exp(minusq))

Figure 11 Logistic graph

class is faulty or not Univariate regression analysis is usedto examine the effect of each metric on fault of the classwhile multivariate regression analysis is used to examine thecommon effectiveness of metrics on fault of the class Theresults of three versions of AIF are compared consideringthese two statistical techniques Figure 11 shows the typicalldquo119878rdquo curve obtained (similar to Sigmoid function) for the AIFversion 16 using multivariate logistic regression Tables 9and 10 contain the tabulated values for the results obtainedby applying univariate and multivariate regression analysisrespectively

From Table 9 it can be observed that all metrics of CKsuite are highly significant except for DIT The 119875 value forthe three versions (wrt DIT) is 0335 0108 and 03527respectively Higher values of ldquo119875rdquo are an indication of lesssignificance

Univariate and multivariate logistic regression statisticalmethods were used for classifying a class as faulty or notfaulty Logistic regression was applied with a threshold value05 that is120587 gt 05 indicates that a class is classified as ldquofaultyrdquootherwise it is categorized as ldquonot faultyrdquo class

Tables 11 and 12 represent the confusion matrix fornumber of classes with faults before and after applyingregression analysis respectively for AIF version 16 FromTable 11 it is clear that before applying the logistic regressiona total number of 777 classes contained zero bugs and 188classes contained at least one bug After applying logisticregression (Table 12) a total of 767 + 16 classes are classifiedcorrectly with accuracy of 8113

The performance parameters of all three versions of theAIF are shown in Table 13 obtained by applying univariateand multivariate logistic regression analysis Here precisioncorrectness completeness and accuracy [1 13 27 28] aretaken as a performance parameters By using multivariatelogistic regression accuracy of AIF version 12 is found to be6444 accuracy of AIF version 14 is 8337 and that of AIFversion 16 is 8113

From the results obtained by applying linear and logisticregression analysis it is found that out of the six metricsWMC appears to have more impact in predicting faults

65 Fault Prediction Using Neural Network

651 Artificial Neural Network ANN is an interconnectedgroup of nodes In this paper three layers of ANN areconsidered in which six nodes act as input nodes nine nodesrepresent the hidden nodes and one node acts as outputnode

ANN is a three-phase network the phases are used forlearning validation and testing purposes So in this article70 of total input pattern is considered for learning phase15 for validation and the rest 15 for testingThe regressionanalysis carried out classifies whether a class is faulty or notfaulty The prediction models of ANN and its forms such as

ISRN Software Engineering 11

Table 9 Analysis of univariate regression

Coefficient Constant 119875 value 119877 valuever 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 0028 005 003 minus083 minus211 minus177 00013 00007 000 0130 0240 018DIT minus0067 010 005 minus046 minus183 minus153 0335 0108 03257 minus0039 0054 002NOC 0137 009 013 minus066 minus167 minus150 00007 000 000 0136 013 016CBO 0011 001 002 minus071 minus180 minus166 0017 000 000 0096 015 017RFC 0012 002 001 minus086 minus215 minus179 00014 000 000 0130 023 017LCOM 0007 0007 0007 minus064 minus167 minus148 00349 00004 00007 0085 011 011

Table 10 Multivariate logistic regression analysis

CoefficientAIF version 12 AIF version 14 AIF version 16

WMC 00195 00574 00320DIT minus0041 0000 0000NOC 01231 0000 0000CBO 0005 0008 0001RFC 00071 00081 00109LCOM 0 minus0001 0Constant minus0917 minus2785 minus2157

Table 11 Before applying regression

Not-faulty FaultyNot-Faulty 777 0Faulty 188 0

Table 12 After applying regression

Not-faulty FaultyNot-Faulty 767 10Faulty 172 16

PNN RBFN and FLANN not only classify the class as faultyor not faulty but also highlight the number of bugs foundin the class and these bugs are fixed in the testing phase ofsoftware development life cycle

In this paper six CKmetrics are taken as input and outputis the fault prediction accuracy rate required for developingthe software The network is trained using Gradient descentmethod and Levenberg Marquardt method

(a) Gradient Descent Method Gradient descent method isused for updating the weights using (15) and (16) Table 14shows the performance metrics of AIF version 16 Figure 12shows the graph plot for variation ofmean square error valueswrt no of epoch (or iteration) for AIF version 16

(b) Levenberg Marquardt Method Levenberg Marquardtmethod [21 22] is a technique for updating weights In caseof Gradient descent method learning rate 120572 is constant but inLevenbergMarquardt method learning rate 120572 varies in everyiteration So this method consumes less number of iterations

Mea

n sq

uare

erro

r

07

06

05

04

03

02

01

00 20 40 60 80 100 120 140 160 180

Number of iterations

Figure 12 MSE versus number of epoch wrt Gradient descent NN

Mea

n sq

uare

erro

r

Number of iterations

02

015

01

005

00 2 4 6 8 10 12 14

Figure 13 MSE versus number of epoch wrt Levenberg-marquardtNN

to train the network Table 15 shows the performance metricsfor AIF version 16 using Levenberg Marquardt method

Figure 13 shows the graph plot for variation of meansquare error values wrt number of epoch for AIF version 16

12 ISRN Software Engineering

Table 13 Precision correctness completeness and accuracy for three versions of AIF

Precision () Correctness () Completeness () Accuracy ()ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 6111 4117 5714 6111 4117 5714 509 482 425 6613 8402 8171DIT mdash mdash mdash mdash mdash mdash 0 0 0 6447 8337 8051NOC 75 75 6666 75 75 6666 555 206 531 6578 836 8103CBO 60 5714 7777 60 5714 7777 277 275 372 648 8348 8103RFC 6666 3636 50 6666 3636 50 462 275 212 6529 8302 8051LCOM 6666 50 60 066 05 06 277 68 159 6496 8337 8062MULTI 6875 50 6153 6875 50 6153 1018 758 851 6644 8337 8113

Table 14 Accuracy prediction using gradient descent NN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00594 1093 00617 minus02038 00044 00048 940437

Table 15 Accuracy prediction using Levenberg Marquardt

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00023 11203 00308 minus02189 00022 00041 904977

Table 16 Accuracy prediction using FLANN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00304 07097 00390 03308 24601119890 minus 06 00050 963769

Table 17 Accuracy prediction using basic RBFN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00279 03875 00573 01969 0059 0006 972792

652 Functional Link Artificial Neural Network (FLANN)FLANN architecture for software fault prediction is a singlelayer feed-forward neural network consisting of an input andoutput layer FLANN does not incorporate any hidden layerand hence has less computational cost In this paper adaptivealgorithm has been used for updating the weights as shownin (21) Figure 14 shows the variation of mean square valuesagainst number of epochs for AIF version 16 Table 16 showsthe performance metrics of FLANN

653 Radial Basis Function Network In this paper Gaussianradial function is used as a radial function Gradient descentlearning and hybrid learning methods are used for updatingthe centers and weights respectively

Three layered RBFN has been considered in which sixCK metrics are taken as input nodes nine hidden centers aretaken as hidden nodes and output is the fault prediction rateTable 17 shows the performance metrics for AIF version 16

(a) Gradient Descent Learning Method Equations (25) and(26) are used for updating center and weight during trainingphase After simplifying (25) the equation is represented as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1(1199101015840minus 119910)119882119894

120601119894

1205902(119909119895minus 119862119894119895(119896)) (39)

and the modified Equation (26) is formulated as

119882119894(119896 + 1) = 119882

119894(119896) + 120578

2(1199101015840minus 119910) 120601

119894 (40)

where 120590 is the width of the center and 119896 is the currentiteration number Table 18 shows the performancemetrics forAIF version 16 Figure 15 indicates the variation of MSE wrtnumber of epochs

(b) Hybrid Learning Method In Hybrid learning methodcenters are updated using (27) while weights are updatedusing supervised learning methods In this paper least meansquare error (LMSE) algorithm is used for updating theweights Table 19 shows the performance matrix for AIFversion 16 Figure 16 shows the graph for variation of MSEversus number of epochs

654 Probabilistic Neural Network (PNN) As mentioned inSection 424 PNN is a multilayered feed-forward networkwith four layers such as input hidden summation andoutput layer

In PNN 50 of faulty and nonfaulty classes are takenas input for hidden layers Gaussian elimination (28) isused as a hidden node function The summation layers sum

ISRN Software Engineering 13

Table 18 Accuracy prediction using RBFN gradient

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00207 02316 00323 03041 16302119890 minus 05 00041 972475

Table 19 Accuracy prediction using hybrid RBFN

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00614 01032 00316 09184 31834119890 minus 79 00013 984783

Number of iterations

Mea

n sq

uare

erro

r

0 10 20 30 40 50 60 700

01

02

03

04

05

06

07

08

09

Figure 14 Graph plot for MSE versus number of iterations (epoch)wrt FLANN

Number of iterations

Mea

n sq

uare

erro

r 0015

001

005

00 5 10 15 20 25 30 35 40 45

Figure 15 MSE versus number of epochs wrt gradient RBFN

contribution of each class of input patterns and producea net output which is a vector of probabilities The outputpattern having maximum summation value is classified intorespective class Figure 17 shows the variation of accuracy fordifferent values of smoothing parameter

66 Comparison Table 20 shows the tabulated results forthe obtained performance parameter values number ofepochs and accuracy rate by applying three neural networktechniques This performance table is an indication of betterfault prediction model In this comparative analysis theperformance parameter mean square error (MSE) was takenas a criterion to compute the performance parameters (suchas MARE MSE number of epochs and accuracy rate)when four neural network techniques were applied Duringthis process the MSE value of 0002 was set a thresholdfor evaluation Based on the number of iterations and theaccuracy rate obtained by the respective NN technique bestprediction model was determined

Number of iterationsM

ean

squa

re er

ror

006

005

003

002

001

004

00 2 4 6 8 10 12 14

Figure 16 MSE versus number of epochs wrt hybrid RBFN

Smoothing parameter

Accu

racy

()

865

86

855

85

845

84

835

83

825

820 05 1 15 2 25 3 35 4 45 5

Figure 17 Accuracy rate versus smoothing parameter

From Table 20 it is evident that gradient NN methodobtained an accuracy rate of 9404 in 162 epochs (iter-ations) LM technique which is an improvised model ofANN obtained 904 accuracy rate This accuracy rate isless than gradient NN but this approach (LM method) tookonly 13 epochs PNN method achieved a classification rate of8641

The three types of RBFN namely basic RBFN gradientand hybrid methods obtained a prediction rate of 97279724 and 9847 respectively Considering the number ofepochs RBFN hybridmethod obtained better prediction rateof 9847 in only 14 epochs when compared with gradientmethod (41 epochs) and basic RBFN approaches

FLANN architecture obtained 9637 accuracy rate withless computational cost involved FLANN obtained accuracyrate in 66 epochs as it has no hidden layer involved in itsarchitecture

14 ISRN Software Engineering

Table 20 Performance metrics

Performance parametersAI technique Epoch MAE MARE RMSE Std Error AccuracyGradient descent 162 00594 10930 00617 00048 9404LM 13 00023 11203 00308 00041 9049RBFN basic mdash 00279 03875 00573 006 9727RBFN gradient 41 00207 02316 00323 00041 9724RBFN hybrid 14 00614 01032 00316 00013 9847FLANN 66 00304 07097 00390 00050 9637

The performance of PNN is shown in Figure 17 Highestaccuracy in prediction was obtained for smoothing parame-ter value of 17 PNN obtained a classification rate of 8641

RBFN using hybrid learning model gives the least valuesfor MAE MARE RMSE and high accuracy rate Hencefrom the obtained results by using ANN techniques it can beconcluded that RBFNhybrid approach obtained the best faultprediction rate in less number of epochswhen comparedwithother three ANN techniques

7 Conclusion

System analyst use of prediction models to classify faultprone classes as faulty or not faulty is the need of the dayfor researchers as well as practitioners So more reliableapproaches for prediction need to be modeled In this papertwo approaches namely statistical methods and machinelearning techniques were applied for fault prediction Theapplication of statistical and machine learning methods infault prediction requires enormous amount of data andanalyzing this huge amount of data is necessary with the helpof a better prediction model

This paper proposes a comparative study of differentprediction models for fault prediction for an open-sourceproject Fault prediction using statistical and machine learn-ing methods were carried out for AIF by coding in MATLABenvironment Statistical methods such as linear regressionand logistic regression were applied Also machine learningtechniques such as artificial neural network (gradient descentand Levenberg Marquardt methods) Functional link artifi-cial neural network radial basis function network (RBFNbasic RBFN gradient and RBFN hybrid) and probabilisticneural network techniques were applied for fault predictionanalysis

It can be concluded from the statistical regression analysisthat out of six CK metrics WMC appears to be more usefulin predicting faults Table 20 shows that hybrid approachof RBFN obtained better fault prediction in less number ofepochs (14 iterations) when compared with the other threeneural network techniques

In future work should be replicated to other open-sourceprojects like Mozilla using different AI techniques to analyzewhich model performs better in achieving higher accuracyfor fault prediction Also fault prediction accuracy should bemeasured by combining multiple computational intelligencetechniques

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] V R Basili L C Briand and W L Melo ldquoA validationof object-oriented design metrics as quality indicatorsrdquo IEEETransactions on Software Engineering vol 22 no 10 pp 751ndash761 1996

[2] T J McCabe ldquoA Complexity Measurerdquo IEEE Transactions onSoftware Engineering vol 2 no 4 pp 308ndash320 1976

[3] M H Halstead Elements of Software Science Elsevier ScienceNew York NY USA 1977

[4] W Li and S Henry ldquoMaintenance metrics for the Object-Oriented paradigmrdquo in Proceedings of the 1st InternationalSoftware Metrics Symposium pp 52ndash60 1993

[5] S R Chidamber and C F Kemerer ldquoMetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[6] F B E Abreu andR Carapuca ldquoObject-Oriented software engi-neering measuring and controlling the development processrdquoin Proceedings of the 4th International Conference on SoftwareQuality pp 1ndash8 McLean Va USA October 1994

[7] M Lorenz and J Kidd Object-Oriented Software MetricsPrentice Hall Englewood NJ USA 1994

[8] R Martin ldquoOO design quality metricsmdashan analysis of depen-denciesrdquo in Proceedings of the Workshop Pragmatic and Theo-retical Directions in Object-Oriented Software Metrics (OOPSLArsquo94) 1994

[9] D P Tegarden S D Sheetz and D E Monarchi ldquoA softwarecomplexity model of object-oriented systemsrdquoDecision SupportSystems vol 13 no 3-4 pp 241ndash262 1995

[10] W Melo and F B E Abreu ldquoEvaluating the impact of object-oriented design on software qualityrdquo in Proceedings of the 3rdInternational Software Metrics Symposium pp 90ndash99 BerlinGermany March 1996

[11] L Briand P Devanbu and W Melo ldquoInvestigation intocoupling measures for C++rdquo in Proceedings of the IEEE 19thInternational Conference on Software EngineeringAssociation forComputing Machinery pp 412ndash421 May 1997

[12] L Etzkorn J Bansiya and C Davis ldquoDesign and code com-plexity metrics for OO classesrdquo Journal of Object-OrientedProgramming vol 12 no 1 pp 35ndash40 1999

[13] L C Briand JWust JWDaly andDVictor Porter ldquoExploringthe relationships between designmeasures and software qualityin object-oriented systemsrdquoThe Journal of Systems and Softwarevol 51 no 3 pp 245ndash273 2000

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

8 ISRN Software Engineering

Table 5 Distribution of bugs for AIF version 16

Number of classes Percentageof bugs

Number ofassociated bugs

777 805181 0101 104663 132 33161 216 16580 314 14508 46 06218 52 02073 63 03109 75 05181 81 01036 91 01036 103 03109 111 01036 131 01036 171 01036 181 01036 28965 10000 142

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 14

12

10

8

6

4

2

0minus20 0 20 40 60 80 100 120 140 160 180

Figure 5 WMC of AIF version 16

6 Results and Analysis

In this section the relationship between value of metrics andthe fault found in a class is determined In this approachthe comparative study involves using six CK metrics as inputnodes and the output is the achieved fault prediction rateFault prediction is performed for AIF version 16

61 Fault Data To perform statistical analysis bugs werecollected fromPromise data repository [18] Table 5 shows thedistribution of bugs based on the number of occurrences (interms of percentage of class containing number of bugs) forAIF version 16

AIF version 16 contains 965 numbers of classes inwhich 777 classes contain zero bugs (805181) 104663 ofclasses contain at least one bug 33161 of classes containa minimum of two bugs 16580 of classes contain threebugs 14508 of classes contain four bugs 06218 of classescontain five bugs 02073 of the classes contain six bugs

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 50

40

45

30

35

20

25

10

5

15

00 1 2 3 4 5 6

Figure 6 DIT of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

90

80

70

60

50

40

30

20

10

0minus5 0 5 10 15 20 25 30 35 40

Figure 7 NOC of AIF version 16

03109 of classes contain seven and eleven bugs 05181 ofclasses contain eight bugs and 01036 of the class containnine thirteen seventeen eighteen and twenty-eight bugs

62 Metrics Data CK metric values for WMC DIT NOCCBO RFC and LCOM respectively for AIF version 16 aregraphically represented in Figures 5 6 7 8 9 and 10

63 Descriptive Statistics and Correlation Analysis This sub-section gives the comparative analysis of the fault datadescriptive statistics of classes and the correlation among thesix metrics with that of Basili et al [1] Basili et al studiedobject-oriented systems written in C++ language They car-ried out an experiment in which they set up eight projectgroups each consisting of three students Each group hadthe same task of developing smallmedium-sized softwaresystem Since all the necessary documentation (for instancereports about faults and their fixes) were available they couldsearch for relationships between fault density and metricsThey used the same CK metric suite Logistic regression wasemployed to analyze the relationship betweenmetrics and thefault proneness of classes

The obtained CK metric values of AIF version 16 arecompared with the results of Basili et al [1] In comparisonwith Basili the total number of classes considered is muchgreater that is 965 classes were considered (Vs 180) Table 6shows the comparative statistical analysis results obtained for

ISRN Software Engineering 9

Table 6 Descriptive statistics of classes

WMC DIT NOC CBO RFC LCOMBasili et al [1]

Max 9900 900 10500 1300 3000 42600Min 100 000 000 000 000 000Median 950 000 1950 000 500 000Mean 1340 132 3391 023 680 970Std Dev 1490 199 3337 154 756 6377

AIF version 16Max 16600 600 3900 44800 32200 13617Min 000 000 000 000 000 000Median 500 100 000 700 1400 400Mean 857 195 0052 1110 2142 7933Std Dev 1120 127 263 2252 2500 52375

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

10

9

8

7

6

5

4

3

2

1

00 50 100 150 200 250 300 350 400 450minus50

Figure 8 CBO of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 7

6

5

4

3

2

1

0minus50 0 50 100 150 200 250 300 350

Figure 9 RFC of AIF version 16

Basili et al andAIF version 16 forCKmetrics indicatingMaxMin Median and Standard deviation

The dependency between CK metrics is computed usingPearsonrsquos correlations (1198772 coefficient of determination) andcompared with Basili et al [1] for AIF version 16 Thecoefficient of determination 1198772 is useful because it gives theproportion of the variance (fluctuation) of one variable that ispredictable from the other variable It is ameasure that allowsa researcher to determine how certain one can be in makingpredictions from a certain modelgraph Table 7 shows thePearsonrsquos correlations for the data set used by Basili et al [1]and the correlation metrics of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

30

25

20

15

10

5

0minus200 0 2000 4000 6000 8000 10000 12000 14000

Figure 10 LCOM of AIF version 16

From Table 7 wrt AIF version 16 it is observed thatcorrelation between WMC and RFC is 077 which is highlycorrelated that is these two metrics are very much linearlydependent on each other Similarly correlation betweenWMC and DIT is 0 which indicates that they are looselycorrelated that is there is no dependency between these twometrics

64 Fault Prediction Using Statistical Methods

641 Linear Regression Analysis Table 8 shows resultsobtained for linear regression analysis in which the fault isconsidered as the dependent variable and the CK metrics arethe independent variables

ldquo119877rdquo represents the coefficient of correlation ldquo119875rdquo refers tothe significance of the metric value If 119875 lt 0001 then themetrics are of very great significance in fault prediction

642 Logistic Regression Analysis The logistic regressionmethod helps to indicate whether a class is faulty or notbut does not convey anything about the possible numberof faults in the class Univariate and multivariate logisticregression techniques are applied to predict whether the

10 ISRN Software Engineering

Table 7 Correlations between metrics

WMC DIT NOC CBO RFC LCOMBasili et al [1]

WMC 100 002 024 000 013 038DIT 100 000 000 000 001NOC 100 000 000 000CBO 100 031 001RFC 100 009LCOM 100

AIF version 16WMC 100 000 003 010 077 060DIT 100 000 000 000 001NOC 100 0024 0025 0027CBO 100 008 005RFC 100 042LCOM 100

Table 8 Linear regression analysis

Version 119877 119875 value Std error12 05360 0000 0111414 05024 0000 0145016 05154 0000 00834

1

08

06

04

02

0

minus4 minus3 minus2 minus1 0 1 2 3 4

1(1 + exp(minusq))

Figure 11 Logistic graph

class is faulty or not Univariate regression analysis is usedto examine the effect of each metric on fault of the classwhile multivariate regression analysis is used to examine thecommon effectiveness of metrics on fault of the class Theresults of three versions of AIF are compared consideringthese two statistical techniques Figure 11 shows the typicalldquo119878rdquo curve obtained (similar to Sigmoid function) for the AIFversion 16 using multivariate logistic regression Tables 9and 10 contain the tabulated values for the results obtainedby applying univariate and multivariate regression analysisrespectively

From Table 9 it can be observed that all metrics of CKsuite are highly significant except for DIT The 119875 value forthe three versions (wrt DIT) is 0335 0108 and 03527respectively Higher values of ldquo119875rdquo are an indication of lesssignificance

Univariate and multivariate logistic regression statisticalmethods were used for classifying a class as faulty or notfaulty Logistic regression was applied with a threshold value05 that is120587 gt 05 indicates that a class is classified as ldquofaultyrdquootherwise it is categorized as ldquonot faultyrdquo class

Tables 11 and 12 represent the confusion matrix fornumber of classes with faults before and after applyingregression analysis respectively for AIF version 16 FromTable 11 it is clear that before applying the logistic regressiona total number of 777 classes contained zero bugs and 188classes contained at least one bug After applying logisticregression (Table 12) a total of 767 + 16 classes are classifiedcorrectly with accuracy of 8113

The performance parameters of all three versions of theAIF are shown in Table 13 obtained by applying univariateand multivariate logistic regression analysis Here precisioncorrectness completeness and accuracy [1 13 27 28] aretaken as a performance parameters By using multivariatelogistic regression accuracy of AIF version 12 is found to be6444 accuracy of AIF version 14 is 8337 and that of AIFversion 16 is 8113

From the results obtained by applying linear and logisticregression analysis it is found that out of the six metricsWMC appears to have more impact in predicting faults

65 Fault Prediction Using Neural Network

651 Artificial Neural Network ANN is an interconnectedgroup of nodes In this paper three layers of ANN areconsidered in which six nodes act as input nodes nine nodesrepresent the hidden nodes and one node acts as outputnode

ANN is a three-phase network the phases are used forlearning validation and testing purposes So in this article70 of total input pattern is considered for learning phase15 for validation and the rest 15 for testingThe regressionanalysis carried out classifies whether a class is faulty or notfaulty The prediction models of ANN and its forms such as

ISRN Software Engineering 11

Table 9 Analysis of univariate regression

Coefficient Constant 119875 value 119877 valuever 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 0028 005 003 minus083 minus211 minus177 00013 00007 000 0130 0240 018DIT minus0067 010 005 minus046 minus183 minus153 0335 0108 03257 minus0039 0054 002NOC 0137 009 013 minus066 minus167 minus150 00007 000 000 0136 013 016CBO 0011 001 002 minus071 minus180 minus166 0017 000 000 0096 015 017RFC 0012 002 001 minus086 minus215 minus179 00014 000 000 0130 023 017LCOM 0007 0007 0007 minus064 minus167 minus148 00349 00004 00007 0085 011 011

Table 10 Multivariate logistic regression analysis

CoefficientAIF version 12 AIF version 14 AIF version 16

WMC 00195 00574 00320DIT minus0041 0000 0000NOC 01231 0000 0000CBO 0005 0008 0001RFC 00071 00081 00109LCOM 0 minus0001 0Constant minus0917 minus2785 minus2157

Table 11 Before applying regression

Not-faulty FaultyNot-Faulty 777 0Faulty 188 0

Table 12 After applying regression

Not-faulty FaultyNot-Faulty 767 10Faulty 172 16

PNN RBFN and FLANN not only classify the class as faultyor not faulty but also highlight the number of bugs foundin the class and these bugs are fixed in the testing phase ofsoftware development life cycle

In this paper six CKmetrics are taken as input and outputis the fault prediction accuracy rate required for developingthe software The network is trained using Gradient descentmethod and Levenberg Marquardt method

(a) Gradient Descent Method Gradient descent method isused for updating the weights using (15) and (16) Table 14shows the performance metrics of AIF version 16 Figure 12shows the graph plot for variation ofmean square error valueswrt no of epoch (or iteration) for AIF version 16

(b) Levenberg Marquardt Method Levenberg Marquardtmethod [21 22] is a technique for updating weights In caseof Gradient descent method learning rate 120572 is constant but inLevenbergMarquardt method learning rate 120572 varies in everyiteration So this method consumes less number of iterations

Mea

n sq

uare

erro

r

07

06

05

04

03

02

01

00 20 40 60 80 100 120 140 160 180

Number of iterations

Figure 12 MSE versus number of epoch wrt Gradient descent NN

Mea

n sq

uare

erro

r

Number of iterations

02

015

01

005

00 2 4 6 8 10 12 14

Figure 13 MSE versus number of epoch wrt Levenberg-marquardtNN

to train the network Table 15 shows the performance metricsfor AIF version 16 using Levenberg Marquardt method

Figure 13 shows the graph plot for variation of meansquare error values wrt number of epoch for AIF version 16

12 ISRN Software Engineering

Table 13 Precision correctness completeness and accuracy for three versions of AIF

Precision () Correctness () Completeness () Accuracy ()ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 6111 4117 5714 6111 4117 5714 509 482 425 6613 8402 8171DIT mdash mdash mdash mdash mdash mdash 0 0 0 6447 8337 8051NOC 75 75 6666 75 75 6666 555 206 531 6578 836 8103CBO 60 5714 7777 60 5714 7777 277 275 372 648 8348 8103RFC 6666 3636 50 6666 3636 50 462 275 212 6529 8302 8051LCOM 6666 50 60 066 05 06 277 68 159 6496 8337 8062MULTI 6875 50 6153 6875 50 6153 1018 758 851 6644 8337 8113

Table 14 Accuracy prediction using gradient descent NN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00594 1093 00617 minus02038 00044 00048 940437

Table 15 Accuracy prediction using Levenberg Marquardt

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00023 11203 00308 minus02189 00022 00041 904977

Table 16 Accuracy prediction using FLANN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00304 07097 00390 03308 24601119890 minus 06 00050 963769

Table 17 Accuracy prediction using basic RBFN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00279 03875 00573 01969 0059 0006 972792

652 Functional Link Artificial Neural Network (FLANN)FLANN architecture for software fault prediction is a singlelayer feed-forward neural network consisting of an input andoutput layer FLANN does not incorporate any hidden layerand hence has less computational cost In this paper adaptivealgorithm has been used for updating the weights as shownin (21) Figure 14 shows the variation of mean square valuesagainst number of epochs for AIF version 16 Table 16 showsthe performance metrics of FLANN

653 Radial Basis Function Network In this paper Gaussianradial function is used as a radial function Gradient descentlearning and hybrid learning methods are used for updatingthe centers and weights respectively

Three layered RBFN has been considered in which sixCK metrics are taken as input nodes nine hidden centers aretaken as hidden nodes and output is the fault prediction rateTable 17 shows the performance metrics for AIF version 16

(a) Gradient Descent Learning Method Equations (25) and(26) are used for updating center and weight during trainingphase After simplifying (25) the equation is represented as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1(1199101015840minus 119910)119882119894

120601119894

1205902(119909119895minus 119862119894119895(119896)) (39)

and the modified Equation (26) is formulated as

119882119894(119896 + 1) = 119882

119894(119896) + 120578

2(1199101015840minus 119910) 120601

119894 (40)

where 120590 is the width of the center and 119896 is the currentiteration number Table 18 shows the performancemetrics forAIF version 16 Figure 15 indicates the variation of MSE wrtnumber of epochs

(b) Hybrid Learning Method In Hybrid learning methodcenters are updated using (27) while weights are updatedusing supervised learning methods In this paper least meansquare error (LMSE) algorithm is used for updating theweights Table 19 shows the performance matrix for AIFversion 16 Figure 16 shows the graph for variation of MSEversus number of epochs

654 Probabilistic Neural Network (PNN) As mentioned inSection 424 PNN is a multilayered feed-forward networkwith four layers such as input hidden summation andoutput layer

In PNN 50 of faulty and nonfaulty classes are takenas input for hidden layers Gaussian elimination (28) isused as a hidden node function The summation layers sum

ISRN Software Engineering 13

Table 18 Accuracy prediction using RBFN gradient

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00207 02316 00323 03041 16302119890 minus 05 00041 972475

Table 19 Accuracy prediction using hybrid RBFN

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00614 01032 00316 09184 31834119890 minus 79 00013 984783

Number of iterations

Mea

n sq

uare

erro

r

0 10 20 30 40 50 60 700

01

02

03

04

05

06

07

08

09

Figure 14 Graph plot for MSE versus number of iterations (epoch)wrt FLANN

Number of iterations

Mea

n sq

uare

erro

r 0015

001

005

00 5 10 15 20 25 30 35 40 45

Figure 15 MSE versus number of epochs wrt gradient RBFN

contribution of each class of input patterns and producea net output which is a vector of probabilities The outputpattern having maximum summation value is classified intorespective class Figure 17 shows the variation of accuracy fordifferent values of smoothing parameter

66 Comparison Table 20 shows the tabulated results forthe obtained performance parameter values number ofepochs and accuracy rate by applying three neural networktechniques This performance table is an indication of betterfault prediction model In this comparative analysis theperformance parameter mean square error (MSE) was takenas a criterion to compute the performance parameters (suchas MARE MSE number of epochs and accuracy rate)when four neural network techniques were applied Duringthis process the MSE value of 0002 was set a thresholdfor evaluation Based on the number of iterations and theaccuracy rate obtained by the respective NN technique bestprediction model was determined

Number of iterationsM

ean

squa

re er

ror

006

005

003

002

001

004

00 2 4 6 8 10 12 14

Figure 16 MSE versus number of epochs wrt hybrid RBFN

Smoothing parameter

Accu

racy

()

865

86

855

85

845

84

835

83

825

820 05 1 15 2 25 3 35 4 45 5

Figure 17 Accuracy rate versus smoothing parameter

From Table 20 it is evident that gradient NN methodobtained an accuracy rate of 9404 in 162 epochs (iter-ations) LM technique which is an improvised model ofANN obtained 904 accuracy rate This accuracy rate isless than gradient NN but this approach (LM method) tookonly 13 epochs PNN method achieved a classification rate of8641

The three types of RBFN namely basic RBFN gradientand hybrid methods obtained a prediction rate of 97279724 and 9847 respectively Considering the number ofepochs RBFN hybridmethod obtained better prediction rateof 9847 in only 14 epochs when compared with gradientmethod (41 epochs) and basic RBFN approaches

FLANN architecture obtained 9637 accuracy rate withless computational cost involved FLANN obtained accuracyrate in 66 epochs as it has no hidden layer involved in itsarchitecture

14 ISRN Software Engineering

Table 20 Performance metrics

Performance parametersAI technique Epoch MAE MARE RMSE Std Error AccuracyGradient descent 162 00594 10930 00617 00048 9404LM 13 00023 11203 00308 00041 9049RBFN basic mdash 00279 03875 00573 006 9727RBFN gradient 41 00207 02316 00323 00041 9724RBFN hybrid 14 00614 01032 00316 00013 9847FLANN 66 00304 07097 00390 00050 9637

The performance of PNN is shown in Figure 17 Highestaccuracy in prediction was obtained for smoothing parame-ter value of 17 PNN obtained a classification rate of 8641

RBFN using hybrid learning model gives the least valuesfor MAE MARE RMSE and high accuracy rate Hencefrom the obtained results by using ANN techniques it can beconcluded that RBFNhybrid approach obtained the best faultprediction rate in less number of epochswhen comparedwithother three ANN techniques

7 Conclusion

System analyst use of prediction models to classify faultprone classes as faulty or not faulty is the need of the dayfor researchers as well as practitioners So more reliableapproaches for prediction need to be modeled In this papertwo approaches namely statistical methods and machinelearning techniques were applied for fault prediction Theapplication of statistical and machine learning methods infault prediction requires enormous amount of data andanalyzing this huge amount of data is necessary with the helpof a better prediction model

This paper proposes a comparative study of differentprediction models for fault prediction for an open-sourceproject Fault prediction using statistical and machine learn-ing methods were carried out for AIF by coding in MATLABenvironment Statistical methods such as linear regressionand logistic regression were applied Also machine learningtechniques such as artificial neural network (gradient descentand Levenberg Marquardt methods) Functional link artifi-cial neural network radial basis function network (RBFNbasic RBFN gradient and RBFN hybrid) and probabilisticneural network techniques were applied for fault predictionanalysis

It can be concluded from the statistical regression analysisthat out of six CK metrics WMC appears to be more usefulin predicting faults Table 20 shows that hybrid approachof RBFN obtained better fault prediction in less number ofepochs (14 iterations) when compared with the other threeneural network techniques

In future work should be replicated to other open-sourceprojects like Mozilla using different AI techniques to analyzewhich model performs better in achieving higher accuracyfor fault prediction Also fault prediction accuracy should bemeasured by combining multiple computational intelligencetechniques

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] V R Basili L C Briand and W L Melo ldquoA validationof object-oriented design metrics as quality indicatorsrdquo IEEETransactions on Software Engineering vol 22 no 10 pp 751ndash761 1996

[2] T J McCabe ldquoA Complexity Measurerdquo IEEE Transactions onSoftware Engineering vol 2 no 4 pp 308ndash320 1976

[3] M H Halstead Elements of Software Science Elsevier ScienceNew York NY USA 1977

[4] W Li and S Henry ldquoMaintenance metrics for the Object-Oriented paradigmrdquo in Proceedings of the 1st InternationalSoftware Metrics Symposium pp 52ndash60 1993

[5] S R Chidamber and C F Kemerer ldquoMetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[6] F B E Abreu andR Carapuca ldquoObject-Oriented software engi-neering measuring and controlling the development processrdquoin Proceedings of the 4th International Conference on SoftwareQuality pp 1ndash8 McLean Va USA October 1994

[7] M Lorenz and J Kidd Object-Oriented Software MetricsPrentice Hall Englewood NJ USA 1994

[8] R Martin ldquoOO design quality metricsmdashan analysis of depen-denciesrdquo in Proceedings of the Workshop Pragmatic and Theo-retical Directions in Object-Oriented Software Metrics (OOPSLArsquo94) 1994

[9] D P Tegarden S D Sheetz and D E Monarchi ldquoA softwarecomplexity model of object-oriented systemsrdquoDecision SupportSystems vol 13 no 3-4 pp 241ndash262 1995

[10] W Melo and F B E Abreu ldquoEvaluating the impact of object-oriented design on software qualityrdquo in Proceedings of the 3rdInternational Software Metrics Symposium pp 90ndash99 BerlinGermany March 1996

[11] L Briand P Devanbu and W Melo ldquoInvestigation intocoupling measures for C++rdquo in Proceedings of the IEEE 19thInternational Conference on Software EngineeringAssociation forComputing Machinery pp 412ndash421 May 1997

[12] L Etzkorn J Bansiya and C Davis ldquoDesign and code com-plexity metrics for OO classesrdquo Journal of Object-OrientedProgramming vol 12 no 1 pp 35ndash40 1999

[13] L C Briand JWust JWDaly andDVictor Porter ldquoExploringthe relationships between designmeasures and software qualityin object-oriented systemsrdquoThe Journal of Systems and Softwarevol 51 no 3 pp 245ndash273 2000

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

ISRN Software Engineering 9

Table 6 Descriptive statistics of classes

WMC DIT NOC CBO RFC LCOMBasili et al [1]

Max 9900 900 10500 1300 3000 42600Min 100 000 000 000 000 000Median 950 000 1950 000 500 000Mean 1340 132 3391 023 680 970Std Dev 1490 199 3337 154 756 6377

AIF version 16Max 16600 600 3900 44800 32200 13617Min 000 000 000 000 000 000Median 500 100 000 700 1400 400Mean 857 195 0052 1110 2142 7933Std Dev 1120 127 263 2252 2500 52375

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

10

9

8

7

6

5

4

3

2

1

00 50 100 150 200 250 300 350 400 450minus50

Figure 8 CBO of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue 7

6

5

4

3

2

1

0minus50 0 50 100 150 200 250 300 350

Figure 9 RFC of AIF version 16

Basili et al andAIF version 16 forCKmetrics indicatingMaxMin Median and Standard deviation

The dependency between CK metrics is computed usingPearsonrsquos correlations (1198772 coefficient of determination) andcompared with Basili et al [1] for AIF version 16 Thecoefficient of determination 1198772 is useful because it gives theproportion of the variance (fluctuation) of one variable that ispredictable from the other variable It is ameasure that allowsa researcher to determine how certain one can be in makingpredictions from a certain modelgraph Table 7 shows thePearsonrsquos correlations for the data set used by Basili et al [1]and the correlation metrics of AIF version 16

Value

o

f cla

ss co

ntai

ning

sam

e val

ue

30

25

20

15

10

5

0minus200 0 2000 4000 6000 8000 10000 12000 14000

Figure 10 LCOM of AIF version 16

From Table 7 wrt AIF version 16 it is observed thatcorrelation between WMC and RFC is 077 which is highlycorrelated that is these two metrics are very much linearlydependent on each other Similarly correlation betweenWMC and DIT is 0 which indicates that they are looselycorrelated that is there is no dependency between these twometrics

64 Fault Prediction Using Statistical Methods

641 Linear Regression Analysis Table 8 shows resultsobtained for linear regression analysis in which the fault isconsidered as the dependent variable and the CK metrics arethe independent variables

ldquo119877rdquo represents the coefficient of correlation ldquo119875rdquo refers tothe significance of the metric value If 119875 lt 0001 then themetrics are of very great significance in fault prediction

642 Logistic Regression Analysis The logistic regressionmethod helps to indicate whether a class is faulty or notbut does not convey anything about the possible numberof faults in the class Univariate and multivariate logisticregression techniques are applied to predict whether the

10 ISRN Software Engineering

Table 7 Correlations between metrics

WMC DIT NOC CBO RFC LCOMBasili et al [1]

WMC 100 002 024 000 013 038DIT 100 000 000 000 001NOC 100 000 000 000CBO 100 031 001RFC 100 009LCOM 100

AIF version 16WMC 100 000 003 010 077 060DIT 100 000 000 000 001NOC 100 0024 0025 0027CBO 100 008 005RFC 100 042LCOM 100

Table 8 Linear regression analysis

Version 119877 119875 value Std error12 05360 0000 0111414 05024 0000 0145016 05154 0000 00834

1

08

06

04

02

0

minus4 minus3 minus2 minus1 0 1 2 3 4

1(1 + exp(minusq))

Figure 11 Logistic graph

class is faulty or not Univariate regression analysis is usedto examine the effect of each metric on fault of the classwhile multivariate regression analysis is used to examine thecommon effectiveness of metrics on fault of the class Theresults of three versions of AIF are compared consideringthese two statistical techniques Figure 11 shows the typicalldquo119878rdquo curve obtained (similar to Sigmoid function) for the AIFversion 16 using multivariate logistic regression Tables 9and 10 contain the tabulated values for the results obtainedby applying univariate and multivariate regression analysisrespectively

From Table 9 it can be observed that all metrics of CKsuite are highly significant except for DIT The 119875 value forthe three versions (wrt DIT) is 0335 0108 and 03527respectively Higher values of ldquo119875rdquo are an indication of lesssignificance

Univariate and multivariate logistic regression statisticalmethods were used for classifying a class as faulty or notfaulty Logistic regression was applied with a threshold value05 that is120587 gt 05 indicates that a class is classified as ldquofaultyrdquootherwise it is categorized as ldquonot faultyrdquo class

Tables 11 and 12 represent the confusion matrix fornumber of classes with faults before and after applyingregression analysis respectively for AIF version 16 FromTable 11 it is clear that before applying the logistic regressiona total number of 777 classes contained zero bugs and 188classes contained at least one bug After applying logisticregression (Table 12) a total of 767 + 16 classes are classifiedcorrectly with accuracy of 8113

The performance parameters of all three versions of theAIF are shown in Table 13 obtained by applying univariateand multivariate logistic regression analysis Here precisioncorrectness completeness and accuracy [1 13 27 28] aretaken as a performance parameters By using multivariatelogistic regression accuracy of AIF version 12 is found to be6444 accuracy of AIF version 14 is 8337 and that of AIFversion 16 is 8113

From the results obtained by applying linear and logisticregression analysis it is found that out of the six metricsWMC appears to have more impact in predicting faults

65 Fault Prediction Using Neural Network

651 Artificial Neural Network ANN is an interconnectedgroup of nodes In this paper three layers of ANN areconsidered in which six nodes act as input nodes nine nodesrepresent the hidden nodes and one node acts as outputnode

ANN is a three-phase network the phases are used forlearning validation and testing purposes So in this article70 of total input pattern is considered for learning phase15 for validation and the rest 15 for testingThe regressionanalysis carried out classifies whether a class is faulty or notfaulty The prediction models of ANN and its forms such as

ISRN Software Engineering 11

Table 9 Analysis of univariate regression

Coefficient Constant 119875 value 119877 valuever 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 0028 005 003 minus083 minus211 minus177 00013 00007 000 0130 0240 018DIT minus0067 010 005 minus046 minus183 minus153 0335 0108 03257 minus0039 0054 002NOC 0137 009 013 minus066 minus167 minus150 00007 000 000 0136 013 016CBO 0011 001 002 minus071 minus180 minus166 0017 000 000 0096 015 017RFC 0012 002 001 minus086 minus215 minus179 00014 000 000 0130 023 017LCOM 0007 0007 0007 minus064 minus167 minus148 00349 00004 00007 0085 011 011

Table 10 Multivariate logistic regression analysis

CoefficientAIF version 12 AIF version 14 AIF version 16

WMC 00195 00574 00320DIT minus0041 0000 0000NOC 01231 0000 0000CBO 0005 0008 0001RFC 00071 00081 00109LCOM 0 minus0001 0Constant minus0917 minus2785 minus2157

Table 11 Before applying regression

Not-faulty FaultyNot-Faulty 777 0Faulty 188 0

Table 12 After applying regression

Not-faulty FaultyNot-Faulty 767 10Faulty 172 16

PNN RBFN and FLANN not only classify the class as faultyor not faulty but also highlight the number of bugs foundin the class and these bugs are fixed in the testing phase ofsoftware development life cycle

In this paper six CKmetrics are taken as input and outputis the fault prediction accuracy rate required for developingthe software The network is trained using Gradient descentmethod and Levenberg Marquardt method

(a) Gradient Descent Method Gradient descent method isused for updating the weights using (15) and (16) Table 14shows the performance metrics of AIF version 16 Figure 12shows the graph plot for variation ofmean square error valueswrt no of epoch (or iteration) for AIF version 16

(b) Levenberg Marquardt Method Levenberg Marquardtmethod [21 22] is a technique for updating weights In caseof Gradient descent method learning rate 120572 is constant but inLevenbergMarquardt method learning rate 120572 varies in everyiteration So this method consumes less number of iterations

Mea

n sq

uare

erro

r

07

06

05

04

03

02

01

00 20 40 60 80 100 120 140 160 180

Number of iterations

Figure 12 MSE versus number of epoch wrt Gradient descent NN

Mea

n sq

uare

erro

r

Number of iterations

02

015

01

005

00 2 4 6 8 10 12 14

Figure 13 MSE versus number of epoch wrt Levenberg-marquardtNN

to train the network Table 15 shows the performance metricsfor AIF version 16 using Levenberg Marquardt method

Figure 13 shows the graph plot for variation of meansquare error values wrt number of epoch for AIF version 16

12 ISRN Software Engineering

Table 13 Precision correctness completeness and accuracy for three versions of AIF

Precision () Correctness () Completeness () Accuracy ()ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 6111 4117 5714 6111 4117 5714 509 482 425 6613 8402 8171DIT mdash mdash mdash mdash mdash mdash 0 0 0 6447 8337 8051NOC 75 75 6666 75 75 6666 555 206 531 6578 836 8103CBO 60 5714 7777 60 5714 7777 277 275 372 648 8348 8103RFC 6666 3636 50 6666 3636 50 462 275 212 6529 8302 8051LCOM 6666 50 60 066 05 06 277 68 159 6496 8337 8062MULTI 6875 50 6153 6875 50 6153 1018 758 851 6644 8337 8113

Table 14 Accuracy prediction using gradient descent NN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00594 1093 00617 minus02038 00044 00048 940437

Table 15 Accuracy prediction using Levenberg Marquardt

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00023 11203 00308 minus02189 00022 00041 904977

Table 16 Accuracy prediction using FLANN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00304 07097 00390 03308 24601119890 minus 06 00050 963769

Table 17 Accuracy prediction using basic RBFN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00279 03875 00573 01969 0059 0006 972792

652 Functional Link Artificial Neural Network (FLANN)FLANN architecture for software fault prediction is a singlelayer feed-forward neural network consisting of an input andoutput layer FLANN does not incorporate any hidden layerand hence has less computational cost In this paper adaptivealgorithm has been used for updating the weights as shownin (21) Figure 14 shows the variation of mean square valuesagainst number of epochs for AIF version 16 Table 16 showsthe performance metrics of FLANN

653 Radial Basis Function Network In this paper Gaussianradial function is used as a radial function Gradient descentlearning and hybrid learning methods are used for updatingthe centers and weights respectively

Three layered RBFN has been considered in which sixCK metrics are taken as input nodes nine hidden centers aretaken as hidden nodes and output is the fault prediction rateTable 17 shows the performance metrics for AIF version 16

(a) Gradient Descent Learning Method Equations (25) and(26) are used for updating center and weight during trainingphase After simplifying (25) the equation is represented as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1(1199101015840minus 119910)119882119894

120601119894

1205902(119909119895minus 119862119894119895(119896)) (39)

and the modified Equation (26) is formulated as

119882119894(119896 + 1) = 119882

119894(119896) + 120578

2(1199101015840minus 119910) 120601

119894 (40)

where 120590 is the width of the center and 119896 is the currentiteration number Table 18 shows the performancemetrics forAIF version 16 Figure 15 indicates the variation of MSE wrtnumber of epochs

(b) Hybrid Learning Method In Hybrid learning methodcenters are updated using (27) while weights are updatedusing supervised learning methods In this paper least meansquare error (LMSE) algorithm is used for updating theweights Table 19 shows the performance matrix for AIFversion 16 Figure 16 shows the graph for variation of MSEversus number of epochs

654 Probabilistic Neural Network (PNN) As mentioned inSection 424 PNN is a multilayered feed-forward networkwith four layers such as input hidden summation andoutput layer

In PNN 50 of faulty and nonfaulty classes are takenas input for hidden layers Gaussian elimination (28) isused as a hidden node function The summation layers sum

ISRN Software Engineering 13

Table 18 Accuracy prediction using RBFN gradient

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00207 02316 00323 03041 16302119890 minus 05 00041 972475

Table 19 Accuracy prediction using hybrid RBFN

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00614 01032 00316 09184 31834119890 minus 79 00013 984783

Number of iterations

Mea

n sq

uare

erro

r

0 10 20 30 40 50 60 700

01

02

03

04

05

06

07

08

09

Figure 14 Graph plot for MSE versus number of iterations (epoch)wrt FLANN

Number of iterations

Mea

n sq

uare

erro

r 0015

001

005

00 5 10 15 20 25 30 35 40 45

Figure 15 MSE versus number of epochs wrt gradient RBFN

contribution of each class of input patterns and producea net output which is a vector of probabilities The outputpattern having maximum summation value is classified intorespective class Figure 17 shows the variation of accuracy fordifferent values of smoothing parameter

66 Comparison Table 20 shows the tabulated results forthe obtained performance parameter values number ofepochs and accuracy rate by applying three neural networktechniques This performance table is an indication of betterfault prediction model In this comparative analysis theperformance parameter mean square error (MSE) was takenas a criterion to compute the performance parameters (suchas MARE MSE number of epochs and accuracy rate)when four neural network techniques were applied Duringthis process the MSE value of 0002 was set a thresholdfor evaluation Based on the number of iterations and theaccuracy rate obtained by the respective NN technique bestprediction model was determined

Number of iterationsM

ean

squa

re er

ror

006

005

003

002

001

004

00 2 4 6 8 10 12 14

Figure 16 MSE versus number of epochs wrt hybrid RBFN

Smoothing parameter

Accu

racy

()

865

86

855

85

845

84

835

83

825

820 05 1 15 2 25 3 35 4 45 5

Figure 17 Accuracy rate versus smoothing parameter

From Table 20 it is evident that gradient NN methodobtained an accuracy rate of 9404 in 162 epochs (iter-ations) LM technique which is an improvised model ofANN obtained 904 accuracy rate This accuracy rate isless than gradient NN but this approach (LM method) tookonly 13 epochs PNN method achieved a classification rate of8641

The three types of RBFN namely basic RBFN gradientand hybrid methods obtained a prediction rate of 97279724 and 9847 respectively Considering the number ofepochs RBFN hybridmethod obtained better prediction rateof 9847 in only 14 epochs when compared with gradientmethod (41 epochs) and basic RBFN approaches

FLANN architecture obtained 9637 accuracy rate withless computational cost involved FLANN obtained accuracyrate in 66 epochs as it has no hidden layer involved in itsarchitecture

14 ISRN Software Engineering

Table 20 Performance metrics

Performance parametersAI technique Epoch MAE MARE RMSE Std Error AccuracyGradient descent 162 00594 10930 00617 00048 9404LM 13 00023 11203 00308 00041 9049RBFN basic mdash 00279 03875 00573 006 9727RBFN gradient 41 00207 02316 00323 00041 9724RBFN hybrid 14 00614 01032 00316 00013 9847FLANN 66 00304 07097 00390 00050 9637

The performance of PNN is shown in Figure 17 Highestaccuracy in prediction was obtained for smoothing parame-ter value of 17 PNN obtained a classification rate of 8641

RBFN using hybrid learning model gives the least valuesfor MAE MARE RMSE and high accuracy rate Hencefrom the obtained results by using ANN techniques it can beconcluded that RBFNhybrid approach obtained the best faultprediction rate in less number of epochswhen comparedwithother three ANN techniques

7 Conclusion

System analyst use of prediction models to classify faultprone classes as faulty or not faulty is the need of the dayfor researchers as well as practitioners So more reliableapproaches for prediction need to be modeled In this papertwo approaches namely statistical methods and machinelearning techniques were applied for fault prediction Theapplication of statistical and machine learning methods infault prediction requires enormous amount of data andanalyzing this huge amount of data is necessary with the helpof a better prediction model

This paper proposes a comparative study of differentprediction models for fault prediction for an open-sourceproject Fault prediction using statistical and machine learn-ing methods were carried out for AIF by coding in MATLABenvironment Statistical methods such as linear regressionand logistic regression were applied Also machine learningtechniques such as artificial neural network (gradient descentand Levenberg Marquardt methods) Functional link artifi-cial neural network radial basis function network (RBFNbasic RBFN gradient and RBFN hybrid) and probabilisticneural network techniques were applied for fault predictionanalysis

It can be concluded from the statistical regression analysisthat out of six CK metrics WMC appears to be more usefulin predicting faults Table 20 shows that hybrid approachof RBFN obtained better fault prediction in less number ofepochs (14 iterations) when compared with the other threeneural network techniques

In future work should be replicated to other open-sourceprojects like Mozilla using different AI techniques to analyzewhich model performs better in achieving higher accuracyfor fault prediction Also fault prediction accuracy should bemeasured by combining multiple computational intelligencetechniques

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] V R Basili L C Briand and W L Melo ldquoA validationof object-oriented design metrics as quality indicatorsrdquo IEEETransactions on Software Engineering vol 22 no 10 pp 751ndash761 1996

[2] T J McCabe ldquoA Complexity Measurerdquo IEEE Transactions onSoftware Engineering vol 2 no 4 pp 308ndash320 1976

[3] M H Halstead Elements of Software Science Elsevier ScienceNew York NY USA 1977

[4] W Li and S Henry ldquoMaintenance metrics for the Object-Oriented paradigmrdquo in Proceedings of the 1st InternationalSoftware Metrics Symposium pp 52ndash60 1993

[5] S R Chidamber and C F Kemerer ldquoMetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[6] F B E Abreu andR Carapuca ldquoObject-Oriented software engi-neering measuring and controlling the development processrdquoin Proceedings of the 4th International Conference on SoftwareQuality pp 1ndash8 McLean Va USA October 1994

[7] M Lorenz and J Kidd Object-Oriented Software MetricsPrentice Hall Englewood NJ USA 1994

[8] R Martin ldquoOO design quality metricsmdashan analysis of depen-denciesrdquo in Proceedings of the Workshop Pragmatic and Theo-retical Directions in Object-Oriented Software Metrics (OOPSLArsquo94) 1994

[9] D P Tegarden S D Sheetz and D E Monarchi ldquoA softwarecomplexity model of object-oriented systemsrdquoDecision SupportSystems vol 13 no 3-4 pp 241ndash262 1995

[10] W Melo and F B E Abreu ldquoEvaluating the impact of object-oriented design on software qualityrdquo in Proceedings of the 3rdInternational Software Metrics Symposium pp 90ndash99 BerlinGermany March 1996

[11] L Briand P Devanbu and W Melo ldquoInvestigation intocoupling measures for C++rdquo in Proceedings of the IEEE 19thInternational Conference on Software EngineeringAssociation forComputing Machinery pp 412ndash421 May 1997

[12] L Etzkorn J Bansiya and C Davis ldquoDesign and code com-plexity metrics for OO classesrdquo Journal of Object-OrientedProgramming vol 12 no 1 pp 35ndash40 1999

[13] L C Briand JWust JWDaly andDVictor Porter ldquoExploringthe relationships between designmeasures and software qualityin object-oriented systemsrdquoThe Journal of Systems and Softwarevol 51 no 3 pp 245ndash273 2000

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

10 ISRN Software Engineering

Table 7 Correlations between metrics

WMC DIT NOC CBO RFC LCOMBasili et al [1]

WMC 100 002 024 000 013 038DIT 100 000 000 000 001NOC 100 000 000 000CBO 100 031 001RFC 100 009LCOM 100

AIF version 16WMC 100 000 003 010 077 060DIT 100 000 000 000 001NOC 100 0024 0025 0027CBO 100 008 005RFC 100 042LCOM 100

Table 8 Linear regression analysis

Version 119877 119875 value Std error12 05360 0000 0111414 05024 0000 0145016 05154 0000 00834

1

08

06

04

02

0

minus4 minus3 minus2 minus1 0 1 2 3 4

1(1 + exp(minusq))

Figure 11 Logistic graph

class is faulty or not Univariate regression analysis is usedto examine the effect of each metric on fault of the classwhile multivariate regression analysis is used to examine thecommon effectiveness of metrics on fault of the class Theresults of three versions of AIF are compared consideringthese two statistical techniques Figure 11 shows the typicalldquo119878rdquo curve obtained (similar to Sigmoid function) for the AIFversion 16 using multivariate logistic regression Tables 9and 10 contain the tabulated values for the results obtainedby applying univariate and multivariate regression analysisrespectively

From Table 9 it can be observed that all metrics of CKsuite are highly significant except for DIT The 119875 value forthe three versions (wrt DIT) is 0335 0108 and 03527respectively Higher values of ldquo119875rdquo are an indication of lesssignificance

Univariate and multivariate logistic regression statisticalmethods were used for classifying a class as faulty or notfaulty Logistic regression was applied with a threshold value05 that is120587 gt 05 indicates that a class is classified as ldquofaultyrdquootherwise it is categorized as ldquonot faultyrdquo class

Tables 11 and 12 represent the confusion matrix fornumber of classes with faults before and after applyingregression analysis respectively for AIF version 16 FromTable 11 it is clear that before applying the logistic regressiona total number of 777 classes contained zero bugs and 188classes contained at least one bug After applying logisticregression (Table 12) a total of 767 + 16 classes are classifiedcorrectly with accuracy of 8113

The performance parameters of all three versions of theAIF are shown in Table 13 obtained by applying univariateand multivariate logistic regression analysis Here precisioncorrectness completeness and accuracy [1 13 27 28] aretaken as a performance parameters By using multivariatelogistic regression accuracy of AIF version 12 is found to be6444 accuracy of AIF version 14 is 8337 and that of AIFversion 16 is 8113

From the results obtained by applying linear and logisticregression analysis it is found that out of the six metricsWMC appears to have more impact in predicting faults

65 Fault Prediction Using Neural Network

651 Artificial Neural Network ANN is an interconnectedgroup of nodes In this paper three layers of ANN areconsidered in which six nodes act as input nodes nine nodesrepresent the hidden nodes and one node acts as outputnode

ANN is a three-phase network the phases are used forlearning validation and testing purposes So in this article70 of total input pattern is considered for learning phase15 for validation and the rest 15 for testingThe regressionanalysis carried out classifies whether a class is faulty or notfaulty The prediction models of ANN and its forms such as

ISRN Software Engineering 11

Table 9 Analysis of univariate regression

Coefficient Constant 119875 value 119877 valuever 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 0028 005 003 minus083 minus211 minus177 00013 00007 000 0130 0240 018DIT minus0067 010 005 minus046 minus183 minus153 0335 0108 03257 minus0039 0054 002NOC 0137 009 013 minus066 minus167 minus150 00007 000 000 0136 013 016CBO 0011 001 002 minus071 minus180 minus166 0017 000 000 0096 015 017RFC 0012 002 001 minus086 minus215 minus179 00014 000 000 0130 023 017LCOM 0007 0007 0007 minus064 minus167 minus148 00349 00004 00007 0085 011 011

Table 10 Multivariate logistic regression analysis

CoefficientAIF version 12 AIF version 14 AIF version 16

WMC 00195 00574 00320DIT minus0041 0000 0000NOC 01231 0000 0000CBO 0005 0008 0001RFC 00071 00081 00109LCOM 0 minus0001 0Constant minus0917 minus2785 minus2157

Table 11 Before applying regression

Not-faulty FaultyNot-Faulty 777 0Faulty 188 0

Table 12 After applying regression

Not-faulty FaultyNot-Faulty 767 10Faulty 172 16

PNN RBFN and FLANN not only classify the class as faultyor not faulty but also highlight the number of bugs foundin the class and these bugs are fixed in the testing phase ofsoftware development life cycle

In this paper six CKmetrics are taken as input and outputis the fault prediction accuracy rate required for developingthe software The network is trained using Gradient descentmethod and Levenberg Marquardt method

(a) Gradient Descent Method Gradient descent method isused for updating the weights using (15) and (16) Table 14shows the performance metrics of AIF version 16 Figure 12shows the graph plot for variation ofmean square error valueswrt no of epoch (or iteration) for AIF version 16

(b) Levenberg Marquardt Method Levenberg Marquardtmethod [21 22] is a technique for updating weights In caseof Gradient descent method learning rate 120572 is constant but inLevenbergMarquardt method learning rate 120572 varies in everyiteration So this method consumes less number of iterations

Mea

n sq

uare

erro

r

07

06

05

04

03

02

01

00 20 40 60 80 100 120 140 160 180

Number of iterations

Figure 12 MSE versus number of epoch wrt Gradient descent NN

Mea

n sq

uare

erro

r

Number of iterations

02

015

01

005

00 2 4 6 8 10 12 14

Figure 13 MSE versus number of epoch wrt Levenberg-marquardtNN

to train the network Table 15 shows the performance metricsfor AIF version 16 using Levenberg Marquardt method

Figure 13 shows the graph plot for variation of meansquare error values wrt number of epoch for AIF version 16

12 ISRN Software Engineering

Table 13 Precision correctness completeness and accuracy for three versions of AIF

Precision () Correctness () Completeness () Accuracy ()ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 6111 4117 5714 6111 4117 5714 509 482 425 6613 8402 8171DIT mdash mdash mdash mdash mdash mdash 0 0 0 6447 8337 8051NOC 75 75 6666 75 75 6666 555 206 531 6578 836 8103CBO 60 5714 7777 60 5714 7777 277 275 372 648 8348 8103RFC 6666 3636 50 6666 3636 50 462 275 212 6529 8302 8051LCOM 6666 50 60 066 05 06 277 68 159 6496 8337 8062MULTI 6875 50 6153 6875 50 6153 1018 758 851 6644 8337 8113

Table 14 Accuracy prediction using gradient descent NN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00594 1093 00617 minus02038 00044 00048 940437

Table 15 Accuracy prediction using Levenberg Marquardt

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00023 11203 00308 minus02189 00022 00041 904977

Table 16 Accuracy prediction using FLANN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00304 07097 00390 03308 24601119890 minus 06 00050 963769

Table 17 Accuracy prediction using basic RBFN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00279 03875 00573 01969 0059 0006 972792

652 Functional Link Artificial Neural Network (FLANN)FLANN architecture for software fault prediction is a singlelayer feed-forward neural network consisting of an input andoutput layer FLANN does not incorporate any hidden layerand hence has less computational cost In this paper adaptivealgorithm has been used for updating the weights as shownin (21) Figure 14 shows the variation of mean square valuesagainst number of epochs for AIF version 16 Table 16 showsthe performance metrics of FLANN

653 Radial Basis Function Network In this paper Gaussianradial function is used as a radial function Gradient descentlearning and hybrid learning methods are used for updatingthe centers and weights respectively

Three layered RBFN has been considered in which sixCK metrics are taken as input nodes nine hidden centers aretaken as hidden nodes and output is the fault prediction rateTable 17 shows the performance metrics for AIF version 16

(a) Gradient Descent Learning Method Equations (25) and(26) are used for updating center and weight during trainingphase After simplifying (25) the equation is represented as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1(1199101015840minus 119910)119882119894

120601119894

1205902(119909119895minus 119862119894119895(119896)) (39)

and the modified Equation (26) is formulated as

119882119894(119896 + 1) = 119882

119894(119896) + 120578

2(1199101015840minus 119910) 120601

119894 (40)

where 120590 is the width of the center and 119896 is the currentiteration number Table 18 shows the performancemetrics forAIF version 16 Figure 15 indicates the variation of MSE wrtnumber of epochs

(b) Hybrid Learning Method In Hybrid learning methodcenters are updated using (27) while weights are updatedusing supervised learning methods In this paper least meansquare error (LMSE) algorithm is used for updating theweights Table 19 shows the performance matrix for AIFversion 16 Figure 16 shows the graph for variation of MSEversus number of epochs

654 Probabilistic Neural Network (PNN) As mentioned inSection 424 PNN is a multilayered feed-forward networkwith four layers such as input hidden summation andoutput layer

In PNN 50 of faulty and nonfaulty classes are takenas input for hidden layers Gaussian elimination (28) isused as a hidden node function The summation layers sum

ISRN Software Engineering 13

Table 18 Accuracy prediction using RBFN gradient

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00207 02316 00323 03041 16302119890 minus 05 00041 972475

Table 19 Accuracy prediction using hybrid RBFN

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00614 01032 00316 09184 31834119890 minus 79 00013 984783

Number of iterations

Mea

n sq

uare

erro

r

0 10 20 30 40 50 60 700

01

02

03

04

05

06

07

08

09

Figure 14 Graph plot for MSE versus number of iterations (epoch)wrt FLANN

Number of iterations

Mea

n sq

uare

erro

r 0015

001

005

00 5 10 15 20 25 30 35 40 45

Figure 15 MSE versus number of epochs wrt gradient RBFN

contribution of each class of input patterns and producea net output which is a vector of probabilities The outputpattern having maximum summation value is classified intorespective class Figure 17 shows the variation of accuracy fordifferent values of smoothing parameter

66 Comparison Table 20 shows the tabulated results forthe obtained performance parameter values number ofepochs and accuracy rate by applying three neural networktechniques This performance table is an indication of betterfault prediction model In this comparative analysis theperformance parameter mean square error (MSE) was takenas a criterion to compute the performance parameters (suchas MARE MSE number of epochs and accuracy rate)when four neural network techniques were applied Duringthis process the MSE value of 0002 was set a thresholdfor evaluation Based on the number of iterations and theaccuracy rate obtained by the respective NN technique bestprediction model was determined

Number of iterationsM

ean

squa

re er

ror

006

005

003

002

001

004

00 2 4 6 8 10 12 14

Figure 16 MSE versus number of epochs wrt hybrid RBFN

Smoothing parameter

Accu

racy

()

865

86

855

85

845

84

835

83

825

820 05 1 15 2 25 3 35 4 45 5

Figure 17 Accuracy rate versus smoothing parameter

From Table 20 it is evident that gradient NN methodobtained an accuracy rate of 9404 in 162 epochs (iter-ations) LM technique which is an improvised model ofANN obtained 904 accuracy rate This accuracy rate isless than gradient NN but this approach (LM method) tookonly 13 epochs PNN method achieved a classification rate of8641

The three types of RBFN namely basic RBFN gradientand hybrid methods obtained a prediction rate of 97279724 and 9847 respectively Considering the number ofepochs RBFN hybridmethod obtained better prediction rateof 9847 in only 14 epochs when compared with gradientmethod (41 epochs) and basic RBFN approaches

FLANN architecture obtained 9637 accuracy rate withless computational cost involved FLANN obtained accuracyrate in 66 epochs as it has no hidden layer involved in itsarchitecture

14 ISRN Software Engineering

Table 20 Performance metrics

Performance parametersAI technique Epoch MAE MARE RMSE Std Error AccuracyGradient descent 162 00594 10930 00617 00048 9404LM 13 00023 11203 00308 00041 9049RBFN basic mdash 00279 03875 00573 006 9727RBFN gradient 41 00207 02316 00323 00041 9724RBFN hybrid 14 00614 01032 00316 00013 9847FLANN 66 00304 07097 00390 00050 9637

The performance of PNN is shown in Figure 17 Highestaccuracy in prediction was obtained for smoothing parame-ter value of 17 PNN obtained a classification rate of 8641

RBFN using hybrid learning model gives the least valuesfor MAE MARE RMSE and high accuracy rate Hencefrom the obtained results by using ANN techniques it can beconcluded that RBFNhybrid approach obtained the best faultprediction rate in less number of epochswhen comparedwithother three ANN techniques

7 Conclusion

System analyst use of prediction models to classify faultprone classes as faulty or not faulty is the need of the dayfor researchers as well as practitioners So more reliableapproaches for prediction need to be modeled In this papertwo approaches namely statistical methods and machinelearning techniques were applied for fault prediction Theapplication of statistical and machine learning methods infault prediction requires enormous amount of data andanalyzing this huge amount of data is necessary with the helpof a better prediction model

This paper proposes a comparative study of differentprediction models for fault prediction for an open-sourceproject Fault prediction using statistical and machine learn-ing methods were carried out for AIF by coding in MATLABenvironment Statistical methods such as linear regressionand logistic regression were applied Also machine learningtechniques such as artificial neural network (gradient descentand Levenberg Marquardt methods) Functional link artifi-cial neural network radial basis function network (RBFNbasic RBFN gradient and RBFN hybrid) and probabilisticneural network techniques were applied for fault predictionanalysis

It can be concluded from the statistical regression analysisthat out of six CK metrics WMC appears to be more usefulin predicting faults Table 20 shows that hybrid approachof RBFN obtained better fault prediction in less number ofepochs (14 iterations) when compared with the other threeneural network techniques

In future work should be replicated to other open-sourceprojects like Mozilla using different AI techniques to analyzewhich model performs better in achieving higher accuracyfor fault prediction Also fault prediction accuracy should bemeasured by combining multiple computational intelligencetechniques

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] V R Basili L C Briand and W L Melo ldquoA validationof object-oriented design metrics as quality indicatorsrdquo IEEETransactions on Software Engineering vol 22 no 10 pp 751ndash761 1996

[2] T J McCabe ldquoA Complexity Measurerdquo IEEE Transactions onSoftware Engineering vol 2 no 4 pp 308ndash320 1976

[3] M H Halstead Elements of Software Science Elsevier ScienceNew York NY USA 1977

[4] W Li and S Henry ldquoMaintenance metrics for the Object-Oriented paradigmrdquo in Proceedings of the 1st InternationalSoftware Metrics Symposium pp 52ndash60 1993

[5] S R Chidamber and C F Kemerer ldquoMetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[6] F B E Abreu andR Carapuca ldquoObject-Oriented software engi-neering measuring and controlling the development processrdquoin Proceedings of the 4th International Conference on SoftwareQuality pp 1ndash8 McLean Va USA October 1994

[7] M Lorenz and J Kidd Object-Oriented Software MetricsPrentice Hall Englewood NJ USA 1994

[8] R Martin ldquoOO design quality metricsmdashan analysis of depen-denciesrdquo in Proceedings of the Workshop Pragmatic and Theo-retical Directions in Object-Oriented Software Metrics (OOPSLArsquo94) 1994

[9] D P Tegarden S D Sheetz and D E Monarchi ldquoA softwarecomplexity model of object-oriented systemsrdquoDecision SupportSystems vol 13 no 3-4 pp 241ndash262 1995

[10] W Melo and F B E Abreu ldquoEvaluating the impact of object-oriented design on software qualityrdquo in Proceedings of the 3rdInternational Software Metrics Symposium pp 90ndash99 BerlinGermany March 1996

[11] L Briand P Devanbu and W Melo ldquoInvestigation intocoupling measures for C++rdquo in Proceedings of the IEEE 19thInternational Conference on Software EngineeringAssociation forComputing Machinery pp 412ndash421 May 1997

[12] L Etzkorn J Bansiya and C Davis ldquoDesign and code com-plexity metrics for OO classesrdquo Journal of Object-OrientedProgramming vol 12 no 1 pp 35ndash40 1999

[13] L C Briand JWust JWDaly andDVictor Porter ldquoExploringthe relationships between designmeasures and software qualityin object-oriented systemsrdquoThe Journal of Systems and Softwarevol 51 no 3 pp 245ndash273 2000

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

ISRN Software Engineering 11

Table 9 Analysis of univariate regression

Coefficient Constant 119875 value 119877 valuever 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 0028 005 003 minus083 minus211 minus177 00013 00007 000 0130 0240 018DIT minus0067 010 005 minus046 minus183 minus153 0335 0108 03257 minus0039 0054 002NOC 0137 009 013 minus066 minus167 minus150 00007 000 000 0136 013 016CBO 0011 001 002 minus071 minus180 minus166 0017 000 000 0096 015 017RFC 0012 002 001 minus086 minus215 minus179 00014 000 000 0130 023 017LCOM 0007 0007 0007 minus064 minus167 minus148 00349 00004 00007 0085 011 011

Table 10 Multivariate logistic regression analysis

CoefficientAIF version 12 AIF version 14 AIF version 16

WMC 00195 00574 00320DIT minus0041 0000 0000NOC 01231 0000 0000CBO 0005 0008 0001RFC 00071 00081 00109LCOM 0 minus0001 0Constant minus0917 minus2785 minus2157

Table 11 Before applying regression

Not-faulty FaultyNot-Faulty 777 0Faulty 188 0

Table 12 After applying regression

Not-faulty FaultyNot-Faulty 767 10Faulty 172 16

PNN RBFN and FLANN not only classify the class as faultyor not faulty but also highlight the number of bugs foundin the class and these bugs are fixed in the testing phase ofsoftware development life cycle

In this paper six CKmetrics are taken as input and outputis the fault prediction accuracy rate required for developingthe software The network is trained using Gradient descentmethod and Levenberg Marquardt method

(a) Gradient Descent Method Gradient descent method isused for updating the weights using (15) and (16) Table 14shows the performance metrics of AIF version 16 Figure 12shows the graph plot for variation ofmean square error valueswrt no of epoch (or iteration) for AIF version 16

(b) Levenberg Marquardt Method Levenberg Marquardtmethod [21 22] is a technique for updating weights In caseof Gradient descent method learning rate 120572 is constant but inLevenbergMarquardt method learning rate 120572 varies in everyiteration So this method consumes less number of iterations

Mea

n sq

uare

erro

r

07

06

05

04

03

02

01

00 20 40 60 80 100 120 140 160 180

Number of iterations

Figure 12 MSE versus number of epoch wrt Gradient descent NN

Mea

n sq

uare

erro

r

Number of iterations

02

015

01

005

00 2 4 6 8 10 12 14

Figure 13 MSE versus number of epoch wrt Levenberg-marquardtNN

to train the network Table 15 shows the performance metricsfor AIF version 16 using Levenberg Marquardt method

Figure 13 shows the graph plot for variation of meansquare error values wrt number of epoch for AIF version 16

12 ISRN Software Engineering

Table 13 Precision correctness completeness and accuracy for three versions of AIF

Precision () Correctness () Completeness () Accuracy ()ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 6111 4117 5714 6111 4117 5714 509 482 425 6613 8402 8171DIT mdash mdash mdash mdash mdash mdash 0 0 0 6447 8337 8051NOC 75 75 6666 75 75 6666 555 206 531 6578 836 8103CBO 60 5714 7777 60 5714 7777 277 275 372 648 8348 8103RFC 6666 3636 50 6666 3636 50 462 275 212 6529 8302 8051LCOM 6666 50 60 066 05 06 277 68 159 6496 8337 8062MULTI 6875 50 6153 6875 50 6153 1018 758 851 6644 8337 8113

Table 14 Accuracy prediction using gradient descent NN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00594 1093 00617 minus02038 00044 00048 940437

Table 15 Accuracy prediction using Levenberg Marquardt

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00023 11203 00308 minus02189 00022 00041 904977

Table 16 Accuracy prediction using FLANN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00304 07097 00390 03308 24601119890 minus 06 00050 963769

Table 17 Accuracy prediction using basic RBFN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00279 03875 00573 01969 0059 0006 972792

652 Functional Link Artificial Neural Network (FLANN)FLANN architecture for software fault prediction is a singlelayer feed-forward neural network consisting of an input andoutput layer FLANN does not incorporate any hidden layerand hence has less computational cost In this paper adaptivealgorithm has been used for updating the weights as shownin (21) Figure 14 shows the variation of mean square valuesagainst number of epochs for AIF version 16 Table 16 showsthe performance metrics of FLANN

653 Radial Basis Function Network In this paper Gaussianradial function is used as a radial function Gradient descentlearning and hybrid learning methods are used for updatingthe centers and weights respectively

Three layered RBFN has been considered in which sixCK metrics are taken as input nodes nine hidden centers aretaken as hidden nodes and output is the fault prediction rateTable 17 shows the performance metrics for AIF version 16

(a) Gradient Descent Learning Method Equations (25) and(26) are used for updating center and weight during trainingphase After simplifying (25) the equation is represented as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1(1199101015840minus 119910)119882119894

120601119894

1205902(119909119895minus 119862119894119895(119896)) (39)

and the modified Equation (26) is formulated as

119882119894(119896 + 1) = 119882

119894(119896) + 120578

2(1199101015840minus 119910) 120601

119894 (40)

where 120590 is the width of the center and 119896 is the currentiteration number Table 18 shows the performancemetrics forAIF version 16 Figure 15 indicates the variation of MSE wrtnumber of epochs

(b) Hybrid Learning Method In Hybrid learning methodcenters are updated using (27) while weights are updatedusing supervised learning methods In this paper least meansquare error (LMSE) algorithm is used for updating theweights Table 19 shows the performance matrix for AIFversion 16 Figure 16 shows the graph for variation of MSEversus number of epochs

654 Probabilistic Neural Network (PNN) As mentioned inSection 424 PNN is a multilayered feed-forward networkwith four layers such as input hidden summation andoutput layer

In PNN 50 of faulty and nonfaulty classes are takenas input for hidden layers Gaussian elimination (28) isused as a hidden node function The summation layers sum

ISRN Software Engineering 13

Table 18 Accuracy prediction using RBFN gradient

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00207 02316 00323 03041 16302119890 minus 05 00041 972475

Table 19 Accuracy prediction using hybrid RBFN

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00614 01032 00316 09184 31834119890 minus 79 00013 984783

Number of iterations

Mea

n sq

uare

erro

r

0 10 20 30 40 50 60 700

01

02

03

04

05

06

07

08

09

Figure 14 Graph plot for MSE versus number of iterations (epoch)wrt FLANN

Number of iterations

Mea

n sq

uare

erro

r 0015

001

005

00 5 10 15 20 25 30 35 40 45

Figure 15 MSE versus number of epochs wrt gradient RBFN

contribution of each class of input patterns and producea net output which is a vector of probabilities The outputpattern having maximum summation value is classified intorespective class Figure 17 shows the variation of accuracy fordifferent values of smoothing parameter

66 Comparison Table 20 shows the tabulated results forthe obtained performance parameter values number ofepochs and accuracy rate by applying three neural networktechniques This performance table is an indication of betterfault prediction model In this comparative analysis theperformance parameter mean square error (MSE) was takenas a criterion to compute the performance parameters (suchas MARE MSE number of epochs and accuracy rate)when four neural network techniques were applied Duringthis process the MSE value of 0002 was set a thresholdfor evaluation Based on the number of iterations and theaccuracy rate obtained by the respective NN technique bestprediction model was determined

Number of iterationsM

ean

squa

re er

ror

006

005

003

002

001

004

00 2 4 6 8 10 12 14

Figure 16 MSE versus number of epochs wrt hybrid RBFN

Smoothing parameter

Accu

racy

()

865

86

855

85

845

84

835

83

825

820 05 1 15 2 25 3 35 4 45 5

Figure 17 Accuracy rate versus smoothing parameter

From Table 20 it is evident that gradient NN methodobtained an accuracy rate of 9404 in 162 epochs (iter-ations) LM technique which is an improvised model ofANN obtained 904 accuracy rate This accuracy rate isless than gradient NN but this approach (LM method) tookonly 13 epochs PNN method achieved a classification rate of8641

The three types of RBFN namely basic RBFN gradientand hybrid methods obtained a prediction rate of 97279724 and 9847 respectively Considering the number ofepochs RBFN hybridmethod obtained better prediction rateof 9847 in only 14 epochs when compared with gradientmethod (41 epochs) and basic RBFN approaches

FLANN architecture obtained 9637 accuracy rate withless computational cost involved FLANN obtained accuracyrate in 66 epochs as it has no hidden layer involved in itsarchitecture

14 ISRN Software Engineering

Table 20 Performance metrics

Performance parametersAI technique Epoch MAE MARE RMSE Std Error AccuracyGradient descent 162 00594 10930 00617 00048 9404LM 13 00023 11203 00308 00041 9049RBFN basic mdash 00279 03875 00573 006 9727RBFN gradient 41 00207 02316 00323 00041 9724RBFN hybrid 14 00614 01032 00316 00013 9847FLANN 66 00304 07097 00390 00050 9637

The performance of PNN is shown in Figure 17 Highestaccuracy in prediction was obtained for smoothing parame-ter value of 17 PNN obtained a classification rate of 8641

RBFN using hybrid learning model gives the least valuesfor MAE MARE RMSE and high accuracy rate Hencefrom the obtained results by using ANN techniques it can beconcluded that RBFNhybrid approach obtained the best faultprediction rate in less number of epochswhen comparedwithother three ANN techniques

7 Conclusion

System analyst use of prediction models to classify faultprone classes as faulty or not faulty is the need of the dayfor researchers as well as practitioners So more reliableapproaches for prediction need to be modeled In this papertwo approaches namely statistical methods and machinelearning techniques were applied for fault prediction Theapplication of statistical and machine learning methods infault prediction requires enormous amount of data andanalyzing this huge amount of data is necessary with the helpof a better prediction model

This paper proposes a comparative study of differentprediction models for fault prediction for an open-sourceproject Fault prediction using statistical and machine learn-ing methods were carried out for AIF by coding in MATLABenvironment Statistical methods such as linear regressionand logistic regression were applied Also machine learningtechniques such as artificial neural network (gradient descentand Levenberg Marquardt methods) Functional link artifi-cial neural network radial basis function network (RBFNbasic RBFN gradient and RBFN hybrid) and probabilisticneural network techniques were applied for fault predictionanalysis

It can be concluded from the statistical regression analysisthat out of six CK metrics WMC appears to be more usefulin predicting faults Table 20 shows that hybrid approachof RBFN obtained better fault prediction in less number ofepochs (14 iterations) when compared with the other threeneural network techniques

In future work should be replicated to other open-sourceprojects like Mozilla using different AI techniques to analyzewhich model performs better in achieving higher accuracyfor fault prediction Also fault prediction accuracy should bemeasured by combining multiple computational intelligencetechniques

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] V R Basili L C Briand and W L Melo ldquoA validationof object-oriented design metrics as quality indicatorsrdquo IEEETransactions on Software Engineering vol 22 no 10 pp 751ndash761 1996

[2] T J McCabe ldquoA Complexity Measurerdquo IEEE Transactions onSoftware Engineering vol 2 no 4 pp 308ndash320 1976

[3] M H Halstead Elements of Software Science Elsevier ScienceNew York NY USA 1977

[4] W Li and S Henry ldquoMaintenance metrics for the Object-Oriented paradigmrdquo in Proceedings of the 1st InternationalSoftware Metrics Symposium pp 52ndash60 1993

[5] S R Chidamber and C F Kemerer ldquoMetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[6] F B E Abreu andR Carapuca ldquoObject-Oriented software engi-neering measuring and controlling the development processrdquoin Proceedings of the 4th International Conference on SoftwareQuality pp 1ndash8 McLean Va USA October 1994

[7] M Lorenz and J Kidd Object-Oriented Software MetricsPrentice Hall Englewood NJ USA 1994

[8] R Martin ldquoOO design quality metricsmdashan analysis of depen-denciesrdquo in Proceedings of the Workshop Pragmatic and Theo-retical Directions in Object-Oriented Software Metrics (OOPSLArsquo94) 1994

[9] D P Tegarden S D Sheetz and D E Monarchi ldquoA softwarecomplexity model of object-oriented systemsrdquoDecision SupportSystems vol 13 no 3-4 pp 241ndash262 1995

[10] W Melo and F B E Abreu ldquoEvaluating the impact of object-oriented design on software qualityrdquo in Proceedings of the 3rdInternational Software Metrics Symposium pp 90ndash99 BerlinGermany March 1996

[11] L Briand P Devanbu and W Melo ldquoInvestigation intocoupling measures for C++rdquo in Proceedings of the IEEE 19thInternational Conference on Software EngineeringAssociation forComputing Machinery pp 412ndash421 May 1997

[12] L Etzkorn J Bansiya and C Davis ldquoDesign and code com-plexity metrics for OO classesrdquo Journal of Object-OrientedProgramming vol 12 no 1 pp 35ndash40 1999

[13] L C Briand JWust JWDaly andDVictor Porter ldquoExploringthe relationships between designmeasures and software qualityin object-oriented systemsrdquoThe Journal of Systems and Softwarevol 51 no 3 pp 245ndash273 2000

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

12 ISRN Software Engineering

Table 13 Precision correctness completeness and accuracy for three versions of AIF

Precision () Correctness () Completeness () Accuracy ()ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16 ver 12 ver 14 ver 16

WMC 6111 4117 5714 6111 4117 5714 509 482 425 6613 8402 8171DIT mdash mdash mdash mdash mdash mdash 0 0 0 6447 8337 8051NOC 75 75 6666 75 75 6666 555 206 531 6578 836 8103CBO 60 5714 7777 60 5714 7777 277 275 372 648 8348 8103RFC 6666 3636 50 6666 3636 50 462 275 212 6529 8302 8051LCOM 6666 50 60 066 05 06 277 68 159 6496 8337 8062MULTI 6875 50 6153 6875 50 6153 1018 758 851 6644 8337 8113

Table 14 Accuracy prediction using gradient descent NN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00594 1093 00617 minus02038 00044 00048 940437

Table 15 Accuracy prediction using Levenberg Marquardt

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00023 11203 00308 minus02189 00022 00041 904977

Table 16 Accuracy prediction using FLANN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00304 07097 00390 03308 24601119890 minus 06 00050 963769

Table 17 Accuracy prediction using basic RBFN

MAE MARE RMSE 119877 119875 value Std error Accuracy ()00279 03875 00573 01969 0059 0006 972792

652 Functional Link Artificial Neural Network (FLANN)FLANN architecture for software fault prediction is a singlelayer feed-forward neural network consisting of an input andoutput layer FLANN does not incorporate any hidden layerand hence has less computational cost In this paper adaptivealgorithm has been used for updating the weights as shownin (21) Figure 14 shows the variation of mean square valuesagainst number of epochs for AIF version 16 Table 16 showsthe performance metrics of FLANN

653 Radial Basis Function Network In this paper Gaussianradial function is used as a radial function Gradient descentlearning and hybrid learning methods are used for updatingthe centers and weights respectively

Three layered RBFN has been considered in which sixCK metrics are taken as input nodes nine hidden centers aretaken as hidden nodes and output is the fault prediction rateTable 17 shows the performance metrics for AIF version 16

(a) Gradient Descent Learning Method Equations (25) and(26) are used for updating center and weight during trainingphase After simplifying (25) the equation is represented as

119862119894119895(119896 + 1) = 119862

119894119895(119896) minus 120578

1(1199101015840minus 119910)119882119894

120601119894

1205902(119909119895minus 119862119894119895(119896)) (39)

and the modified Equation (26) is formulated as

119882119894(119896 + 1) = 119882

119894(119896) + 120578

2(1199101015840minus 119910) 120601

119894 (40)

where 120590 is the width of the center and 119896 is the currentiteration number Table 18 shows the performancemetrics forAIF version 16 Figure 15 indicates the variation of MSE wrtnumber of epochs

(b) Hybrid Learning Method In Hybrid learning methodcenters are updated using (27) while weights are updatedusing supervised learning methods In this paper least meansquare error (LMSE) algorithm is used for updating theweights Table 19 shows the performance matrix for AIFversion 16 Figure 16 shows the graph for variation of MSEversus number of epochs

654 Probabilistic Neural Network (PNN) As mentioned inSection 424 PNN is a multilayered feed-forward networkwith four layers such as input hidden summation andoutput layer

In PNN 50 of faulty and nonfaulty classes are takenas input for hidden layers Gaussian elimination (28) isused as a hidden node function The summation layers sum

ISRN Software Engineering 13

Table 18 Accuracy prediction using RBFN gradient

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00207 02316 00323 03041 16302119890 minus 05 00041 972475

Table 19 Accuracy prediction using hybrid RBFN

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00614 01032 00316 09184 31834119890 minus 79 00013 984783

Number of iterations

Mea

n sq

uare

erro

r

0 10 20 30 40 50 60 700

01

02

03

04

05

06

07

08

09

Figure 14 Graph plot for MSE versus number of iterations (epoch)wrt FLANN

Number of iterations

Mea

n sq

uare

erro

r 0015

001

005

00 5 10 15 20 25 30 35 40 45

Figure 15 MSE versus number of epochs wrt gradient RBFN

contribution of each class of input patterns and producea net output which is a vector of probabilities The outputpattern having maximum summation value is classified intorespective class Figure 17 shows the variation of accuracy fordifferent values of smoothing parameter

66 Comparison Table 20 shows the tabulated results forthe obtained performance parameter values number ofepochs and accuracy rate by applying three neural networktechniques This performance table is an indication of betterfault prediction model In this comparative analysis theperformance parameter mean square error (MSE) was takenas a criterion to compute the performance parameters (suchas MARE MSE number of epochs and accuracy rate)when four neural network techniques were applied Duringthis process the MSE value of 0002 was set a thresholdfor evaluation Based on the number of iterations and theaccuracy rate obtained by the respective NN technique bestprediction model was determined

Number of iterationsM

ean

squa

re er

ror

006

005

003

002

001

004

00 2 4 6 8 10 12 14

Figure 16 MSE versus number of epochs wrt hybrid RBFN

Smoothing parameter

Accu

racy

()

865

86

855

85

845

84

835

83

825

820 05 1 15 2 25 3 35 4 45 5

Figure 17 Accuracy rate versus smoothing parameter

From Table 20 it is evident that gradient NN methodobtained an accuracy rate of 9404 in 162 epochs (iter-ations) LM technique which is an improvised model ofANN obtained 904 accuracy rate This accuracy rate isless than gradient NN but this approach (LM method) tookonly 13 epochs PNN method achieved a classification rate of8641

The three types of RBFN namely basic RBFN gradientand hybrid methods obtained a prediction rate of 97279724 and 9847 respectively Considering the number ofepochs RBFN hybridmethod obtained better prediction rateof 9847 in only 14 epochs when compared with gradientmethod (41 epochs) and basic RBFN approaches

FLANN architecture obtained 9637 accuracy rate withless computational cost involved FLANN obtained accuracyrate in 66 epochs as it has no hidden layer involved in itsarchitecture

14 ISRN Software Engineering

Table 20 Performance metrics

Performance parametersAI technique Epoch MAE MARE RMSE Std Error AccuracyGradient descent 162 00594 10930 00617 00048 9404LM 13 00023 11203 00308 00041 9049RBFN basic mdash 00279 03875 00573 006 9727RBFN gradient 41 00207 02316 00323 00041 9724RBFN hybrid 14 00614 01032 00316 00013 9847FLANN 66 00304 07097 00390 00050 9637

The performance of PNN is shown in Figure 17 Highestaccuracy in prediction was obtained for smoothing parame-ter value of 17 PNN obtained a classification rate of 8641

RBFN using hybrid learning model gives the least valuesfor MAE MARE RMSE and high accuracy rate Hencefrom the obtained results by using ANN techniques it can beconcluded that RBFNhybrid approach obtained the best faultprediction rate in less number of epochswhen comparedwithother three ANN techniques

7 Conclusion

System analyst use of prediction models to classify faultprone classes as faulty or not faulty is the need of the dayfor researchers as well as practitioners So more reliableapproaches for prediction need to be modeled In this papertwo approaches namely statistical methods and machinelearning techniques were applied for fault prediction Theapplication of statistical and machine learning methods infault prediction requires enormous amount of data andanalyzing this huge amount of data is necessary with the helpof a better prediction model

This paper proposes a comparative study of differentprediction models for fault prediction for an open-sourceproject Fault prediction using statistical and machine learn-ing methods were carried out for AIF by coding in MATLABenvironment Statistical methods such as linear regressionand logistic regression were applied Also machine learningtechniques such as artificial neural network (gradient descentand Levenberg Marquardt methods) Functional link artifi-cial neural network radial basis function network (RBFNbasic RBFN gradient and RBFN hybrid) and probabilisticneural network techniques were applied for fault predictionanalysis

It can be concluded from the statistical regression analysisthat out of six CK metrics WMC appears to be more usefulin predicting faults Table 20 shows that hybrid approachof RBFN obtained better fault prediction in less number ofepochs (14 iterations) when compared with the other threeneural network techniques

In future work should be replicated to other open-sourceprojects like Mozilla using different AI techniques to analyzewhich model performs better in achieving higher accuracyfor fault prediction Also fault prediction accuracy should bemeasured by combining multiple computational intelligencetechniques

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] V R Basili L C Briand and W L Melo ldquoA validationof object-oriented design metrics as quality indicatorsrdquo IEEETransactions on Software Engineering vol 22 no 10 pp 751ndash761 1996

[2] T J McCabe ldquoA Complexity Measurerdquo IEEE Transactions onSoftware Engineering vol 2 no 4 pp 308ndash320 1976

[3] M H Halstead Elements of Software Science Elsevier ScienceNew York NY USA 1977

[4] W Li and S Henry ldquoMaintenance metrics for the Object-Oriented paradigmrdquo in Proceedings of the 1st InternationalSoftware Metrics Symposium pp 52ndash60 1993

[5] S R Chidamber and C F Kemerer ldquoMetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[6] F B E Abreu andR Carapuca ldquoObject-Oriented software engi-neering measuring and controlling the development processrdquoin Proceedings of the 4th International Conference on SoftwareQuality pp 1ndash8 McLean Va USA October 1994

[7] M Lorenz and J Kidd Object-Oriented Software MetricsPrentice Hall Englewood NJ USA 1994

[8] R Martin ldquoOO design quality metricsmdashan analysis of depen-denciesrdquo in Proceedings of the Workshop Pragmatic and Theo-retical Directions in Object-Oriented Software Metrics (OOPSLArsquo94) 1994

[9] D P Tegarden S D Sheetz and D E Monarchi ldquoA softwarecomplexity model of object-oriented systemsrdquoDecision SupportSystems vol 13 no 3-4 pp 241ndash262 1995

[10] W Melo and F B E Abreu ldquoEvaluating the impact of object-oriented design on software qualityrdquo in Proceedings of the 3rdInternational Software Metrics Symposium pp 90ndash99 BerlinGermany March 1996

[11] L Briand P Devanbu and W Melo ldquoInvestigation intocoupling measures for C++rdquo in Proceedings of the IEEE 19thInternational Conference on Software EngineeringAssociation forComputing Machinery pp 412ndash421 May 1997

[12] L Etzkorn J Bansiya and C Davis ldquoDesign and code com-plexity metrics for OO classesrdquo Journal of Object-OrientedProgramming vol 12 no 1 pp 35ndash40 1999

[13] L C Briand JWust JWDaly andDVictor Porter ldquoExploringthe relationships between designmeasures and software qualityin object-oriented systemsrdquoThe Journal of Systems and Softwarevol 51 no 3 pp 245ndash273 2000

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 13: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

ISRN Software Engineering 13

Table 18 Accuracy prediction using RBFN gradient

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00207 02316 00323 03041 16302119890 minus 05 00041 972475

Table 19 Accuracy prediction using hybrid RBFN

MAE MARE RMSE 119877 119875 value Std Error Accuracy ()00614 01032 00316 09184 31834119890 minus 79 00013 984783

Number of iterations

Mea

n sq

uare

erro

r

0 10 20 30 40 50 60 700

01

02

03

04

05

06

07

08

09

Figure 14 Graph plot for MSE versus number of iterations (epoch)wrt FLANN

Number of iterations

Mea

n sq

uare

erro

r 0015

001

005

00 5 10 15 20 25 30 35 40 45

Figure 15 MSE versus number of epochs wrt gradient RBFN

contribution of each class of input patterns and producea net output which is a vector of probabilities The outputpattern having maximum summation value is classified intorespective class Figure 17 shows the variation of accuracy fordifferent values of smoothing parameter

66 Comparison Table 20 shows the tabulated results forthe obtained performance parameter values number ofepochs and accuracy rate by applying three neural networktechniques This performance table is an indication of betterfault prediction model In this comparative analysis theperformance parameter mean square error (MSE) was takenas a criterion to compute the performance parameters (suchas MARE MSE number of epochs and accuracy rate)when four neural network techniques were applied Duringthis process the MSE value of 0002 was set a thresholdfor evaluation Based on the number of iterations and theaccuracy rate obtained by the respective NN technique bestprediction model was determined

Number of iterationsM

ean

squa

re er

ror

006

005

003

002

001

004

00 2 4 6 8 10 12 14

Figure 16 MSE versus number of epochs wrt hybrid RBFN

Smoothing parameter

Accu

racy

()

865

86

855

85

845

84

835

83

825

820 05 1 15 2 25 3 35 4 45 5

Figure 17 Accuracy rate versus smoothing parameter

From Table 20 it is evident that gradient NN methodobtained an accuracy rate of 9404 in 162 epochs (iter-ations) LM technique which is an improvised model ofANN obtained 904 accuracy rate This accuracy rate isless than gradient NN but this approach (LM method) tookonly 13 epochs PNN method achieved a classification rate of8641

The three types of RBFN namely basic RBFN gradientand hybrid methods obtained a prediction rate of 97279724 and 9847 respectively Considering the number ofepochs RBFN hybridmethod obtained better prediction rateof 9847 in only 14 epochs when compared with gradientmethod (41 epochs) and basic RBFN approaches

FLANN architecture obtained 9637 accuracy rate withless computational cost involved FLANN obtained accuracyrate in 66 epochs as it has no hidden layer involved in itsarchitecture

14 ISRN Software Engineering

Table 20 Performance metrics

Performance parametersAI technique Epoch MAE MARE RMSE Std Error AccuracyGradient descent 162 00594 10930 00617 00048 9404LM 13 00023 11203 00308 00041 9049RBFN basic mdash 00279 03875 00573 006 9727RBFN gradient 41 00207 02316 00323 00041 9724RBFN hybrid 14 00614 01032 00316 00013 9847FLANN 66 00304 07097 00390 00050 9637

The performance of PNN is shown in Figure 17 Highestaccuracy in prediction was obtained for smoothing parame-ter value of 17 PNN obtained a classification rate of 8641

RBFN using hybrid learning model gives the least valuesfor MAE MARE RMSE and high accuracy rate Hencefrom the obtained results by using ANN techniques it can beconcluded that RBFNhybrid approach obtained the best faultprediction rate in less number of epochswhen comparedwithother three ANN techniques

7 Conclusion

System analyst use of prediction models to classify faultprone classes as faulty or not faulty is the need of the dayfor researchers as well as practitioners So more reliableapproaches for prediction need to be modeled In this papertwo approaches namely statistical methods and machinelearning techniques were applied for fault prediction Theapplication of statistical and machine learning methods infault prediction requires enormous amount of data andanalyzing this huge amount of data is necessary with the helpof a better prediction model

This paper proposes a comparative study of differentprediction models for fault prediction for an open-sourceproject Fault prediction using statistical and machine learn-ing methods were carried out for AIF by coding in MATLABenvironment Statistical methods such as linear regressionand logistic regression were applied Also machine learningtechniques such as artificial neural network (gradient descentand Levenberg Marquardt methods) Functional link artifi-cial neural network radial basis function network (RBFNbasic RBFN gradient and RBFN hybrid) and probabilisticneural network techniques were applied for fault predictionanalysis

It can be concluded from the statistical regression analysisthat out of six CK metrics WMC appears to be more usefulin predicting faults Table 20 shows that hybrid approachof RBFN obtained better fault prediction in less number ofepochs (14 iterations) when compared with the other threeneural network techniques

In future work should be replicated to other open-sourceprojects like Mozilla using different AI techniques to analyzewhich model performs better in achieving higher accuracyfor fault prediction Also fault prediction accuracy should bemeasured by combining multiple computational intelligencetechniques

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] V R Basili L C Briand and W L Melo ldquoA validationof object-oriented design metrics as quality indicatorsrdquo IEEETransactions on Software Engineering vol 22 no 10 pp 751ndash761 1996

[2] T J McCabe ldquoA Complexity Measurerdquo IEEE Transactions onSoftware Engineering vol 2 no 4 pp 308ndash320 1976

[3] M H Halstead Elements of Software Science Elsevier ScienceNew York NY USA 1977

[4] W Li and S Henry ldquoMaintenance metrics for the Object-Oriented paradigmrdquo in Proceedings of the 1st InternationalSoftware Metrics Symposium pp 52ndash60 1993

[5] S R Chidamber and C F Kemerer ldquoMetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[6] F B E Abreu andR Carapuca ldquoObject-Oriented software engi-neering measuring and controlling the development processrdquoin Proceedings of the 4th International Conference on SoftwareQuality pp 1ndash8 McLean Va USA October 1994

[7] M Lorenz and J Kidd Object-Oriented Software MetricsPrentice Hall Englewood NJ USA 1994

[8] R Martin ldquoOO design quality metricsmdashan analysis of depen-denciesrdquo in Proceedings of the Workshop Pragmatic and Theo-retical Directions in Object-Oriented Software Metrics (OOPSLArsquo94) 1994

[9] D P Tegarden S D Sheetz and D E Monarchi ldquoA softwarecomplexity model of object-oriented systemsrdquoDecision SupportSystems vol 13 no 3-4 pp 241ndash262 1995

[10] W Melo and F B E Abreu ldquoEvaluating the impact of object-oriented design on software qualityrdquo in Proceedings of the 3rdInternational Software Metrics Symposium pp 90ndash99 BerlinGermany March 1996

[11] L Briand P Devanbu and W Melo ldquoInvestigation intocoupling measures for C++rdquo in Proceedings of the IEEE 19thInternational Conference on Software EngineeringAssociation forComputing Machinery pp 412ndash421 May 1997

[12] L Etzkorn J Bansiya and C Davis ldquoDesign and code com-plexity metrics for OO classesrdquo Journal of Object-OrientedProgramming vol 12 no 1 pp 35ndash40 1999

[13] L C Briand JWust JWDaly andDVictor Porter ldquoExploringthe relationships between designmeasures and software qualityin object-oriented systemsrdquoThe Journal of Systems and Softwarevol 51 no 3 pp 245ndash273 2000

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 14: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

14 ISRN Software Engineering

Table 20 Performance metrics

Performance parametersAI technique Epoch MAE MARE RMSE Std Error AccuracyGradient descent 162 00594 10930 00617 00048 9404LM 13 00023 11203 00308 00041 9049RBFN basic mdash 00279 03875 00573 006 9727RBFN gradient 41 00207 02316 00323 00041 9724RBFN hybrid 14 00614 01032 00316 00013 9847FLANN 66 00304 07097 00390 00050 9637

The performance of PNN is shown in Figure 17 Highestaccuracy in prediction was obtained for smoothing parame-ter value of 17 PNN obtained a classification rate of 8641

RBFN using hybrid learning model gives the least valuesfor MAE MARE RMSE and high accuracy rate Hencefrom the obtained results by using ANN techniques it can beconcluded that RBFNhybrid approach obtained the best faultprediction rate in less number of epochswhen comparedwithother three ANN techniques

7 Conclusion

System analyst use of prediction models to classify faultprone classes as faulty or not faulty is the need of the dayfor researchers as well as practitioners So more reliableapproaches for prediction need to be modeled In this papertwo approaches namely statistical methods and machinelearning techniques were applied for fault prediction Theapplication of statistical and machine learning methods infault prediction requires enormous amount of data andanalyzing this huge amount of data is necessary with the helpof a better prediction model

This paper proposes a comparative study of differentprediction models for fault prediction for an open-sourceproject Fault prediction using statistical and machine learn-ing methods were carried out for AIF by coding in MATLABenvironment Statistical methods such as linear regressionand logistic regression were applied Also machine learningtechniques such as artificial neural network (gradient descentand Levenberg Marquardt methods) Functional link artifi-cial neural network radial basis function network (RBFNbasic RBFN gradient and RBFN hybrid) and probabilisticneural network techniques were applied for fault predictionanalysis

It can be concluded from the statistical regression analysisthat out of six CK metrics WMC appears to be more usefulin predicting faults Table 20 shows that hybrid approachof RBFN obtained better fault prediction in less number ofepochs (14 iterations) when compared with the other threeneural network techniques

In future work should be replicated to other open-sourceprojects like Mozilla using different AI techniques to analyzewhich model performs better in achieving higher accuracyfor fault prediction Also fault prediction accuracy should bemeasured by combining multiple computational intelligencetechniques

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] V R Basili L C Briand and W L Melo ldquoA validationof object-oriented design metrics as quality indicatorsrdquo IEEETransactions on Software Engineering vol 22 no 10 pp 751ndash761 1996

[2] T J McCabe ldquoA Complexity Measurerdquo IEEE Transactions onSoftware Engineering vol 2 no 4 pp 308ndash320 1976

[3] M H Halstead Elements of Software Science Elsevier ScienceNew York NY USA 1977

[4] W Li and S Henry ldquoMaintenance metrics for the Object-Oriented paradigmrdquo in Proceedings of the 1st InternationalSoftware Metrics Symposium pp 52ndash60 1993

[5] S R Chidamber and C F Kemerer ldquoMetrics suite for objectoriented designrdquo IEEE Transactions on Software Engineeringvol 20 no 6 pp 476ndash493 1994

[6] F B E Abreu andR Carapuca ldquoObject-Oriented software engi-neering measuring and controlling the development processrdquoin Proceedings of the 4th International Conference on SoftwareQuality pp 1ndash8 McLean Va USA October 1994

[7] M Lorenz and J Kidd Object-Oriented Software MetricsPrentice Hall Englewood NJ USA 1994

[8] R Martin ldquoOO design quality metricsmdashan analysis of depen-denciesrdquo in Proceedings of the Workshop Pragmatic and Theo-retical Directions in Object-Oriented Software Metrics (OOPSLArsquo94) 1994

[9] D P Tegarden S D Sheetz and D E Monarchi ldquoA softwarecomplexity model of object-oriented systemsrdquoDecision SupportSystems vol 13 no 3-4 pp 241ndash262 1995

[10] W Melo and F B E Abreu ldquoEvaluating the impact of object-oriented design on software qualityrdquo in Proceedings of the 3rdInternational Software Metrics Symposium pp 90ndash99 BerlinGermany March 1996

[11] L Briand P Devanbu and W Melo ldquoInvestigation intocoupling measures for C++rdquo in Proceedings of the IEEE 19thInternational Conference on Software EngineeringAssociation forComputing Machinery pp 412ndash421 May 1997

[12] L Etzkorn J Bansiya and C Davis ldquoDesign and code com-plexity metrics for OO classesrdquo Journal of Object-OrientedProgramming vol 12 no 1 pp 35ndash40 1999

[13] L C Briand JWust JWDaly andDVictor Porter ldquoExploringthe relationships between designmeasures and software qualityin object-oriented systemsrdquoThe Journal of Systems and Softwarevol 51 no 3 pp 245ndash273 2000

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 15: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

ISRN Software Engineering 15

[14] M-H Tang M-H Kao and M-H Chen ldquoEmpirical study onobject-oriented metricsrdquo in Proceedings of the 6th InternationalSoftware Metrics Symposium pp 242ndash249 November 1999

[15] K El Emam W Melo and J C Machado ldquoThe prediction offaulty classes using object-oriented design metricsrdquo Journal ofSystems and Software vol 56 no 1 pp 63ndash75 2001

[16] T M Khoshgoftaar E B Allen J P Hudepohl and S J AudldquoApplication of neural networks to software quality modeling ofa very large telecommunications systemrdquo IEEE Transactions onNeural Networks vol 8 no 4 pp 902ndash909 1997

[17] R Hochman T M Khoshgoftaar E B Allen and J PHudepohl ldquoEvolutionary neural networks a robust approachto software reliability problemsrdquo in Proceedings of the 8thInternational Symposium on Software Reliability Engineering(ISSRE rsquo97) pp 13ndash26 November 1997

[18] T Menzies B Caglayan E Kocaguneli J Krall F Peters andB Turhan ldquoThe PROMISE Repository of empirical softwareengineering datardquo West Virginia University Department ofComputer Science 2012 httppromisedatagooglecodecom

[19] Y Kumar Jain and S K Bhandare ldquoMin max normalizationbased data perturbation method for privacy protectionrdquo Inter-national Journal of Computer and Communication Technologyvol 2 no 8 pp 45ndash50 2011

[20] R Battiti ldquoFirst and Second-Order Methods for Learning bet-ween steepest descent and newtonrsquos methodrdquo Neural Computa-tion vol 4 no 2 pp 141ndash166 1992

[21] K Levenberg ldquoA method for the solution of certain non-linearproblems in least squaresrdquo Quarterly of Applied Mathematicsvol 2 no 2 pp 164ndash168 1944

[22] D W Marquardt ldquoAn algorithm for the lest-squares estimationof non-linear parametersrdquo SIAM Journal of Applied Mathemat-ics vol 11 no 2 pp 431ndash441 1963

[23] Y H Pao Adaptive Pattern Recognition and Neural NetworksAddison-Wesley Reading UK 1989

[24] D F Specht ldquoProbabilistic neural networksrdquo Neural Networksvol 3 no 1 pp 109ndash118 1990

[25] C Catal ldquoPerformance evaluation metrics for software faultprediction studiesrdquo Acta Polytechnica Hungarica vol 9 no 4pp 193ndash206 2012

[26] X Yaun T M Khoshgoftaar E B Allen and K GanesanldquoApplication of fuzzy clustering to software quality predictionrdquoin Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSESTrsquo00) pp 85ndash91 March 2000

[27] T Gyimothy R Ferenc and I Siket ldquoEmpirical validation ofobject-oriented metrics on open source software for fault pre-dictionrdquo IEEE Transactions on Software Engineering vol 31 no10 pp 897ndash910 2005

[28] G DenaroM Pezze and SMorasca ldquoTowards industrially rel-evant fault-proneness modelsrdquo International Journal of SoftwareEngineering and Knowledge Engineering vol 13 no 4 pp 395ndash417 2003

[29] S Kanmani and U V Rymend ldquoObject-Oriented softwarequality prediction using general regression neural networksrdquoSIGSOFT Software Engineering Notes vol 29 no 5 pp 1ndash62004

[30] N Nagappan and W Laurie ldquoEarly estimation of softwarequality using in-process testingmetrics a controlled case studyrdquoin Proceedings of the 3rd Workshop on Software Quality pp 1ndash7St Louis Mo USA 2005

[31] H M Olague L H Etzkorn S Gholston and S QuattlebaumldquoEmpirical validation of three software metrics suites to pre-dict fault-proneness of object-oriented classes developed usinghighly Iterative or agile software development processesrdquo IEEETransactions on Software Engineering vol 33 no 6 pp 402ndash4192007

[32] K K Aggarwal Y Singh A Kaur and R Malhotra ldquoEmpiricalanalysis for investigating the effect of object-oriented metricson fault proneness a replicated case studyrdquo Software ProcessImprovement and Practice vol 14 no 1 pp 39ndash62 2009

[33] F Wu ldquoEmpirical validation of object-oriented metrics onNASA for fault predictionrdquo in Proceedings of theInternationalConference on Advances in Information Technology and Educa-tion pp 168ndash175 2011

[34] H Kapila and S Singh ldquoAnalysis of CK metrics to predict soft-ware fault-proneness using bayesian inferencerdquo InternationalJournal of Computer Applications vol 74 no 2 pp 1ndash4 2013

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 16: Research Article Statistical and Machine Learning …downloads.hindawi.com/archive/2014/251083.pdfchosen for fault prediction... Empirical Data Collection. Metricsuitesareusedand de

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014