investigating software detection methods - ieomieomsociety.org/ieom_2016/pdfs/702.pdf ·...

Proceedings of the 2016 International Conference on Industrial Engineering and Operations Management Kuala Lumpur, Malaysia, March 8-10, 2016

Investigating Software Detection Methods

Kiran Khatter School of Engineering & Technology

Ansal University Gurgaon, India

[email protected]

Arvind Kalia Department of Computer Science

Himachal Pradesh University Shimla, India

[email protected]

Abstract— A defective module not only increases the development time and development cost but also increases the maintenance time and maintenance cost. According to the available literature survey, many systems failed due to schedule and time budget overruns. Therefore a software defect detection technique is needed to identify those software modules that are very likely to include defects and thereby improves the software quality by contributing in the efficient removal software defects. The main objective of the paper is to help software developers in identifying the software defects based on the various software metrics using various classification and machine learning techniques. In this paper, we are performing empirical classification comparison on 5 real world datasets.

Keywords— Software defect detection, Clustering Techniques, Machine Learning Techniques, Software Metrics

I. INTRODUCTION

Computer Software have been become an essential part in everyday life and have been used for different applications ranging from business and personal computing applications to real-time applications. The main objective of software development is to develop high quality software with the expected functionalities. In context of software engineering, software quality refers to the satisfaction of functional and non-functional requirements. A functional requirement refers to the operations performed by the software and its components whereas Non-functional requirement refers to the quality attributes such as reliability, maintainability, availability etc. of the system and the degree of satisfaction of non-functional requirements determines the success/failure of software systems [1].There are also various approaches to achieve these attributes with emphasis to improve the software quality by preventing the introduction of faults. These software defect detection approaches help in finding the defective modules in the software. Testing at the various level of software development process is the traditional way to identify the defects, but when the project size increases in terms of lines of codes(LOCs) and complexity, it becomes difficult and expensive to find the defects with the use of traditional testing techniques. If defects or errors are not identified in the initial phase, these may likely to creep in the later phases such as design and implementation phase of the software development process. Repair of such errors and defects increases the time and cost of software development process [2][3][4]. National Institute of Standards and Technology did study on various projects of U.S. and concluded that software errors cost U.S. economy about $59.5 billion yearly. So testing and debugging are persistent during the software development process. [5].Thus sooner we detect defected software components; it lowers down the cost& time to develop the software and hence improves the reliability of the software. Thus Defect prediction is imperative to accomplish the software quality.

In order to perform comparison of different software fault detection classifiers, we are using various performance metrics such as probability of detection, accuracy, precision, G-mean, F-measure. This paper also represents the comparison of various different classification models using ROC curves.

This paper is organized as follows: Section 2 explains the related work reported by various researchers for software fault detection. Section 3 describes the data sets used in the study to perform comparison. Section 4 specifies the different classifiers and a detailed comparison between various classifiers. Section 5 represents the experimental set up and the results interpretation followed by conclusion of the study.

2248© IEOM Society International


II. RELATED WORKS

A software defect is an error, failure or fault which triggers software to diverge from its specified behavior. The main objective of software detection method is to categorize the fault prone and non-fault prone modules on the basis of classifier. It improves the overall quality of software by discovering the software defects in the early phase of software development process and helping developers to focus on defect prone modules [3][6][7]. Identification and removal of software faults is one of the critical activities in the software development process. Though it is not possible to make defect-free software, it is feasible to minimize the software faults and their impact on the software. Anticipation and restoration of software defects not only improves the quality of software, also yields the huge returns.

There are various public repositories such as NASA MDP, PROMISE etc. utilized for building classifiers to identify the fault prone and non-fault prone modules. These repositories specifies the information on various metrics such as LOC, Halstead measure, Mccabe Complexity, branch count etc. There are various classification and machine learning methods such as Associative Classification [8], Logistic Regression [9][10], Fuzzy Subtractive Clustering [11], Support Vector Machine[12] and Artificial Neural Networks [13][14]and Baysian Belief network [15][16]etc. are accessible to build software prediction models.

Fenton et al. [17] used Dynamic Bayesian Network for different software life cycles adopted for several platforms.

Greg Rothermell [18] prioritized the test suits for regression testing by using different types of techniques and checked which technique is best for the prediction of software faults.

Janes et al. [19] used zero-inflated negative binomial regression and Poisson distribution for real-time telecommunication systems.

Catal et al. [20] used a clustering based software fault prediction approach to classify software failures.

Seliya et al. [21] used K-means clustering method to investigate fault prone software modules.

Hall et al. [22] used N-fold cross-validation multiple times to transform the software failure data to the model properly and to represent the system with an acceptable accuracy.

Zhong et al. [23] used K-means and Neural-Gas clustering algorithms to cluster modules to perform software fault prediction.

In this paper, we are using PROMISE Software Engineering Repository [24] to build software defect prediction models based on the Classification Tree, Logistic Regression, Support Vector Machine and Naïve Bayes algorithm. Five following datasets are investigated to perform comparison of different classifiers.

A. JM1 Dataset:

JM1 dataset, written in C language, contains 10,885 modules on software metrics and fault-proneness information of Real-time predictive ground system project.

B. CM1 Dataset

CM1data set contains 498 modules of NASA spacecraft instrument for prediction of fault-prone modules.

C. PC1 Dataset

The PC1 project, written in 40 KLOC of C, has 1,109 modules on the flight software for earth orbiting satellite application.

D. KC1 Dataset

KC1 project, written in C++, consists of 43 KLOC has 2017 modules to process the data.

E. KC2 Dataset

KC2 project is on storage management system to process data and contains 520 modules of C++ code.

III. CLASSIFICATION TECHNIQUES

A. Classification Tree

Classification Tree helps in categorizing the fault prone and not fault prone modules on the basis of certain softwaremetrics or code attributes derived from datasets. These datasets includes the data of previous development projects in terms of number of faults and characteristics code attributes [25] [26]. Voulgaris et al. [27] used classification tree based on the JM1 and CM1 projects of NASA and the model they built had 79.3% accuracy. The advantage of the classification trees is the accuracy of the prediction model, but it cannot evaluate imbalanced trees. The classification tree for JM1 dataset is shown below in the Figure 1:



Figure 1: Classification Tree for JM1 dataset

B. Logistic Regression

Logistic Regression is a non-linear regression technique to measure the relationship between continuous dependent and independent variables. In this technique, maximum-likelihood ratio is used to measure the significance of the variables. It is used for predicting outcomes of class label based on the specific features of an individual case [28][29].

C. Support Vector Machine (SVM)

VM is a machine learning techniqueused for performing data classificationand regression analysis in varied applications such as fingerprint and hand written recognition etc. SVM helps in separating the clusters of vector on the basis optimal hyper plane. Elish et al. [29] [30] compared the performance of SVM against various statistical and machine learning methods and indicated that SVM yields better results in comparison to other methods in the context of NASA datasets.

D. Naive Bayes

Naive Bayes classifier is the probabilistic classifier based on Bayes theorem which represents the networks of probabilities in order to capture the probabilistic relationships between variables. Bayesian network represents joint probability distribution in the form of Direct Acyclic Graph network structure (G) for V variables. The arcs symbolize the Bayesian probabilistic relationships among various nodes representing random variables [2]. It represents the group of conditional probability distributions where each variable Vi in the graph G is represented by a conditional distribution given its parent nodes Par(Vi) [2] [31].It consists of following two components:

a) A directed acyclic graph G whose vertices correspond to random variables V1, V2, V3,…, Vn.

b) Conditional probability distribution of every variableiV given its parents Par(Vi).

Graph G is based on the Markov assumption “Every variable iV given its parents Par(Vi) is conditional independent of non-

descendants nodes” and it is defined as follows [32]:

)(|(),...,,(1

21 i

n

iin VParVPVVVPar ∏

=

=

IV. EVALUATION MEASUREMENTS

Binary Classifiers are evaluated on the basis of Confusion Matrix, which helps to classify fault prone, or not fault prone modules. Figure 2 represents the information about the actual and predicted classification for any two-class classifiers. In this



study, if software module is found to be defective, is categorized as “positive” case otherwise modules without defects are termed as “negative” cases [8].

Defects Predicted?NO YES

Modules with Defect?

fp True Negative (TN)

False Positive (FP)

nfp False Negative (FN)

True Positive (TP)

Figure 2: Defect Prediction Confusion Matrix

In order to assess the performance of classification algorithm, several performance metrics are defined below on the basis of total number of True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) cases:

A. Probability of Detection (PD)

It is the proportion of fault-prone modules predicted by the classification algorithm. It is defined as follows[8]:

FN)(TP

TP PD)Detection( ofy Probabilit

+=

B. Accuracy (ACC)

It refers to the correctly predicted modules and calculated as follows [5]:

TP)FNFP(TN

TNTP CC)Accuracy(A

++++=

C. Precision

The precision is percentage of correctly predicted fault-prone modules, calculated as follows [8]:

FP)(TP

TP Precision

+=

D. F-measure

It is the harmonic mean of recall and precision and calculated as follows [8]:

PD) percision*(

PD*percision*1)( measure-F

2

2

++=

ββ

In F-measure, β refers to any non-negative value and controls the weight assigned to PD and precision. If β takes value 1, then PD and precision are weighted equally [8]. Following mean measure consider “imbalance of data” into account while evaluating the classification method [8]:

precision *PD mean -G 1 =

TNR *PD mean -G 2 =Both G-mean’s and F-measure result in0 when all the positive cases are predicted incorrectly.

V. PERFORMANCE COMPARISON

A suitable way to compare the performance of various classification algorithms is by using Receiver Operating Characteristic (ROC) curves. The ROC curve is two-dimensional graph in which PF is plotted against PD to indicate the tradeoff between true positive cases and false positive cases. The higher the PD at low PF values, the better the model [33]. The goal here is to investigate which classification method performs better in the prediction of fault-prone modules of these data sets. We are using RF algorithm at three cutoffs: (0.9, 0.1), (0.8, 0.2) and (0.7, 0.3).



TABLE I

EXPERIMENTAL RESULTS ON DIFFERENT METRICS-JM1PROJECT

Methods Probability of

Detection Accuracy Precision G-mean1 G-mean2 F-measure

Classification Tree 0.605 0.826 0.125 0.275 0.71 0.207

Logistic Regression 0.107 0.824 0.598 0.253 0.325 0.182

Support Vector Machine

0.118 0.835 0.838 0.315 0.343 0.208

Naïve Bayes 0.491 0.816 0.204 0.317 0.699 0.289

Random Forest (Cutoff= (0.9,0.1))

0.969 0.839 0.532 0.718 0.886 0.687


0.959 0.973 0.901 0.938 0.968 0.929


0.956 0.988 0.977 0.967 0.975 0.967

In Table 1, the highest values in both G − mean and F – measure in Random forests at cutoffs (0.9, 0.1), (0.8,0.2) and (0.7, 0.3) indicate that random forests is best classification method in comparison of other classification techniques. Since JM1 data set suffers from noisy software engineering measurements, we can conclude that random forests performs better in comparison to other classifiers on a large and noisy dataset.

TABLE II

EXPERIMENTAL RESULTS ON DIFFERENT METRICS-CM1PROJECT

From the outcome in Table 2, Classification Tree appears to perform better on CM1 project based on F − measure. Random forests at cutoffs (0.8, 0.2) and (0.7, 0.3) returns the highest values for both G − mean and F − measure. It demonstrates that Random Forest Algorithm is in better execution.

Methods Probability

of Detection Accuracy Precision G-mean1 G-mean2 F-measure


Logistic Regression 0.276 0.934 0.800 0.469 0.524 0.410

Support Vector Machine 0.069 0.922 1.000 0.263 0.263 0.129

Naïve Bayes 0.282 0.867 0.379 0.327 0.515 0.324


1.000 0.922 0.518 0.719 0.851 0.518


1.000 0.974 0.723 0.967 0.919 0.722


1.000 0.994 0.935 1.000 0.997 0.967



TABLE III EXPERIMENTAL RESULTS ON DIFFERENT METRICS-PC1 PROJECT

In Table 3, Random Forest algorithms at cutoffs (0.9, 0.1), (0.8, 0.2) and (0.7, 0.3) and Logistic Regression show better preference in comparison to Classification Tree, SVM and Naïve Bayes methods. Classification Tree produces an acceptable value in G-mean, but a low value in F-measure.

TABLE IV

EXPERIMENTAL RESULTS ON DIFFERENT METRICS: KC1 PROJECT

Logistic Regression produces the best result based on G-mean and F−measure. Random Forest Algorithm also gives not too bad performance measures. It appears that aside from Logistic Regression, Random forest algorithm accomplishes higher results at cutoffs (0.9, 0.1), (0.8, 0.2) and (0.7, 0.3).

Methods Probability

of Detection Accuracy Precision G-mean1 G-mean2 F-measure

Classification Tree 0.750 0.934 0.203 0.391 0.448 0.320Logistic

Regression 0.203 0.929 0.600 0.833 0.839 0.304


Naïve Bayes 0.310 0.895 0.305 0.308 0.541 0.308


1.000 0.966 0.694 0.833 0.982 0.819


1.000 0.991 0.894 0.945 0.995 0.944


1.000 0.996 0.952 0.976 0.998 0.975

Methods Probability

of Detection

Accuracy Precision G-mean1 G-mean2 F-measure


Logistic Regression

0.256 0.869 0.695 0.461 0.622 0.595

Support Vector Machine

0.242 0.878 0.844 0.452 0.490 0.373

Naïve Bayes 0.433 0.829 0.390 0.411 0.468 0.391


0.390 0.829 0.433 0.411 0.495 0.410


0.390 0.829 0.411 0.433 0.500 0.410


0.390 0.829 0.433 0.410 0.410 0.419



TABLEV

EXPERIMENTAL RESULTS ON DIFFERENT METRICS: KC2 PROJECT

Table 5 indicates that Classification Tree appears to perform better on KC2 project based on G − mean. Random forests at cutoffs (0.9, 0.1), (0.8, 0.2) and (0.7, 0.3) returns the highest values for G-mean and F – measure. We have used two-dimensional ROC curve to represent the trade-off between True Positive cases and False Positive cases for assessing the performance of various classification algorithms being used in the study: Classification Tree, Logistic Regression, Support Vector Machine, Naïve Bayes and Random Forest Algorithms.

Figure 3: ROC Curve- JM1 Project

Methods Probability

of Detection

Accuracy Precision G-mean1 G-mean2 F-measure

Classification Tree 0.825 0.906 0.712 0.767 0.873 0.765 Logistic

Regression 0.466 0.85 0.739 0.587 0.667 0.571


Naïve Bayes 0.696 0.838 0.438 0.552 0.774 0.538 Random Forest

(Cutoff= (0.9,0.1))

0.932 0.829 0.562 0.734 0.864 0.701


0.917 0.718 0.817 0.938 0.931 0.865


0.917 0.971 0.866 0.944 0.931 0.931



Figure 4: ROC Curve- CM1 Project

Figure 5: ROC Curve- PC1 Project



Figure 6: ROC Curve-KC1 Project

Figure 7: ROC Curve- KC2 Project



Figure 3 represents Random Forest as best classification algorithm in terms of identifying defective software modules. From Figure 4 and Figure 7, it is evident that Classification Tree works in better execution for CM1 and KC2 project. Figure 5 represents Logistic Regression and Random Forest as better classification methods. ROC curve shown in Figure 6 represents the Logistic Regression as best classification algorithm representing the trade-offs between True and False cases. Thus ROC plots shown in Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7 indicates the performance comparison of Classification Tree, Logistic Regression, Support Vector Machine, Naïve Bayes and Random Forest Algorithms. It is found that Random Forest algorithm works well on JM1 project of PROMISE Software Engineering Repository. The performance of Random forest algorithm is good if voting thresholds is proportionate to number of fault-prone modules.

VI. CONCLUSIONS

As software defects are introduced in software development process, modules in the dataset categorized into fault prone and not fault prone modules. The early detection of software defects decreases cost of development and increases the overall quality in software systems. This paper gave several insights in the context of software fault prediction. Different software fault prediction algorithms were evaluated in terms of different prediction performance parameters. It is also reported that performance of classification algorithm may return arbitrary results depending on voting threshold values.

References

[1] Khatter, K. & Kalia, A. (2013). Impact of Non-functional Requirements on Requirements Evolution. Sixth International Conference on EmergingTrends in Engineering and Technology- ICETET 2013, IEEE computer Society, 2013. 61-68. doi: 10.1109/ICETET.2013.15.

[2] Khatter, K. & Kalia, A. (2014). Quantification of Non-Functional Requirements. Sixth International Conference on Contemporary Computing(IC3), 224-229, IEEE computer Society , doi: 1109/IC3.2014.6897177

[3] Zheng, J., Williams, L., Nagappan, N., Snipes, W., Hudepohl, J. and Vouk, M. (2006). On the value of static analysis for fault detection insoftware. IIEEE Trans. Software Eng., 32(4), pp.240-253.

[4] Abaei, G. and Selamat, A. (2013). A survey on software fault detection based on different prediction approaches. Vietnam J Comput Sci, 1(2), pp.79-95.

[5] James S. Peters & Witold Pedrycz, Software Engineering an Engineering Approach, Wiley Publications, 2000

[6] Koren, I. and Krishna, C. (2007). Fault-tolerant systems. Amsterdam: Elsevier/Morgan Kaufmann.

[7] A. G. Koru & H. Liu, Building e®ective defect-prediction models in practice, IEEE Software 22 (2005) 23–29.

[8] Ma B., Zhang H., Chen G., Zhao Y. & Baesens B., Investigating Associative Classification for Software Fault Prediction: An Experimental Perspective, International Journal of Software and Knowledge Engineering, 24, 61 (2014). DOI: 10.1142/S021819401450003X

[9] G. Denaro, M. Pezze and S. Morasca, Towards industrially relevant fault-proneness models, Int. J. Software Engineering & Knowledge Engineering 13(2003) 395–414.

[10] T. Khoshgoftaar, N. Seliya and K. Gao, Assessment of a new three-group software quality classi¯cation technique: An empirical case study, EmpiricalSoftware Engineering 10 (2005) 183–218.

[11] S. Zhong, T. M. Khoshgoftaar and N. Seliya, Unsupervised learning for expert-based software quality estimation, in Eighth IEEE InternationalSymposium on High Assurance Systems Engineering, 2004, pp. 149–155.

[12] F. Xing, P. Guo and M. R. Lyu, A novel method for early software quality prediction based on support vector machine, in Sixteenth IEEE InternationalSymposium onSoftware Reliability Engineering, Chicago, IL, USA, 2005, pp. 213–222.

[13] S. Kanmani, V. R. Uthariaraj, V. Sankaranarayanan and P. Thambidurai, Object oriented software quality prediction using general regression neuralnetworks, SIGSOFT Softw. Eng. Notes 29 (2004) 1–6.

[14] Z. Jun, Cost-sensitive boosting neural networks for software defect prediction, Expert Systems with Applications 37 (2010) 4537–4543.

[15] T. Menzies, J. DiStefano, A. Orrego and R. Chapman, Assessing predictors of software defects, in Predictive Software Models Workshop, 2004, pp. 1–4.

[16] B. Turhan and A. Bener, Analysis of naive Bayes' assumptions on software fault data: An empirical study, Data & Knowledge Engineering 68 (2009)278–290.

[17] Fenton, N., Neil, M., Marsh, W., Hearty, P., Marquez, D., Krause, P., Mishra, R. Predicting Software Defects in Varying Development Lifecyclesusing Bayesian Nets

[18] Gregg Rothermel , Roland H. Untch and Chengyun Chu, and Mary Jean Harrold, “Prioritizing Test Cases For Regression Testing”, IEEE Transactionson Software Engineering, Vol. 27, No. 10, October 2001

[19] Janes, A.; Scotto, M.; Pedrycz, W.; Russo, B.; Stefanovic, M.; Succi, G. Identification of Defect-prone Classes in Telecommunication SoftwareSystems using Design Metrics. Information Sciences, 176, 24(2006), pp. 3711- 3734.

[20] C. Catal, U. Sevim, B. Diri, “Clustering and metrics thresholds based software fault prediction of unlabeled program modules”, 6th Int’l. Conferenceon Information Technology: New Generations, IEEE Computer Society, Las Vegas, Nevada, 2009.

[21] Seliya, N.; Khosh goftaar, T. M. Software Quality Analysis of Unlabeled Program Modules with Semi-supervised Clustering. IEEE Transactions onSystems, Man and Cybernetics-Part A: Systems and Humans, 37, 2(2007), pp. 201-211.

[22] M. Hall, G. Holmes, Benchmarking Attribute Selection Techniques for Discrete Class Data Mining, IEEE Trans. Knowledge and Data Eng. 15 (6)(2003) 1437-1447

[23] Zhong, S.; Khoshgoftaar, T. M.; Seliya, N. Unsupervised Learning for Expert-based Software Quality Estimation. Proc. of the 8th Intl. Symp. on HighAssurance Systems Eng. Tampa, FL, 2004, pp. 149-155.



[24] Sayyad Shirabad, J. and Menzies, T.J. (2005) The PROMISE Repository of Software Engineering Databases. School of Information Technology andEngineering, University of Ottawa, Canada.

[25] Boetticher, G. D. (2005),” Nearest neighbor sampling for better defect prediction”, In ACM SIGSOFT Software Engineering Notes, Vol. 30, No. 4, pp.1-6.

[26] El Emam, K., Melo, W., & Machado, J. C. (2001), “The prediction of faulty classes using object-oriented design metrics”, Journal of Systems andSoftware, 56(1), 63-75.

[27] Voulgaris, Z., &Magoulas, G. D. (2008),”Extensions of the k nearest neighbour methods for classification problems”, In the Proceedings of the 26thIASTED International Conference on “Artificial Intelligence and Applications”,Innsbruck, Austria, February ,Vol. 13, pp. 23-28.

[28] Jiang, Y., Cukic, B., & Menzies, T. (2007),”Fault prediction using early lifecycle data”, In Software Reliability, 2007. ISSRE'07. The 18th IEEEInternational Symposium on pp. 237-246.

[29] Hall, T., Beecham, S., Bowes, D., Gray, D. and Counsell, S. (2012). A Systematic Literature Review on Fault Prediction Performance in SoftwareEngineering. IIEEE Trans. Software Eng., 38(6), pp.1276-1304.

[30] K.O. Elish, M.O. Elish, Predicting defect-prone software modules using support vector machines, Journal of Systems and Software 81 (5) (2008) 649–660.

[31] Adnan D. (2009). Modeling and Reasoning with Bayesian Networks. New York, NY:Cambridge University Press

[32] Friedman N., Linial M, Nachman, I. &Pe'er D. (2000). Using Bayesian networks to analyze expression data. In: Proceedings of the fourth annualinternational conference on Computational molecular biology (RECOMB '00), 127-135. DOI=10.1145/332306.332355 http://doi.acm.org/10. 1145/332306.332355

[33] Hanley, J. A., & McNeil, B. J. (1983). A Method of Comparing the Areas under Receiver Operating Characteristic Curves Derived from the SameCases, Radiology, 148, 839-843.

BIOGRAPHY

Dr. Kiran Khatter is currently working as Assistant Professor in the School of Engineering & Technology at Ansal University, Gurgaon. She is awarded Ph.D. in Computer Science from Himachal Pradesh University, Summer Hill, Shimla. She has done M.Tech (IT) from Punjabi University, Patiala in 2004. She has over 12 years of experience in IT industry and Academics. She has a number of publications in journal & international conference proceedings of repute. Her research interests span over Software Engineering, Image Processing, Data Mining and Big Data analytics. She has keen interest in computer programming. She is an Oracle Certified Java SE 6 Programmer. She has done training in ABAP programming and handled ABAP Projects.

Dr. Arvind Kalia is currently working as a Professor at Himachal Pradesh University, Shimla. He has done Master of Computer Applications from Thapar Institute and Ph.D from Punjabi University, Patiala. He has authored two books with National and International Publishers. He has good number of publications in national and international journals of repute. He has an experience of over 25 years and has been interested in the area of Software Engineering and Data mining. Currently, he is involved in research related to Software Quality, Component Based Systems and Open Software Systems. He is also a life member of Computer Society of India and Indian Science Congress.


investigating software detection methods - ieomieomsociety.org/ieom_2016/pdfs/702.pdf ·...

Documents