deep learning-based fault localization with contextual information · 2020. 3. 16. · ieice trans....

IEICE TRANS. INF. & SYST., VOL.E100–D, NO.12 DECEMBER 20173027

LETTER

Deep Learning-Based Fault Localization with ContextualInformation

Zhuo ZHANG†a), Nonmember, Yan LEI††,†††,††††b), Member, Qingping TAN†c), Xiaoguang MAO†, Ping ZENG†,and Xi CHANG†, Nonmembers

SUMMARY Fault localization is essential for solving the issue of soft-ware faults. Aiming at improving fault localization, this paper proposes adeep learning-based fault localization with contextual information. Specif-ically, our approach uses deep neural network to construct a suspiciousnessevaluation model to evaluate the suspiciousness of a statement being faulty,and then leverages dynamic backward slicing to extract contextual infor-mation. The empirical results show that our approach significantly outper-forms the state-of-the-art technique Dstar.key words: fault localization, dynamic slice, deep learning, contextualinformation

1. Introduction

Software faults are inevitable with software’s increasingscale and complexity. However, debugging activity is anexpensive and time-consuming process. Thus, many ef-forts have been made to provide assistance in guiding pro-grammers to the locations of faults that cause incorrect out-puts [1]. The core work of fault localization is to evaluatewhich statement being suspiciously faulty. Among exist-ing localization methods, spectrum-based fault localization(SBFL) is the most popular one by using spectrum-basedsuspiciousness formulas to assign suspiciousness values ofbeing faulty on program statements [2]. Xie et al. [2], [3] hastheoretically showed the maximal effectiveness that SBFLcan reach, that is, the potential of SBFL has been fully ex-plored. Thus, it is necessary to propose a novel approachdifferent from the view of SBFL to evaluate which state-ments being suspiciously faulty for further improvement.

We notice that machine learning techniques were suc-cessfully applied in the field of fault localization [4], [5].However, many models have drawbacks such as restrictedgeneralization ability for intricate problem and inadequate

Manuscript received June 30, 2017.Manuscript revised August 23, 2017.Manuscript publicized September 8, 2017.†The authors are with College of Computer, National Univer-

sity of Defense Technology, Changsha, 410073 China.††The author is with Key Laboratory of Dependable Service

Computing in Cyber Physical Society (Chongqing University),Ministry of Education, Chongqing, 400044 China.†††The author is with School of Software Engineering,

Chongqing University, Chongqing, 400044 China.††††The author is with Logistical Engineering University,

Chongqing 401331, China.a) E-mail: [email protected]) E-mail: [email protected] (Corresponding author)c) E-mail: eric [email protected] (Corresponding author)

DOI: 10.1587/transinf.2017EDL8143

ability in using limited data set to express complicated func-tions. With these shortcomings, the faults cannot be ana-lyzed exactly. In recent years, deep learning has witnessed arapid development to tackle these limitations. With its abil-ity of showing tremendous improvement in robust and ac-curacy, deep learning has already been rapidly evolved intomaking sense of data such as images, text and sound [6]. Itmeans that with this technology, we could construct a deepneural network with multiple hidden layers to extract fea-tures from the huge amount of data collecting from test casesand then assign different suspiciousness to sentences andrank them. Another drawback of many techniques is thatthey ignore the contextual information of the relationshipamong suspicious statements [4]. Hence, program slicingtechnology could be taken into consideration for its abilityof extracting data and/or control dependencies of programstatements and selecting a subset of statements affecting theincorrect output [7], [8]. There are static slicing and dy-namic slicing, where dynamic slicing gathers run-time in-formation along the execution path and cut down the sizeof the slice noticeably in comparison with static slicing [9].Hence, we use dynamic slicing to construct the context.

Thus, we propose a fault localization approach usingdeep neural network and dynamic backward slicing tech-nique. Our approach adopts deep neural network to assigndifferent suspiciousness values to the statements, then weconstruct a suspicious context with the utilization of dy-namic slicing to slice the output of a failed test case [10].Consequently, our approach provides useful context with itselements ranked in descending order to provide the examin-ing guidance for developers. We conduct an empirical studyon 10 representative open-source Java programs. The re-sults show that our approach can reduce almost 43.48% upto 88.85% of the average cost of examined code over thestate of the art technique Dstar [11].

2. Approach

2.1 Overview

Deep neural network is an artificial neural network withmultiple hidden layers, each node at the same hidden layeruses the same nonlinear function to map the feature inputfrom the layer below. Deep neural network’s structure isvery flexible and demonstrates excellent capacity to fit thehighly complex nonlinear relationship between inputs and

Copyright c© 2017 The Institute of Electronics, Information and Communication Engineers

3028IEICE TRANS. INF. & SYST., VOL.E100–D, NO.12 DECEMBER 2017

Fig. 1 The coverage of M executions.

Fig. 2 Virtual test cases.

outputs. The basic idea of our approach is to adopt deep neu-ral network to quantify the suspiciousness of the statements,and then use dynamic backward slicing to construct a sus-picious context with the failed outputs. Since the suspiciouscontext defined in our approach can show useful informa-tion of data/control dependencies and present its statementswith different suspiciousness in descending order, it can helpdevelopers to distinguish which statements are more suspi-cious and should be assigned priority to be checked.

Evaluating suspiciousness using deep neural net-work: We adopt deep neural network to compute the sus-piciousness of each statement. First, a deep neural networkis constructed with three parts: the input layer, hidden layerand output layer. Then this model is trained with the utiliza-tion of test cases’ results and statement coverage data whichis collected indicating that whether the statements are exe-cuted or not as input, we further get suspiciousness of eachstatement by testing the trained model using virtual test suit.Specifically, given a program P with N executable state-ments, it is executed by M test cases T which contains atleast one failed test case (see Fig. 1). xi j = 0 indicates thatstatement j is not executed under test case i, and xi j = 1otherwise. The error vector e represents the test results. Theelement ei is equal to 0 if test case i passed, and 1 other-wise. Based on the coverage vector as input, the network istrained iteratively. The complex nonlinear relationship be-tween the statement’s coverage information and test case’sresult can be reflected after training. At last, a set of virtualtest cases (see Fig. 2) that is an N-dimensional unit matrixis constructed as the testing input, the output vector is thesuspiciousness of statements.

After suspiciousness calculation by deep neural net-work, a ranking list of all statements is produced in descend-ing order of their suspiciousness. In our model (see Fig. 3),there are one input layer with the number of nodes accord-ing to the number of executable statements, one output layerwith the number of nodes according to test cases’ scale, ap-propriate number of hidden layers with the number of nodesin each one estimating by the following formula:

number = round(n/30 + 1) ∗ 10 (1)

Fig. 3 Architecture of our model.

where, n represents the number of executable statements.The hidden layers extract features from the input layer.Transfer function reflects the complex relationship betweeninput and output, we use relu function in the hidden layerand output layer as nonlinear transfer function and sigmoidfunction to output the result:

relu(x) = max(0, x) (2)

sigmoid(x) = 1/(1 + e−x) (3)

where, x is the output vector from the last layer. The learn-ing rate impacts the speed of convergence. In our model,dynamic adjusting learning rate is adopted for it has the ben-efit of making large changes at the beginning of the trainingprocedure when larger learning rate values are used and de-creasing the learning rate therefore smaller training updatesare made to weights later, this method can get good weightsmore quickly. In the following formula, one Epoch meanscompleting all training data once, LR represents learningrate, DropRate is the amount that learning rate is modi-fied each time while EpochDrop is how often to change thelearning rate.

LR = LR ∗ DropRate(Epoch+1)/EpochDrop (4)

Impulse factor is adjusted according to the size of thesample. We use Back Propagation algorithm to fine-tune theparameters (weight and bias) of the model, and the goal is tominimize the difference between error vector e (see Fig. 1)and training result y.

Construct suspicious contexts: The suspicious con-text is constructed by the dynamic slice of the faulty outputof a failed test case. Specifically, a dynamic slice consistsof an execution trace and a slicing criterion. We randomlychoose a failed test case, using its execution trace as the ex-ecution trace for the dynamic slice. Furthermore, we use theoutput statement with its variable outputting the faulty resultof the failed test case as the slicing criterion. Thus, we canobtain a dynamic slice. Since the dynamic slice shows howa set of statements affect the faulty output to cause a failure,we define it as the suspicious context.

2.2 An Illustrative Example

Figure 4 shows an example illustrating how our approach isto be applied with program P and a faulty statement s3. Thecells below each statement represent whether the statement

LETTER3029

Fig. 4 Example illustrating our approach based on deep learning.

is covered by the test case showed in the left cell (1 for exe-cuted and 0 otherwise). The rightmost cells indicate whetherthe test case is failed or not (1 for fail and 0 otherwise).The concrete process is as follows: Firstly, construct thedeep learning model with the number of input layer nodesbeing 16, 3 hidden layers with the number of each one’snodes being 10 according to Eq. (1) and the number of out-put layer nodes being 1. Secondly, we input the vector t1(1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0) and its result 1, thenvector t2 (1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0) and its result0 into the input layer until the coverage data and executionresults are all inputted into the network. After that, we trainthe network iteratively to get the final relationship betweenthe coverage information and test cases’ results. Thirdly,we construct the virtual test set which is a 16 dimensionalunit matrix and put it into the network, then get the suspi-ciousness values. Finally, dynamic slice is utilized to getthe context and rank the statements in the context by usinga slicing criterion (t1, s14, x), where t1 is a failed test caseand s14 is the statement that outputs the faulty result with itsvariable x.

Based on these information, Dstar and deep neural net-work both output a ranking list of all statements in descend-ing order. It reveals that the faulty statement s3 is ranked8th by Dstar and ranked 4th by deep neural network whileranked 1st in the context {s1, s3, s7, s14} by our approach, thestatements in the context are all responsible for the failedoutput of t1. Therefore, our approach obtains a better local-ization result than original Dstar.

3. An Experimental Study

3.1 Experimental Setup

The experiment selects 10 Java programs. Table 1 lists thename of these programs, function description, the numberof faulty versions, the lines of code and the number of test

Table 1 Subject programs.

Program Description Versions Loc TestPrint token1 Lexical analyzer 5 587 4,071Print token2 Lexical analyzer 10 571 4,056Schedule1 Priority scheduler 9 422 2,650Schedule2 Priority scheduler 23 411 2,710Tot info Information measure 37 173 1,052

Jtcas Collision avoidance 14 198 1,608NanoXML v1 XML parser 7 5,369 206NanoXML v2 XML parser 7 5,650 206NanoXML v3 XML parser 10 8,392 206NanoXML v4 XML parser 7 8,795 206

cases. The 6 small-scale subject programs and 4 large-scalesubject programs are all widely used in the fault localizationcommunity.

Our experiments adopt widely used metrics EXAM [8]and RImp [12], the former is defined as the percentage of ex-ecutable statements to be examined before finding the actualfaulty statement, the latter is to compare the total number ofstatements that need to be examined to find all faults by ourapproach versus the number that need to be examined byusing other fault localization technique. Lower values ofboth metrics show better improvement. As a reminder, oursystem uses EMMA† to collect coverage information andJSlice†† to perform slicing.

3.2 Data Analysis

Since our approach has the effect of deep learning and con-text (dynamic slicing), the comparison has four cases: DeepLearning (context), Deep Learning, Dstar (context), andDstar, among which Deep Learning (context) is our pro-posed approach. Figure 5 shows the EXAM distributionamong four cases in all faulty versions. As shown in Fig. 5,

†EMMA, http://emma.sourceforge.net/††JSlice, http://jslice.sourceforge.net/

3030IEICE TRANS. INF. & SYST., VOL.E100–D, NO.12 DECEMBER 2017

Fig. 5 EXAM comparison among our approach, Deep Learning, Dstar (context), and Dstar.

Fig. 6 RImp of our approach over Deep Learning, Dstar (context), andDstar on all faulty versions of each small-scale program.

our approach significantly outperforms the other three tech-niques. Furthermore, we have the effectiveness relationship:• Deep Learning (context) > Dstar (context) and Deep

Learning > Dstar.• Deep Learning (context) > Deep Learning and Dstar

(context) > Dstar.The relationship means that both deep learning and

context have a beneficial effect on fault localization. Basedon the gap between each of two curves, the effect of contextis a bit higher than that of deep learning.

Figures 6 and 7 illustrate the RImp score of our ap-proach over Deep Learning, Dstar (context), and Dstar inall faulty versions of each program†. The data in the cellsshow the detailed RImp values, and the fraction in the paren-theses shows the computation of RImp, among which thenumerator is the total number of the statements examinedby our approach to locate all faults in all faulty versionsof a program and the denominator is the total number of

†Complete data of RImp on each fault version are available athttps://github.com/ifuleyou/Rimp/blob/master/RImp.txt

Fig. 7 RImp of our approach over Deep Learning, Dstar (context), andDstar on all faulty versions of each large-scale program.

the statements examined by other fault localization tech-nique. Take Print tokens in Dstar as an example. Ourapproach needs to check a total number of 34 statementsto locate all faults of Print tokens, while Dstar requiresto examine a total number of 305 statements to locate allfaults of Print tokens. It means our approach takes up a34/305 = 11.15% of the examined statements of Dstar tolocate all faults in Print tokens. We can observe that ourapproach sharply decreases the statements that need to beexamined over the three techniques, accounting for a rangefrom 26.98% to 92.86% of the examined statements of Dstar(context), 26.14% to 66% of the examined statements ofDeep learning, 11.15% to 56.52% of the examined state-ments of Dstar. In conclusion, our approach is more effec-tive than Deep Learning, Dstar (context), and Dstar.

4. Conclusion

In this paper, we propose an approach using deep neuralnetwork to establish an effective suspiciousness model and

LETTER3031

using dynamic slicing to construct contexts for fault local-ization, and the results show a preliminary benefit on faultlocalization. In the future, we plan to promote the accuracyof our approach and explore more in the potential of deeplearning for fault localization.

Acknowledgments

This work is partially supported by the National NaturalScience Foundation of China under Grant No.61502296,No.61602504 and No.61672529, the Scientific ResearchFund of Hunan Provincial Education Department (GrantNo.15A007), the Municipal Natural Science Foundation ofShanghai (Grant No.15ZR1417000).

References

[1] W.E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, “A survey onsoftware fault localization,” IEEE Trans. Softw. Eng., vol.42, no.8,pp.707–740, 2016.

[2] X. Xie, T.Y. Chen, F.-C. Kuo, and B. Xu, “A theoretical analysis ofthe risk evaluation formulas for spectrum-based fault localization,”ACM Trans. Software Engineering and Methodology, vol.22, no.4,Article No.31, 2013.

[3] X. Xie, F.-C. Kuo, T.Y. Chen, S. Yoo, and M. Harman, “Provablyoptimal and human-competitive results in SBSE for spectrum basedfault localisation,” Proc. 5th Symposium on Search-Based SoftwareEngineering (SSBSE 2013), Lecture Notes in Computer Science,vol.8084, pp.224–238, Springer, 2013.

[4] W. Zheng, D. Hu, and J. Wang, “Fault localization analysis basedon deep neural network,” Mathematical Problems in Engineering,vol.2016, Article ID 1820454, 11 pages, 2016.

[5] W.E. Wong, V. Debroy, R. Golden, X. Xu, and B. Thuraisingham,“Effective software fault localization using an RBF neural network,”IEEE Trans. Reliab., vol.61, no.1, pp.149–169, 2012.

[6] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,vol.521, no.7553, pp.436–444, 2015.

[7] Y. Lei, X. Mao, Z. Dai, and C. Wang, “Effective statistical faultlocalization using program slices,” Proc. 36th Annual Interna-tional Computer Software and Applications Conference (COMP-SAC 2012), pp.1–10, 2012.

[8] X. Mao, Y. Lei, Z. Dai, Y. Qi, and C. Wang, “Slice-based statisticalfault localization,” Journal of Systems and Software, vol.89, no.1,pp.51–62, 2014.

[9] B. Korel and J. Laski, “Dynamic program slicing,” Information Pro-cessing Letters, vol.29, no.3, pp.155–163, 1988.

[10] Z. Zhang, X. Mao, Y. Lei, and P. Zhang, “Enriching contextual infor-mation for fault localization,” IEICE Trans. Inf. & Syst., vol.E97-D,no.6, pp.1652–1655, June 2014.

[11] W.E. Wong, V. Debroy, R. Gao, and Y. Li, “The Dstar method foreffective software fault localization,” IEEE Trans. Reliab., vol.63,no.1, pp.290–308, 2014.

[12] V. Debroy, W.E. Wong, X. Xu, and B. Choi, “A grouping-based strat-egy to improve the effectiveness of fault localization techniques,”Proc. 10th International Conference on Quality Software (QSIC2010), pp.13–22, 2010.

http://dx.doi.org/10.1109/tse.2016.2521368

http://dx.doi.org/10.1145/2522920.2522924

http://dx.doi.org/10.1007/978-3-642-39742-4_17

http://dx.doi.org/10.1155/2016/1820454

http://dx.doi.org/10.1109/tr.2011.2172031

http://dx.doi.org/10.1038/nature14539

http://dx.doi.org/10.1109/compsac.2012.9

http://dx.doi.org/10.1016/j.jss.2013.08.031

http://dx.doi.org/10.1016/0020-0190(88)90054-3

http://dx.doi.org/10.1587/transinf.e97.d.1652

http://dx.doi.org/10.1109/tr.2013.2285319

http://dx.doi.org/10.1109/qsic.2010.80

deep learning-based fault localization with contextual information · 2020. 3. 16. · ieice trans....

Documents