eﬀective program debugging based on execution slices and

www.elsevier.com/locate/jss

The Journal of Systems and Software 79 (2006) 891–903

Effective program debugging based on execution slicesand inter-block data dependency

W. Eric Wong *, Yu Qi

Department of Computer Science, University of Texas at Dallas, MS EC 31, 2601 N. Floyd Road, Richardson, TX 75083, United States

Received 6 December 2004; accepted 20 June 2005Available online 30 January 2006

Abstract

Localizing a fault in program debugging is a complex and time-consuming process. In this paper we present a novel approach usingexecution slices and inter-block data dependency to effectively identify the locations of program faults. An execution slice with respect toa given test case is the set of code executed by this test, and two blocks are data dependent if one block contains a definition that is usedby another block or vice versa. Not only can our approach reduce the search domain for program debugging, but also prioritize suspi-cious locations in the reduced domain based on their likelihood of containing faults. More specifically, we use multiple execution slices toprioritize the likelihood of a piece of code containing a specific fault. In addition, the likelihood also depends on whether this piece ofcode is data dependent on other suspicious code. A debugging tool, DESiD, was developed to support our method. A case study thatshows the effectiveness of our method in locating faults on an application developed for the European Space Agency is also reported.� 2005 Elsevier Inc. All rights reserved.

Keywords: Software testing; Fault localization; Program debugging; Execution slice; Inter-block data dependency; DESiD

1. Introduction

No matter how much effort is spent on testing a pro-gram,1 it appears to be a fact of life that software defectsare introduced and removed continually during softwaredevelopment processes. To improve the quality of a pro-gram, we have to remove as many defects in the programas possible without introducing new bugs at the same time.However, localizing a fault in program debugging can be acomplex and time-consuming process. One intuitive way todebug a program when it shows some abnormal behavior isto analyze its memory dump. Another way is to insert print

statements around suspicious code to print out the values

0164-1212/$ - see front matter � 2005 Elsevier Inc. All rights reserved.

doi:10.1016/j.jss.2005.06.045

* Corresponding author. Tel.: +1 972 883 6619; fax: +1 972 883 2349.E-mail address: [email protected] (W.E. Wong).URL: http://www.utdallas.edu/~ewong (W.E. Wong).

1 In this paper, we use ‘‘bugs’’, ‘‘faults’’ and ‘‘software defects’’interchangeably. We also use ‘‘program’’, ‘‘application’’ and ‘‘software’’interchangeably. In addition, ‘‘locating a bug’’ and ‘‘localizing a fault’’have the same meaning.

of some variables. While the former is not often usednow because it might require an analysis of a tremendousamount of data, the latter is still used. However, users needto have a good understanding of how a program is exe-cuted with respect to a given test case that causes the trou-ble and then insert only the necessary (i.e., neither toomany nor too few) print statements at the appropriate loca-tions. As a result, it is also not an ideal debugging tech-nique for identifying the locations of faults.

To overcome the problems discussed above, debuggingtools such as DBX and Microsoft VC++ debugger havebeen developed. They allow users to set break points alonga program execution and examine values of variables aswell as internal states at each break point, if so desired.One can think of this approach as inserting print state-ments into the program except that it does not requirethe physical insertion of any print statements. Break pointscan be pre-set before the execution or dynamically set onthe fly by the user in an interactive way during programexecution. Two types of execution are available: a continu-ous execution from one break point to the next break

mailto:[email protected]

http://www.utdallas.edu/~ewong

892 W.E. Wong, Y. Qi / The Journal of Systems and Software 79 (2006) 891–903

point, or a stepwise execution starting from a break point.While using these tools, users must first decide where thebreak points should be. When the execution reaches abreak point, they are allowed to examine the current pro-gram state and to determine whether the execution is stillcorrect. If not, the execution should be stopped, and someanalysis as well as backtracking may be necessary in orderto locate the fault(s). Otherwise, the execution goes oneither in a continuous mode or in a stepwise mode. In sum-mary, these tools provide a snapshot of the program stateat various break points along an execution path. One majordisadvantage of this approach is that it requires users todevelop their own strategies to avoid examining too muchinformation for nothing. Another significant disadvantageis that it cannot reduce the search domain by prioritizingcode based on the likelihood of containing faults on a givenexecution path.

To develop a more effective debugging strategy, pro-gram slicing has been used. In this paper, we propose amethod based on execution slices and inter-block datadependency. For a given test case, its execution slice isthe set of code (basic blocks2 in our case) executed by thistest. An execution slice-based technique as reported in(Agrawal et al., 1995) can be effective in locating some pro-gram bugs, but not for others, especially those in the codethat are executed by both the failed and the successful tests.Another problem with this technique is that even if a bug isin the dice obtained by subtracting the execution slice of asuccessful test from that of a failed test, there may still betoo much code that needs to be examined. We proposean augmentation method and a refining method to over-come these problems. The former includes additional codein the search domain for inspection based on its inter-blockdata dependency with the code which is currently beingexamined, whereas the latter excludes less suspicious codefrom the search domain using the execution slices of addi-tional successful tests. Stated differently, the augmentationand refining methods are used to help programmers betterprioritize code based on its likelihood of containing pro-gram bug(s) for an effective fault localization. A case studywas conducted to demonstrate the effectiveness of ourmethod, and a tool, DESiD (which stands for a Debuggingtool based on Execution Slices and inter-block data Depen-dency) was implemented to automate the slice generation,code prioritization, and data collection.

The rest of the paper is organized as follows. Section 2describes our method. Section 3 explains our method withan example, and Section 4 reports a case study. Section 5discusses issues related to our method. Section 6 presentsan overview of some related studies. Finally, in Section 7

2 A basic block, also known as a block, is a sequence of consecutivestatements or expressions containing no transfers of control except at theend, so that if one element of it is executed, all are. This of course assumesthat the underlying hardware does not fail during the execution of a block.Hereafter, we refer to a basic block simply as a block.

we offer our conclusions and recommendations for futureresearch.

2. Methodology

Suppose for a given program bug, we have identified onefailed test (say F) and one successful test (say S). Supposealso their execution slices are denoted by EF and ES,respectively. We can construct a dice Dð1Þ ¼ EF � ES. Wepresent two methods to help programmers effectively locatethis bug: an augmentation method to include additionalcode for inspection if the bug is not in Dð1Þ, and a refiningmethod to exclude code from being examined if Dð1Þ

contains too much code.

2.1. Augmenting bad Dð1Þ

If the bug is not in Dð1Þ, we need to examine additionalcode from the rest of the failed execution slice (i.e., codefrom EF �Dð1Þ). Let us use U to represent the set of codethat is in EF �Dð1Þ. For a block b, the notation b 2 Uimplies b is in the failed execution slice EF but not inDð1Þ. We also define a ‘‘direct data dependency’’ relationD such that bDDð1Þ if and only if b defines a variable x thatis used in Dð1Þ or b uses a variable y defined in Dð1Þ. We saythat b is directly data dependent on Dð1Þ.

Instead of examining code in U all at one time (i.e., hav-ing code in U with the same priority), a better approach isto prioritize such code based on its likelihood of containingthe bug. To do so, we use the augmentation proceduresoutlined as follows:

A1: Construct Að1Þ, the augmented code segment fromthe first iteration, such that Að1Þ ¼ fbjb 2 U ^ðbDDð1ÞÞg.

A2: Set k = 1.A3: Examine code in AðkÞ to see whether it contains the

bug.A4: If YES, then STOP because we have located the bug.A5: Set k = k + 1.A6: Construct AðkÞ, the augmented code segment from

the kth iteration such that AðkÞ ¼Aðk�1Þ [ fbjb 2U ^ ðbDAðk�1ÞÞg.

A7: If AðkÞ ¼Aðk�1Þ (i.e., no new code can be includedfrom the (k � 1)th iteration to the kth iteration), thenSTOP. At this point we have Að�Þ, the final aug-mented code segment, equals AðkÞ (and Aðk�1Þ aswell).

A8: Go back to step A3.

Stated differently, Að1Þ contains the code that is directlydata dependent on Dð1Þ, and Að2Þ is the union of Að1Þ andadditional code that is directly data dependent on Að1Þ. It isclear that Að1Þ � Að2Þ � Að3Þ � � � � Að�Þ.

Our incremental approach is that if a program bug is notin Dð1Þ, we will then try to locate it starting from Að1Þ fol-lowed by Að2Þ and so on until Að�Þ. For each iteration we

W.E. Wong, Y. Qi / The Journal of Systems and Software 79 (2006) 891–903 893

include additional code that is data dependent on the pre-vious augmented code segment. One exception is the firstiteration where Að1Þ is data dependent on Dð1Þ. The ratio-nale behind this augmentation method is that even thougha program bug is not in the suspicious area (such as Dð1ÞÞ, itis very likely that it has something to do with the code thatis directly data dependent on the code in Dð1Þ or some AðkÞ

constructed at step A6. An example of this appears in Sec-tion 3. In addition, the results of our case study on aSPACE application (refer to Section 4.3.1) also indicatethat our augmentation method is an effective approach toinclude additional code for locating a program bug whenit is not in Dð1Þ.

One important point worth noting is that if the proce-dure stops at step A4, we have successfully located thebug. However, if the procedure stops at A7, we have con-structed Að�Þ which still does not contain the bug. In thiscase, we need to examine the code that is in the failed exe-cution slice (EF) but not in Að�Þ nor in Dð1Þ (i.e., code inEF �Dð1Þ �Að�ÞÞ. Although in theory this is possible, ourconjecture is that in practice it does not seem to occur veryoften. The data listed in Table 2 show that most Að�Þ’s aregood, i.e., containing the bug. We understand more empir-ical data need to be collected to verify this statement. How-ever, even if the bug is in EF �Dð1Þ �Að�Þ, we at mostexamine the entire failed execution slice (i.e., every pieceof code in EF). This implies that even in the worst caseour augmentation method does not require a programmerto examine any extra code that is not in EF.

2.2. Refining good Dð1Þ

Suppose a program bug is in Dð1Þ; we may still end upwith too much code that needs to be examined. In otherwords, Dð1Þ still contains too much code. If that is the case,we can further prioritize code in this good Dð1Þ by usingadditional successful tests. More specifically, we can usethe refining procedures outlined as follows:

R1: Randomly select additional k distinct successful tests(say t1, t2, . . . , tk).

R2: Construct the corresponding execution slices(denoted by H1,H2, . . . ,Hk).

R3: Set m = k.R4: Construct Dðmþ1Þ such that Dðmþ1Þ ¼ fbjb 2 Dð1Þ ^ b

62 Hn for n ¼ 1; 2; . . . ;mg, that is, Dðmþ1Þ ¼ Dð1Þ � [fHn for n ¼ 1; 2; . . . ;mg.

R5: Examine code in Dðmþ1Þ to see whether it contains thebug.

R6: If YES, then STOP because we have located thebug.

R7: Set m = m � 1.R8: If m = 0, then STOP and examine code in Dð1Þ.R9: Go back to step R4.

Stated differently, Dð2Þ ¼ Dð1Þ �H1 ¼ EF � ES �H1,Dð3Þ ¼ Dð1Þ �H1 �H2 ¼ EF � ES �H1 �H2, etc. This im-

plies Dð2Þ is constructed by subtracting the union of the exe-cution slices of two successful tests (ES and H1 in our case)from the execution slice of the failed test (EF in our case),whereas Dð3Þ is obtained by subtracting the union of theexecution slices of three successful tests (namely, ES, H1

and H2) from EF. Based on this definition, we haveDð1Þ � Dð2Þ � Dð3Þ, etc.

For explanatory purposes, assume k = 2. Our refiningmethod suggests that we can reduce the code to be exam-ined by first inspecting code in Dð3Þ followed by Dð2Þ andthen Dð1Þ. The rationale behind this is that if a piece of codeis executed by some successful tests, then it is less likely tocontain any fault. In this example, if Dð3Þ or Dð2Þ containsthe bug, the refining procedure stops at step R6. Otherwise,we have the bug in Dð1Þ, but not in Dð2Þ nor in Dð3Þ. If so,the refining procedure stops at step R8 where all the codein Dð1Þ will be examined.

An important point worth noting is that at the begin-ning of this section, we assume Dð1Þ contains the bug andthen put our focus on how to prioritize code in Dð1Þ so thatthe bug can be located before all the code in Dð1Þ is exam-ined. However, knowing the location of a program bug inadvance is not possible except for a controlled experimentsuch as the one reported in Section 4. We will respond tothis concern in the next section.

2.3. An incremental debugging strategy

To effectively locate a bug in a program requires a goodexpert knowledge of the program being debugged. If a pro-grammer has a strong feeling about where a bug might be,(s)he should definitely examine that part of the program.However, if this does not reveal the bug, it is better touse a systematic approach by adopting some heuristics sup-ported by good reasoning and proven to be effective insome case studies, rather than take an ad hoc approachwithout any intuitive support, to find the location of thebug.

For explanatory purposes, let us assume we have onefailed test and three successful tests. We suggest an incre-mental debugging strategy by first examining the code inDð3Þ followed by Dð2Þ and then Dð1Þ. This is based on theassumption that the bug is in the code which is executedby the failed test but not the successful test(s). If thisassumption does not hold (i.e., the bug is not in Dð1ÞÞ, thenwe need to inspect additional code in EF (the failed execu-tion slice) but not in Dð1Þ. We can follow the augmentationprocedure discussed earlier to start with code in Að1Þ, fol-lowed by Að2Þ , . . . , up to Að�Þ. If Að�Þ does not containthe bug, then we need to examine the last piece of code,namely, EF �Dð1Þ �Að�Þ. In short, we prioritize code ina failed execution slice based on its likelihood of containingthe bug. The prioritization is done by first using the refiningmethod in Section 2.2 and then the augmentation methodin Section 2.1. In the worst case, we have to examine allthe code in the failed execution slice. More discussionscan be found in Section 5.


3. An example

In this section, we use an example to explain in detailhow our method can be used to prioritize suspicious codein a program for effectively locating a bug. Without resort-ing to complex semantics, we use a program P which takestwo inputs a and b and computes two outputs x and y asshown in Fig. 1. Suppose P has a bug at S8 and the correctstatement should be ‘‘y = 2 * c + d’’ instead of ‘‘y = c + d’’.The execution of P on a test t1 where a = 1 and b = 2 failsbecause y should be 16 instead of 13. However, the execu-tion of P on another test t2 where a = �0.5 and b = 3 doesnot cause any problem. This is because when c is zero(computed at S2) the bug is masked (i.e., there is no differ-ence between ‘‘2 * c + d’’ and ‘‘c + d’’ when c is zero). As aresult, we have found one failed test t1 and one successfultest t2.

Fig. 1. Two execution slices, the corresponding Dð1Þ and Að1Þ for the example dfor a display in color.

The execution slices for t1 (E1) and t2 (E2) are displayedin parts (a) and (b) of Fig. 1. Part (c) gives Dð1Þ obtained bysubtracting E2 from E1. It contains only one statement S3.Our first step is to examine the code in Dð1Þ which, unfor-tunately, does not contain the bug. Next, rather than tryingto locate the bug from the code which is in E1 (the failedexecution slice) but not in Dð1Þ (S3 in our case), a betterapproach is to prioritize such code based on its likelihoodof containing the bug. Stated differently, instead of havingS0, S1, S4, S6, S7, S8 and S9 with the same priority, we give ahigher priority to code that has data dependency with S3.As shown in part (d), Að1Þ (the augmented code after thefirst iteration) contains additional code at S0 and S8. Thisis because the variable a used at S3 is defined at S0 througha scanf statement, and the variable c defined at S3 is used atS8. This also implies S0 and S8 should have a higher priorityand should be examined next. Since S8 is part of the code to

iscussed in Section 3. The reader is referred to the web version of this article


be examined next, we conclude that we have successfullylocated the bug.

Note that Að2Þ (the augmented code after the seconditeration) contains additional code at S1, S4, S6, S7 and S9

because S1 uses the variable a defined at S0, and S4, S6

and S7 use the variable b defined at S0, and S9 uses the var-iable y defined at S8. Hence, although S1, S4, S6, S7 and S9

are not included in Að1Þ, they will be eventually in Að�Þ (ormore precisely Að2Þ in this case). However, there is no needto go through more than one iteration in this example, i.e.,no need to examine S1, S4, S6, S7 and S9.

4. A case study

We first provide an overview of SPACE and DESiD, thetarget program and the tool used in this study, followed bya description of test case generation, fault set, and errone-ous program preparation. Then, we present our results andanalysis based on these results.

4.1. Program and tool

SPACE provides a language-oriented user interface thatallows the user to describe the configuration of an array ofantennas by using a high level language (Cancellieri andGiorgi, 1994). Its purpose is to prepare a data file in accor-dance with a predefined format and characteristics from auser, given the array antenna configuration described in alanguage-like form. The application consists of about10,000 lines of code.

A fundamental problem with applying the method dis-cussed in Section 2 for debugging C code3 is that we needto have a tool to support this process. It is impossible tomanually collect the execution slices and prioritize codein these slices, even for a small C program, because it canbe very time consuming and mistake prone. With this inmind, we developed DESiD on top of ATAC, which is partof the Telcordia Software Visualization and AnalysisToolsuite (Horgan and London, 1991; vSuds User’s Man-ual, 1998).4 Given a C program, ATAC instruments it atcompile time by inserting probes at appropriate locationsand builds an executable based on the instrumented code.For each program execution, ATAC saves the traceability,including how many times each block is executed by a testcase, in a trace file. With this information we can useDESiD to compute the execution slice for each test in termsof a set of blocks. In addition, we can use DESiD to com-

3 The method discussed in Section 2 also applies to other programminglanguages such as C++ and Java. For experiments and the associated tooldevelopment, however, we have to select a specific language—C in ourcase.

4 The tool vSlice (a slicing tool used in our previous study, Agrawalet al., 1995) can be viewed as a preliminary version of DESiD with limitedfeatures. More specifically, vSlice can only compute code in Dð1Þ but not inDð2Þ and Dð3Þ. Neither can vSlice compute code in Að1Þ;Að2Þ; . . . ;Að�Þ. Asa result, vSlice does not work if a program bug is not located in Dð1Þ.

pute the corresponding Dð1Þ, Dð2Þ and Dð3Þ as well asAð1Þ;Að2Þ; . . . ;Að�Þ for locating program bugs.

DESiD provides two different user interfaces: a graphi-cal user interface and a command-line text interface. Theformer allows programmers to select a failed test and upto three different successful tests through a user friendlygraphical interface. In addition, it can also highlight codein the corresponding failed execution slice, andDð1Þ;Dð2Þ;Dð3Þ;Að1Þ;Að2Þ; . . . ; as well as Að�Þ with appro-priate background colors,5 which allows programmers toeasily visualize the part(s) of a program that should beexamined first for locating a bug. In Fig. 2 part of theSPACE program is highlighted in red based on whetherit is in Dð1Þ, Dð2Þ or Dð3Þ. Part (a) shows three pieces of codein Dð1Þ, part (b) shows two pieces in Dð2Þ, and part (c)shows only one line of code in Dð3Þ. Clearly, Dð1Þ subsumesDð2Þ which then subsumes Dð3Þ. On the other hand, codecan be highlighted in blue based on whether it is inAð1Þ;Að2Þ; . . . ; or Að�Þ. Parts (b) and (c) of Fig. 3 showsuch code with respect to Að1Þ and Að2Þ, respectively. Wehave Að1Þ as a subset of Að2Þ. As explained in Section 2,it is up to the programmers to decide whether they shouldstart with Dð3Þ, Dð2Þ or Dð1Þ. If the bug is not in Dð1Þ, thenthey need to examine additional code in Að1Þ;Að2Þ, etc.

While the graphical user interface is useful to help devel-opers find the location of a bug which they do not know inadvance, it is not convenient (because code visualizationthrough such an interface cannot be automated) for a con-trolled experiment like the one reported here. This isbecause in the latter the fault locations are already knownbefore the experiment, and the objective is not to find thelocation of a bug but to verify whether our method canhelp programmers effectively locate a bug, that is, whetherthe bug is in Dð1Þ, or Dð2Þ, or Dð3Þ, or Að1Þ, or Að2Þ, etc. Toovercome this problem, DESiD provides a command-linetext interface in which appropriate routines can be invokeddirectly to compute various dices and augmented codebased on a given failed test and some successful tests.Moreover, such commands which take as input the failedand successful tests can be saved in a batch file for auto-matic execution. It is this approach which makes DESiDvery robust for our case study.

4.2. Test case generation, fault set, and erroneous program

preparation

An operational profile, as formalized by Musa (1993)and used in our experiment, is a set of the occurrence prob-abilities of various software functions. To obtain an oper-ational profile for SPACE we identified the possiblefunctions of the program and generated a graph capturingthe connectivity of these functions. Each node in the graphrepresents a function. Two nodes, A and B, are connected

5 For interpretation of color in all the figures, the reader is referred tothe web version of this article.

Fig. 2. Highlighting of code based on Dð1Þ, Dð2Þ and Dð3Þ.


if control can flow from function A to function B. There isa unique start and end node representing functions atwhich execution begins and terminates, respectively. Apath through the graph from the start node to the end noderepresents one possible program execution. To estimate theoccurrence probability of the software functions, each arcis assigned a transition probability, i.e., the probability ofcontrol flowing between the nodes connected by the arc.For example, if node A is connected to nodes B, C andD, and the probabilities associated with arcs A–B, A–Cand A–D are, respectively, 0.3, 0.6 and 0.1, then after the

execution of function A the program will perform func-tions B, C or D, respectively, with probability 0.3, 0.6and 0.1. Transition probabilities were determined by run-ning a set of representative test cases. Based on this profile,a pool of 1000 distinct test cases was created.

The fault set for the program used in this study isobtained from the error-log maintained during its testingand integration phase. Ten faults were used in our study.For convenience, each fault is numbered as Fk where k isfrom 01 to 10. They were injected one at a time, i.e., oneerroneous program was created for each fault.

Fig. 3. Highlighting of code based on Dð1Þ, Að1Þ and Að2Þ.

6 Since only two decimal places are used, the product may be roundedup or down to an integer. Interested readers can contact the authors forthe exact number of Dð1Þ’s for each fault. The same applies to othercomputations that use the percentages expressed in this format.


4.3. Results and analysis

Table 1 lists a brief classification, based on a scheme rec-ommended by IEEE (IEEE 1044, 1994), of each fault andthe number of test cases in the test case pool that detect it.Also listed is the number of pairwise Dð1Þ’s which is theproduct of the number of failed tests and the number ofsuccessful tests. For example, of the 1000 test cases, 26 ofthem can detect F01; this implies there are 974 successfultests. It also implies there are 25,324 (i.e., 26 * 974) pairwiseDð1Þ’s. Each pairwise Dð1Þ is examined to see whether it con-tains the corresponding fault. The percentage of bad Dð1Þ’sis listed in the second column of Table 2, whereas the num-ber of such Dð1Þ’s for each fault can be computed as theproduct of this percentage and the total number of thepairwise Dð1Þ’s from the last column of Table 1. The per-centage of good Dð1Þ’s can be obtained by subtracting the

percentage of bad Dð1Þ’s from 100; the number of goodDð1Þ’s is the product of the good percentage and the totalnumber of pairwise Dð1Þ’s. Below, we first present theresults and analysis for the bad Dð1Þ’s followed by thatfor the good Dð1Þ’s.

4.3.1. Augmenting with respect to bad Dð1Þ’sFor each bad Dð1Þ, the corresponding Að�Þ is computed

and examined to see whether it contains the correspondingfault. The percentage of good Að�Þ is listed in the third col-umn of Table 2. Let us use F01 as the example again. Thereare 312 (i.e., 1.23% * 25,324)6 bad Dð1Þ’s, but none of the

Table 1Classification, number of failed tests, and number of pairwise Dð1Þ’s for each fault used in our study

Fault classification # of failed tests # of pairwise Dð1Þ’s

Fault type Subtype

F01 Logic omitted or incorrect Missing condition test 26 25,324F02 Logic omitted or incorrect Missing condition test 16 15,744F03 Computational problems Equation insufficient or incorrect 36 34,704F04 Data handling problems Data accessed or stored incorrectly 38 36,556F05 Logic omitted or incorrect Forgotten cases or steps 35 33,775F06 Computational problems Equation insufficient or incorrect 32 30,976F07 Computational problems Equation insufficient or incorrect 32 30,976F08 Logic omitted or incorrect Missing condition test 6 5964F09 Logic omitted or incorrect Checking wrong variable 46 43,884F10 Logic omitted or incorrect Checking wrong variable 320 217,600

Table 2Percentage of bad Dð1Þ’s and good Að�Þ’s, as well as the size distribution of Að�Þ’s

% of bad Dð1Þ’s % of good Að�Þ Size distribution of Að�Þ in terms of % of the corresponding failed execution slice

[0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,100]

F01 1.23 100 1.60 11.21 39.10 46.47 1.60 0 0F02 0.61 100 30.20 25.00 33.33 11.45 0 0 0F03 4.98 100 0 0 0 0 83.62 16.37 0F04 78.69 100 0 1.47 14.40 31.28 12.30 39.24 1.27F05 52.12 92.76 0.53 14.02 35.05 17.06 18.57 14.74 0F06 0.62 100 0 31.77 55.20 13.02 0 0 0F07 0.62 100 0 31.77 55.20 13.02 0 0 0F08 2.82 100 1.19 24.40 51.78 22.61 0 0 0F09 2.62 99.39 3.06 18.02 44.88 31.32 2.71 0 0F10 22.06 86.82 0.53 13.87 41.46 31.33 9.26 3.51 0


Að�Þ’s are bad. The rest of Table 2 (columns four to ten)shows the size distribution of Að�Þ in terms of the percent-age of the corresponding failed execution slice. For exam-ple, there are 39.1% of all the Að�Þ’s for F01 with a sizebetween 20% and 30% of the corresponding failed executionslice. From this table, we make the following observations:

• Three faults (F04, F05 and F10) have a significant numberof bad Dð1Þ’s, which implies that examining code only inthese Dð1Þ’s cannot locate the bugs. Additional code hasto be included by using the method discussed in Section2.1. The same applies to other faults even though theydo not have as many bad Dð1Þ’s as these three.

• The percentage of good Að�Þ is either 100% or more than99% except for F05 and F10 which have 92.76% and86.82%, respectively. This implies that although the faultis not in Dð1Þ, it will be, with a very high probability, inAð�Þ.

• The majority of Að�Þ (such as those for F02, F06, F07, F08

and F09) have a size less than 30% of the correspondingfailed execution slice. Others, such as those for F01, F04,F05 and F10, have a size less than 40%. The only excep-tion is F03 of which most Að�Þ have a size between 40%and 50% of the corresponding failed execution slice.

Our data indicate that the augmentation method dis-cussed in Section 2.1 is an effective approach to includeadditional code for locating a program bug when it is not

in Dð1Þ. Moreover, many of the resulting Að�Þ only have asize between 10% and 20%, or 20% and 30%, or 30% and40% of the corresponding failed execution slice.

4.3.2. Refining with respect to good Dð1Þ’sThe size distribution of good Dð1Þ’s in terms of percent-

age of the corresponding failed execution slice is listed inFig. 4. The notation Fk[a,b] represents the set of all goodDð1Þ’s for Fk with a size between a% and b% of the failedexecution slice. From Fig. 4, we find that 31.45% of allthe good Dð1Þ’s for F01 have a size between 20% and 30%of the corresponding failed execution slice. Also, fromTables 1 and 2, there are (100–1.23)% * 25,324 = 25,012good Dð1Þ’s. Together, there are 31.45% * 25,012 = 7866Dð1Þ’s in F01(20,30]. As explained in Footnote 6, since onlytwo decimal places are used for representing these percent-ages, the exact number of Dð1Þ’s may be rounded up ordown to an integer. From this figure, we make the follow-ing observations:

• F04 has no good Dð1Þ’s with a size less than 30% of the cor-responding failed execution slice, that is, F04[0, 20] = 0and F04(20, 30] = 0. Moreover, 96.12% of its good Dð1Þ’scontain more than 40% of the corresponding failed execu-tion slice.

• F05 and F10 have 40.67% and 31.8%, respectively, of alltheir good Dð1Þ’s with a size greater than 40% of thecorresponding failed execution slice.

Fig. 4. Size distribution of good Dð1Þ’s. For example, there are 31.45% of all the good Dð1Þ’s for F01 with a size between 20% and 30% of the correspondingfailed execution slice.


• Other faults have 20% (or a little more) of all the goodDð1Þ’s with a size greater than 40% of the correspondingfailed execution slice.

After a careful examination, we find that the above dis-crepancy is due to program structure and the location ofthe fault.

Given a good Dð1Þwith respect to a specific fault (i.e., thefault is located in Dð1ÞÞ, if its size is less than 20% of the cor-responding failed execution slice, we say this Dð1Þ is smallenough7 and can be used for effectively locating the fault.Otherwise, it should be further refined as it still containstoo much code that needs to be examined. The refinementcan be done by using additional successful tests asexplained in Section 2.2.

Every good Dð1Þ in each category (such as F01(20, 30],F01(30, 40], etc.) is obtained by subtracting the executionslice of a successful test (say S1) from the execution sliceof a failed test (say F). Since there are too many such

7 If this is not the case (i.e., 20% of the failed execution slice still containstoo much code), we can follow the same refining procedure explained inSection 2.2 to further prioritize code in this Dð1Þ.

Dð1Þ’s, we randomly select 20 of them from each category.For each selected Dð1Þ, we do the following:

• Randomly select two more successful tests (say S2 andS3) such that S1, S2 and S3 differ from each other.

• Construct the corresponding Dð2Þ by subtracting the exe-cution slices of S1 and S2 from the execution slice of F,and Dð3Þ by subtracting the execution slices of S1, S2 andS3 from the execution slice of F.

• Determine whether the corresponding fault is located inDð2Þ and Dð3Þ.

• Compute the sizes of Dð2Þ and Dð3Þ in terms of percent-age of the corresponding failed execution slice. SupposeDð1Þ, Dð2Þ and Dð3Þ have a%, b% and c%. Compute alsothe differences a � b and a � c which are, respectively,the size reductions due to the inclusion of the secondand the third successful tests.

There should be 20 size reductions between Dð1Þ andDð2Þ, and 20 between Dð1Þ and Dð3Þ. Two averages are com-puted: one with respect to all the a � b and the other withrespect to all the a � c.

We repeat the above procedures for good Dð1Þ’s selectedfrom each category. Since each Dð2Þ and Dð3Þ contains less


code than the corresponding Dð1Þ, one question that weneed to answer is whether such a size reduction has a neg-ative impact on locating the fault. Fig. 5 shows the percent-ages of bad Dð2Þ’s and Dð3Þ’s with respect to each categorywhich do not contain the fault. For example, 10% and 20%of the Dð2Þ’s and Dð3Þ’s in the category F08(40, 50] do notcontain F08. Stated differently, of the 20 randomly selectedgood Dð1Þ’s from this category, the refinement makes twoof the resulting Dð2Þ’s and four of the resulting Dð3Þ’s badbecause they do not contain the fault. From this figure,we observe that in many cases, the percentages of badDð2Þ’s and Dð3Þ’s are 10% or lower. This implies mostDð2Þ’s and Dð3Þ’s are good (i.e., they still contain the fault).

However, this is not true for F04 and F05 where themajority of Dð2Þ’s and Dð3Þ’s are not good. To explain this,we use F04 as the example. In order to construct a goodDð1Þ, Dð2Þ and Dð3Þ for this bug, we need to use one, twoand three successful tests, respectively, which do not exe-cute the code containing F04. However, after a carefulexamination, we notice that most test executions will gothrough the code which contains F04. As a result, not onlythe percentage of bad Dð1Þ’s for F04 is high (78.69% as listedin Table 2), but the percentages of the corresponding bad

Fig. 5. Percentages of bad Dð2Þ’s and Dð3Þ’s wh

Dð2Þ’s and Dð3Þ’s are high as well. The same reason appliesto F05 which has 52.12% of bad Dð1Þ’s. As for F10, we havemixed results. The percentages of bad Dð2Þ’s and Dð3Þ’s arehigh in some cases but low for others. One reason for this isbecause 22.06% of its pairwise Dð1Þ’s are bad. This impliesthat while constructing Dð2Þ’s and Dð3Þ’s, the chance toselect a successful test which executes the code containingF10 is relatively high. In short, if the percentage of goodDð1Þ’s is low, the chance for the corresponding Dð2Þ’s andDð3Þ’s to be good is also low.

We now discuss the size reduction due to the refinement.Fig. 6 shows the average size reductions between Dð1Þ andDð2Þ (represented by the lengths of the black bars) andthe reductions between Dð2Þ and Dð3Þ (represented by thelengths of the green bars or the white bars if the paper isprinted in black and white). Since the green bars areappended at the right end of the black bars, the total lengthof the union of these two bars gives the reduction betweenDð1Þ and Dð3Þ. All these reductions are in terms of percent-age of the corresponding failed execution slice. For exam-ple, as indicated in the figure, with respect to F08(40, 50]the size reductions between Dð1Þ and Dð2Þ, and Dð2Þ andDð3Þ are 18.45% and 9.54%, respectively. This also implies

ich do not contain the corresponding fault.

Fig. 6. Size reductions between Dð1Þ and Dð2Þ, and Dð2Þ and Dð3Þ. Thereductions are in terms of percentage of the corresponding failed executionslice.


the reduction between Dð1Þ and Dð3Þ is 27.99%—the sum of18.45% and 9.54% From this figure, we make the followingobservations:

• The size reductions of Dð2Þ’s and Dð3Þ’s are about thesame for every fault.

• The reductions between Dð1Þ and Dð2Þ are unanimouslygreater than those between Dð2Þ and Dð3Þ with noexception.

• In general, Dð1Þ’s with more code have a larger sizereduction than those with less code. For example, thesize reductions between Dð1Þ and Dð3Þ in Fk(60, 100]and Fk(20,30] are about 40% and 6–10%, respectively,with a few exceptions.

To conclude, our data indicate that in most cases Dð2Þ’sand Dð3Þ’s generated based on good Dð1Þ’s are still good.That is, if Dð1Þ contains a program bug, then Dð2Þ andDð3Þ are likely to contain the same bug as well. Some excep-tions may occur, in particular when the percentage of goodDð1Þ’s is low (i.e., too many pairwise Dð1Þ’s are bad). Ourdata also indicate that the sizes of Dð2Þ and Dð3Þ can bemuch smaller than that of the corresponding Dð1Þ whichimplies that we can significantly reduce the code that needto be examined for locating a program bug. In short, the

refining method discussed in Section 2.2 works for mostgood Dð1Þ’s.

5. Discussion

We now discuss some important aspects related to theincremental debugging strategy proposed in Section 2.

First, the effectiveness of our debugging strategydepends on which successful tests are used. One goodapproach is to identify successful tests that are as similaras possible to the failed test (in terms of their executionslices) in order to filter out as much code as possible. In thisway, we start the debugging with a very small set of suspi-cious code, and then increase the search domain, if neces-sary, based on the augmentation and refining methodsdescribed earlier. In addition, the effectiveness also dependson the nature of the bug and the structure of the program.

Second, if a program fails not because of the current testbut because of another test which fails to set up an appro-priate execution environment for the current test, then weshould bundle these two test cases together as one singlefailed test. That is, the corresponding failed execution sliceis the union of the execution slices of these two tests.

Third, for a bug introduced by missing code, our debug-ging strategy can still work because, for example, some unex-pected code may be included in the failed execution slice.That is, such missing code has an adverse impact on a deci-sion to cause a branch to be executed even though it shouldnot be. If that is the case, the abnormal execution path withrespect to a given test can certainly help us realize that somecode required for making the right decision is missing.

Finally, since testing and debugging are very closelyrelated, a common practice is to fix the underlying programand remove the corresponding bug(s) as soon as someabnormal behavior is detected during the testing. Underthis situation, programmers have one failed test in conjunc-tion with some successful tests to be used for prioritizingcode from the debugging point of view. This is the scenariodiscussed in this paper. However, a programmer maydecide not to fix the bug right away for certain specific rea-sons. If so, (s)he may observe additional failures caused bythe same fault(s) in the subsequent executions. That is, forthe given fault(s), there exists more than one failed test. Ifthat is the case, instead of defining EF as the execution sliceof a single failed test (refer to Section 2), the programmermay define it as the intersection of the execution slices ofmultiple failed tests. For example, EF can be defined asthe intersection of the execution slice of a failed test taand that of another failed test, tb. This is because the codethat is executed by both ta and tb is more likely to containthe bug than the code which is only executed by ta or tb butnot both (Wong et al., 2005).

6. Related studies

Program slicing (Weiser, 1984) is a commonly used tech-nique for debugging (Weiser, 1982). Since a static slice


might contain too much code to be debugged, Lyle andWeiser propose constructing a program dice (as the set dif-ference of two groups of static slices) to further reduce thesearch domain for possible locations of a fault (Lyle andWeiser, 1987). An alternative is to use dynamic slicing(Agrawal and Horgan, 1990; Korel and Laski, 1988)instead of static slicing, as the former can identify the state-ments that indeed do, rather than just possibly could (as bythe latter), affect a particular value observed at a particularlocation. A third alternative is to use execution slicing anddicing to locate faults in C programs (Agrawal et al., 1995)or software architecture design in SDL (Wong et al., 2005).The reason for choosing execution slicing in this paper overstatic slicing or dynamic slicing is that a static slice focuseson finding code that could possibly have an impact on thevariables of interest for any inputs instead of code thatindeed affects those variables for a specific input. Thisimplies a static slice does not make any use of the input val-ues that reveal the fault. It clearly violates a very importantconcept in debugging which suggests that programmersanalyze the program behavior under the test case thatreveals the fault, not under a generic test case. The disad-vantage of using dynamic slices is that collecting themmay consume excessive time and file space, even thoughdifferent algorithms have been proposed to address theseissues. On the other hand, for a given test case, its execu-tion slice is the set of code (basic blocks in our case) exe-cuted by this test. Such a slice can be constructed veryeasily. This is particularly true if we know the code cover-age of a test because the corresponding execution slice ofthis test can be obtained simply by converting the coveragedata collected during the testing into another format, i.e.,instead of reporting the coverage percentage, it reportswhich basic blocks are covered.

Another debugging technique is to use execution back-tracking. The major problem with this approach is that itmay require a large amount of space to save the executionhistory and program states. To solve this problem, Agrawalet al. propose a structured backtracking approach withoutstoring the entire execution history (Agrawal et al., 1993).They present a debugging technique based on dynamic slic-ing and backtracking, where a dynamic slice contains all thestatements that actually affect the value of a variable at aprogram point for a particular execution of the program,and backtracking is used to restore the program state at agiven location without having to re-execute the programfrom the beginning. More specifically, after a program fail-ure is detected, its external symptoms are translated into thecorresponding internal symptoms in terms of data or con-trol problems in the program. Then one of the internalsymptoms is selected for creating a dynamic slice. The nextstep is to guess the possible location of the fault that isresponsible for the failure. This can be done by selecting astatement in the slice, restoring the program state to thestate when the control last reached that statement, andexamining values of some variables in the restored state.If this does not reveal the fault, one has three options: (1)

further examining the restored state, (2) making a new guessby selecting another statement in the slice and restoring theprogram state accordingly, or (3) creating a new dynamicslice based on a different internal symptom. These stepsare repeated until the fault is localized. A prototype tool,SPYDER, was implemented to support this technique.One problem with this approach is that translating exter-nally visible symptoms of a failure into corresponding inter-nal program symptoms can be very difficult in some cases.Another problem is that slices do not make explicit depen-dences among multiple occurrences of the same statement.The third problem is that execution backtracking has itslimitations, in particular, if the execution of a statementcan have impacts outside the boundaries of the program,e.g., into the operating system. Under this condition, onlya partial backtracking is possible.

Korel presents a prototype of a fault localization assis-tant system (PELAS) to help programmers debug Pascalprograms (Korel, 1988). He claims that PELAS is differentfrom others in that it makes use of the knowledge of pro-gram structure to guide a programmer’s attention to possi-ble erroneous parts of the program. More specifically,such knowledge is derived from the program and its execu-tion trace and is represented by a dependence network. Thisnetwork is then used to discover probable sources of thefault(s). The rationale of the entire approach is that any pro-gram instructions in the execution trace from the beginningto a position of incorrectness could conceivably be responsi-ble for the abnormal behavior of the program beingdebugged. However, his approach does not include the refin-ing and augmentation methods as described in this paper.

Jones et al. describe a visualization technique to assist infault location by illuminating the statements of a given pro-gram in different colors (Jones et al., 2002). For each state-ment, its color depends on the relative percentage of thesuccessful tests that execute the statement to the failed teststhat execute the statement. If the statement is executed bymore successful tests than failed tests, its color appears tobe greener, whereas the other way around (i.e., more failedtests than successful tests) makes the color redder. Twomeasurements are computed: a false negative in terms ofhow often a faulty statement is not colored in red, and afalse positive in terms of how often a non-faulty statementis colored in red. Their results show that for some faults,this technique works well. However, for other faults, espe-cially those in the initialization code or on the main path inthe program which are executed by most of the test cases,none of the faulty statements are colored in the red range.A possible way to overcome this problem is to use slice anddata dependency analysis.

7. Conclusion and future research

We propose a debugging method based on executionslices and inter-block data dependency for effectively locat-ing program bugs. We start with code that is only executedby the failed test but not any successful tests. To do so, we


first construct Dð1Þ which is obtained by subtracting theexecution slice of a successful test from the execution sliceof a failed test. Then, we follow the refining procedure dis-cussed in Section 2.2 to construct, for example, Dð2Þ andDð3Þ by subtracting the union of the execution slices oftwo and three successful tests, respectively, from the execu-tion slice of a failed test. Since the likelihood of a piece ofcode containing a bug decreases if it is executed by somesuccessful tests, we will try to locate a bug by first examin-ing Dð3Þ followed by Dð2Þ and then Dð1Þ. If we cannot findthe bug in Dð1Þ, we follow the augmenting procedure dis-cussed in Section 2.1 to first construct Að1Þ which is directlydata dependent on Dð1Þ. In other words, Að1Þ contains codethat either defines variables used in Dð1Þ or uses variablesdefined in Dð1Þ. We examine code in Að1Þ to see whetherit contains the bug. If not, we will construct Að2Þ whichis the union of Að1Þ and the set of code that is directly datadependent on Að1Þ. If the bug is in Að2Þ, we have success-fully located it. Otherwise, we continue this process untilAð�Þ is constructed and examined. If the bug is still notfound, we then examine the last piece of code, namely, codein the failed execution slice but not in Að�Þ and Dð1Þ. Inshort, we propose an incremental debugging strategy tofirst examine the most suspicious code identified by usingsome heuristics supported by good reasoning. If the bugis not located then we increase the search domain step bystep by including additional code that is next most likelyto contain the bug.

We conducted a case study on a SPACE application toillustrate the effectiveness of our method in locating pro-gram bugs. A tool, DESiD, was implemented to automatethe slice generation, code prioritization, and data collec-tion. Our results are encouraging. They indicate that if aprogram bug is in Dð1Þ, then the chance for Dð2Þ’s andDð3Þ’s generated based on this Dð1Þ to contain the samebug is also high in most cases with some exceptions as dis-cussed in Section 4. In addition, the sizes of Dð2Þ’s andDð3Þ’s can be much smaller than that of the correspondingDð1Þ. On the other hand, the method we used for augment-ing a bad Dð1Þ (i.e., a Dð1Þ that does not contain the bug) isalso effective for locating a program bug because thechance for the corresponding Að�Þ to contain the bug isvery high and yet its size is relatively small in comparisonto the failed execution slice.

Finally, since execution slices obtained as well asDð1Þ;Dð2Þ;Dð3Þ; . . . ;Að1Þ;Að2Þ; . . . ; and Að�Þ depend on thetest cases used, different suspicious code will be prioritizeddifferently by different sets of successful and failed tests.An interesting future study would be to develop guidelines(in addition to what has been discussed in Section 5) to helpprogrammers select appropriate successful test(s) withrespect to a given failed test so that only a small portion ofthe program needs to be examined for locating a bug. Anintuitive guess is that such selection will depend on the nat-ure of the bug to be located as well as the structure of the pro-gram to be debugged. Another important future study is toapply our debugging method to industry projects to examine

how much time programmers can save by using our methodin locating bugs in comparison to other methods.

References

Agrawal, H., Horgan, J.R., 1990. Dynamic program slicing. In: Proceed-ings of the ACM SIGPLAN’90 Conference on Programming Lan-guage Design and Implementation, White Plains, New York, USA,June, pp. 246–256.

Agrawal, H., DeMillo, R.A., Spafford, E.H., 1993. Debugging withdynamic slicing and backtracking. Software—Practice and Experience23 (6), 589–616.

Agrawal, H., Horgan, J.R., London, S., Wong, W.E., 1995. Faultlocalization using execution slices and dataflow tests. In: Proceedingsof the 6th IEEE International Symposium on Software ReliabilityEngineering, Toulouse, France, October, pp. 143–151.

Cancellieri, A., Giorgi, A., 1994. Array Preprocessor User Manual.Technical Report IDS-RT94/052.

Horgan, J.R., London, S.A., 1991. Data flow coverage and the Clanguage. In: Proceedings of the 4th Symposium on Software Testing,Analysis, and Verification, Victoria, British Columbia, Canada,October, pp. 87–97.

IEEE., 1044—standard classification for software errors, faults andfailures, 1994. IEEE Computer Society.

Jones, J.A., Harrold, M.J., Stasko, J., 2002. Visualization of testinformation to assist fault localization. In: Proceedings of the 24thInternational Conference on Software Engineering, Florida, USA,May, pp. 467–477.

Korel, B., 1988. PELAS—program error-locating assistant system. IEEETransactions on Software Engineering 14 (9), 1253–1260.

Korel, B., Laski, J., 1988. Dynamic program slicing. InformationProcessing Letters 29 (3), 155–163.

Lyle, J.R., Weiser, M., 1987. Automatic program bug location by programslicing. In: Proceedings of the 2nd International Conference onComputer and Applications, Beijing, China, June, pp. 877–883.

Musa, J.D., 1993. Operational profiles in software reliability engineering.IEEE Software 10 (2), 14–32.

vSuds User’s Manual, 1998. Telcordia Technologies.Weiser, M., 1982. Programmers use slices when debugging. Communica-

tions of the ACM 25 (7), 446–452.Weiser, M., 1984. Program slicing. IEEE Transactions on Software

Engineering 10 (4), 352–357.Wong, W.E., Sugeta, T., Qi, Y., Maldonado, J.C., 2005. Smart debugging

software architectural design in SDL. Journal of Systems and Software76 (1), 15–28.

W. Eric Wong received his B.S. in Computer Science from EasternMichigan University, and his M.S. and Ph.D. in Computer Science fromPurdue University. In 1997, he received the Quality Assurance SpecialAchievement Award from Johnson Space Center, NASA. Currently, he isa tenured associate professor in Computer Science at the University ofTexas at Dallas. Before joining UTD in 2002, he was with TelcordiaTechnologies (formerly Bellcore) as a senior research scientist and aproject manager in charge of the initiative for Dependable TelecomSoftware Development. His research focus is on the development oftechniques to help practitioners produce high quality software at low cost.In particular, he is doing research in the areas of software testing, reli-ability, metrics, and QoS at the application as well as architectural designlevels. He served as general chair for ICCCN 2003 and program chair forMutation 2000, ICCCN 2002, COMPSAC 2004, SEKE 2005, and ISSRE2005. He is a senior member of IEEE.

Yu Qi received his B.S. in Computer Science from Wuhan University inChina and M.S. in Computer Science from Fudan University in China.Currently, he is a Ph.D. candidate in Computer Science at the Universityof Texas at Dallas. His research interests are in software testing andembedded systems.

eﬀective program debugging based on execution slices and

Documents