automated patch assessment for program repair at scale · 10 patches from previous research...

1

Automated Patch Assessment for ProgramRepair at Scale

He Ye∗, Matias Martinez†, Martin Monperrus∗∗KTH Royal Institute of Technology, Sweden [email protected]; [email protected]

†University of Valenciennes, France [email protected]

Abstract—In this paper, we do automatic correctness assessment for patches generated by program repair techniques. We considerthe human patch as ground truth oracle and randomly generate tests based on it, i.e., Random testing with Ground Truth – RGT. Webuild a curated dataset of 638 patches for Defects4J generated by 14 state-of-the-art repair systems. We evaluate automated patchassessment on our dataset which is, to our knowledge, the largest ever. The results of this study are novel and significant: first, weshow that 10 patches from previous research classified as correct by their respective authors are actually overfitting; second, wedemonstrate that the human patch is not the perfect ground truth; third, we precisely measure the trade-off between the time spent fortest generation and the benefits for automated patch assessment at scale.

Index Terms—automatic program repair, patch correctness assessment

F

1 INTRODUCTION

AUTOMATIC program repair (APR) aims to reduce man-ual bug-fixing effort by providing automatically gen-

erated patches [11], [27]. Most of the program repair tech-niques use test suites as a specification of the program,which is what we consider in this paper. One of the keychallenges of program repair is that test suites are generallytoo weak to fully specify the correct behavior of a program.Consequently, a generated patch passing all test cases maystill be incorrect [29]. Per the usual terminology, such anincorrect patch is said to be overfitting if it passes all testsbut is not able to generalize to other buggy input pointsnot present in the test suite [33]. Previous research (e.g. [20],[22], [25]) was shown that automatic repair systems tend toproduce more overfitting patches than correct patches.

Due to the overfitting problem, researchers cannot onlyrely on test suites to assess the capability of the new re-pair systems they invent. Thus, a common practice in theprogram repair research community is to employ manualassessment for generated patches to assess their correct-ness. Analysts, typically authors of the papers, annotate thepatches as ‘correct’ or ‘overfitting’ [25] according to theiranalysis results. This assessment is typically done accordingto a human-written patch considered as a ground truth. Apatch is deemed as correct if and only if: 1) it is identicalto the human-written patch or 2) the analysts perceive itas semantically equivalent. Otherwise a patch is deemed asoverfitting.

There are three major problems with manual patch as-sessment: difficulty, bias and scale. First, in some cases, itis hard to understand the semantics of the program underrepair. Without expertise on the code base, the analyst maysimply be unable to assess correctness [25], [45]. Second,the usual practice is that the analysts of patches are alsoauthors of the program repair system being evaluated. Con-sequently, there may exist an inherent bias towards consid-ering the generated patches as correct. Third, it frequentlyhappens that dozens of patches are generated for the same

bug [20], [26], which makes the amount of manual analysisrequired quickly overpass what is doable in a reasonableamount of time. To overcome difficulty, bias and scale inmanual patch assessment, we need automated patch assess-ment [17], [40], [41], [46].

In this paper, we consider automated patch assessmentgiven a ground truth reference patch, as done by Xin andReiss [40], Le et al. [20] and Yu et al. [46]. Notably, thereexists other works such as those by Xiong et al. [41] andYang et al. [43] based on the opposite premise: the absence ofa reference patch. Having a ground truth reference patch isin line with manual assessment based on the human-writtenpatch, and enables us to compare them.

Using the ground truth reference patch, we present anovel empirical study of automated patch assessment inthis paper. The key novelty is the scale: we analyze 638patches (189 in [17]) from 14 repairs systems (8 in [17]). Ourautomated patch assessment is based on test generation [31],[33]: we generate tests using the behavior of the human-written patch as the oracle. If any automatically generatedtest fails on a machine patch, it is considered to be overfitting.In this paper, we call this procedure RGT, standing forRandom testing based on Ground Truth. Our study usesEvosuite [9] and Randoop [28] as test generators and the col-lected 638 patches in our dataset are automatically assessedwith 4,477,707 RGT tests generated (to our knowledge,largest number of tests ever reported in this context).

The results of this study are novel and significant. First,we show that 10 patches from previous research classified ascorrect by their respective authors are actually overfitting.This result confirms the difficulty of manual patch assess-ment and strongly suggests to use automated patch assess-ment in program repair research. Second, we systematicallyanalyze the false positive of RGT assessment that the failureof a generated test does not signal an overfitting patch,indicating important research directions for test generation.We demonstrate that the human patch is not the perfect

arX

iv:1

909.

1369

4v2

[cs

.SE

] 1

4 D

ec 2

019

2

TABLE 1: Our Major Findings and Their Implications Based on Our Study of 638 APR Patches and 4,477,707 GeneratedTests for Automatic Patch Correctness Assessment. “RGT” Refers to the Patch Assessment Technique Based on RandomTesting Which Introduced in [31] and Deepened in This Empirical Study.

Findings on Manual Versus RGT Assessment Implications(1) The misclassification of patches by manual assessment is a commonproblem. Our experiment shows it happened for 6/14 repair systemsmanually assessed in previous research.

(1) The research community of APR researchers need better tech-niques for patch assessment to strengthen scientific validity.

(2) APR researchers confirm that the inputs sampled by randomtesting are valuable to assess patch correctness.

(2) It helps APR researchers to have concrete inputs to analyse patchcorrectness, suggesting more research about automatic identificationof interesting input points (e.g. [32]).

Findings on the Effectiveness of RGT Assessment Implications(3) In our experiment, RGT automatically identifies 274 / 381 (72%) ofpatches claimed as overfitting by manual analysis. This is a significantimprovement over [17] in which fewer than 20% of overfitting patchescould be identified.

(3) Our results suggest that the effectiveness of RGT patch assess-ment was underestimated in [17]. This calls for future research onthis topic, with replication studies, in order to strengthen externalvalidity.

(4) Behavioral differences identified with exception comparison is animportant factor behind RGT’s effectiveness. DiffTGen, which onlyconsiders assertion-based differences between output values, thusperforms worse.

(4) Future overfitting detection techniques should consider bothassertion and exception related behavioral differences.

(5) For RGT patch assessment, Evosuite outperforms Randoop insampling inputs that differentiate program behaviors by 210%, butconsidering these two techniques together can maximize the effective-ness of identifying overfitting patches.

(5) Patch assessment techniques that involve automatic test genera-tion can consider different techniques to maximize their effectiveness(e.g. PATCH-SIM [41]).

(6) We found flaky tests in both newly generated RGT tests andpreviously generated RGT tests from previous research.

(6) Flaky test detection is important to consider for RGT assessment.APR researchers who use RGT tests should give particular attentionto identify flaky tests.

Findings on False Positive Ratio of RGT Assessment Implications(7) RGT patch assessment sometimes suffer from false positives. Inour experiment, the false positive rate of RGT is 6/257 (2.3%).

(7) This false positive rate is low, researchers can rely on RGT forproviding better assessment results of their program repair contri-butions.

(8) RGT causes false positive case because the used test generationtechnique is not aware of preconditions or constraints on inputs.

(8) Better support for preconditions in test generation would helpto increase the reliability of RGT patch assessment. An example ofrecent work in that direction is [5]

(9) In our experiments, RGT patch assessment yields three false pos-itives because of optimization or imperfection in the human-writtenpatches.

(9) We, as APR researchers, should not blindly consider the human-written patch as perfect ground truth, this impacts both manualassessment and automatic patch assessment.

Findings on RGT Time Cost Implications(10) Over 87% of the time cost of RGT patch assessment is spent intest case generation.

(10) We encourage researchers to share the generated tests for be-havioral assessment of APR patches. This is a big time saver and thisimproves scientific reproducibility.

(11) Using previous generated RGT tests from [31] is able to identify219/381 (57.5%) of overfitting patches without paying any test gener-ation time.

(11) Future APR experiments on Defects4J can reuse previouslygenerated RGT tests. When researchers assess patch correctness ofAPR patches with the same dataset of tests, the community has afair and unbiased comparison of program repair effectiveness.

(12) There is a trade-off between time spent in generating tests andeffectiveness to discard overfitting patches.

(12) Our experiments provide practical configuration guidelines forfuture research and experiments using the RGT patch assessmenttechnique.

ground truth. Third, we precisely measure the trade-offbetween the time spent for test generation and the benefitsfor automated patch assessment at scale.

Our large scale study enables us to identify 12 ma-jor findings that have important implications for futureresearch in the field of automatic program repair: thesefindings and their implications are summarized in Table 1.

To sum up, the contributions of this paper are:

• A large-scale empirical study of automated patch as-sessment based on test generation. Our empirical studyis comprehensive, from canonicalization of patches tosanity check of tests to careful handling of randomness.Our study is the largest ever to our knowledge, involv-ing 638 patches and 4,477,707 generated tests.

• Our key result shows that 72% of overfitting patchescan be discarded with automated patch assessment, this

is a significant improvement over [17] in which fewerthan 20% of overfitting patches could be identified.

• We estimate the reliability of the RGT technique thatusing the human-written patch as a ground truth yieldsa 2.3% false-positive rate and conduct a novel per-formance analysis. These results are based on a noveltaxonomy of seven categories of behavioral differences.

• A curated dataset of 638 patches generated from 14 pro-gram repair tools for the Defects4J benchmark. Thosepatches are given in canonical format with metadataso that they provide a foundation for future programrepair research. All the data presented in this paper arepublicly-available at [1].

• A curated dataset of 4,477,707 generated tests for De-fects4J based on the ground truth human patch. Thisdataset is valuable for future automated patch assess-ment, as well as for sister fields such as fault localizationand testing.

3

2 BACKGROUND

This section provides the motivation through an exampledemonstrating the problem of manual patch assessment, aswell as the background of overfitting problem in programrepair.

2.1 Motivating ExampleManual patch assessment is an error-prone and subjectivetask, which could lead to various results depending on theknowledge and experience of the analysts. Listing 1 presentsthe human-written patch and the APR patch by Arja [47],DeepRepair [37], and JGenProg [25] for Defects4J [16] Chart-3 bug. The APR patch is syntactically different from thehuman-written patch.

Listing 1: Motivating Example

1056 TimeSeries copy = (TimeSeries)super.clone();

1057 + copy.minY = Double.NaN;

1058 + copy.maxY = Double.NaN;

1059 copy.data = new java.util.ArrayList();1060 if (this.data.size() > 0) { ...

a: The human-written patch of bug Chart-3 in Defects4J573 if (item == null) {574 throw new IllegalArgumentException("...");575 }

576 + findBoundsByIteration();

577 item = (TimeSeriesDataItem)item.clone();

b: The generated patch by Arja, DeepRepair and JGenProg

Even though these three APR techniques generate thesame patch for bug Chart-3, however, their analysts holddifferent opinions about the correctness of the generatedpatch. Table 2 shows the assessment results for this APRpatch from previous literature. Originally, the Arja analystsconsidered it as correct, while it was deemed as overfittingby the DeepRepair’s analysts and unknown by the JGenProganalysts. Le et al. [17] employed 3 to 5 external softwareexperts to evaluate the correctness of this patch and theresult was overfitting.

We performed several discussions of the correctness ofthis patch with the original authors of DeepRepair andJGenProg via email. Eventually, they achieved consensus onthe correctness of this patch and confirmed that this patchis actually a correct patch.

TABLE 2: Manual Analysis Result for Motivating Example

Analysts Previous ResultArja [47] CorrectDeepRepair [37] OverfittingJGenProg [25] Unknown3-5 Independent Annotators [17] Overfitting

The motivating example shows that analysts may holddifferent opinions of the correctness even on the same patch.If manual patch correctness assessment gives too manyerroneous results, it is a significant threat to the validity ofthe evaluation of program repair research. With unreliablecorrectness assessment, a technique A claimed as betterthan a technique B may actually be worse. Ideally, weneed a method that automatically and reliably assesses thecorrectness of program repair patches.

2.2 Overfitting PatchesOverfitting patches are those plausible patches that pass alldeveloper provided tests, nevertheless, they fail to be a goodgeneral solution to the bug under consideration. As such,overfitting patches can fail on other held out tests [33]. Theessential reason behind the overfitting problem is that thetest cases that are used for guiding patch generation areincomplete.

The overfitting problem has been reported both qualita-tively and quantitatively in previous work [22], [25], [29],[33]. For example, in the context of Java code, Yu et al.[46] studied the overfitting on Defects4J. In the context of Ccode, Le et al. [20] measured that 73% - 81% of APR patchesare overfitting considering two benchmarks, IntroClass andCodeFlaws. Qi et al. [29] conducted an empirical studyon the correctness of three repair techniques. The threeconsidered techniques have an overfitting rate ranging from92% to 98%. Such a large percentage of overfitting patchesmotivates us to assess patch correctness in an automaticmanner.

2.3 Automated Patch Correctness AssessmentTypically, researchers employ the human-written patch as aground truth to identify overfitting patches. Xin and Reiss[40] propose DiffTGen to identify overfitting patches withtests generated by Evosuite [9]. Those tests are meant todetect behavioral differences between a machine patch anda human-written patch. If any test case differentiates theoutput value between a machine patch and the correspond-ing human-written patch, the machine patch is assessedas overfitting. DiffTGen has been further studied by Leet al. [17], who have confirmed its potential. Opad [43]employs two test oracles (crash and memory-safety) to helpAPR techniques filter out overfitting patches by enhancingexisting test cases. Xiong et al. [41] do not use a ground truthpatch to determine the correctness of a machine patch. Theyconsider the similarity of test case execution traces to reasonabout overfitting.

3 EXPERIMENTAL METHODOLOGY

In this section, we first present an overview of the RGTpatch assessment (3.1). We then introduce seven categoriesof program behavioral differences for automated patch as-sessment (3.2) and present the workflow of the RGT assess-ment (3.3). After that, we present our research questions(RQs) to comprehensively evaluate the effectiveness andperformance of RGT assessment (3.4). Finally, we illustratethe methodology for each RQ in detail (3.5).

3.1 An Overview of RGT Patch AssessmentRGT patch assessment is to automatically assess the correct-ness of APR patches. It is based on 1) a ground truth patchand 2) a random test generator. The intuition is that randomtests would differentiate the behaviors between a groundtruth patch and an APR patch.

With regard to test generation, we consider typical re-gression test generation techniques [9], [28] for randomlysampling regression oracles based on a ground truth pro-gram. In other words, these automatic test case generation

4

TABLE 3: RGT Detects 7 Behavioral Differences

Differences Ground-Truth Behavior Actual Behavior Test Failure Diagnostic

Dassert expect value V 1 actual value V 2 ComparisonFailure/AssertionError: expected: V 1 but was: V 2Dexc1 exception E1 no exception Expecting exception: E1Dexc2 no exception exception E1 Exception E1 atDexc type exception E1 exception E2 Expected exception of type E1Dexc loc exception E1 by function A exception E1 by function B Expected exception A.E1 but was B.E1Dtimeout execution within timeout T execution out of timeout T Test timed out after T millisecondsDerror no error error Other failures

techniques use the current behavior of the program itselfas an oracle [38]. Consequently, a “RGT test” in this paperrefers to a test generated based on a ground truth patch, con-taining oracle that encodes runtime behaviors of a groundtruth program.

RGT patch assessment takes RGT tests and an APRpatched program as inputs and outputs the number of testfailures that witness a behavioral difference. RGT patchassessment establishes a direct connection between the out-puts of random tests and overfitting classification: if anybehavioral difference exists between an APR patch and aground truth patch, such APR patch is assessed as overfit-ting. More specifically, if a ground truth patch passes allRGT tests but an APR patch fails on any of them, thisAPR patch is assessed as overfitting. While RGT patchassessment is a known technique, it has not been studiedat a large scale.

3.2 Categorization of Behavioral DifferencesBased on our experiment of executing 4,477,707 RGT testson 638 patches, we empirically define seven program be-havioral differences that could be revealed by RGT tests.They are summarized in Table 3. The first column gives theidentifier of differences between the ground truth programbehavior (shown in the second column) and the actualpatched program behavior (shown in the third column). Inthe fourth column, we give the test failure diagnostic thatused for mapping each category. In our study, we use regexpatterns to match test failure diagnostics that enable us toautomatically classify the behavioral difference categories.

Now we explain them as follows:Dassert: Given the same input, the expected output value

from the ground truth program is different from the actualoutput value from the patched program. In this case, adifference in value comparison reveals an overfitting patch.

Dexc1: Given the same input, an exception is thrownwhen executed on the ground truth program but thepatched program does not throw any exception when ex-ecuted with such input. The expected behavior is an excep-tion in this case.

Dexc2: Given the same input, no exception is thrownwhen executed on the ground truth program but at leastone exception are thrown when executed on the patchedprogram. The expected behavior is no exception in this case.

Dexc type: Given the same input, an exception E1 isthrown when executed on the ground truth program buta different exception E2 is thrown when executed on thepatched program. The expected behavior is exception E1 inthis case.

Dexec loc: Given the same input, an exception E1 isthrown by function A when executed on the ground truth

program but the same exception E1 is thrown by anotherfunction B when executed on the patched program. In thiscase, we consider same exception produced by differentfunctions as behavioral differences.

Dtimeout: Given the same input and a large enoughtimeout configuration value T , the ground truth programexecuted within a considered timeout but the execution ofthe patched program causes a timeout.

Derror: Given the same input, no error is caused whenexecuted on the ground truth program but an error is causedwhen executed on the patched program. Derror indicatesan unexpected error while test execution, instead of a testfailure. The cause of a test error can be various. In this study,we consider failing tests not mapped in the aforementionedsix categories belong to Derror.

3.3 The RGT Algorithm

The RGT algorithm has been proposed by Shamshir etal. [31]. It consists of using generated tests to identify abehavioral difference. We use it in the context of a patchassessment process for program repair. Algorithm 1 presentsthe RGT algorithm. RGT takes as input a machine patchset P , a ground truth patch set G, and the automaticallygenerated RGT test set T . As result, RGT outputs for eachmachine patch from P two diagnoses: a) a label, which is ei-ther correct or overfitting, b) a list of behavioral differences.The assessment process mainly consists of two proceduresthat we discuss now: sanity check for T and automaticassessment for P .

Sanity Check: We first perform a sanity check for RGTtests in T in order to detect and remove flaky tests, thosegenerated tests that have non-deterministic behaviors. Foreach human-written patched program phi from G, we exe-cute the corresponding RGT tests Ti against phi. If any testin Ti yields a failure against phi, we add it into a flaky testset FLAKYi (line 7). If FLAKYi captures any flaky test,we then remove all tests in FLAKYi from Ti (line 8). Weconduct this procedure consecutively n times to maximizethe likelihood of detecting flaky tests (n is the cnt variableat line 4, it is set to 3 per previous research [31]).

Assessment: For the considered patch set P and RGT testset T , after the sanity check (line 11), we execute all testsfrom T against each machine patch in P . If any generatedtest yields a failure against a machine patch pmi, it isrecorded in the failing test set FTi (line 13), signaling abehavioral difference. If the FTi captures failing test, thecorrectness label of such pmi is set to overfitting, otherwisecorrect. Regarding the patches assessed as overfitting, foreach failing test, we analyze the failure and add one of theseven categories of behavioral differences in the set Dpmi

5

according to their failure diagnostic (line 18). As a result,RGT outputs the correctness label and a set of behavioraldifference for each machine patch.

Algorithm 1 RGT Patch AssessmentInput: (1) the machine patch set P={pm1...pmn}, where pmi

is a machine patched program for bug i; (2) a ground truthpatch set G={ ph1... phk}, where phi is a ground truth patchfor bug i; (3) RGT test set T = {T1...Tk}, where Ti is a set oftests generated for bug i.Output: the correctness label:correct/overfitting; a list ofbehavioral differences

1: procedure SANITYCHECK(G,T )2: for phi in G do3: for Ti in T do4: cnt← 35: while cnt > 0 do6: cnt← cnt− 17: FLAKYi ← runTests(phi, Ti)8: Ti = Ti − FLAKYi

return T9: procedure ASSESSMENT(G,P, T )

10: Ar ← ∅11: T ← SanityCheck(G,T )12: for Pmi in P do13: FTi ← runTests(pmi, Ti)14: if FTi 6= ∅ then15: labelPmi ← overfitting16: Dpmi ← ∅17: for ti in FTi do18: Dpmi ← Dpmi ∪ ti

19: Ar ← Ar ∪ 〈Pmi, label,Dpmi〉

20: else21: labelPmi

← correct22: Ar ← Ar ∪ 〈Pmi, label, null〉

return Ar

3.4 Research Questions

We intend to comprehensively evaluate the effectivenessof the RGT patch assessment. For this, we investigate thefollowing RQs:• RQ1: To what extent does RGT patch assessment tech-

nique identify misclassified patches in previously re-ported research in program repair? This is a key to seewhether RGT patch assessment is better than manualpatch assessment or rather complementary. We also askresearchers from the program repair community aboutthe misclassification cases.

• RQ2: To what extent does RGT patch assessment yieldfalse positives? There are a number of pitfalls with RGTpatch assessment which have never been studied indepth.

• RQ3: To what extent is RGT patch assessment goodat discarding overfitting patches compared against thestate-of-the-art?

• RQ4: What is the time cost of RGT patch assessment?Also, we study whether we could reuse tests generatedin previous research projects to speed up the patchassessment process.

• RQ5: What is the trade-off between test generation costand patch classification effectiveness of RGT?

3.5 Protocols

RQ1. We first collect a set of APR patches for Defects4J,that were claimed as correct by their respective authors.This set of patches is denoted as Dcorrect. Next, we executeRGT tests over all Dcorrect patches and we report thenumber of patches that make at least one RGT test fail.This case means that the RGT patch assessment contradictsthe manual analysis previously done by APR researchers.Last, for those cases where RGT assessments are not in linewith the manual assessment from previous work, we sendour RGT assessment results and failing RGT tests to theoriginal authors of the patch and ask them for feedback.In particular, we explore to what extent they agree with theRGT assessment results.

RQ2. We first manually investigate the positive cases ofRGT assessment when executing them over Dcorrect, where‘positive’ means that a patch is classified as overfitting byRGT assessment. This manual analysis aims at finding falsepositives by RGT assessment. We record the number ofthe correct patches yet assessed as overfitting by the RGTassessment. This enables us to estimate a false positive rateof RGT assessment. Last, we carefully classify those falsepositive cases according to their root causes.

RQ3. RQ3 focuses on the effectiveness of RGT assess-ment in identifying overfitting patches. We first collect a setof APR patches for Defects4J, that were manually assessedas overfitting by the corresponding researchers. This set ofpatches is denoted as Doverfitting . We execute RGT testsover the whole Doverfitting patches and record test failures.A test failure means that RGT succeeds in identifying apatch as overfitting, that RGT agrees with the manualanalysis by researchers. Next, we also execute the state-of-the-art overfitting patch detection technique DiffTGenover the same dataset. We execute DiffTGen by the defaultmode which calling EvoSuite in 30 trials with the searchingtimeout being 60 seconds for each trial. We do not executeOpad [43] and PATCH-SIM [41] on this dataset for thefollowing reasons: Opad is based on memory safety analysisin C which not relevant in the context of the memorysafe language Java. PATCH-SIM is not appropriate for tworeasons: (1) PATCH-SIM is a heuristic technique, the goal ofPATCH-SIM is to “improve the precision of program repairsystems, even at the risk of losing some correct patches”(quote from the introduction of [41]). The goal of RGTis different, it is to assist the researchers (and not APRusers) to classify patches with correct labels. (2) PATCH-SIMtargets APR users who do not have any ground truth patchavailable. On the contrary, RGT targets APR researcherswho have a ground truth patch at hand.

RQ4. We estimate the performance of RGT from a timecost perspective. We measure the time cost of RGT in threedimensions: the time cost of test case generation, the timecost of sanity checking and the time cost of executing thetest cases over the APR patches. Those three durations arerespectively denoted TCGen, SC, and EXEC. Next, wecollect RGT tests from previous research. Last, we executepreviously generated RGT tests over both Dcorrect and

6

TABLE 4: Dataset of Collected Defects4J PatchesDataset APR

ToolChart Closure Lang Math Time Total

Dcorrect

ACS 2 0 3 12 1 18Arja 3 0 4 10 1 18CapGen 5 0 9 14 0 28DeepRepair 0 0 4 1 0 5Elixir 4 0 8 12 2 26HDRepair 0 0 1 4 1 6Jaid 8 9 14 11 0 42JGenProg2015 0 0 0 5 0 5Nopol2015 1 0 3 1 0 5SequenceR 3 4 2 8 0 17SimFix 4 6 9 14 1 34SketchFix 6 2 2 6 0 16SOFix 5 0 3 13 1 22ssfix 2 1 5 7 0 15

Sum for Dcorrect 43 22 67 118 7 257

Doverfitting

ACS 0 0 1 4 0 5Arja 30 0 54 73 15 172CapGen 0 0 14 24 0 38DeepRepair 4 0 1 4 0 9Elixir 3 0 4 7 1 15HDRepair 0 0 0 3 0 3Jaid 8 4 10 17 0 39JGenProg2015 3 0 0 2 1 6Nopol2015 0 0 2 3 1 6SequenceR 3 32 1 20 0 56SimFix 0 0 3 9 0 12SketchFix 2 0 2 5 0 9SOFix 0 0 0 2 0 2ssfix 1 1 1 6 0 9

Sum for Doverfitting 54 37 93 179 18 381

Sum for all 97 59 160 297 25 638

Doverfitting in order to compare both SC and EXEC. Weevaluate the effectiveness of previous generated RGT testsby comparing them with the new generated RGT tests.

RQ5. RQ5 investigates the trade-off between the num-ber of RGT test generation cost and the effectiveness ofoverfitting patch classification. We conduct our experimentof executing 30 runs of RGT tests on Doverfitting . First,we record the number of overfitting patches individuallyidentified by each test generation. Next, to account forrandomness, we analyze 1000 random groups and each ofwhich is with a random sequence of 30 test generations.Last, we analyze the number of test generation on averageand their effectiveness of overfitting patch identification.

3.6 Curated Patch DatasetFourteen repair systems. APR patches for Defects4J form theessential data for our experiment. The criteria of repairsystems considered in this study are they were previousevaluated on the Defects4J [16]

benchmark. We carefully collect APR patches that arepublicly available. We perform this by browsing the repos-itories / appendices / replication packages of the corre-sponding research papers or by asking the authors directly.As a result, we build our dataset Dcorrect and Doverfitting

from following 14 APR systems: ACS [42]; Arja [47]; CapGen[36]; DeepRepair [37]; Elixir [30]; HDRepair [19]; Jaid [3];JGenProg [25]; Nopol [25] SimFix [15]; SketchFix [14]; SOFix[21]; ssFix [39]; SequenceR [4].

Patch Canonization and Verification. In order to fully au-tomate RGT patch assessment, we need to have all patchesin the same canonical format. Otherwise, applying a patchmay fail for spurious reasons. To do so, we manuallyconvert the collected patches from their initial formats,such as XML, plain log file, patched program and etc.,

into a unified DIFF format. After unifying the patch for-mat, we carefully name the patch files according to asystematic naming convention: <PatchNo>-<ProjectID>-<BugID>-<ToolID>.patch. For instance, patch1-Lang-24-ACS.patch refers to the first patch generated by ACS to repairthe bug from Lang project identified as 24 in Defects4J.

Sanity Check. Some shared patches may not be plausi-ble per the definition of test-suite based program repair(passing all test cases). We conduct a rigorous sanity checkto keep applicable and plausible patches. Applicable meansthat a patch can be applied successfully for the consideredDefects4J version1. Plausible means that a patch is test-suite adequate, we check this property by executing thehuman-written test cases originally provided by Defects4J.Eventually, we discard all patches that are not applicable ornot plausible.

3.7 Curated Dataset of Ground Truth based RandomTestsNow we present our curated dataset of RGT tests generatedbased on ground truth patched programs. We consider boththe previously generated RGT tests and new generated RGTtests in our study.

3.7.1 Previously Generated RGT TestsWe search and obtain existing generated test cases for De-fects4J from previous research.• EvosuiteASE15 : tests generated by Evosuite from

ASE’15 paper [31];• RandoopASE15: tests generated by Randoop from

ASE’15 paper [31];• EvosuiteEMSE18: tests generated by Evosuite from

EMSE’18 paper [46].EvosuiteASE15 and RandoopASE15 were generated for

357 Defects4J bugs and each of them with 10 runs of testgeneration (with 10 seeds). EvosuiteEMSE18 were gener-ated for 42 bugs with 30 runs of test generation (with 30seeds).

3.7.2 New generated RGT TestsIn this paper we decided to generate new RGT tests fortwo main reasons. First, we execute 30 runs of Evosuite[9] and Randoop [28], using a different random seed valueon each, with the goal of generating new test cases (notgenerated by the 10 executions from [31]) that potentiallydetect behavioral differences. They are respectively denotedas RGTEvosuite2019 and RGTRandoop2019. By using 20 addi-tional executions with new seeds, the new test cases sampleother part of the input space. Second, the test dataset fromEvosuiteEMSE18 partially covers the Defects4J bugs (42 intotal).

Parameters We run both Evosuite and Randoop on theground truth program with 30 different seeds and a searchbudget of 100 seconds. We configure a timeout of 300 sec-onds for each test execution. The test generators version andthe configuration are as the same as [31], the only differenceis we execute more runs with random seeds (30 instead of10 in [31]). In this paper, as difference from [31], we did

1. Version 1.2: commit at 486e2b49d806cdd3288a64ee3c10b3a25632e991

7

not consider the test generation tool AgitarOne because alicense issue.2

3.7.3 Sanity CheckPer the aforementioned RGT approach in 3.3, we conductthe sanity check for both previous generated RGT testsand new generated RGT tests. We execute each RGT testconsecutively three times over the ground truth program. Ifany test yields a failure against the ground truth program,we discard it until all RGT tests pass three times consecutivesanity check. By doing so, we obtain a set of stable RGT testsfor assessing patch correctness.

4 EXPERIMENTAL RESULTS

We now present our experimental results. We first look atthe dataset and RGT tests we have collected.

4.1 Patches

We have collected a total of 638 patches from 14 APR sys-tems. All pass the sanity checks described in subsection 3.6.Table 4 presents this dataset of patches for Defects4J. Thefirst column specifies the dataset category and the secondcolumn gives the name of the automatic repair system. Thenumber of patches collected per project of Defects4J is givenin the third to the seventh columns and they are summed atthe last column. They are 257 patches previously claimed ascorrect, they form Dcorrect. There are 381 patches that wereconsidered as overfitting by manual analysis in previousresearch, they form Doverfitting . We found 160/257 patchesfrom Dcorrect are syntactically equivalent to the human-written patches: the exact same code modulo formatting,and comments. The remaining 97/257 patches are semanti-cally equivalent to human-written patches. Overall, the 638patches cover 117/357 different bugs of Defects4J.3 To ourknowledge, this is the largest ever APR patch dataset withmanual analysis labels by the researchers. The most relateddataset is from [40] containing 89 patches from 4 repair toolsand the one from [41] containing 139 patches from 5 repairtools. Our dataset is 4 times bigger than the latter one.

4.2 Tests

Evosuite and Randoop have been invoked 30 times withrandom seeds for each of the 117 bugs covered by thepatch dataset. In total, they have been separately invokedfor 117 bugs× 30 seeds = 3510 runs. We discard 2.2% and2.4% flaky tests from RGTEvosuite2019 and RGTRandoop2019

respectively with a strict sanity check. As a result, we haveobtained a total of 4,477,707 stable RGT tests: 199,871 byRGTEvosuite2019 and 4,277,836 by RGTRandoop2019.

We also collect RGT tests generated by previous research,they are 15,136,567 tests: 141,170 in RGTEvosuiteASE15

[31], 14,932,884 in RGTRandoopASE15 [31], and 62,513 inRGTEvosuiteEMSE18 [46]. By conducting a sanity check ofthose tests, we discard 2.7%, 4.7% and 1.1% flaky tests.Compared with the new generated RGT tests, more flaky

2. We are not able to run the AgitarOne tests outside of a licensedinfrastructure

3. version 1.2: commit at 486e2b49d806cdd3288a64ee3c10b3a25632e991

TABLE 5: Misclassified Patches Found by RGT. The OriginalAuthors Agreed with the Analysis Error.

RGT TestsPatch Name Evos2019 Rand2019 Category Consensuspatch1-Lang-35-ACS 12 140 Dexc2 confirmedpatch1-Lang-43-CapGen 10 0 Derror confirmedpatch2-Lang-43-CapGen 10 0 Derror confirmedpatch2-Lang-51-Jaid 43 0 Dassert confirmedpatch1-Lang-27-SimFix 32 0 Dexc1 confirmedpatch1-Lang-41-SimFix 124 0 Dassert confirmedpatch1-Chart-5-Nopol2015 1 266 Dexc2 confirmedpatch1-Math-50-Nopol2015 2 0 Dexc1 confirmedpatch1-Lang-58-Nopol2015 21 0 Dassert confirmed

patch1-Math-73-JGenProg2015 49 0Dexc1 confirmedDassert

Sum 10 2 - 10 confirmed

tests exist in previous generated tests due to external factorssuch as version, date and time [31].

To our knowledge, this is the largest ever curated datasetof generated tests for Defects4J.

4.3 Result of RQ1: RGT Patch Assessment ContradictsPreviously Done Manual AnalysisWe have executed 30 runs of RGT tests over 257 patchesfrom Dcorrect. For the 160 patches syntactically equivalentto the ground truth patches, the results are consistent: noRGT test fails. For the remaining 97 patches, the assessmentof 16 patches contradicts with previously reported manualanalysis (at least one RGT test fails on the patch consideredas correct in previous research). This makes 10/16 truepositive cases while the 6/16 are false positives accordingto our manual analysis.

The ten true positive cases are presented in Table 5.The first column gives the patch name, with the failingtests number by each RGT category in the second andthird column. The fourth column shows the category ofbehavioral difference defined at Table 3. The last columngives the result of the conversation we had with the orig-inal authors about the actual correctness of the patch. Forinstance, the misclassified patch of patch1-Lang-35-ACS isidentified as overfitting by 10 tests from RGTEvosuite2019

and it is exposed by behavioral difference category cexc2of non-semantically behavior: no exception thrown from aground truth program but exceptions caused in a patchedprogram execution. This result has been confirmed by theoriginal authors.

RGTEvosuite2019 and RGTRandoop2019 identify 10 and2 misclassified patches individually. This means that Evo-suite is better than Randoop on this task. Now we look atthe behavioral differences of those 10 misclassified patcheswhich are exposed by four categories of behavioral differ-ences. This shows the diversity of behavioral differences isimportant for RGT assessment.

Notably, the 10 misclassified patches are from 6/14 re-pair systems, which shows the misclassification in manualpatch assessment is a common problem. This shows thelimitation of purely manual analysis of patch correctness.The 10.3% (10/97) previous claimed correct semanticallyequivalent patches were overfitting, which shows that man-ual assessment of semantical APR patches is hard and error-

8

prone. A previous research [35] reported over a quarter ofcorrect APR patches are actually semantical patches, andthis warn us should pay considered attention in assessingtheir correctness. All patches have been confirmed as mis-classified by the original authors. Five researchers gave usfeedback that the inputs sampled by the RGT techniquewere under-considered or missed in their previous manualassessment. The RGT assessment samples corner cases ofinputs that assist researchers in manual assessment.

We now present a case study to illustrate how thosepatches are assessed by RGT tests.

Listing 2: The Case Study of Two Patches were Misclassified

419 int start = pos.getIndex();420 char[] c = pattern.toCharArray();421 if(escapingOn && c[start] == QUOTE){

422 + next(pos);

a: The ground truth patch for Lang-43419 int start = pos.getIndex();420 char[] c = pattern.toCharArray();

421 + next(pos);

422 if(escapingOn && c[start] == QUOTE){

b: The generated patch of patch1-Lang-43-CapGen419 int start = pos.getIndex();

420 + next(pos);

421 char[] c = pattern.toCharArray();422 if(escapingOn && c[start] == QUOTE){

c: The generated patch of patch2-Lang-43-CapGen

Case study of Lang-43. The CapGen repair tool generatesthree patches for bug Lang-43. Those three patches are allcomposed of a single inserted statement next(pos) butthe insertion happens at three different positions in theprogram. Among them, there is one patch that is identical tothe ground truth patch (Listing 2a). It inserts the statementin a if-block. Patches patch1-Lang-43-Capgen (Listing 2b)and patch2-Lang-43-Capgen (Listing 2c) insert the correctstatement but at different location, respectively 1 line and 2lines before the correct position from the ground truth patch.Both patches are classified as overfitting by RGT, because 10sampled inputs result in a heap space error. With the sameinputs, the ground truth patch performs without exception,this corresponds to category Derror in Table 3. The originalauthors have confirmed the misclassification of these twopatches. This case study illustrates the difficulty of APRpatch assessment: it is unlikely to detect a heap memoryerror by only reading over the source code of the patch.

Answers to RQ1: Among the 257 patches claimed ascorrect in previous work, 160 are syntactically identicalto the human written patch, and 97 are claimed assemantically equivalent to the human written patch.We find that 10 / 97 are assessed as overfitting by theRGT patch assessment. All 10 patches have been con-firmed as actually overfitting by their original authors.This shows that manual analysis of the correctness ofsemantical APR patches is hard and error-prone. Themost closely related experiment is the one performedby [17], which is based on 45 claimed correct patches(as opposed to 257) and where one single patch is iden-tified as misclassification. Our experiment significantly

improves external validity as it is performed on a fivetimes larger dataset.

4.4 Result of RQ2: False Positives of RGT AssessmentPer the protocol described in subsection 3.5, we identifyfalse positives of RGT assessment by manual analysis ofthe patches where at least one RGT test fails. Over the 257patches from Dcorrect, RGT patch assessment yields 6 falsepositives. This means the false positive rate of RGT assessmentis 6/257 = 2.3%.

We now discuss the 6 cases that are falsely classifiedas overfitting by RGT assessment. They are classified intofour categories according to the root causes and describedin the first column in Table 6. The second column presentsthe patch name, the third column shows the category ofbehavioral difference as defined in Table 3. The fourthcolumn gives the RGT test set that contains the failing testand the last column gives a short explanation.

PRECOND The patch from patch1-Math-73-Arja is falselyidentified as overfitting because RGT samples inputs thatviolate implicit preconditions of the program. Listing 3gives the ground truth patch, the Arja patch and theRGT test that differentiates the behavior between thepatches. In Listing 3c, we can see that RGT samples anegative number -1397.1657558041848 to update the vari-able functionValueAccuracy. However, the value offunctionValueAccuracy is used to compare absolute val-ues (see the first three lines of Listing 3a). It is meaninglessto compare the absolute values with a negative number,an implicit precondition is that functionValueAccuracymust be positive, but there is no way for the test generatorto infer this precondition.

This case study illustrates that RGT patch assessmentmay create false positives because the used test generationtechnique is not aware of preconditions or constraints oninputs. This confirms the challenge of Evosuite for samplingundesired inputs [8]. On the contrary, human developersare able to guess the range of acceptable values based onthe variable names and common knowledge. This warnus that a better support for preconditions handling in testgeneration would help to increase the reliability of RGTpatch assessment.

Listing 3: The Case Study of Patch1-Math-73-Arja

106 if (Math.abs(yInitial) <=functionValueAccuracy){...}

107 if(Math.abs(yMin) <= functionValueAccuracy){...}108 if (Math.abs(yMax) <= functionValueAccuracy){...}

109 + if (yMin * yMax > 0) {

110 + throw MathRuntimeException... }

a: The ground truth patch for Math-73.136 if (Math.abs(yMax)<=functionValueAccuracy{...}

137 + verifyBracketing(min, max, f);

138 return solve(f,min,yMin,max,yMax,initial,yInitial);

b: The generated patch of patch1-Math-73-Arja.

665 double double1 = -1397.1657558041848 ;666 brentSolver0.setFunctionValueAccuracy(double1);

c: The generated test that fails on the generated patch.

EXCEPTION Both patch1-Lang-7-SimFix and patch1-Lang-7-ACS throw the same exception as the one expected

9

TABLE 6: False Positive Cases by RGT Assessment

Category Correct Patches Category RGT Reasons in Detail

PRECOND patch1-Math-73-Arja Dexc2 Evosuite2019 RGT samples inputs that violate implicit preconditions of the program

EXCEPTIONpatch1-Lang-7-DeepRepair

Dexc loc Evosuite2019 Same exception thrown from different functionspatch1-Lang-7-ACS

OPTIM patch1-Math-93-ACS Dassert Randoop2019 The ground-truth patch is more precise than the APR patch.

IMPERFECTpatch1-Chart-5-Arja

Dexc2 Evosuite2019 RGT reveals a limitation in the ground-truth patchpatch1-Math-86-Arja

in the ground truth program: fail(”Expecting exception: Num-berFormatException”).

However, these two patches are still assessed as overfit-ting because the exceptions are thrown from different func-tion from the ground truth program. Per the introduction ofbehavioral difference Dexc loc at Table 3, exceptions thrownby different function justify an overfitting assessment. RGTassessment yields two false positives when verifying excep-tions thrown positions. This suggests that category Dexc loc

may be skipped for RGT, which is easy to adjust by config-uring corresponding options in test generators.

OPTIM The patch1-Math-93-ACS is assessed as an over-fitting patch by RGTRandoop2019 tests because they detectbehavioral differences of Dassert. Bug Math-93 deals withcomputing a value based on logarithms. The fix fromACS uses lnn!, which is mathematically equivalent to thehuman-written solution

∑lnn. Their behavior should be

semantically equivalent. However, the human-written patchintroduces optimization for calculating

∑lnn when n is less

than 20 by returning a precalculated value. For instance,one of the sampled input is n=10, the expected value fromthe ground truth patch is 15.104412573075516d (looked upin a list of hard-coded results), however the actual valueof patch1-Math-93-ACS is 15.104412573075518d. Thus, an as-sertion failure is caused and RGT classifies this patch asan overfitting patch because of such behavioral differencein output value. This false positive case would have beenavoided if no optimization was introduced in the human-written patch that was taken as a ground truth.

Our finding warns the reproducible bug benchmarkwork (e.g. [2], [23]) should pay additional attention todistinguishing the optimization code from the repair codein the human-written (reference) patches.

Listing 4: A Null Pointer Exception Thrown in AssessingPatch1-Chart-5-Arja

593 for (int i = 0; i < this.data.size(); i++) {594 XYDataItem item = (XYDataItem) this.data.get(i);

595 if ( item.getX().equals(x) ) {

IMPERFECT Two cases are falsely classified as over-fitting due to the imperfection of human-written patches.They both cause behavioral difference category Dexc2 thatno exception is expected from a ground truth programwhile exceptions are thrown from a patched program. Thepatch1-Chart-5-Arja throws a null pointer exception becausevariable item is null when executing RGT tests. The codesnippet is given at line 595 of Listing 4. The human-writtenpatch returns earlier, before executing the problematic codesnippet, while the fix by patch1-Chart-5-Arja is later in the

execution flow. Hence, an exception is thrown from patch1-Chart-5-Arja but not from the human-written patch for theillegal input. Another patch of patch1-Math-86-Arja can ac-tually be considered better than the human-written patchbecause it is able to signal the illegal value NAN by throwingan exception while the human-written patch silently ignoresthe error.

Is the human written patch a perfect ground truth? RGTand related techniques are based on the assumption that thehuman-written patches are fully correct. Thus, when a testcase differentiates the behavior between an APR patch and ahuman-written patch, the APR patch is considered as over-fitting. The experimental results we have presented confirmthat human-written patches are not perfect. Our findingsconfirm that the human patch itself may be problematic [13],[45]. However, we are the first to reveal how the imperfec-tion of human patches impacts automatic patch correctnessassessment. Beyond that, as shown in this section, optimiza-tion introduced at the same commit of bug fixing and otherlimitations influence overfitting patch identification of RGTassessment.

Answers to RQ2: According to this experiment, thefalse positive rate of using RGT patch assessment is6/257 = 2.3%. Considering this false positive rate asreasonable, researchers can rely on this technique forproviding better assessment results of their programrepair contributions. Our detailed case studies warnthat blindly considering the human-written patch asa perfect ground truth is fallacious. To our knowledge,this is the first analysis of the false positives for auto-mated patch assessment.

4.5 Result of RQ3: Effectiveness of RGT AssessmentCompared to DiffTGenWe have executed 30 runs of DiffTGen over Dcorrect. DiffT-Gen identifies 2 patches as overfitting, which were both mis-classified as correct (patch2-Lang-51-Jaid and patch1-Math-73-JGenProg2015). Recall that RGT patch assessment identifiesin total 10 misclassified patches, including the 2 mentionedpatches found by DiffTGen. This shows that RGT is moreeffective than DiffTGen.

Per the core algorithm of DiffTGen and its implemen-tation, DiffTGen can only handle category Dassert of be-havioral difference (value difference in assertion). How-ever, DiffTGen fails to identify another two misclassifiedpatches also found by RGT of Dassert category: patch1-Lang-58-Nopol2015 and patch1-Lang-41-SimFix. Because DiffTGenfails to sample an input that differentiates the instrumentedbuggy and human-written patched programs, while our

10

Fig. 1: The Effectiveness of RGT and DiffTGen

RGT assessment does not require those instrumented pro-grams.

Further, we have performed 30 executions of RGTtests and DiffTGen over the whole 381 patches fromDoverfitting . RGTEvosuite2019 yields 7,923 test failuresand RGTRandoop2019 yields 65,819 test failures. Specifi-cally, RGTEvosuite2019 identifies 248 overfitting patches andRGTRandoop2019 identifies 118 overfitting patches, and to-gether they identifies 274 overfitting patches. DiffTGenidentifies 143/381 overfitting patches. Our experiment pro-vides two implications: (1) RGT patch assessment improvesover DiffTGen, and (2) For RGT patch assessment, Evosuiteoutperforms Randoop in sampling inputs that differentiateprogram behaviors by 210% (248/118), but consider thesetwo techniques together can maximize the effectiveness ofoverfitting patches identification.

Figure 1 shows the number of overfitting patches inDoverfitting dataset identified by RGT assessment andDiffTGen. RGT gives better results than DiffTGen for allDefects4J projects. An outlier case is Closure, we can seethat the effectiveness is low, both for RGT (9/37) and forDiffTGen (0/37). After analysis, the reason is that Closurehas a majority of private methods and third-party APIs.As a result, the considered automatic test generators areineffective in sampling good inputs.

Figure 2 shows the proportion of behavioral differencesdetected by RGT tests and DiffTGen per the taxonomypresented in Table 3. The proportions are computed over7,923 test failures of RGTEvosuite2019, 65,819 test failuresof RGTRandoop2019, and 143 behavioral differences detectedby DiffTGen. RGTEvosuite2019 (top horizontal bar) detectssix categories of behavioral differences and RGTRandoop2019

detects five categories. DiffTGen is only able to detect be-havioral differences due to assertion failure between ex-pected and actual values. In all cases, we see that assertion

Fig. 2: Categories of Behavioral Differences Detected by RGTand DiffTGen

failure is the most effective category to detect behavioraldifferences of overfitting patches. Moreover, exceptions arealso effective to detect behavioral differences, and this is thekey factor for RGT’s effectiveness over DiffTGen. Notably,the two considered test generators are not equally good atgenerating exceptional cases, eg. 31.9% of RGTEvosuite2019

failing tests expose differences of category Dexc1 while only2.8% of RGTRandoop2019 tests do so. Similarly, we note thatRandoop does not support exception assertions based onthe thrown location (0% of Dexc loc).

Answer to RQ3: Out of 381 patches claimed as over-fitting by manual analysis, RGT assessment automat-ically identifies 274 / 381 (72%) of them. RGT im-proves the state-of-the-art technique DiffTGen by 190%(274 versus 143 patches detected as overfitting). RGTis a fully automated technique that can alleviate re-searchers to manually label overfitting patches. Byusing RGT patch assessment, the research communitycan provide assessment results at a larger scale. Themost related experiments are [17] and [40] based on 135and 79 overfitting patches respectively. Our experimentis performed on 2.8X and 4.8X larger dataset.

4.6 Result of RQ4: Time Cost of RGT Patch Assess-ment

Table 7 summarizes the time cost of RGT patch assessment.The first column gives the breakdown of time cost as ex-plained in subsection 3.5. The second and third columnsgive the cost for the RGT tests we have generated for thisstudy, while the fourth to sixth columns are the three cate-gories of RGT tests generated in previous research projectsshared by their respective authors. TCGen time is notavailable for the previous generated RGT tests. They werereported by their authors but it is not our goal, thus weput a ‘-‘ in the corresponding cells. For example, the secondcolumn indicates RGTEvosuite2019 required 136.3 hours fortest case generation, 2.9 hours for performing the sanitycheck, and 6.2 hours for assessing the correctness of patchesin Dcorrect dataset and 9.1 hours in Doverfitting dataset.

We observe that TCGen is the dominant timecost of RGT patch assessment. RGTEvosuite2019 and

11

TABLE 7: Time Cost of RGT Patch Assessment

RGTEvosuite2019 Randoop2019 EvosuiteASE15 RandoopASE15 EvosuiteEMSE18

Test generation (TCGen) 136.3 hrs 109.7 hrs - - -Sanity check (SC) 2.9 hrs 2.5 hrs 1.3 hrs 2.6hrs 1.1 hrsTest execution on Dcorrect(EXEC1) 6.2 hrs 5.2 hrs 1.6 hrs 5.1hrs 1.7hrsTest execution on Doverfitting(EXEC2) 9.1 hrs 7.7 hrs 2.3 hrs 7.6hrs 2.3hrssum in hours 154.5 hrs 125.1 hrs 5.2 hrs 15.3 hrs 5.1 hrs

RGTRandoop2019 respectively spend 136.3/154.5 hours(88.2%) and 109.7/125.1 hours (87.7%) on test generation.

The three sets of previously generated RGT tests require5.2, 15.3 and 5.1 hours in accessing patch correctness forDcorrect and Doverfitting dataset. Our experiment presentsreusing tests from previous research is a significant timesaver.

Note that the execution time of RGTEvosuiteASE15 is lessthan RGTEvosuite2019. This is because RGTEvosuiteASE15

contains only 10 runs of test generation but RGTEvosuite2019

contains 30 runs. With the same number of test gener-ation configuration, RGTEvosuiteEMSE18 goes faster thanRGTEvosuite2019, because it only contains tests for 42 bugs.

Now we take a look at the effectiveness of RGT testsfrom previous research. RGT tests generated from previ-ous research identifies 9 out of 10 misclassified patchesfrom Dcorrect (the missing one is patch1-lang-35-ACS). FromDoverfitting , a total of 219 overfitting patches are foundby the three previous generated RGT tests together. Recallthat RGTEvosuite2019 and RGTRandoop2019 together identify274 overfitting patches for Doverfitting . Despite a fewernumber of tests, RGT tests from previous research achieve80% (219/274) effectiveness compared to our new generatedRGT tests. Therefore, RGT tests generated from previousresearch are considered effective and efficient for patchcorrectness assessment usage.

Answer to RQ4: Over 87% of the time cost of RGT patchassessment is spent in test case generation. However, itis possible to reuse previously generated RGT tests fortime-saving. This also improves scientific reproducibil-ity and coherence because all researchers can assess theAPR patches with the same generated tests.

4.7 Result of RQ5: Trade-off between Test Generationand Effectiveness of RGT

Figure 3 illustrates the average number of test generationcorresponds to their effectiveness, respectively indicate asthe y-axis and x-axis. The results are based on 1000 ran-dom groups executed over Doverfitting dataset, each ofwhich contains 30 random sequences of RGT test genera-tions. Please note that the effectiveness is compared to thewhole 30 test generations illustrated in RQ3. Recall thatRGTEvosuite2019 and RGTRandoop2019 identify 248 and 148overfitting patches individually from Doverfitting . Thus, forinstance, their 80% effectiveness means to identify 198 and118 overfitting patches respectively.

For both techniques, the more number of test generation,the better effectiveness of RGT assessment. Nevertheless,

Fig. 3: The number of overfitting patches found dependingon the number of test generation. The x-axis indicates thepercentage of effectiveness and the y-axis indicates thenumber of test generation.

even a small number of test generation obtain 80% of effec-tiveness with an average of 4.45 and 2.96 runs respectively.To achieve the last 5% effectiveness(from 95% to 100%),the cost of test generations is the most expensive, whichrequire additional 11.2 (27.8− 16.6) and 7.2 (23− 15.8) testgenerations from RGTEvosuite2019 and RGTRandoop2019.

For RGTEvosuite2019, the trade-off is considered around9 test generations which achieve 90% effectiveness of over-fitting patch identification. The cost of the test generationsignificantly increases to obtain more than 90% effective-ness. With an average of 27.8 test generations, all overfittingpatches are found. This shows test generators may occa-sionally sample inputs for some corner cases to identifyoverfitting patches.

For RGTRandoop2019, the number of test generation isconsiderably close, 14.75 and 15.8 in average, for achieving90% and 95% effectiveness. In our experiment, the trade-offis considered at around 16 test generations that equivalentto 95% effectiveness of overfitting patch classification.

Answer to RQ5: The more number of test generationfor RGT assessment, the better effectiveness for overfit-ting patch identification. Yet, a trade-off exists betweentime spent in test generation and automated patch as-sessment effectiveness: 9 runs of Evosuite and 16 runsof Randoop. This provides a practical configurationguideline for future research and experiments.

12

5 ACTIONABLE DATA

Table 1 at the beginning of this paper lists the actionableimplications obtained with our original experiments. Fur-thermore, our works provides actionable data for futureresearch in automatic program repair.

A dataset of 638 APR patches for Defects4J We havecollected and canonicalized 638 original patches from 14different repair systems that form our experiment dataset.All patches have gone through strict sanity checks. This is areusable asset for future research in program repair in par-ticular to study anti-overfitting techniques and behavioralanalysis.

A dataset of 4,477,707 RGT tests for Defects4J We havecurated 4,477,707 generated test cases from two test gener-ation systems. They complement the manual tests writtenby developers with new assertion and input points sampledfrom the input space. Overall, they provide a specificationfor Defects4J bugs. Given the magnitude, it is possiblythe largest specification ever of the expected behavior ofDefects4J bugs. This is essential for program repair researchwhich heavily relies on Defects4J. We believe it could beof great value as well in other research fields such as faultlocalization, testing and bug clustering.

6 THREATS TO VALIDITY

We now discuss the threats to the validity of our results.Threats to internal validity A threat to internal validity re-

lates to the implementation of the methodology techniques.1) Threats to validity in RGT. The removal of flaky tests fromRGT may discard test inputs that could expose behavioraldifferences. For this reason, the results we report are poten-tially an under-estimation of RGT’s effectiveness. 2) Threatsto validity in DiffTGen. DiffTGen requires mandatory con-figuration about syntactic deltas, which are not provided bythe authors of DiffTGen. Consequently, in our experiment,we improved DiffTGen to automatically generate the deltainformation. We observe that minor differences in thosedeltas could produce different results: this poses a threatto the DiffTGen results reported in this paper. We providethe delta information in our public open-science repository[1] so that future research can verify them and build on topof them.

Threats to external validity The threats to external valid-ity correspond to the generalizability of our findings. Inthis paper, we perform our experiments on the Defects4Jbenchmark with 638 patches. We acknowledge that theresults may differ if another bug benchmark is used [7], [23].Future research on other benchmarks will further improvethe external validity. To the best of our knowledge, ourexperiment on analyzing 638 patches from automatic repairresearch with with 4,477,707 generated tests are the largestever reported.

7 RELATED WORK

We now discuss the related work on patch correctness as-sessment and approaches focusing on alleviating overfittingpatch generation.

7.1 Patch Assessment

To assess a patch, it is required to be able to cover thepatch. Marinescu and Cadar [24] proposed KATCH whichuses symbolic execution to generate test inputs that areable to cover the patched code. In our paper, we considersearch-based test generation instead of symbolic executionapproach.

The work most related to our paper is the study by Le etal. [17]. In their study, they investigate the reliability of man-ual patch assessment and automatic patch assessment withDiffTGen and Randoop. There are four major differences be-tween [17] and our experiment: 1) our key result shows that72% of overfitting patches can be discarded with automatedpatch assessment, this is a significant improvement over[17] in which fewer than 20% of overfitting patches couldbe identified. 2) we provide novel experiments to com-prehensively study automatic patch correctness assessment,including false positive measurement, time cost estimation,and trade-off analysis; 3) they consider patches generatedby 8 repair systems while we consider 14 repair systems; 4)their dataset is composed of 189 patches while our datasetcontains 638 patches;

Ye et al. [44] use RGT tests to access patch correctnesson QuixBug benchmark. There are two major differencesbetween [44] and our experiment: (1) their experimentis performed on small buggy programs which the totalamount of lines of code ranges from 9 to 67 lines. Ourexperiment is performed on real-word bug repositories. (2)their dataset is composed of 64 patches while our datasetcontains 638 patches.

There are several works focusing on alleviating over-fitting patches generation from the perspective of practicalusage, which is not an automatic patches correctness assess-ment for scientific study.

Xiong et al. [41] propose PATCH-SIM and TEST-SIMto heuristically determine the correctness of the generatedpatches without oracles. They run the tests before andafter patching the buggy program and measure the de-gree of behavior change. TEST-SIM complements PATCH-SIM by determining the test results of newly generatedtest inputs from Randoop. Our experiment shows Evosuiteoutperforms Randoop to sample test inputs to differentiateprogram behaviors. This suggests the effectiveness of thisapproach could be improved by also considering Evosuitefor test generation.

Although PATCH-SIM is able to filter out overfittingpatches, we consider RGT assessment is better than PATCH-SIM for scientific study for two reasons: (1) RGT assess-ment achieves better effectiveness of identifying overfittingpatches (72% of RGT and 56% of PATCH-SIM reported in[41]); (2) PATCH-SIM without using a ground truth patchsuffers a significant false positive rate (8.25%). RGT assess-ment reduces such false positive rate to 2.3%. However, thistechnique could be improved by comparing the test execu-tion difference with a ground truth program for scientificstudy. Nevertheless, due to the high cost of execution tracescomparison of PATCH-SIM, this approach is too expensivefor scientific patch assessment.

Tan et al. [34] aim to identify the overfitting patcheswith the predefined templates to capture typical overfitting

13

behaviors. They propose anti-pattern to assess whether apatch violates specific static structures. Recent work by [12]aims to improve anti-pattern by combining it with machinelearning techniques. On the contrary, RGT assessment fullyrelies on program run time behavioral differences to iden-tify overfitting patches. While related, anti-pattern is notconsidered for assessing patch correctness. Based on theirstatic structures, the syntactically different yet semanticallyequivalent patches are typically not discarded with anti-patterns, as discussed by the authors.

Yang et al. [43] propose Opad and Gao et al. [10] proposeFix2Fit, two approaches based on implicit oracles for detect-ing overfitting patches that introduce crashes or memory-safety problems. Using these two approaches for automaticpatch correctness assessment would be an underestimationof overfitting patches, and also useless for Java where thereis no memory problem.

D’Antoni et al. [6] propose Qlose to quantify the changesbetween the buggy program and the potential patch interms of syntactic distances and semantic distances. Theyuse program execution traces as a measure to rank patches.With the ground truth patch, this technique can be used toassess the correctness of automatic repair patches.

In S3 [18], the syntactic and semantic distances between apatched and a buggy program is used to drive synthesis forgenerating less overfitting patches. This approach could beextended with a ground truth patch to calculate the syntacticand semantic distances between an automatic repair patchand a ground truth patch for the usage of automated patchassessment.

Overall, all these techniques are overfitting patches iden-tification techniques embedded in the repair process, theyare not techniques for scientific evaluations of programrepair research.

7.2 Study of Overfitting

Smith et al. [33] find that overfitting patches fix certainprogram behaviors, however, they tend to break otherwisecorrect behaviors. They study the impact of test suites cov-erage on generating correct patches: test suites with highercoverage lead to higher quality patches. Consequently,patches generated with lower coverage test suites are proneto be overfitting. Our study has a different scope, we lookat the usage of generated tests for automatic correctnessassessment, not the impact of coverage.

Long and Rinard [22] conduct an analysis of the searchspaces of two APR systems. Their analysis shows that inthe search space, there exist more overfitting patches thancorrect patches: those overfitting patches that neverthelesspass all of the test cases are typically orders of magnitudemore abundant. This presents the need for automated patchassessment technique. Our result of automatic patch cor-rectness is encouraging news for researchers on accessingoverfitting patches at scale.

Qi et al. [29] and Le et al. [20] perform empirical over-fitting studies of automatic program repair. They confirmautomatic program repair indeed produces over 70% to98% overfitting patches. By using RGT patch assessment,a majority of manual work could be saved for APR patchcorrectness assessment.

Yu et al. [46] analyze the overfitting problem in programrepair and identify two overfitting issues: incomplete fixing(A-Overfitting) and regression introduction (B-Overfitting).The former one is about the fact that the generated patchespartially repair the bug while the later one is about thosepatches which break already correct behaviors. Their exper-iments show that automatically generated tests are valuableto identify B-Overfitting(regression introduction). Our studyto some extent confirms and complements their results.RGT tests based on regression oracles are effective to detectbehavioral differences. Their experiment is performed on 42patches, our study has a much larger scope with automaticassessment of 638 patches (15 times bigger).

8 CONCLUSION

We have presented a large-scale empirical study of auto-mated patch correctness assessment in this paper. Our studyconfirms that manual patch correctness analysis is error-prone. Our automated patch assessment technique identifies10 overfitting patches that were misclassified as correctby previous research. All of them have been confirmedby the original authors (RQ1). However, automated patchassessment is not completely perfect. We also measured afalse positive rate of 2.3% and discussed the false positivecases in detail (RQ2). Overall, automated patch assessmentis effective to identify 72% overfitting patches, which savesmuch manual effort for APR researchers (RQ3). Our experi-ment also shows that over 87% time cost of RGT assessmentis spent in test case generation (RQ4) and that a trade-offexists between time cost in test generation and automatedpatch assessment effectiveness (RQ5).

Our results are encouraging news for researchers in theprogram repair community: automatically generated testcases do help to assess patch correctness in scientific stud-ies. To support the community and encourage automatedpatch assessment in future program repair experiments onDefects4J bugs, we make the dataset of 638 patches and4,477,707 generated tests publicly available.

In the future, in addition to random test generationapproaches used in this study, we also consider the goaldirected test generation to cover the changes in patches.One direction is to use symbolic execution that employsconstraint collection and solving to explore the state space offeasible execution paths and to reveal behavioral differencesbetween programs.

REFERENCES

[1] Experiment repository is available on github:.https://anonymous.4open.science/r/cffe573f-61ab-4d99-9e7c-dc769d657e75/.

[2] Samuel Benton, Ali Ghanbari, and Lingming Zhang. Defexts: Acurated dataset of reproducible real-world bugs for modern jvmlanguages. In Proceedings of the 41st International Conference onSoftware Engineering: Companion Proceedings, ICSE ’19, 2019.

[3] L. Chen, Y. Pei, and C. A. Furia. Contract-based program repairwithout the contracts. In 2017 32nd IEEE/ACM International Con-ference on Automated Software Engineering (ASE), 2017.

[4] Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-NoelPouchet, Denys Poshyvanyk, and Martin Monperrus. Sequencer:Sequence-to-sequence learning for end-to-end program repair.CoRR, abs/1901.01808, 2019.

14

[5] Patrick Cousot, Radhia Cousot, Manuel Fahndrich, and FrancescoLogozzo. Automatic inference of necessary preconditions. InRoberto Giacobazzi, Josh Berdine, and Isabella Mastroeni, editors,Verification, Model Checking, and Abstract Interpretation. SpringerBerlin Heidelberg, 2013.

[6] Loris D’Antoni, Roopsha Samanta, and Rishabh Singh. Qlose:Program repair with quantitative objectives. In Computer AidedVerification, 2016.

[7] Thomas Durieux, Fernanda Madeiral, Matias Martinez, and RuiAbreu. Empirical Review of Java Program Repair Tools: A Large-Scale Experiment on 2,141 Bugs and 23,551 Repair Attempts. InProceedings of the 27th ACM Joint European Software EngineeringConference and Symposium on the Foundations of Software Engineering(ESEC/FSE ’19), 2019.

[8] G. Fraser and A. Arcuri. Evosuite: On the challenges of test casegeneration in the real world. In 2013 IEEE Sixth InternationalConference on Software Testing, Verification and Validation, 2013.

[9] Gordon Fraser and Andrea Arcuri. Evosuite: automatic test suitegeneration for object-oriented software. In ESEC/FSE ’11 Proceed-ings of the 19th ACM SIGSOFT symposium and the 13th Europeanconference on Foundations of software engineering, 2011.

[10] Xiang Gao, Sergey Mechtaev, and Abhik Roychoudhury. Crash-avoiding program repair. In Proceedings of the 28th ACM SIGSOFTInternational Symposium on Software Testing and Analysis, ISSTA2019, 2019.

[11] Luca Gazzola, Daniela Micucci, and Leonardo Mariani. Automaticsoftware repair: A survey. IEEE Transactions on Software Engineer-ing, 2017.

[12] Ali Ghanbari. Validation of automatically generated patches: Anappetizer. 2019.

[13] Zhongxian Gu, Earl T. Barr, David J. Hamilton, and Zhendong Su.Has the bug really been fixed? In Proceedings of the 32Nd ACM/IEEEInternational Conference on Software Engineering - Volume 1, ICSE ’10,New York, NY, USA, 2010. ACM.

[14] Jinru Hua, Mengshi Zhang, Kaiyuan Wang, and Sarfraz Khurshid.Towards practical program repair with on-demand candidate gen-eration. In Proceedings of the 40th International Conference on SoftwareEngineering, 2018.

[15] Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xi-angqun Chen. Shaping program repair space with existing patchesand similar code. ISSTA, 2018.

[16] Rene Just, Darioush Jalali, and Michael D. Ernst. Defects4j: Adatabase of existing faults to enable controlled testing studies forjava programs. In Proceedings of the 2014 International Symposiumon Software Testing and Analysis, ISSTA 2014, 2014.

[17] Xuan-Bach D. Le, Lingfeng Bao, David Lo, Xin Xia, and ShanpingLi. On reliability of patch correctness assessment. In Proceedings ofthe 41st ACM/IEEE International Conference on Software Engineering,2019.

[18] Xuan-Bach D. Le, Duc-Hiep Chu, David Lo, Claire Le Goues, andWillem Visser. S3: Syntax- and semantic-guided repair synthesisvia programming by examples. In Proceedings of the 2017 11th JointMeeting on Foundations of Software Engineering, ESEC/FSE 2017,2017.

[19] Xuan Bach D Le, David Lo, and Claire Le Goues. History drivenprogram repair. In Software Analysis, Evolution, and Reengineering(SANER), 2016 IEEE 23rd International Conference on, volume 1.IEEE, 2016.

[20] Xuan-Bach D. Le, Ferdian Thung, David Lo, and Claire Le Goues.Overfitting in semantics-based automated program repair. In Pro-ceedings of the 40th International Conference on Software Engineering,ICSE ’18, New York, NY, USA, 2018. ACM.

[21] X. Liu and H. Zhong. Mining stackoverflow for program repair.In 2018 IEEE 25th International Conference on Software Analysis,Evolution and Reengineering (SANER), 2018.

[22] Fan Long and Martin Rinard. An analysis of the search spaces forgenerate and validate patch generation systems. In Proceedings ofthe 38th International Conference on Software Engineering, ICSE ’16,New York, NY, USA, 2016. ACM.

[23] Fernanda Madeiral, Simon Urli, Marcelo Maia, and Martin Mon-perrus. Bears: An Extensible Java Bug Benchmark for AutomaticProgram Repair Studies. In Proceedings of the 26th IEEE Interna-tional Conference on Software Analysis, Evolution and Reengineering(SANER ’19), Hangzhou, China, 2019. IEEE.

[24] Paul Dan Marinescu and Cristian Cadar. Katch: High-coveragetesting of software patches. In European Software Engineering Con-

ference / ACM SIGSOFT Symposium on the Foundations of SoftwareEngineering (ESEC/FSE 2013), 8 2013.

[25] Matias Martinez, Thomas Durieux, Romain Sommerard, JifengXuan, and Martin Monperrus. Automatic Repair of Real Bugs inJava: A Large-Scale Experiment on the Defects4J Dataset. SpringerEmpirical Software Engineering, 2016.

[26] Matias Martinez and Martin Monperrus. Ultra-large repair searchspace with automatically mined templates: the cardumen mode ofastor. In SSBSE 2018 - 10th International Symposium on Search-BasedSoftware Engineering.

[27] Martin Monperrus. Automatic software repair: A bibliography.ACM Comput. Surv., 51(1), January 2017.

[28] Carlos Pacheco and Michael D. Ernst. Randoop: Feedback-directed random testing for java. In Companion to the 22Nd ACMSIGPLAN Conference on Object-oriented Programming Systems andApplications Companion, OOPSLA ’07, 2007.

[29] Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. An analysisof patch plausibility and correctness for generate-and-validatepatch generation systems. In Proceedings of the 2015 InternationalSymposium on Software Testing and Analysis, ISSTA 2015, 2015.

[30] Ripon K. Saha, Yingjun Lyu, Hiroaki Yoshida, and Mukul R.Prasad. Elixir: Effective object oriented program repair. In Proceed-ings of the 32Nd IEEE/ACM International Conference on AutomatedSoftware Engineering, ASE 2017, 2017.

[31] Sina Shamshiri, Rene Just, Jose Miguel Rojas, Gordon Fraser,Phil McMinn, and Andrea Arcuri. Do automatically generatedunit tests find real faults? an empirical study of effectivenessand challenges. In Proceedings of the 30th IEEE/ACM InternationalConference on Automated Software Engineering (ASE), 2015.

[32] D. Shriver, S. Elbaum, and K. T. Stolee. At the end of synthesis:Narrowing program candidates. In 2017 IEEE/ACM 39th Interna-tional Conference on Software Engineering: New Ideas and EmergingTechnologies Results Track (ICSE-NIER), 2017.

[33] Edward K. Smith, Earl T. Barr, Claire Le Goues, and Yuriy Brun. Isthe cure worse than the disease? overfitting in automated programrepair. In Proceedings of the 2015 10th Joint Meeting on Foundationsof Software Engineering, ESEC/FSE 2015, 2015.

[34] Shin Hwei Tan, Hiroaki Yoshida, Mukul R. Prasad, and AbhikRoychoudhury. Anti-patterns in search-based program repair. InProceedings of the 2016 24th ACM SIGSOFT International Symposiumon Foundations of Software Engineering, FSE 2016, 2016.

[35] Shangwen Wang, Ming Wen, Liqian Chen, Xin Yi, and Xi-aoguang Mao. How different is it between machine-generatedand developer-provided patches? : An empirical study on the cor-rect patches generated by automated program repair techniques.In International Symposium on Empirical Software Engineering andMeasurement, 2019.

[36] Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, and Shing-ChiCheung. Context-aware patch generation for better automatedprogram repair. In Proceedings of the 40th International Conferenceon Software Engineering, ICSE ’18, 2018.

[37] Martin White, Michele Tufano, Matias Martinez, Martin Monper-rus, and Denys Poshyvanyk. Sorting and transforming programrepair ingredients via deep learning code similarities. 2018.

[38] Tao Xie. Augmenting automatically generated unit-test suites withregression oracle checking. In Dave Thomas, editor, ECOOP 2006– Object-Oriented Programming, 2006.

[39] Q. Xin and S. P. Reiss. Leveraging syntax-related code for au-tomated program repair. In 2017 32nd IEEE/ACM InternationalConference on Automated Software Engineering (ASE), 2017.

[40] Qi Xin and Steven P. Reiss. Identifying test-suite-overfitted patchesthrough test case generation. In ISSTA, 2017.

[41] Yingfei Xiong, Xinyuan Liu, Muhan Zeng, Lu Zhang, and GangHuang. Identifying patch correctness in test-based program re-pair. In Proceedings of the 40th International Conference on SoftwareEngineering, 2018.

[42] Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, GangHuang, and Lu Zhang. Precise condition synthesis for programrepair. In Proceedings of the 39th International Conference on SoftwareEngineering, 2017.

[43] Jinqiu Yang, Alexey Zhikhartsev, Yuefei Liu, and Lin Tan. Bettertest cases for better automated program repair. In In Proceedingsof 2017 11th Joint Meeting of the European Software EngineeringConference and the ACM SIGSOFT Symposium on the Foundationsof Software Engineering, Paderborn, Germany, September 4–8, 2017(ESEC/FSE’17), 2017.

15

[44] H. Ye, M. Martinez, T. Durieux, and M. Monperrus. A com-prehensive study of automatic program repair on the quixbugsbenchmark. In 2019 IEEE 1st International Workshop on IntelligentBug Fixing (IBF), 2019.

[45] Zuoning Yin, Ding Yuan, Yuanyuan Zhou, Shankar Pasupathy,and Lakshmi Bairavasundaram. How do fixes become bugs? InProceedings of the 19th ACM SIGSOFT Symposium and the 13th Eu-ropean Conference on Foundations of Software Engineering, ESEC/FSE’11, 2011.

[46] Zhongxing Yu, Matias Martinez, Benjamin Danglot, ThomasDurieux, and Martin Monperrus. Alleviating patch overfittingwith automatic test generation: a study of feasibility and effective-ness for the nopol repair system. Empirical Software Engineering,2018.

[47] Yuan Yuan and Wolfgang Banzhaf. Arja: Automated repair ofjava programs via multi-objective genetic programming. In IEEETransactions on Software Engineering, 2018.

automated patch assessment for program repair at scale · 10 patches from previous research...

Documents