disi - university of trento e t c g m o g · combines stochastic grammars with genetic programming...

PhD Dissertation

International Doctorate School in Information andCommunication Technologies

DISI - University of Trento

EVOLUTIONARY TEST CASE GENERATION VIA

MANY OBJECTIVE OPTIMIZATION AND

STOCHASTIC GRAMMARS

Fitsum Meshesha Kifetew

Advisor: Paolo Tonella, PhD

Fondazione Bruno Kessler

Co-Advisor: Roberto Tiella

Fondazione Bruno Kessler

November 2015

To my family.

Acknowledgements

First of all, I would like to thank my advisor Paolo Tonella for giving methe opportunity to pursue this doctoral research and for his constant advice andsupport. He has always been available, despite his busy schedule, and was keento provide all the help I needed to carry out the research work. I would alsolike to extend my gratitude to my co-advisor Roberto Tiella for his interest-ing comments and discussions as well as his relentless effort to help me tacklechallenges I faced while conducting this PhD research.

During the course of this PhD work, I was fortunate to work in collaborationwith great researchers from other institutions. Part of the work in this thesisis the result of these collaborative works with Alessandro Orso, Wei Jin, andAnnibale Panichella. I would also like to thank Gordon Fraser for his unre-served help with the EvoSuite tool and for his hospitality during my short visitto Sheffield.

I would also like to thank members of the thesis evaluation committee Prof.Mauro Pezze’ from Universita’ degli Studi di Milano - Bicocca, Prof. MarkHarman from University College London, and Prof. Roberto Sebastiani fromUniversita’ degli Studi di Trento for accepting the task of evaluating this thesistaking time out of their busy schedule and for their useful feedback.

I’m grateful for my wonderful colleagues at FBK: Angelo, Anna, Mariano,Andrea, Mirko, Alberto, Chiara, Cu, Itzel, Matthieu, Biniam, and Gunel for theinteresting discussions and insights not only on research related issues but on awide variety of topics, as well as for the various social events which made thework environment enjoyable.

My heartfelt thanks goes to my dear friend Surafel who has been there forme during the PhD giving me all the help I needed both in my personal and pro-fessional life. I would like to extend my heartfelt thanks to my friends Birhanu,

Komminist, Biruk, and Ephrem for the good times we had and the various in-teresting conversations that helped me balance my personal and work life.

I am forever indebted to my parents to whom I owe everything. I am alsograteful for the continuous love and support of my brother Sol and my sisterFasik.

And last, but not least, I like to specially thank my wife Hirut who alwaysstood by my side during the ups and downs of the PhD life, and my daughterBlen whose love and affection keeps me going. Thank you!

Abstract

In search based test case generation, most of the research works focus on

the single-objective formulation of the test case generation problem. How-

ever, there are a wide variety of multi- and many-objective optimization strate-

gies that could offer advantages currently not investigated when addressing the

problem of test case generation. Furthermore, existing techniques and available

tools mainly handle test generation for programs with primitive inputs, such as

numeric or string input. The techniques and tools applicable to such types of

programs often do not effectively scale up to large sizes and complex inputs.

In this thesis work, at the unit level, branch coverage is reformulated as a

many-objective optimization problem, as opposed to the state of the art single-

objective formulation, and a novel algorithm is proposed for the generation of

branch adequate test cases.

At the system level, this thesis proposes a test generation approach that

combines stochastic grammars with genetic programming for the generation of

branch adequate test cases. Furthermore, the combination of stochastic gram-

mars and genetic programming is also investigated in the context of field failure

reproduction for programs with highly structured input.

Keywordsevolutionary test case generation, many-objective optimization, failure-reproduction,grammar based testing

Contents

1 Introduction 11.1 The Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 The Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . 8

2 Background 112.1 Context-Free Grammars and Derivation . . . . . . . . . . . . . 11

2.1.1 Notation and Definitions . . . . . . . . . . . . . . . . . 11

2.1.2 The 80/20 Rule . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . 17

2.2.2 Genetic Programming . . . . . . . . . . . . . . . . . . 19

2.3 Multi-Objective Optimization . . . . . . . . . . . . . . . . . . . 21

3 State of the Art 233.1 Unit Level Test Generation . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Search Based Software Testing . . . . . . . . . . . . . . 24

3.1.2 Dynamic Symbolic Execution . . . . . . . . . . . . . . 28

3.1.3 Hybrid Approaches: SBST + DSE . . . . . . . . . . . . 30

3.2 System Level Test Generation . . . . . . . . . . . . . . . . . . 30

i

3.3 Reproducing Field Failures . . . . . . . . . . . . . . . . . . . . 33

4 Many-Objective Optimization for Unit Test Generation 374.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 384.2 Exiting Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 404.3 MOSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3.1 Graphical Interpretation . . . . . . . . . . . . . . . . . 464.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . 49

4.4.1 Prototype Tool . . . . . . . . . . . . . . . . . . . . . . 494.4.2 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . 494.4.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4.4 Experiment Protocol and Settings . . . . . . . . . . . . 504.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4.6 Qualitative analysis . . . . . . . . . . . . . . . . . . . . 564.4.7 Threats to Validity . . . . . . . . . . . . . . . . . . . . 59

4.5 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 System Level Test Generation for Coverage of Programs with Struc-tured Input 635.1 Learning Probabilities . . . . . . . . . . . . . . . . . . . . . . . 645.2 Annotated Grammars . . . . . . . . . . . . . . . . . . . . . . . 65

5.2.1 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.2.2 Annotation Syntax . . . . . . . . . . . . . . . . . . . . 675.2.3 Supporting Data Structures . . . . . . . . . . . . . . . . 715.2.4 Annotation Example . . . . . . . . . . . . . . . . . . . 73

5.3 Sentence Generation . . . . . . . . . . . . . . . . . . . . . . . 755.3.1 Representation of Individuals . . . . . . . . . . . . . . 765.3.2 Genetic Operators . . . . . . . . . . . . . . . . . . . . 775.3.3 Fitness Evaluation . . . . . . . . . . . . . . . . . . . . 78

ii

5.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . 795.4.1 Prototype Tool . . . . . . . . . . . . . . . . . . . . . . 805.4.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 825.4.3 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . 825.4.4 Experiment Protocol and Settings . . . . . . . . . . . . 855.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 865.4.6 Threats to Validity . . . . . . . . . . . . . . . . . . . . 95


6 System Level Test Generation for Reproducing Failures of Programswith Structured Input 1016.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.2 SBFR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2.1 Seeding the Search with Representative Inputs . . . . . 1046.2.2 Input Representation and Genetic Operators . . . . . . . 1056.2.3 Fitness Computation and Search Termination . . . . . . 105

6.3 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . 1076.3.1 Prototype Tool . . . . . . . . . . . . . . . . . . . . . . 1086.3.2 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.3.3 Experiment Protocol and Settings . . . . . . . . . . . . 1126.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 1166.3.6 Threats to Validity . . . . . . . . . . . . . . . . . . . . 119


7 Conclusion 1237.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . 1247.2 Summary of Future Works . . . . . . . . . . . . . . . . . . . . 125

iii

Bibliography 127

A The Inside-Outside Algorithm 143

B Own Publications 147B.1 JOURNAL PUBLICATIONS . . . . . . . . . . . . . . . . . . . . 147B.2 CONFERENCE PUBLICATIONS . . . . . . . . . . . . . . . . . . 147

iv

List of Tables

4.1 Parameter settings . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Coverage achieved by WS and MOSA along with p-values fromthe Wilcoxon test. Numeric and verbal effect size (A12) valuesare also shown. A12 > 0.5 means MOSA is better than WS;A12 < 0.5 means WS is better than MOSA, and A12 = 0.5

means they are equal. Significantly better values are shown inboldface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3 Budget consumed by each approach to achieve the best cover-age. P-values and effect size statistics are also shown. A12 <

0.5 means MOSA is better than WS, A12 > 0.5 means WS isbetter than MOSA and A12 = 0.5 means they are equal. Statis-tically significant values are printed in boldface. . . . . . . . . . 55

5.1 Subjects used in our experimental study . . . . . . . . . . . . . 83

5.2 Number of sentences and tokens in the corpus used for eachsubject during learning. Also shown is the average number oftokens per sentence. . . . . . . . . . . . . . . . . . . . . . . . . 84

5.3 Unique sentences and proportion of valid sentences producedby the various sentence generation strategies. In the last 3 columns,highest values in boldface differ from the second highest in astatistically significant way according to the Wilcoxon test, atsignificance level 0.05 . . . . . . . . . . . . . . . . . . . . . . . 87

v

5.4 Branch coverage and p-values obtained from the Wilcoxon testcomparing LRN and AN. Statistically significant values (at sig-nificance level 0.05) are shown in boldface under columns LRNand AN; the highest values are in gray background. Effectsize measures using the Vargha-Delaney statistics (A12) are alsoshown. A12 < 0.5 means LRN is better than AN, A12 > 0.5

means AN is better than LRN, and A12 = 0.5 means they areequal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.5 Branches covered by AN and LRN: intersection, differencesand similarity (Jaccard index) . . . . . . . . . . . . . . . . . . . 91

5.6 Mutation scores achieved by RND and GP when using annota-tions (AN) and learning (LRN). Significantly better values areshown in boldface; the highest values are in gray background. . 93

5.7 Coverage achieved by RND and GP when 5, 10, 15, 20, and25% of the annotations are dropped. The corresponding loss orgain in coverage is also shown; values above 1pp are in high-lighted background. . . . . . . . . . . . . . . . . . . . . . . . . 94

6.1 Subjects used in the experimental study. . . . . . . . . . . . . . 1106.2 Failure reproduction probability for RND and SBFR. Statisti-

cally significant p-values are shown in boldface. . . . . . . . . . 1146.3 Uncompressed (SZ) and compressed (ZSZ) execution trace size. 1156.4 Test suite execution time before and after instrumentation. . . . 1156.5 Failure reproduction probability (FRP) for SBFR and RND with

initialization using the learned stochastic grammar, rather thanthe 80/20 rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

vi

List of Figures

2.1 An simple grammar, a derivation for the string “(n)+n” and itssyntax tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Graphical representation of an EA . . . . . . . . . . . . . . . . 16

2.3 Crossover between two test cases in a GA . . . . . . . . . . . . 18

2.4 Mutation of a test case in a GA . . . . . . . . . . . . . . . . . 18

2.5 Graphical representation of subtree crossover in GGGP, basedon the grammar shown in Figure 2.1 . . . . . . . . . . . . . . . 20

2.6 Graphical representation of subtree mutation in GGGP, basedon the grammar shown in Figure 2.1 . . . . . . . . . . . . . . . 21

2.7 Solutions in a Pareto front representing trade-offs among con-trasting objectives . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1 Test case generation targeting one branch at a time. . . . . . . . 26

4.1 Graphical comparison between the non-dominated rank assign-ment obtained by the traditional non-dominated sorting algo-

rithm and the ranking algorithm based on preference criterion

proposed in this thesis. . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Comparison of coverage achieved by WS and MOSA over 100independent runs on Conversion.java. . . . . . . . . . . . . . . 53

4.3 Comparison of the budget consumed by WS and MOSA over100 independent runs on SchurTransformer.java. . . . . . . . . 56

4.4 Example of uncovered branch for MatrixUtils . . . . . . . . . . 57

vii

5.1 Example grammar . . . . . . . . . . . . . . . . . . . . . . . . . 715.2 A possible annotation for the grammar in Figure 5.1 . . . . . . . 725.3 Annotated grammar extracted from the JavaScript grammar . . . 745.4 During fitness evaluation, tree representations are unparsed and

wrapped into sequences of Java statements . . . . . . . . . . . . 815.5 Coverage box plots under various configurations (from left to

right: RND 8020, RND LRN, RND AN, GP 8020, GP LRN,GP AN) for the experimental subjects: Calc, Basic, Kahlua,and MDSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.6 Coverage box plots under various configurations (from left toright: RND 8020, RND LRN, RND AN, GP 8020, GP LRN,GP AN) for the experimental subjects: Javascal and Rhino . . . 90

6.1 Overview of SBFR . . . . . . . . . . . . . . . . . . . . . . . . 1036.2 Prototype tool that implements SBFR . . . . . . . . . . . . . . 108

A.1 Graphical representation of a derivation of w which uses non-terminal u . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

viii

Chapter 1

Introduction

Testing constitutes an important activity in the software development process.A reasonable level of confidence in the software can be gained by testing itbefore its release. Software testing is nowadays a widely adopted practice in thesoftware industry. Unit testing in particular is intensively used by developers toexercise the functionality implemented in the unit, independent of other units.In a survey among developers in Microsoft, 79% of the respondents intensivelyused unit tests in the development process [Venolia et al., 2005]. Unit tests arealso intensively used in Agile methodologies [Highsmith and Cockburn, 2001].

Software testing accounts for a significant portion of the overall cost of apiece of software mainly due to the effort spent by test engineers in generatingtest data that exercise the System Under Test (SUT) in various ways [Pezzeand Young, 2007]. For this reason, automating test case generation has beenthe subject of a large body of research work resulting in several techniques andtools. In particular structural test case generation techniques have received asignificant amount of attention from the research community. There are severalpublications in conferences and journals reporting original research results aswell as a number of comprehensive surveys [Zhu et al., 1997, Bertolino, 2007,Ali et al., 2010, Harman et al., 2012, Anand et al., 2013].

Software testing techniques can be categorized as structural testing, model-

1

CHAPTER 1. INTRODUCTION

based testing, combinatorial testing, (adaptive) random testing based on themethodology and level of abstraction they follow [Anand et al., 2013]. Testingmay be applied at unit, integration, or system level depending on the objectiveto be achieve. In each of these categories, there are specific techniques that dealwith the various aspects of testing.

To decide when to stop testing, various adequacy criteria are defined. Thesecriteria may be based on some measure of coverage (e.g., statement, branch,path, etc.), on the number of artificially seeded faults (mutants) that are exposed,or on some other metric that captures the desired level of testing [Zhu et al.,1997].

Unit level structural test case generation has been the subject of intensiveresearch and has seen a wide range of techniques and tools being proposed totackle it. The introduction of unit testing frameworks such as JUnit 1 has fur-ther facilitated the automated generation and execution of unit tests. Amongthe prevalent approaches taken by researchers towards automated unit test gen-eration are Dynamic Symbolic Execution (DSE) [Godefroid et al., 2005, Till-mann and de Halleux, 2008, Cadar et al., 2011, Pasareanu et al., 2011] andSearch Based Software Testing (SBST) [McMinn, 2004, Harman and McMinn,2010, Michael et al., 2001].

Recently, growing attention is being given to test case generation for the pur-pose of re-creating field failures in house, so as to facilitate the debugging andfixing of reported bugs. Typically, when end users submit bug reports, theyoften do not include information (test cases, steps to reproduce, etc) that helpthe developers reproduce the problem [Bettenburg et al., 2008]. This could bedue to various reasons: users are not willing to expose sensitive information,unable to remember/know what caused the failure, or simply do not pay atten-tion. Consequently, the developer who is ultimately responsible for fixing thereported bug faces a difficult time in reproducing the failure, without which it

1www.junit.org/

2

CHAPTER 1. INTRODUCTION 1.1. THE CONTEXT

is quite difficult to fix the bug, and eventually test whether the fix works or not.Automated support in this process could save the developer a significant amountof effort. Towards this goal, a number of research works have been reported inthe literature which produce test cases that reproduce failures occurring in thefield [Jin and Orso, 2012, Chilimbi et al., 2009, Clause and Orso, 2007, Jiangand Su, 2007, Artzi et al., 2008, Rossler et al., 2013].

1.1 The Context

While there are several strategies that could be applied to test generation (e.g., DSE),the focus in thesis is on SBST, in particular on Evolutionary Algorithms. Search-based approaches are particularly effective for test generation due to their abilityto scale up to problems of large sizes as well as their ability to deal with difficultsituations that other techniques (e.g., DSE) are not able to.

Similarly, while test case generation could target various levels, in this thesiswe consider automated test case generation at the unit and system levels, whichare extensively used by developers and researchers. At the unit level, we focuson test generation for Object Oriented programs for the purpose of achievinghigh levels of structural coverage. Object Oriented programs are growinglypopular, and from the automated test generation perspective, they pose greaterchallenges beyond those faced when testing procedural programs. In particular,the test generation strategy needs to synthesize method call sequences that pre-pare the desired object state required for achieving the specific testing objectiveat hand.

At the system level, we focus on test generation for the purpose of achievingsystem level structural coverage as well as field failure reproduction for pro-grams with highly structured input. By structured input we mean input thatadheres to a formal specification described in a context free grammar (CFG).For such type of programs the challenge faced by automated test generation

3

1.2. THE PROBLEM CHAPTER 1. INTRODUCTION

is twofold: (1) enumerating sentences from non-trivial grammars with deeplynested and recursive rules, and (2) generating sentences geared towards a giventest objective (e.g., branch coverage). These challenges are not sufficiently ad-dressed by previous works reported in the literature.

Furthermore, when we investigate system-level test generation (either forcoverage or field failure reproduction), we limit the scope of our discussion toprograms without an interactive graphical user interface (GUI). While test gen-eration for programs with GUI is essentially similar to those without GUI, itposes a different problem when it comes to exercising the GUI and possible in-teractions and sequences provided by the GUI. For the purpose of limiting thescope of the work in this thesis, for test generation related to the GUI, exist-ing works could be resorted to [Gross et al., 2012]. For similar reasons, thisthesis does not also consider programs with non-deterministic and concurrentfeatures [Weeratunge et al., 2010].

1.2 The Problem

A closer look at the literature in test data generation reveals that while thereare several research works that have tried to address automated test genera-tion, there is still room for improvement [Harman et al., 2015]. In particularfor structural unit test generation, one of the most widely researched areas intest case generation, search based test case generation approaches have alwaysconsidered the problem of branch coverage from a single-objective point ofview, and consequently applied various forms of single-objective optimizationalgorithms for tackling it [McMinn, 2004]. Some efforts are reported in theliterature in which test generation is seen from a multi-objective perspective,however in these cases, coverage is still considered as one objective and the ad-ditional objectives optimize other aspects not related to coverage (e.g., memoryconsumption, execution time, etc) [Lakhotia et al., 2007, Harman et al., 2015].

4

CHAPTER 1. INTRODUCTION 1.2. THE PROBLEM

Application of optimization algorithms in other disciplines shows the suit-ability of multi-objective optimization techniques for certain classes of com-plex, single-objective, problems [Knowles et al., 2001, Handl et al., 2008]. Inthis thesis, we investigate branch adequate test generation from a multi-objective

perspective and exploit its inherent properties to formulate an appropriate solu-tion which is able to scale to programs with high levels of complexity (a largenumber of objectives).

Moving from the unit-level to the system-level test generation, the issuesto be addressed change. At the system level, since the only control the testgeneration process has is on the input, the challenge then becomes searchingwithin the input space of the program and finding those inputs that exercise asmuch of the program as possible being guided by the optimization algorithm.While for programs with “regular” input (e.g., numeric data) there are solutionsdeveloped in the literature (e.g., [Gross et al., 2012]), for programs that takestructured data as input, there is still a need for an approach that scales to realworld programs with complex grammars describing their input.

The problem of test case generation at the system level can be viewed fromtwo perspectives. The first is from the testing perspective in which the testcases generated are to be used for achieving a certain level of coverage accord-ing to a given adequacy criterion (branch coverage). The second is from thepoint of view of debugging. A survey of the literature on field failure reproduc-tion reveals that while there are several approaches that apply to programs withregular input (e.g., [Jin and Orso, 2012]), there is no approach that deals withprograms with highly structured input. Hence, in this thesis we investigate theapplicability of grammar-based system-level test generation for the purpose ofreproducing field failures.

5

1.3. THE SOLUTION CHAPTER 1. INTRODUCTION

1.3 The Solution

At the unit level, a reformulation of branch coverage as a many-objective op-timization problem is presented, in contrast to the single-objective formulationwidely adopted in the SBST literature [McMinn, 2004]. In this new formula-tion, each branch to be covered is explicitly considered as an objective to beoptimized. However, since a program could have hundreds or even thousandsof branches, the many-objective reformulation of the branch coverage problemhas to deal with a rapidly increasing number of objectives to be optimized. Thetypical problem faced by many-objective algorithms in the presence of a largenumber of objectives is the exponential growth of the number of non-dominatedsolutions. Since in SBST we are also interested in solutions (test cases) that areclose to covering a given branch, we introduce a customized preference crite-

rion among non-dominated solutions in such a way that the algorithm favorsthose solutions in the Pareto front which are closest to any of the branches. Ac-cordingly, we propose a new many-objective optimization algorithm that takesinto consideration the aforementioned preference criterion.

At the system level, the challenge in test generation focuses mainly on theprogram input, specially when the inputs are complex in structure. In this the-sis, we present an SBST approach that combines evolutionary algorithms withstochastic grammars for the generation of system level test input. When deal-ing with grammar based input generation, problems arise because of the in-evitable presence of recursive grammar rules as well as due to hierarchicallynested grammar definitions. Such definitions lead to non-terminating deriva-tions and/or very short sentences. To cope with these challenges, we resort tostochastic grammars obtained either via heuristics or learning, which are effec-tive at controlling highly recursive derivations. Furthermore, a grammar anno-tation scheme is introduced as an alternative when it is not possible to learnstochastic grammars due to lack of an appropriate corpus of human written sen-

6

CHAPTER 1. INTRODUCTION 1.4. CONTRIBUTION

tences.

While stochastic grammars and annotations are effective at controlling re-cursion, they do not target any specific criterion (e.g., branch coverage). Inthis thesis, we present a mechanism for generating test inputs geared towards agiven coverage criterion by combining the potential of stochastic grammars forgenerating input from a grammar and the potential of genetic programming forevolving tree structures, which are fit for the test generation goal (e.g., coverage)being guided by a suitable fitness function. Specifically, we present two appli-cations of system level test generation for programs with structured input: (1)branch coverage at the system level, and (2) in-house reproduction of programfailures observed in the field.

1.4 Contribution

The work in this thesis advances the state of the art in automated test case gen-eration by introducing novel contributions along two directions: (1) test gener-ation at the unit level, and (2) test generation at the system level for programswith highly structured input. Specifically the following novel contributions areintroduced:

At the unit level: A novel re-formulation of branch coverage as a many-

objective optimization problem, in contrast to the single-objective formulationwidely adopted in the literature, is presented. Based on the re-formulation, asuitable algorithm (Many-Objective Sorting Algorithm - MOSA) is introducedso as to exploit the features of branch coverage testing which favor the appli-cation of many-objective optimization techniques. Unlike other many-objectivealgorithms in the literature, MOSA is able to scale easily to problems with hun-dreds of objectives.

At the system level: The challenges in test case generation when moving fromunit level test generation to system level test generation, and from programs

7

1.5. ORGANIZATION OF THE THESIS CHAPTER 1. INTRODUCTION

with “regular” input to programs with highly structured input are investigated.A novel combination of stochastic grammars and genetic programming is pro-posed to tackle the problem of system level test case generation for programswith highly structured input.

The proposed strategy for system level test case generation is then applied totwo practical problems concerning programs with highly structured input: (1)for achieving high system level branch coverage; and (2) for reproducing fieldfailures that help developers debug and fix reported failures.

Prototype tools: All the techniques proposed in this thesis are implemented inprototype tools which are made publicly available 2. Specifically, the followingtools have been developed:

• a unit test generation tool for Java programs that implements the MOSAalgorithm by extending the EvoSuite test generation framework.

• a system level test generation tool for Java programs that implements theproposed approach which combines stochastic grammars and genetic pro-gramming. This tool is also developed by extending the EvoSuite testgeneration framework.

• a system level failure reproduction tool which implements the failure re-production strategy proposed in this thesis. The tool works for programsdeveloped in the C language as well as those developed in the Java lan-guage.

1.5 Organization of the Thesis

Chapter 2 introduces concepts extensively used throughout this thesis. It presentsa formal definition of context free grammars and the process of sentence deriva-

2tools developed as part of this thesis work and related resources could be found here: http://selab.fbk.eu/kifetew/tools.html

8

http://selab.fbk.eu/kifetew/tools.html

http://selab.fbk.eu/kifetew/tools.html

CHAPTER 1. INTRODUCTION 1.5. ORGANIZATION OF THE THESIS

tion. It also discusses stochastic grammars and a simple heuristic rule for deter-mining the probabilities of stochastic grammars so as to limit the application ofrecursive grammar rules. The chapter also provides basic background on evolu-tionary algorithms, in particular genetic algorithms and genetic programming.It concludes by briefly introducing the notion of multi-objective optimization.

Chapter 3 presents an overview of the literature in software testing and failurereproduction. It starts by discussing relevant literature in unit test generationand the popular techniques used by researchers. These include symbolic execu-tion, search based software testing, and a combination of the two. It then moveson to discuss test generation at the system level, focusing on test generation forgrammar based programs. Finally, it presents an overview of the literature infield failure reproduction.

Chapter 4 presents the approach proposed in this thesis for unit test genera-tion using many-objective optimization. It presents a formalization of branchcoverage as a many-objective problem, an overview of existing many-objectivealgorithms and their limitation with respect to test case generation. It then intro-duces the proposed many-objective algorithm for unit test generation. Empiricalevaluation of the proposed algorithm with respect to the state of the art in unittest generation and discussion of the results are also presented.

Chapter 5 presents the approach proposed in this thesis for system level testgeneration that combines stochastic grammars and genetic programming forthe generation of branch adequate test data for programs with structured input.The techniques proposed in this thesis for controlling recursion during sentencederivation, together with the genetic programming based approach for sentencegeneration, are described. Results of empirical evaluation are also presentedand discussed.

9

1.5. ORGANIZATION OF THE THESIS CHAPTER 1. INTRODUCTION

Chapter 6 presents the framework proposed in this thesis for system level fail-ure reproduction based on the system level test generation technique introducedin the previous chapter. It then discusses the empirical evaluation and the resultsobtained.

Chapter 7 concludes the thesis by summarizing its contributions and outliningpossible directions for future work.

10

Chapter 2

Background

This chapter provides basic background on topics that are extensively used inthis thesis: (1) Grammars and Derivation, with a focus on stochastic context-free grammars and heuristics for controlling highly recursive derivations, (2)Evolutionary Algorithms, with a focus on genetic algorithms and genetic pro-gramming, and (3) Multi-Objective Optimization. The reader who is alreadyfamiliar with these topics can safely skip this chapter.

2.1 Context-Free Grammars and Derivation

2.1.1 Notation and Definitions

Given a set Σ of symbols or characters, Σ∗ indicates the Kleene closure of Σ,i.e. the set of all finite words obtained by concatenating symbols from Σ, andΣ+ the Kleene closure without the empty string ε, i.e. Σ+ = Σ∗ \ {ε}. A subsetL ⊆ Σ∗ is called a language on Σ.

Definition 1 (Grammar) Given a finite set of symbols Σ, a Grammar G is de-

fined by means of a 4-tuple (T,N, P, s) with:

1. T ⊆ Σ, T finite, the set of terminals;

2. N finite set such that T ∩N 6= ∅, the set of non-terminals;

11

2.1. CONTEXT-FREE GRAMMARS AND DERIVATION CHAPTER 2. BACKGROUND

3. P ⊆ (N ∪ T )+ × (N ∪ T )∗, P finite, the set of production rules;

4. an element s ∈ N , called the start symbol.

Rules in P are represented by α → β, where α = u1u2...um and β =

v1v2...vn are words on N ∪ T . An element of (N ∪ T )∗ is called a sentential

form and an element of T ∗ is called a sentence.

Definition 2 (Context-free Grammar) A context-free grammar (CFG) is a gram-

mar G = (T,N, P, s) where rules in P have left hand side consisting of just a

single non-terminal symbol, i.e. they have the form:

u→ v1v2...vn

with u ∈ N and vi ∈ N ∪ T

Given a non-terminal u ∈ N , Pu ⊆ P denotes the set of production ruleswith u as LHS, namely Pu = {u→ β ∈ P}.

From a sentential form α = β1uβ2, a new sentential form β1γβ2 can beproduced by means of a rule π = u → γ substituting the left-most occurrenceof u in α with γ, in symbols α ⇒π β. A finite chain of left-most productionss ⇒π1 · · · ⇒πn γ, beginning with the start symbol s, is called a left-most

derivation (just derivation in what follows) for γ. α⇒∗ β denotes the existenceof a derivation for β starting from α. Derivations are associated to syntax (orparse) trees.

Figure 2.1 shows a simple context-free grammar G = (T,N, P, s), with fourterminal symbols (contained in set T ), one non-terminal symbol (set N ), threeproduction rules (π1, π2, π3) and start symbol s. L(G) denotes the languagegenerated by the grammar, namely the set of words that can be derived from thestart symbol, L(G) = {w|s ⇒∗ w}. A derivation for the sentence “(n)+n” andthe associated parse tree are also shown in Figure 2.1.

A grammar G is ambiguous if there exists a word w ∈ L(G) that admitsmore than one derivation. For example, there are two distinct derivations for

12

CHAPTER 2. BACKGROUND 2.1. CONTEXT-FREE GRAMMARS AND DERIVATION

T = {n, (, ),+}N = {E}s = E

π1 : E → E + E

π2 : E → (E)

π3 : E → n

E ⇒π1

E + E ⇒π2

(E) + E ⇒π3

(n) + E ⇒π3

(n) + n n

( E ) n

E + E

E 1

2

3

3

Figure 2.1: An simple grammar, a derivation for the string “(n)+n” and its syntax tree

the sentence “n+n+n” in the grammar shown in Figure 2.1, namely π1π3π1π3π3and π1π1π3π3π3.

Algorithm 1: Generation of a string using a CFGS ← s

k = 1

while k < MAX ITER and S has the form α · u · β, where α ∈ T ∗ and u ∈ N doπ ← choose(Pu)

S ← α · π(u) · βk = k + 1

end whileif k < MAX ITER then

return S

elsereturn TIMEOUT

end if

A CFG can be used as a tool to randomly generate strings that belong to thelanguage L(G), expressed by grammar G, by means of the process described inAlgorithm 1. The algorithm begins by setting the start symbol s as the workingsentential form S. It then applies a production rule, randomly chosen fromthe subset of applicable rules Pu (by means of the function choose), to theleft-most non terminal u of the working sentential form S, so obtaining a newsentential form that is assigned to S. The algorithm iterates until there are nomore non-terminal symbols to substitute (i.e., S ∈ T ∗, since it does not have the

13

2.1. CONTEXT-FREE GRAMMARS AND DERIVATION CHAPTER 2. BACKGROUND

form α · u · β with u ∈ N ) or a maximum number of iterations is reached. Thebehavior of Algorithm 1 can be analyzed by resorting to the notion of Stochastic

Context-free Grammars [Booth and Thompson, 1973].

Definition 3 (Stochastic Context-free Grammar) A Stochastic Context-free Gram-mar S is defined by a pair (G, p) where G is a CFG, called the core CFG of

S, and p is a function from the set of rules P to the interval [0, 1] ⊆ R, namely

p : P → [0, 1], satisfying the following condition:∑u→β∈Pu

p(u→ β) = 1, for all u ∈ N (2.1)

Condition (2.1) ensures that p is a (discrete) probability distribution on eachsubset Pu ⊆ P of rules that have the same non-terminal u as left hand side.

An invocation of Algorithm 1 can be seen as realizing a derivation in astochastic grammar based on G where probabilities are defined by the func-tion choose. The number of iterations that Algorithm 1 requires to producea sentence depends on the structure of the grammar G and on the probabilitiesassigned to rules. As a matter of fact, interesting grammars contain (mutually)recursive rules. If recursive rules have a high selection probability p, the num-ber of iterations needed to derive a sentence from the grammar using Algorithm1 can be very large, in some cases even infinite, and quite likely beyond thetimeout limit MAX ITER.

Let us consider the grammar in Figure 2.1, with p(π3) = q, p(π2) = 0

and p(π1) = 1 − q. The probability that the generation algorithm terminates(assuming MAX ITER = ∞) depends on q. If q < 1/2 the probability thatthe algorithm terminates is less than 1 and it decreases at lower values of q,reaching 0 when q = 0.

This example shows that when Algorithm 1 is used in practice, with a fi-nite value of MAX ITER, the timeout could be reached frequently with somechoices of probabilities p, resulting in a waste of computational resources and

14

CHAPTER 2. BACKGROUND 2.2. EVOLUTIONARY ALGORITHMS

in a small number of sentences being generated. A method to control how of-ten recursive rules are applied is hence needed. We discuss below a practicalheuristic for dealing with the problem of recursive rules, which we refer to asthe 80/20 rule.

2.1.2 The 80/20 Rule

Given a CFG G = (T,N, P, s), for every non-terminal u ∈ N , Pu is split intotwo disjoint subsets P r

u and P nu , where P r

u (respectively P nu ) is the subset of rules

in Pu which are (mutually) recursive (respectively non-recursive). Probabilitiesof rules are then defined as follows:

p(α→ β) =

{q/|P n

u |, if α→ β ∈ P nu

(1− q)/|P ru |, if α→ β ∈ P r

u

so as to assign a total probability q to the non-recursive rules and 1 − q to therecursive ones. A commonly used rule of thumb consists of assigning 80%probability to the non-recursive rules (q = 0.80) and 20% to the recursive rules.In practice, with these values the sentence derivation process has been shownempirically to generate non-trivial sentences in most cases, while keeping thenumber of times the timeout limit is reached reasonably low.

2.2 Evolutionary Algorithms

Evolutionary Algorithms (EAs) are a class of metaheuristic search algorithmsinspired from the process of natural evolution in which a population of indi-viduals interact with each other and evolve through generations following theprinciple of survival of the fittest [Eiben and Smith, 2003]. EAs are widely usedto solve practical optimization problems for which exact solutions could not befound in reasonable time. The core of an EA involves the creation of a pop-ulation of individuals, suitably encoded according to the problem for which a

15

2.2. EVOLUTIONARY ALGORITHMS CHAPTER 2. BACKGROUND

solution is sought, and evolve them from generation to generation by applyingoperators that mimic natural evolution (recombination, mutation, and selection).An overview of a typical EA is shown in Figure 2.2.

InitializePopulation

Recombination

Mutation

FitnessEvaluation

stop

start

StoppingCondition

ParentSelection

SurvivorSelection

Figure 2.2: Graphical representation of an EA

As shown in Figure 2.2, a typical EA starts by creating an initial populationof individuals. It then evaluates each individual in the population by means ofan appropriate fitness function and assigns it a fitness value. Once individualsare assigned fitness values, EA proceeds by selecting ‘fitter’ individuals (to be-come parents) from the current population and subjects them to the process ofrecombination or crossover resulting in offspring. The selection process couldbe performed in a variety of methods based on the fitness value. Commonlyused selection procedures include fitness proportional selection, tournament se-lection, and rank based selection [Eiben and Smith, 2003]. Once offspring areproduced via recombination, they could further be subjected to a process of mu-

tation with the aim of introducing diversity into the population. The EA thenselects, from the combined pool of parents and offspring, the individuals that

16


form the new population in the next generation (survivors). Survivor selectioncould also be performed in a number of ways (e.g., based on age of individualsor based on fitness) [Eiben and Smith, 2003]. This process continues to iterateuntil some stopping condition is reached, in which case the EA terminates.

Genetic Algorithms (GAs) and Genetic Programming (GP) are two variantsof EA commonly applied to solve various problems in Software Engineering.Other variants of EA, such as Evolutionary Strategies and Evolutionary Pro-gramming also exist. In the remainder of this chapter, we briefly discuss GAsand GP. A comprehensive discussion of EAs in general could be found in thebook by Eiben et al. [Eiben and Smith, 2003].

2.2.1 Genetic Algorithms

Genetic Algorithms (GAs), first introduced by John Holland, essentially modelthe process of evolution and survival of the fittest in the context of digital opti-mization problems [Holland, 1975]. In a GA, candidate solutions are encodedinto individuals following various encoding schemes depending on the problembeing solved. Binary coded individuals (chromosomes) are commonly used inthe literature where each individual is represented as a sequence of binary digits(i.e., 0 and 1). More complicated and high level encodings are also used. Forinstance, in test case generation for object oriented programs, individuals aretypically represented as a sequence of program statements.

Search operators (crossover and mutation) are defined based on the type ofencoding used to represent individuals. Crossover is typically performed by ex-changing parts of the encodings of the parent individuals. Figure 2.3 depictsa typical crossover operation between two test cases represented as sequencesof statements. Similarly, mutation is performed by slightly modifying the en-coding of the individual being mutated. Typical mutation operation in binaryencoded individuals is bit flip mutation in which bits are randomly changedfrom 0 to 1 or vice versa (i.e., flipped). For individuals encoded as sequences

17


B b0 = new B();

int i0 = 1;

A a0 = new A();

a0.m1(b0, i0);

double d0 = 0.5;

B b0 = new B(d0);

b0.m1();

b0.m2(null);

B b0 = new B();

b0.m1();

b0.m2(null);

double d0 = 0.5;

B b0 = new B(d0);

int i0 = 1;

A a0 = new A();

a0.m1(b0, i0);

Figure 2.3: Crossover between two test cases in a GA

of program statements, mutation operators could be implemented by randomlychanging one or more statements in the sequence (e.g., change the parametervalues for a method call statement). Figure 2.4 illustrates such a mutation oper-ator on an individual represented as a sequence of statements.

B b0 = new B();

int i0 = 1;

A a0 = new A();

a0.m1(b0, i0);

B b0 = new B();

int i0 = -5;A a0 = new A();

a0.m1(b0, i0);

Figure 2.4: Mutation of a test case in a GA

Fitness functions are defined in such a way that they assign a numeric value(fitness value) to every individual with respect to the problem at hand. For in-stance in test case generation for object oriented programs, an individual (i.e., atest case) is evaluated by executing it against the SUT and assigning it a fit-ness value based on some form of coverage measure (statement, branch, etc).The commonly used fitness function for branch coverage measures how closean individual gets to covering a particular branch, designated as a target. A

18


typical measure of such “closeness” of an individual to covering a target is acombination of approach level and branch distance [McMinn, 2004].

2.2.2 Genetic Programming

Genetic programming (GP) [Koza, 1994] is a variation of GAs where the indi-viduals manipulated by the search algorithm are tree-structured data (programs,in the GP terminology). While there are a number of variants of GP in the liter-ature, the work in this thesis focuses on Grammar Guided GP (GGGP) [McKayet al., 2010]. In GGGP, individuals are sentences generated according to the for-mal rules prescribed by a CFG. Specifically, initial sentences are generated froma SCFG and new individuals produced by the GP search operators (crossoverand mutation) are constrained to be valid under the associated CFG.

An individual (a sentence from the grammar) in the population is representedby its parse tree. Evolutionary operators (crossover and mutation) play a cru-cial role in the GP search process. Subtree crossover and subtree mutation arecommonly used operators in GP [McKay et al., 2010]. Subtree crossover isperformed in a similar way as in GAs by exchanging parts of the encodings ofparent individuals. However, in this case the exchanged parts are subtrees ofthe same type, i.e., rooted at the same non-terminal, from the respective treesrepresenting the parent individuals. Figure 2.5 illustrates an example of sub-tree crossover based on the grammar in Figure 2.1. In Figure 2.5 crossover isperformed between parent individuals, representing the sentences “(n)/n” and“nxn+n”, in which the subtree representing the string “(n)” in the first parent isreplaced by the subtree representing the string “n+n” from the second parent.Similarly, the subtree representing “n+n” in the second parent is replaced by thesubtree representing “(n)” from the first parent. Ultimately the crossover resultsin offspring representing sentences “n+n/n” and “nx(n)”. Notice that the sub-trees exchanged are both rooted at the same non-terminal E, hence the resultingsubtrees remain well formed with respect to the grammar. The choice of which

19


subtrees to select for the exchange could be done either randomly or based onother heuristics depending on the specific characteristics of the problem beingsolved.

n

( E ) n

E / E

E

n n

n E + E

E x E

E (n)/n nxn+n

n n

E + E n

E / E

E

n

n ( E )

E x E

E n+n/n nx(n)

Figure 2.5: Graphical representation of subtree crossover in GGGP, based on the grammarshown in Figure 2.1

On the other hand, subtree mutation is performed by removing a subtreefrom an individual tree and replacing it with a newly generated subtree of thesame type. Figure 2.6 depicts subtree mutation applied to an individual repre-senting the sentence “(n)/n”, based on the grammar in Figure 2.1. As shownin Figure 2.6, a subtree rooted at the non-terminal E, representing the string“n”, is removed and in its place, a new subtree is derived from the grammarstarting from the non-terminal E and inserted to the individual’s tree represen-tation. As a result, the original individual representing the sentence “(n)/n” ismutated to a new individual representing the sentence “(n)/n+n”. The choiceof which subtree to remove and which subtree to generate from the grammarcould be performed at random or based on some other heuristic. Furthermore,when stochastic grammars are used, the generation of the new subtree would beperformed using the probabilities defined by the stochastic grammar.

Both subtree crossover and mutation operators ensure that structurally validindividuals, according to the underlying CFG, result in structurally valid off-spring or mutated individuals.

20

CHAPTER 2. BACKGROUND 2.3. MULTI-OBJECTIVE OPTIMIZATION

n

( E ) n

E / E

E

n n n

( E ) E + E

E / E

E (n)/n (n)/n+n

Figure 2.6: Graphical representation of subtree mutation in GGGP, based on the grammar shownin Figure 2.1

2.3 Multi-Objective Optimization

In optimization problems where there is only one objective to be optimized,there is typically an optimal solution that fulfills the desired objective. However,there are optimization problems for whom there is no single best solution. Infact, there could be a wide range of equivalent solutions each of which is apotential solution. In this later case, there are more than one objectives to beoptimized simultaneously and each solution presents a trade-off in the solutionspace (optimizing one objective at the expense of another). Ultimately, suchtrade-off among objectives is represented by a Pareto front of non-dominated

solutions, rather than just one optimal solution. Figure 2.3 shows an exampleof a Pareto front for a problem with two objectives to be optimized: obj1 andobj2. The points in the figure represent solutions which are equivalent to oneanother (i.e., are non-dominated). This type of optimization is refereed to asMulti-objective Optimization [Deb, 2014].

While multi-objective optimization has been shown to be effective at solv-ing problems with up to four objectives, it runs into difficulty as the numberof objectives to be optimized increases [Li et al., 2015]. This is mainly be-cause of the fact that increasing number of objectives result in an exponentialincrease in the number of non-dominated solutions. Hence, (parent) selection

21

2.3. MULTI-OBJECTIVE OPTIMIZATION CHAPTER 2. BACKGROUND

0 10 20 300

10

20

30

obj1

obj 2

Figure 2.7: Solutions in a Pareto front representing trade-offs among contrasting objectives

pressure will be quite low and insignificant. For problems with more than fourobjectives, referred to as Many-objective optimization problems, there are al-gorithms designed to scale beyond the number of objectives handled by classicmulti-objective optimization algorithms [Li et al., 2015]. Many-objective opti-mization is revisited with respect to the work in this thesis in Chapter 4.

22

Chapter 3

State of the Art

This chapter presents an overview of the state of the Art in the area of testcase generation. The chapter is organized into three sections each focusing ona different issue in automated test case generation. We first discuss test casegeneration at the unit level. We then present an overview of the techniquesemployed for testing programs whose system level input are highly structured.We conclude the chapter by discussing test case generation for the purpose ofre-creating post-deployment failures in house.

3.1 Unit Level Test Generation

Automated test case generation has been the subject of intensive research forseveral decades now. In particular, unit test generation has received a signifi-cant amount of attention from researchers and as a result a wide variety of toolsand techniques have been proposed [Anand et al., 2013]. In the context of thisthesis work, we limit the discussion of the literature to the most widely usedtechniques that have been devised to tackle the problem of automated unit testgeneration: search based techniques [McMinn, 2004] and (dynamic) symbolicexecution [King, 1976, Godefroid et al., 2005]. In addition, adaptive randomsearch (e.g., [Chen et al., 2005, Ciupa et al., 2008]) techniques have also beenproposed by researchers. As a baseline, advanced search techniques are usu-

23

3.1. UNIT LEVEL TEST GENERATION CHAPTER 3. STATE OF THE ART

ally compared with simple random search based techniques for evaluating theirperformance [Arcuri et al., 2010a].

3.1.1 Search Based Software Testing

Search Based Software Testing (SBST) [McMinn, 2004, Harman and McMinn,2010, Michael et al., 2001] is an approach to the problem of automated test-ing that falls under the general umbrella of Search Based Software Engineering(SBSE) [Harman and Jones, 2001, Harman et al., 2012]. While SBST has beenapplied to a wide range of testing problems, the most widely studied area is testgeneration for structural coverage [Harman et al., 2015]. SBST formulates testdata generation as an optimization problem and employs metaheuristic algo-rithms for finding optimal solutions. The focus is mostly on white-box testingstrategies in which some form of structural metrics are used as objective func-tions that guide the metaheuristic search, with the ultimate goal of generatingtest suites according to a given adequacy criterion. While in principle any kindof adequacy criterion could be handled by applying SBST, research in the areahas mainly focused on branch coverage.

The actual search could be performed using any of the optimization algo-rithms: local search algorithms (e.g., hill climbing), simulated annealing, evo-lutionary algorithms (e.g., GA), etc [McMinn, 2004]. In this section, we limitthe discussion to evolutionary algorithms, in particular to GAs.

As discussed in Chapter 2, when GAs are used for test data generation, indi-viduals are encodings of test cases. In particular, for Object Oriented programs,a test case is encoded as a sequence of statements such as object instantiationsand method calls. An ideal test case for a class under test would contain: aninstantiation of the class with the appropriate parameter values, a sequence ofmethod calls to alter the newly created object’s state to the desired state, a callto the method under consideration with the appropriate parameter values, andfinally an assertion to check return values or object state. In order for GA to

24

CHAPTER 3. STATE OF THE ART 3.1. UNIT LEVEL TEST GENERATION

evolve such an individual (test case), it needs appropriate genetic operators. Thecrossover operator exchanges statements between two test cases. The mutationoperator modifies parts of a test case, i.e., adds a new statement, removes anexisting statement, modifies a statement, etc. Furthermore, GA also needs a se-lection operator so that it is able to select individuals, i.e., parents, for crossoverand for survival into the next generation.

The fitness function is based on the distance measure that quantifies thecloseness of an individual with respect to the target branch. The state of theart approach used for quantifying how close a test case is to covering a partic-ular branch is computed as a combination of approach level and branch dis-

tance [McMinn, 2004]. Approach level measures how many control depen-dencies need to be traversed before reaching the target branch; while branchdistance quantifies the distance from satisfying the branch condition of the lastbranch where the execution diverged from the intended path. The overall objec-tive function then becomes a combination of these two values, typically a sumof the two numeric values. Branch distance values are usually normalized toa value between 0 and 1 in order to put them on the same scale and to reducetheir importance with respect to approach level values, which range over inte-gers. While there are several normalization functions that could be used, the oneproposed by Arcuri [Arcuri, 2010] has been shown to be simple and effective.

Traditionally, search based test data generation focused on an approach inwhich test goals (branches) are targeted by the search one at a time. i.e., for ev-ery branch in the SUT, a (GA) search is performed to find a test case that coversit. A high level algorithm of this approach is shown in Figure 3.1.1. In practicalimplementations of the algorithm in Figure 3.1.1, there are a couple of issuesthat need to be addressed: (1) collateral coverage - the set of uncovered testgoals should be updated by taking into account test goals covered ‘by accident’,i.e., test goals covered while searching for another test goal. (2) search budgetreallocation - if the search for a particular target succeeds without completely

25


consuming the allocated search budget, the remaining search budget could beredistributed so that the search for the remaining targets has a larger search bud-get.

Figure 3.1: Test case generation targeting one branch at a time.

suite <- {}

For each target t in SUT

Begin

tc <- Generate Test Case for t

suite <- suite U {tc}

End

return suite

At the end of the overall process, all tests cases that cover one or moregoals are collected into the final test suite for the SUT. In this approach, thefitness of tests is measured with respect to the coverage target under considera-tion [Tonella, 2004, McMinn, 2004, Michael et al., 2001].

The overall search budget available for generating a test suite for the SUTis partitioned in such a way that each individual search (for a particular target)is allocated a fair share of the total search budget. However, achieving a fairdistribution of the search budget is a non-trivial task. Since it is impossible toknow a priori whether a given target (branch) is reachable or not, the approachallocates an (equal) amount of budget to each target (including those which areunreachable). As a result, all search budget allocated to the unreachable targetsis simply wasted.

Whole test suite generation [Fraser and Arcuri, 2013] is a recent develop-ment in the area of evolutionary testing, where a population of test suites isevolved towards satisfying all coverage targets at once. In this approach, theindividuals in the search are test suites (not test cases). Consequently, the de-tails of the search operators (crossover and mutation) are also changed to reflect

26


the change in the encoding of the individual. Unlike the traditional approachin which targets are optimized one by one, whole test suite generation is notaffected by the presence of infeasible targets. The fitness of each test suite ismeasured with respect to all coverage targets. That is, when a test suite is exe-cuted for fitness evaluation, its performance is measured with respect to all testtargets, and the fitness value is based on the sum of all the branch distances.This approach has been implemented in the EvoSuite tool [Fraser and Arcuri,2011].

Harman et al. have carried out theoretical and empirical analysis in whichthey formalize the theory behind the application of EAs for test data genera-tion [Harman and McMinn, 2010]. The authors also discuss the suitability ofparticular search techniques for particular problems; in particular, when globalsearch (GA) is successful and when local search (Hill Climbing) is successful.They also present a hybrid approach (Memetic algorithm) which combines thestrengths of both, accompanied by empirical results.

In all of the approaches discussed thus far, the fitness function is a single

objective function, i.e., individuals are evolved towards optimizing (minimizingor maximizing) a single objective.

In the literature, there are research works that apply multi-objective opti-mization techniques. However all such works considered branch coverage as aunique objective, while other additional domain-specific goals have been con-sidered as further objectives the tester would like to achieve, such as memoryconsumption, execution time, test suite size, etc [Harman et al., 2015].

For example, Lakhotia et al. [Lakhotia et al., 2007] experimented with multi-objective approaches by considering as objectives branch coverage and dynamicmemory consumption for both real and synthetic programs. The authors com-pared two variants of GAs, one based on Pareto dominance and another inwhich the two objectives are combined into a single objective by weightingthem. However, even if they called this formulation as multi-objective branch

27


coverage they still represent branch coverage with a single objective function,by considering one branch at a time.

Harman et al. [Harman et al., 2010] proposed a search-based multi-objectiveapproach in which the first objective is branch coverage (each goal is targetedindividually) and the second objective is the number of collateral targets thatare accidentally covered.

Ferrer et al. [Ferrer et al., 2012] proposed a multi-objective approach thatconsiders two conflicting objectives: the coverage (to maximize) and the oraclecost (to minimize). They also used the targeting one branch at a time approachfor maximizing the branch coverage criterion, i.e., their approach selects onebranch at time and then runs GA for finding the test case with minimum oraclecost that covers such a branch.

Pinto and Vergilio [Pinto and Vergilio, 2010] considered three different ob-jectives when generating test cases: structural coverage criteria (targeting one

branch at a time), ability to reveal faults, and execution time. Oster and Sagli-etti [Oster and Saglietti, 2006] considered other two objectives to optimize:branch coverage (to maximize) and number of test cases required to reach themaximum coverage (to minimize).

In all the aforementioned research works that apply multi-objective opti-mization to test case generation, coverage is treated as one objective to be opti-mized together with other non-coverage objectives. That is, coverage by itselfhas not been considered as a multi-optimization problem.

3.1.2 Dynamic Symbolic Execution

Another approach widely used for the automated test case generation is Sym-bolic Execution (SE) [King, 1976]. SE is a systematic approach for generat-ing test inputs that traverse as many different control flow paths as possible(all paths, asymptotically). The dramatic growth in the computational powerof today’s computers, together with the availability of increasingly powerful

28


constraint solvers (e.g., Z3 [de Moura and Bjrner, 2008], Yices [Dutertre andde Moura, 2006], STP [Ganesh and Dill, 2007]), has resulted in a renewedinterest in using symbolic execution for test input generation [Cadar et al.,2011, Visser et al., 2004, Sen et al., 2005, Godefroid et al., 2005]. Despite theserecent advances, however, symbolic execution is still an inherently limited tech-nique, mostly due to the path explosion problem (i.e., the virtually infinite num-ber of paths in the code), the environment problem (i.e., the challenges involvedwith handling interactions between the code and its environment, such as exter-nal libraries), and the limitations of constraint solvers in handling complex con-straints and theories [Anand et al., 2013]. Some of these limitations are partiallyaddressed by integrating information obtained via dynamic (concrete execution)of the SUT into the test generation process, giving rise to what is now knownas Dynamic Symbolic Execution (DSE) [Godefroid et al., 2005, Tillmann andde Halleux, 2008, Cadar et al., 2011], [Pasareanu et al., 2011].

Even with the advancements made by DSE, the technique is still limited inseveral aspects when dealing with certain classes of programs. For instance,DSE falls short when faced with programs whose inputs have complex struc-tures (e.g., inputs conforming to a formal grammar), programs with dependen-cies on libraries, large and complex programs which result in constraints thatthe employed solver could not handle [Anand et al., 2013]. On the other hand,SBST is more robust with respect to those issues that pose difficulty for DSE.However, the success of SBST depends heavily on the guidance it gets from theunderlying objective function, which may not always be effective. In particu-lar, SBST faces problems in cases where the fitness landscape contains regionswhich do not offer enough guidance to the search (e.g., flat regions). Conse-quently, recent research efforts have been directed towards combining the bestfeatures of DSE and SBST [Anand et al., 2013].

29

3.2. SYSTEM LEVEL TEST GENERATION CHAPTER 3. STATE OF THE ART

3.1.3 Hybrid Approaches: SBST + DSE

Hybrid approaches to test case generation that aim to exploit the strengthsof both SBST and DSE have also been proposed from the research commu-nity. One attempt in this direction introduces DSE as additional genetic op-erator which is invoked based on a given criteria during the progress of thesearch [Malburg and Fraser, 2011, Galeotti et al., 2013]. Another directiontaken by researchers builds an alternation between DSE and SBST where theresult of one is given as an input to the other for further improvement [Inkum-sah and Xie, 2008]. Guidance from a fitness function is also proposed to helpthe algorithm choose which path to explore in DSE [Xie et al., 2009]. Symbolicexecution based fitness in SBST has also been proposed [Baars et al., 2011].Active research in this area is still being carried out to find a suitable integrationbetween the two approaches.

3.2 System Level Test Generation for Programs with Struc-tured Input

The idea of exploiting formal specifications, such as grammars, for test datageneration has been the subject of research for several decades now. In the70s Purdom proposed an algorithm for the generation of short programs from aCFG making sure that each grammar rule is used at least once [Purdom, 1972].The algorithm ensures a high level of coverage of the grammar rules. Thereare a number of works in the literature that improve, reformulate or extendPurdom’s algorithm (e.g., [Zheng and Wu, 2009]). However, rule coverage doesnot necessarily imply code coverage nor fault exposure [Hennessy and Power,2005].

Maurer [Maurer, 1990] reports on using extended CFGs for test cases gen-eration. In his work, he introduces constructs that improve uniform random

30

CHAPTER 3. STATE OF THE ART 3.2. SYSTEM LEVEL TEST GENERATION

sentence generation by specifying rule-selection probabilities. No indicationis given on how to choose them. In the proposed notation, terminal and non-terminal symbols can be replaced by actions and variables in production rules,to specify dynamic rule-selection strategies: during sentence generation, selec-tion rules can change adaptively. While this is an extremely powerful mecha-nism, only few simple usage examples are presented in this work. No attemptis made to foresee how complex it could be writing such actions to constrain,for example, the generation of samples from a statically typed language such asPascal.

In a recent work by Poulding et al. [Poulding et al., 2013], the authors pro-pose to automatically optimize the distribution of weights for production rulesin stochastic CFGs using a metaheuristic technique. Weights and dependenciesare optimized by a local search algorithm with the objective of finding a weightdistribution that ensures a certain level of branch coverage. In our experience,weight optimization is only part of the problem, which we address by means oflearning. To increase coverage, dynamic recombination of sentences, as sup-ported by GP, proves to be another essential ingredient.

Guo and Qiu [Guo and Qiu, 2014] propose an approach to generate sen-tences from a Stochastic CFG (SCFG), where initially uniform probabilitiesare updated during the generation process. The approach avoids the possibleexponential explosion, and the corresponding non-termination, that can be ex-perienced in uniform random based generation algorithms. Their approach gen-erates structurally different test cases whose size increases as the generation al-gorithm proceeds. The approach produces test suites that contain very differenttests. Starting from the simplest sentences, the grammar can generate samplesthat are more and more complex. While the approach seems very promisingfrom the point of view of termination and exponential explosion avoidance,coverage of the SUT is not a direct goal of the approach.

Symbolic Execution (SE) has been applied to the generation of grammar

31

3.2. SYSTEM LEVEL TEST GENERATION CHAPTER 3. STATE OF THE ART

based data by Godefroid et al. [Godefroid et al., 2008] and Majumdar et al.[Majumdar and Xu, 2007]. Both approaches reason on symbolic tokens andmanipulate them via SE. The work of Godefroid et al. focuses on grammarbased fuzzing to find well formed, but erroneous, inputs that exercise the systemunder test with the intention of exposing security bugs. The work of Majumdaret al. focuses on string generation via concolic execution with the intention ofmaximizing path exploration. As both works employ SE, they are affected by itsinherent limitations, for instance scalability. Furthermore, the success of theseapproaches depends on the accuracy of the symbolic tokens that summarizeseveral input sequences into one.

Boltzmann samplers, presented by Duchon et al. [Duchon et al., 2004], arealternative means for generating instances of structured data, e.g., trees, graphs,and sentences. The work focuses on the problem of the uniform generation ofgiven-size instances and on the efficiency of the generation algorithm, in termsof time and space complexity, when instances of big size are drawn. While inprinciple, Boltzmann generators can be used as an alternative generation tech-nique to the ones used in our work, it is not clear how this technique scales whenapplied to rather big grammars, such as the one describing JavaScript programs,that are by far bigger, in terms of number of terminals, non-terminals and rules,than the ones used in their work.

Other approaches, such as QuickCheck presented by Claessen K. et al. [Claessenand Hughes, 2011] and GodelTest devised by Felt R. et al. [Feldt and Poulding,2013], employ full-fledged programming languages (Haskell in QuickCheckand Ruby in GodelTest) enriched with non-deterministic constructs, to developtest data generators. These approaches are able to express powerful constraintson the generation of sentences. However, to follow these approaches, testershave to manually develop generators from scratch, a rather demanding anderror-prone activity when huge grammars are involved. Moreover, both worksdo not discuss the efficacy of the proposed generation methods in terms of code

32

CHAPTER 3. STATE OF THE ART 3.3. REPRODUCING FIELD FAILURES

coverage or bug revealing ability.

In the context of generating test data from grammars for code coverage, arecent work closely related to the work in this thesis is that of Beyene and An-drews [Beyene and Andrews, 2012]. Their approach involves generating Javaclasses from the symbols (terminals and non terminals) in the grammar. Theinvocation of a sequence of methods on instances of these classes results inthe generation of strings compliant with the grammar. In their work, they ap-ply various strategies for generating method sequences, including metaheuristicalgorithms and deterministic approaches, such as depth-first search, with theultimate objective of finding a test suite that maximizes statement coverage forthe system under test.

3.3 Reproducing Field Failures

The primary objective of the research in this area is to find test cases that leadprogram executions into failures which are similar to those observed while theprogram was being used on the field.

A number of works have been reported in the literature that apply a record-

and-replay strategy. These techniques capture/record program behavior by mon-itoring or sampling field executions and later (deterministically) replay themfor either recreating observed field failures and/or debugging them (e.g., [Chenet al., 2001, King et al., 2005, Narayanasamy et al., 2005, Netzer and Weaver,1994, Srinivasan et al., 2004, Clause and Orso, 2007, Csallner et al., 2008, Ron-sse and De Bosschere, 1999]) These techniques, however, tend to either recordtoo much information to be practical or too little information to be effective.Furthermore, some of the works in this category rely on specialized supportfrom the operating system or hardware (e.g., [Narayanasamy et al., 2005, Kinget al., 2005, Srinivasan et al., 2004]) which limits their applicability in a widercontext.

33

3.3. REPRODUCING FIELD FAILURES CHAPTER 3. STATE OF THE ART

There are also research works in the record-and-replay category that focus onrecording/replaying executions of subsystems relying on specific Java features(e.g., [Elbaum et al., 2006,Orso et al., 2006,Orso and Kennedy, 2005,Saff et al.,2005]).

Researchers have also investigated more sophisticated approaches to repro-duce field failures by using limited information, rather than extensively record-ing executions. Some debugging techniques, for instance, leverage weakest-precondition computation to generate inputs that can trigger certain types ofexceptions in Java programs (e.g., [Chandra et al., 2009, Nanda and Sinha,2009,Flanagan et al., 2002]). Although potentially promising, these approachestend to handle limited types of exceptions and operate mostly at the modulelevel. SherLog [Yuan et al., 2010] and its follow-up work LogEnhancer [Yuanet al., 2011] use runtime logs to reconstruct and infer paths close to loggingstatements to help developers identify bugs. These techniques have shown to beeffective, but they aim to highlight potential faulty code, rather than synthesiz-ing failing executions.

ReCrash [Artzi et al., 2008] records partial object states at the method leveldynamically to recreate an observed crash. It inspects the call stack (collectedupon crash) at different levels of stack depth and tries to call each method inthe stack with parameters capable of reproducing the failure. Although thisapproach can help reproduce a field failure, it either captures large amountsof program states, which makes it impractical, or reproduces the crash in ashallow way, at the module or even method level, which has limited usefulness(e.g., making a method fail by calling it with a null parameter does not provideuseful information for the developer, who is rather interested in knowing why anull value reached the method).

Both ESD [Zamfir and Candea, 2010] and CBZ [Crameri et al., 2011] lever-age symbolic execution to generate program inputs that reproduce an observedfield failure. Specifically, ESD aims at reaching the point of failure (POF),

34

CHAPTER 3. STATE OF THE ART 3.3. REPRODUCING FIELD FAILURES

whereas CBZ improves ESD by reproducing executions that follow partial branchtraces, where the relevant branches are identified by performing different staticand dynamic analyses. However, POFs and partial traces may not be enoughfor successfully reproducing some failures [Jin and Orso, 2012].

Similar to ESD and CBZ, BugRedux [Jin and Orso, 2012] is a general ap-proach for synthesizing, in-house, an execution that mimics an observed fieldfailure. BugRedux implements a guided symbolic execution algorithm that aimsat reaching a sequence of intermediate points in the execution. Although theempirical evaluation of BugRedux has shown that it can reproduce real worldfield failures effectively and efficiently, given a suitable set of field executiondata, the approach is based on symbolic execution and suffers from the inherentproblems of these kinds of techniques.

Another approach, RECORE [Rossler et al., 2013], applies genetic algo-rithms to synthesize executions from crash call stacks. However, the current em-pirical evaluation of RECORE focuses on unit-level, partial executions (i.e., ex-ecutions of standalone library classes), so it is unclear whether the approachwould be able to reproduce complete, system-level executions. Failures in li-brary classes usually result in shallow crash stacks, and in our experience execu-tion synthesis approaches based on symbolic execution work also quite well inthese cases. At the system level, a potential fundamental limitation of RECOREis the limited information available in crash stacks. Jin et al. have observed intheir empirical study on BugRedux [Jin and Orso, 2012] that it is difficult toreproduce system-level field failure using only information provided by crashstacks.

Some techniques specifically focus on reproducing concurrency-related fail-ures. Among those are ESD [Zamfir and Candea, 2010] (discussed above),PRES [Park et al., 2009], and the technique by Weeratunge et al. based onmulti-core dumps [Weeratunge et al., 2010]. The approach proposed in this the-sis focuses on failures of sequential programs and leaves concurrency-related

35

3.3. REPRODUCING FIELD FAILURES CHAPTER 3. STATE OF THE ART

failures for future work. In this sense, we may be able to leverage some aspectsof ESD and PRES when extending our technique.

In summary, our careful study of state of the art techniques confirmed thattest input generators based on symbolic analysis are effective but have a numberof practical limitations, and are thus ideal for reproducing failures in programsof limited size and input complexity.

Conversely, approaches based on search based algorithms are often not aspowerful as approaches based on symbolic analysis but can scale to larger andmore complex programs that symbolic analysis cannot handle and are thusamenable to failure reproduction for these kinds of programs. Moreover, allthese approaches do not rely on any grammar as input, which makes them inef-fective on programs with complex, structured grammar-based input.

This thesis advances the state of the art in evolutionary test case generationby introducing the following novel contributions:

• a many-objective formulation of branch coverage in which each branch isexplicitly considered as an objective to be optimized, along with a custommany-objective algorithm for the generation of branch adequate test cases.

• a system-level test generation scheme for programs with highly structuredinput that combines stochastic grammars with genetic programming withthe ultimate goal of achieving high system-level branch coverage.

• a system-level test data generation scheme for reproducing field failures ofprograms with highly structured input.

36

Chapter 4

Many-Objective Optimization for UnitTest Generation

Search Based Software Testing (SBST) approaches in the literature have em-ployed a single-objective search in which individuals are measured, with re-spect to coverage, from the point of view of a single objective. Even though thewhole suite generation approach proposed by Fraser et al. [Fraser and Arcuri,2013] uses a fitness function that considers all testing goals simultaneously,from the optimization point of view, it applies the sum scalarization approachthat combines multiple target goals (i.e., multiple branch distances) into a sin-gle, scalar objective function [Deb, 2014], thus allowing for the application ofsingle-objective metaheuristics such as standard GAs.

Previous works on numerical optimization have shown that the sum scalar-ization approach to many-objective optimization has a number of drawbacks,among which the main one is that it is not efficient for some kinds of prob-lems (e.g., problems with non-convex region in the search space) [Deb, 2014].Further studies have also demonstrated that many-objective approaches can bemore efficient than single-objective approaches when solving the same com-plex problem [Knowles et al., 2001, Handl et al., 2008], i.e., a many-objectivereformulation of complex problems reduces the probability of being trapped inlocal optima, also leading to a better convergence rate [Knowles et al., 2001]

37

4.1. PROBLEM FORMULATION CHAPTER 4. MANY-OBJECTIVE OPTIMIZATION

as compared to a single-objective formulation. This remains true even whenthe multiple objectives are eventually aggregated into a single objective forthe purpose of selecting a specific solution at the end of the (many-objective)search [Knowles et al., 2001].

In this thesis work we propose to explicitly reformulate the branch coveragecriterion as a many-objective optimization problem, where different branchesare considered as different objectives to be optimized. In this new formulation,a candidate solution is a test case, while its fitness is measured according toall (yet uncovered) branches at the same time, adopting the multi-objective no-tion of optimality. As noted by Arcuri and Fraser [Arcuri and Fraser, 2014],the branch coverage criterion lends itself to a many-objective problem, butit poses scalability problems to traditional many-objective algorithms since atypical class can have hundreds if not thousands of objectives (e.g., branches).However, we observe that branch coverage presents some peculiarities with re-spect to traditional (numerical) problems and we exploit them to overcome thescalability issues associated with traditional algorithms.

In this thesis, we introduce a novel highly-scalable many-objective GA, namedMOSA (Many-Objective Sorting Algorithm), that modifies the selection schemeused by existing many-objective GA.

4.1 Problem Formulation

In this section, we present a formalization of the problem addressed.

Problem 1 Let B = {b1, . . . , bm} be the set of branches of a program. Find a

set of non-dominated test cases T = {t1, . . . , tn} that minimize the fitness func-

tions for all branches b1, . . . , bm, i.e., minimizing the following m objectives:min f1(t) = al(b1, t) + d(b1, t)...

min fm(t) = al(bm, t) + d(bm, t)

(4.1)

38

CHAPTER 4. MANY-OBJECTIVE OPTIMIZATION 4.1. PROBLEM FORMULATION

where each d(bi, t) denotes the normalized branch distance of test case t for

branch bi, while al(bi, t) is the corresponding approach level (i.e., the minimum

number of control dependencies between the statements in the test case trace

and the branch). Vector 〈f1, . . . , fm〉 is also named fitness vector.

In this formulation, a candidate solution is a test case, not a test suite, and itis scored by branch distance + approach level, computed for all branches in theprogram. Hence, its fitness is a vector of m values, instead of a single aggre-gate score. In many-objective optimization, candidate solutions are evaluated interms of Pareto dominance and Pareto optimality [Deb, 2014]:

Definition 4 A test case x dominates another test case y (also written x ≺ y) if

and only if the values of the objective functions satisfy the following conditions:

∀i ∈ {1, . . . ,m} fi(x) ≤ fi(y)

and

∃j ∈ {1, . . . ,m} such that fj(x) < fj(y)

Conceptually, the definition above indicates that x is preferred to (dominates) yif and only if x is better on one or more objectives (i.e., it has a lower branchdistance + approach level for one or more branches) and it is not worse for theremaining objectives. Among all possible test cases, the optimal test cases arethose non-dominated by any other possible test case:

Definition 5 A test case x∗ is Pareto optimal if and only if it is not dominated

by any other test case in the space of all possible test cases (feasible region).

Single-objective optimization problems have typically one solution (or mul-tiple solutions with the same optimal fitness value). On the other hand, solvinga multi-objective problem may lead to a set of Pareto-optimal test cases (withdifferent fitness vectors), which, when evaluated, correspond to trade-offs inthe objective space. While in many-objective optimization it may be useful to

39

4.2. EXITING ALGORITHMS CHAPTER 4. MANY-OBJECTIVE OPTIMIZATION

consider all the trade-offs in the objective space, especially if the number of ob-jectives is small, in the context of coverage testing we are interested in findingonly the test cases that contribute to maximizing the total coverage by coveringpreviously uncovered branches, i.e., test cases having one ore more objectivescores equal to zero, i.e., fi(t) = 0. These are the test cases that intersect anyof the m Cartesian axes of the vector space where fitness vectors are defined.Such test cases are candidates for inclusion in the final test suite and representa specific sub-set of the Pareto optimal solutions.

4.2 Existing Many Objective Algorithms

This section briefly summarizes the main previous works on many-objectiveoptimization and highlights their limitations in the context of many-objectivebranch coverage. Multi-objective algorithms have been successfully appliedwithin the Software Engineering community to solve problems with two or threeobjectives, such as software refactoring, test case prioritization, etc. However,it has been demonstrated that classical multi-objective evolutionary algorithms,such as the Non-dominated Sorting Genetic Algorithm II (NSGA-II) [Deb et al.,2000] and the improved Strength Pareto Evolutionary Algorithm (SPEA2) [Zit-zler et al., 2001], are not effective in solving optimization problems with morethan three-objectives [Li et al., 2015].

To overcome their limitations, new algorithms have been recently proposedthat modify the Pareto dominance relation to increase the selection pressure.For example, Laumanns et al. [Laumanns et al., 2002] proposed the usage ofan ε-dominance relation (ε-MOEA) instead of the classical one. Although thisapproach has been shown to be helpful in obtaining a good approximation of anexponentially large Pareto front in polynomial time, it presents drawbacks andin some cases it can slow down the optimization process significantly [Horobaand Neumann, 2008]. Zitzler and Kunzli [Zitzler and Kunzli, 2004] proposed

40

CHAPTER 4. MANY-OBJECTIVE OPTIMIZATION 4.2. EXITING ALGORITHMS

the usage of the hypervolume indicator instead of the Pareto dominance whenselecting the best solutions to form the next generation. Even if the new algo-rithm, named IBEA (Indicator Based Evolutionary Algorithm), was able to out-perform NSGA-II and SPEA2, the computation cost associated with the exactcalculation of the hypervolume indicator in a high-dimensional space (i.e., withmore than five objectives) is too expensive, making it infeasible with hundredsof objectives as in the case of branch coverage. Yang et al. [Yang et al., 2013]introduced a Grid-based Evolutionary Algorithm (GrEA) that divides the searchspace into hyperboxes of a given size and uses the concepts of grid domi-nance and grid difference to determine the mutual relationship of individualsin a grid environment. Di Pierro et al. [di Pierro et al., 2007] used a prefer-ence order-approach (POGA) as an optimality criterion in the ranking stage ofNSGA-II. This criterion considers the concept of efficiency of order in subsetsof objectives and provides a higher selection pressure towards the Pareto frontthan Pareto dominance-based algorithms. Yuan et al. [Yuan et al., 2014] pro-posed θ-NSGA-III, an improved version of the classical NSGA-II, where thenon-dominated sorting scheme is based on the concepts of θ-dominance to ranksolutions in the environmental selection phase, which ensures both convergenceand diversity.

All the many-objective algorithms mentioned above have been investigatedmostly for numerical optimization problems with less than 15 objectives. More-over, they are designed to produce a rich set of optimal trade-offs between dif-ferent optimization goals, by considering both proximity to the real Pareto op-timal set and diversity between the obtained trade-offs [Laumanns et al., 2002].As explained in Section 4.1, this is not the case for branch coverage, since weare interested in finding only the test cases having one ore more objective scoresequal to zero (i.e., fi(t) = 0), while the trade-offs are useful just for maintainingdiversity during the optimization process. Hence, there are two main peculiari-ties that have to be considered in many-objective branch coverage as compared

41

4.3. MOSA CHAPTER 4. MANY-OBJECTIVE OPTIMIZATION

to more traditional many-objective problems: (i) not all Pareto optimal test cases(trade-offs between objectives) have a practical utility, hence, the search has tofocus on a specific sub-set of the optimal solutions (those intersecting the maxes); (ii) for a given level of branch coverage, shorter test cases (i.e., test caseswith a lower number of statements) are preferred.

4.3 Many Objective Sorting Algorithm

Algorithm 2: MOSAInput:B = {b1, . . . , bm} the set of branches of a program.Population size MResult: A test suite T

1 begin2 t←− 0 // current generation3 Pt ←− RANDOM-POPULATION(M )4 archive←− UPDATE-ARCHIVE(Pt)5 while not (search budget consumed) do6 Qt ←− GENERATE-OFFSPRING(Pt)7 Rt ←− Pt

⋃Qt

8 F←− PREFERENCE-SORTING(Rt)

9 Pt+1 ←− ∅10 d←− 0

11 while | Pt+1 | + | Fd |6M do12 CROWDING-DISTANCE-ASSIGNMENT(Fd)

13 Pt+1 ←− Pt+1⋃

Fd

14 d←− d+ 1

15 Sort(Fd) //according to the crowding distance16 Pt+1 ←− Pt+1

⋃Fd[1 : (M− | Pt+1 |)]

17 archive←− UPDATE-ARCHIVE(archive⋃Pt+1)

18 t←− t+ 1

19 T ←− archive

Previous research in many-objective optimization [von Lucken et al., 2014]has shown that many-objective problems are particularly challenging becausethe proportion of non-dominated solutions increases exponentially with the num-ber of objectives, i.e., all or most of the individuals are non-dominated. As aconsequence, it is not possible to assign a preference among individuals forselection purposes and the search process becomes equivalent to a random

42

CHAPTER 4. MANY-OBJECTIVE OPTIMIZATION 4.3. MOSA

one [von Lucken et al., 2014]. Thus, problem/domain specific knowledge isneeded to impose an order of preference over test cases that are non-dominatedaccording to the traditional non-dominance relation. For branch coverage, thismeans focusing the search effort on the test cases that are closer to one or moreuncovered branches of a program. To this aim, we propose the following prefer-

ence criterion in order to impose an order of preference among non-dominatedtest cases:

Definition 6 Given a branch bi, a test case x is preferred over another test case

y (also written x ≺bi y) if and only if the values of the objective function for bisatisfy the following condition:

fi(x) < fi(y)

where fi(x) denotes the objective score of test case x for branch bi (see Section4.1). The best test case for a given branch bi is the one preferred over all theothers for such branch (xbest ≺bi y, ∀y ∈ T ). The set of best test cases acrossall uncovered branches ({x | ∃ i : x ≺bi y, ∀y ∈ T}) defines a subset of thePareto front that is given priority over the other non-dominated test cases in ouralgorithm. When there are multiple test cases with the same minimum fitnessvalue for a given branch bi, we use the test case length (number of statements)as a secondary preference criterion.

Our preference criterion provides a way to distinguish between test casesin a set of non-dominated ones, i.e., in a set where test cases are equivalentaccording to the traditional non-dominance relation, and it increases the selec-tion pressure by giving higher priority to the best test cases across uncoveredbranches. Since none of the existing many-objective algorithms considers thispreference ordering, which is a peculiarity of the branch coverage criterion,in this thesis we present a novel many-objective genetic algorithm, which wename MOSA (Many Objective Sorting Algorithm), incorporating the proposedpreference criterion during the selection process.

43


As shown in Algorithm 2, MOSA starts with an initial set of randomly gen-erated test cases that forms the initial population (line 3 of Algorithm 2). Thepopulation then evolves toward nearby better test cases through subsequent it-erations, called generations. To produce the next generation, MOSA first cre-ates new test cases, called offspring, by combining parts from two selected testcases (parents) in the current generation using the crossover operator and ran-domly modifying test cases using the mutation operator (function GENERATE-OFFSPRING, at line 6 of Algorithm 2).

Algorithm 3: PREFERENCE-SORTINGInput:A set of candidate test cases TResult: Non-dominated ranking assignment F

1 begin2 F0 // first non-dominated front3 for bi ∈ B and bi is uncovered do4 // for each uncovered branch we select the best test case according to the preference criterion5 tbest ←− test case in T with minimum objective score for bi6 F0 ←− F0

⋃{tbest}

7 T ←− T − {tbest}8 if T is not empty then9 G←− FAST-NONDOMINATED-SORT(T, {b ∈ B|b is uncovered})

10 d←− 0 //first front in G11 for All non-dominated fronts in G do12 Fd+1 ←− Gd

A new population is generated using a selection operator, which selects fromparents and offspring according to the values of the objective scores. Such aselection is performed by considering both the non-dominance relation and theproposed preference criterion (function PREFERENCE-SORTING, at line 8of Algorithm 2). In particular, the PREFERENCE-SORTING function, whosepseudo-code is provided in Algorithm 3, determines the test case with the lowestobjective score (branch distance + approach level) for each uncovered branchbi, i.e., the test case that is closest to cover bi (lines 2-7 of Algorithm 3). Allthese test cases are assigned rank 0 (i.e., they are inserted into the first non-dominated front F0), so as to give them a higher chance of surviving in to the

44


next generation (elitism). The remaining test cases (those not assigned to thefirst rank) are ranked according to the traditional non-dominated sorting algo-rithm used by NSGA-II [Deb et al., 2000], starting with a rank equal to 1 and soon (line 8-12 of Algorithm 3). It is important to notice that the routine FAST-NONDOMINATED-SORT assigns the ranks to the remaining test cases by con-sidering only the non-dominance relation for the uncovered branches, i.e., byfocusing the search toward the interesting sub-region of the search space.

Once a rank is assigned to all candidate test cases, the crowding distance

is used in order to make a decision about which test case to select: the testcases having a higher distance from the rest of the population are given higherprobability of being selected. Specifically, the loop at line 11 in Algorithm 2and the following lines 15 and 16 add as many test cases as possible to the nextgeneration, according to their assigned ranks, until reaching the population size.The algorithm first selects the non-dominated test cases from the first front (F0);if the number of selected test cases is lower than the population sizeM , the loopselects more test cases from the second front (F1), and so on. The loop will stopwhen adding test cases from current front Fd exceeds the population size M . Atend of the loop (lines 15-16), when the number of selected test cases is lowerthan the population size M , the algorithm selects the remaining test cases fromthe current front Fd according to the descending order of crowding distance.

As a further peculiarity with respect to other many-objective algorithms,MOSA uses a second population, called archive, to keep track of the best testcases that cover branches of the program under test. Specifically, after each gen-eration MOSA stores every test case that covers previously uncovered branchesin the archive as a candidate test case to form the final test suite (line 4 and 17of Algorithm 2). To this aim, at the end of each generation function UPDATE-ARCHIVE (reported in Algorithm 4 for completeness) updates the set of testcases stored in the archive with the new test cases forming the last generation.This function considers both the covered branches and the length of test cases

45


Algorithm 4: UPDATE-ARCHIVEInput:A set of candidate test cases TResult: An archive A

1 begin2 A←− ∅3 for bi ∈ B do4 best length←−∞5 tbest ←− ∅6 for tj ∈ T do7 score←− objective score of tj for branch bi8 length←− number of statements in tj9 if score == 0 and length ≤ best length then

10 tbest ←− {tj}11 best length←− length

12 if tbest 6= ∅ then13 A←− A

⋃tbest

when updating the archive: for each covered branch bi it stores the shortest testcase covering bi in the archive.

In summary, generation by generation MOSA focuses the search towards theuncovered branches of the program under test (both PREFERENCE-SORTINGand FAST-NONDOMINATED-SORT routines analyze the objective scores ofthe candidate test cases considering the uncovered branches only); it also storesthe shortest covering test cases in an external data structure (i.e., the archive)to form the final test suite. Finally, since MOSA uses the crowding distance

when selecting the test cases, it promotes diversity, which represents a key fac-tor to avoid premature convergence toward suboptimal regions of the searchspace [Kifetew et al., 2013].

4.3.1 Graphical Interpretation of Preference Criterion

Let us consider the simple program shown in Figure 4.1-a and let us assumethat the uncovered goals are the true branches of nodes 1 and 3, whose branchpredicates are (a == b) and (b == c) respectively. According to the pro-posed many-objective formulation, the corresponding problem has two resid-

46


Instructionss int example(int a, int b, int c)

{

1 if (a == b)

2 return 1;

3 if (b == c)

4 return -1;

5 return 0;

}

(a) Example program

0 10 20 300

10

20

30

F0

A

B

F1

F2

f1

f 2

(b) Ranking based on the traditional non-dominance relation

0 10 20 300

10

20

30

A ∈ F0

B ∈ F0

F1

F2

F3

f1

f 2

(c) Ranking based on the proposed preference cri-terion

Figure 4.1: Graphical comparison between the non-dominated rank assignment obtained bythe traditional non-dominated sorting algorithm and the ranking algorithm based on preferencecriterion proposed in this thesis.

47


ual optimization goals, which are f1 = al(b1) + d(b1) = abs(a − b) andf2 = al(b2) + d(b2) = abs(b − c). Hence, any test case produced at a givengeneration t corresponds to some point in a two-dimensional objective space asshown in Figure 4.1-b and 4.1-c. Unless both a and b are equal, the objectivefunction f1 computed using the combination of approach level and branch dis-tance is greater than zero. Similarly, the function f2 is greater than zero unlessb and c are equal.

Let us consider the scenario reported in Figure 4.1-b where no test case isable to cover the two uncovered branches (i.e., in all cases f1 > 0 and f2 > 0). Ifwe use the traditional non-dominance relation between test cases, all test casescorresponding to the black points in Figure 4.1-b are non-dominated and formthe first non-dominated front F0. Therefore, all such test cases have the sameprobability to be selected to form the next generation, even if test case A is theclosest to the Cartesian axis f2 (i.e., closest to cover branch b2) and test case Bis the closest to the Cartesian axis f1 (branch b1). Since there is no preferenceamong test cases in F0, it might happen thatA and/orB are not kept for the nextgeneration, while other, less useful test cases in F0 are preserved. This scenariois quite common in many-objective optimization, where the number of non-dominated solutions increases exponentially with the number of objectives [vonLucken et al., 2014]. However, from the branch coverage point of view the twotest cases A or B are better (fitter) than all other test cases, because they are theclosest to cover each of the two uncovered branches.

Our novel preference criterion gives a higher priority to test cases A and Bwith respect to all other test cases, guaranteeing their survival in the next gener-ation. In particular, using the new ranking algorithm proposed in this thesis, thefirst non-dominated front F0 will contain only test cases A and B (see Figure4.1-c), while all other test cases will be assigned to other, successive fronts.

48

CHAPTER 4. MANY-OBJECTIVE OPTIMIZATION 4.4. EMPIRICAL EVALUATION

4.4 Empirical Evaluation

The goal of the empirical evaluation is to assess the effectiveness and efficiency

of MOSA, in comparison with state of the art single-objective approaches, andin particular the whole test suite optimization (WS) approach implemented byEvoSuite [Fraser and Arcuri, 2013]. Specifically, we investigated the followingresearch questions:

• RQ1 (effectiveness): What is the coverage achieved by MOSA vs. WS?

• RQ2 (efficiency): What is the rate of convergence of MOSA vs. WS?

RQ1 aims at evaluating the benefits introduced by the many-objective re-formulation of branch coverage and to what extent the proposed MOSA algo-rithm is able to cover more branches if compared to an alternative, state of theart whole suite optimization approach.

With RQ2 we are interested in analyzing to what extent the proposed ap-proach is able to reduce the cost required for reaching the highest coverage.

4.4.1 Prototype Tool

We have implemented MOSA in a prototype tool by extending the EvoSuitetest data generation framework [Fraser and Arcuri, 2011]. In particular, weimplemented an extended many-objective GA as described in Section 4.3. Allother details (e.g. test case encoding schema, genetic operators, etc.) are thoseimplemented in EvoSuite [Fraser and Arcuri, 2013].

4.4.2 Subjects

In our empirical evaluation we used 64 Java classes from 16 widely used opensource projects, many of which were used to evaluate the whole suite approach[Fraser and Arcuri, 2013]. We tried to incorporate a diverse set of classes with

49

4.4. EMPIRICAL EVALUATION CHAPTER 4. MANY-OBJECTIVE OPTIMIZATION

varying levels of complexity and functionality. Table 4.2 (columns 2, 3, and 4)summarizes the details of the subjects. As running experiments on all classesfrom all the projects is computationally expensive, we selected classes randomlyfrom the respective projects, with the only restriction that the total number ofbranches in the class should be at least 50. As can be seen from Table 4.2,the total number of branches ranges from 50 to 1213 (on average around 215branches). In our proposed many-objective formulation of the test data gener-ation problem, each branch represents an objective to be optimized. Hence theset of classes used in our experiments present a significant challenge for ouralgorithm, in particular with respect to scaling to a large number of objectives.

4.4.3 Metrics

For comparing the two techniques, we use coverage as a measure of effective-

ness and consumption of search budget as a measure of efficiency. Coverage(branch coverage) of a technique for a class is computed as the number ofbranches covered by the technique divided by the total number of branchesin the class. Efficiency (search budget) is measured in the number of executed

statements. Efficiency is used as a secondary metric, hence we compare ef-ficiency only in cases where there is no statistically significant difference ineffectiveness (i.e., coverage).

4.4.4 Experiment Protocol and Settings

For each class, each search strategy (WS or MOSA) is run and the effective-ness (coverage) and efficiency (budget consumption) metrics are collected. Oneach execution, an overall time limit is imposed so that the run of an algorithmon a class is bounded with respect to time. Hence, the search stops when ei-ther full branch coverage is reached, the search budget is finished, or the totalallocated time is elapsed. To allow reliable detection of statistical differences

50


between the two strategies, each run is repeated 100 times. Consequently weperformed a total of 2 (search strategies) × 64 (classes) × 100 (repetitions) =12,800 experiments.

Statistical significance is measured with the non-parametric Wilcoxon RankSum test [Conover, 1998] with a p-value threshold of 0.05. Significant p-valuesindicate that the null hypothesis can be rejected in favor of the alternative hy-pothesis, i.e., that one of the algorithms reaches a higher branch coverage (RQ1)or a lower number of executed statements (RQ2). Other than testing the nullhypothesis, we used the Vargha-Delaney (A12) statistics [Vargha and Delaney,2000] to measure the effect size, i.e., the magnitude of the difference betweenthe coverage levels (or number of executed statements for RQ2) achieved withdifferent algorithms. The Vargha-Delaney (A12) statistics also classifies the ob-tained effect size values into four different levels (negligible, small, medium andlarge) that are easier to interpret.

There are several parameters that control the performance of the algorithmsbeing evaluated. We adopted the default parameter values used by EvoSuite[Fraser and Arcuri, 2013], as it has been empirically shown [Arcuri and Fraser,2013] that the default values, which are also commonly used in the literature,give reasonably acceptable results. The values for the important search param-eters are listed in Table 4.1.

Table 4.1: Parameter settingsParameter ValuePopulation size 50Crossover rate 0.75Mutation rate 1/sizeSearch budget (statements executed) 1,000,000Timeout (seconds) 600

51


4.4.5 Results

Table 4.2 summarizes the results (coverage) of the experiment. The coverage,averaged over 100 runs, that was achieved by each strategy is shown along withp-values obtained from the Wilcoxon test. Effect size metrics are also shown,indicating the magnitude of the difference. We can see from the table that out ofthe 64 classes, WS was statistically significantly better in 9 cases while MOSAwas better in 42 cases. In the remaining 13 cases, no statistically significantdifference was observed.

In particular, from the analysis of the results reported in Table 4.2 we can no-tice that using MOSA coverage increases between 2% and 53%. Across all the64 classes, WS achieved an average coverage of 78.86% while MOSA achieved83.08%. It is worth noticing that such a result indicates a notable improvementif we consider the actual number of branches in our subjects. For example, ifwe consider class Conversion extracted from the Apache commons library,we can observe that WS is able to cover 584 branches (averaged over 100 runs)while MOSA covers on average 712 branches using the same search budget.In other cases the improvements are even larger. For example, if we considerclass ExpressionParser extracted from library JSci, we can notice thatWS covers on average 53 branches against 237 branches covered on average byMOSA.

This notable improvement is also highlighted by the effect size values ob-tained when applying the Vargha-Delaney (A12) statistics (see Table 4.2). In themajority of cases where MOSA outperforms WS (i.e., in 34 out of 42 cases),the magnitude of the difference in terms of coverage is large. In the few caseswhere WS is statistically significantly better than MOSA, we observe that theeffect size is small or negligible (in 7 cases out of 9).

Figure 4.2 provides a graphical comparison of the coverage values producedby WS and MOSA over 100 runs for one of the subject: Conversion. It

52


can be seen from the boxplot that the distribution of coverage values obtainedby MOSA over all independent runs is substantially higher than the distribu-tion achieved by WS. Specifically, in all the runs MOSA reached a coverage of92.58% while WS yielded a lower coverage raging between 74% and 78%.

WS MOSA

7580

8590

org.apache.commons.lang3.Conversion

% B

ranc

h C

over

age

Figure 4.2: Comparison of coverage achieved by WS and MOSA over 100 independent runs onConversion.java.

For the 13 classes on which there was no significant difference in coveragebetween WS and MOSA, we compared the amount of search budget consumedby each search strategy (RQ2). Since for these 13 classes none of the two strate-gies achieved 100% coverage, they both consume the entire search budget. Forthis reason, we recorded the amount of search budget consumed to achieve thebest coverage. This serves as an indicator of the convergence rate of the searchstrategy. Table 4.3 summarizes the result of the comparison of the budget con-sumed to achieve the final (highest) coverage. Out of 13 cases, WS consumeda significantly lower budget in only one case, while MOSA consumed signifi-cantly lower budget in 8 cases. In the remaining 4 cases, there was no statisti-cally significant difference in budget consumption as well. Furthermore we can

53


Table 4.2: Coverage achieved by WS and MOSA along with p-values from the Wilcoxon test.Numeric and verbal effect size (A12) values are also shown. A12 > 0.5 means MOSA is betterthan WS; A12 < 0.5 means WS is better than MOSA, and A12 = 0.5 means they are equal.Significantly better values are shown in boldface.

No Subject Class Branches WS MOSA p-value A12 Magnitude1 Guava Utf8 63 85.16% 90.24% 0.00000 0.73 Medium2 Guava CacheBuilderSpec 139 94.22% 98.95% 0.00000 1.00 Large3 Guava BigIntegerMath 133 91.35% 90.97% 0.02562 0.41 Small4 Guava Monitor 191 10.47% 10.47% NaN 0.50 Negligible5 Tullibee EReader 306 29.78% 46.16% 0.00000 0.95 Large6 Tullibee EWrapperMsgGenerator 67 88.34% 95.90% 0.00000 0.86 Large7 Trove TDoubleShortMapDecorator 59 88.31% 87.81% 0.88950 0.51 Negligible8 Trove TShortByteMapDecorator 59 87.88% 87.56% 0.56929 0.52 Negligible9 Trove TByteIntHash 87 92.91% 94.26% 0.00813 0.61 Small

10 Trove TCharHash 60 89.52% 90.72% 0.00000 0.68 Medium11 Trove TFloatCharHash 87 90.05% 86.69% 0.00000 0.29 Medium12 Trove TFloatDoubleHash 87 91.38% 86.92% 0.00000 0.23 Large13 Trove TShortHash 60 89.87% 89.37% 0.11116 0.44 Negligible14 Trove TDoubleLinkedList 277 87.14% 93.17% 0.00000 0.95 Large15 Trove TByteFloatHashMap 293 85.63% 90.98% 0.00000 0.95 Large16 Trove TByteObjectHashMap 242 87.80% 93.45% 0.00000 0.88 Large17 Trove TFloatObjectHashMap 242 88.55% 93.65% 0.00000 0.91 Large18 JSci LinearMath 262 66.26% 68.01% 0.46728 0.53 Negligible19 JSci SpecialMath 196 85.94% 87.98% 0.00000 0.94 Large20 JSci ExpressionParser 435 12.23% 54.47% 0.00000 0.96 Large21 JSci SimpleCharStream 82 33.67% 70.48% 0.00000 0.99 Large22 NanoXML XMLElement 304 65.80% 76.36% 0.00000 0.97 Large23 CommonsCli HelpFormatter 142 86.86% 85.34% 0.00062 0.36 Small24 CommonsCli Option 96 95.64% 95.62% 0.84901 0.50 Negligible25 CommonsCodec DoubleMetaphone 498 84.95% 92.32% 0.00000 1.00 Large26 CommonsPrimitives RandomAccessByteList 81 95.33% 96.09% 0.00000 0.74 Medium27 CommonsCollections TreeList 215 94.00% 94.02% 0.81068 0.49 Negligible28 CommonsCollections SequencesComparator 89 96.16% 96.63% 0.01327 0.53 Negligible29 CommonsLang ArrayUtils 1119 67.08% 71.64% 0.00000 1.00 Large30 CommonsLang BooleanUtils 271 84.89% 93.34% 0.00000 1.00 Large31 CommonsLang CompareToBuilder 249 87.01% 89.70% 0.00000 0.96 Large32 CommonsLang HashCodeBuilder 116 85.57% 90.27% 0.00000 1.00 Large33 CommonsLang Conversion 766 76.24% 92.95% 0.00000 1.00 Large34 CommonsLang NumberUtils 383 81.52% 89.59% 0.00000 1.00 Large35 CommonsLang StrBuilder 567 95.47% 98.18% 0.00000 1.00 Large36 CommonsLang DateUtils 314 91.78% 95.48% 0.00000 1.00 Large37 CommonsLang Validate 98 90.67% 96.31% 0.00000 1.00 Large38 CommonsMath FunctionUtils 64 65.09% 65.39% 0.05707 0.55 Negligible39 CommonsMath TricubicSplineInterpolatingFunction 80 75.05% 79.20% 0.00000 0.80 Large40 CommonsMath DfpDec 138 67.78% 65.46% 0.00512 0.61 Small41 CommonsMath MultivariateNormalMixtureExpectationMaximization 66 46.97% 46.70% 0.04442 0.48 Negligible42 CommonsMath IntervalsSet 50 86.10% 85.64% 0.04371 0.42 Small43 CommonsMath MatrixUtils 143 72.94% 78.57% 0.00000 0.87 Large44 CommonsMath SchurTransformer 92 91.25% 89.02% 0.62217 0.52 Negligible45 CommonsMath AbstractSimplex 59 67.71% 65.39% 0.00276 0.37 Small46 CommonsMath BrentOptimizer 65 96.89% 96.91% 0.56565 0.51 Negligible47 Javex Expression 173 82.47% 89.64% 0.00000 1.00 Large48 JDom AttributeList 133 76.03% 78.32% 0.00000 0.90 Large49 JDom SAXOutputter 89 97.09% 97.33% 0.09856 0.56 Negligible50 JDom XMLOutputter 62 95.05% 95.16% 0.01327 0.53 Negligible51 JDom JDOMResult 50 58.82% 58.00% 0.08275 0.49 Negligible52 JDom NamespaceStack 80 76.90% 77.03% 0.30896 0.54 Negligible53 JDom Verifier 277 82.81% 89.24% 0.00000 1.00 Large54 JodaTime BasePeriod 79 93.41% 95.39% 0.00000 0.82 Large55 JodaTime BasicMonthOfYearDateTimeField 63 94.89% 95.51% 0.00090 0.63 Small56 JodaTime LimitChronology 112 76.46% 75.04% 0.00187 0.37 Small57 JodaTime PeriodFormatterBuilder 579 79.85% 95.68% 0.00000 0.99 Large58 JodaTime MutablePeriod 76 86.80% 100.00% 0.00000 1.00 Large59 JodaTime Partial 134 84.83% 88.07% 0.00000 0.77 Large60 Tartarus englishStemmer 290 81.45% 83.11% 0.00000 0.95 Large61 Tartarus italianStemmer 228 67.22% 70.71% 0.00000 0.91 Large62 Tartarus turkishStemmer 514 63.67% 69.01% 0.00000 0.98 Large63 XMLEnc XMLChecker 1213 35.66% 34.79% 0.01285 0.40 Small64 XMLEnc XMLEncoder 138 88.26% 90.70% 0.00000 0.82 Large

Overall Average 78.86% 83.08%No. cases significantly better 9/64 42/64

54


Table 4.3: Budget consumed by each approach to achieve the best coverage. P-values and effectsize statistics are also shown. A12 < 0.5 means MOSA is better than WS, A12 > 0.5 means WSis better than MOSA and A12 = 0.5 means they are equal. Statistically significant values areprinted in boldface.

No Class WS MOSA p-value A12 Magnitude1 Monitor 294 906 0.000 0.94 Large2 TDoubleShort-

MapDecorator193701 158788 0.003 0.38 Small

3 TShortByte-MapDecorator

206375 157995 0.000 0.32 Medium

4 TShortHash 203317 179764 0.078 0.43 Negligible5 LinearMath 99049 58050 0.000 0.26 Large6 Option 311052 347356 0.436 0.53 Negligible7 TreeList 425663 286717 0.000 0.27 Medium8 FunctionUtils 408573 416136 0.719 0.49 Negligible9 SchurTransformer 42033 31865 0.029 0.41 Small

10 BrentOptimizer 89938 78094 0.000 0.33 Medium11 SAXOutputter 353790 195672 0.000 0.07 Large12 JDOMResult 29544 20135 0.691 0.48 Negligible13 NamespaceStack 273263 137880 0.000 0.20 Large

Overall Average 202814 159181No. cases significantly better 1/13 8/13

see from Table 4.3 that MOSA reached the best overall mean efficiency acrossthe 13 classes, with a reduction in budget consumption of 22%, with the mini-mum reduction of 18% yielded for TDoubleShortMapDecorator and themaximum one of 50% achieved for NamespaceStack. The A12 effect sizevalues shown in Table 4.3 confirm this analysis: in the majority of cases whereMOSA outperforms WS, the magnitude of the difference in terms of efficiencyis either large (3 out of 8 cases) or medium (3 out of 8 cases).

To provide a graphical comparison between WS and MOSA with respect tobudget consumption, Figure 4.3 shows a boxplot of the search budget consumedby each method for one of the subjects: the class SchurTransformer. The

55


boxplot shows that the distribution of cost values obtained by MOSA over allthe independent runs is lower than the distribution achieved by WS.

WS MOSA

050

000

1500

0025

0000

org.apache.commons.math3.linear.SchurTransformer

Bug

det (

Sta

tem

ents

Con

sum

ed)

Figure 4.3: Comparison of the budget consumed by WS and MOSA over 100 independent runson SchurTransformer.java.

According to the results and the analyses reported above we can answer theresearch questions considered in this experiment as follows:

In summary, we can conclude that MOSA achieves higher branch coverage

than WS (RQ1) and that when both achieve the same coverage, MOSA con-

verges to such coverage more quickly than WS (RQ2).

4.4.6 Qualitative analysis

Figure 4.4 shows an example of branch covered by MOSA but not by WSfor the class MatrixUtils extracted from the Apache commons math library,

56


public static <...> createFieldMatrix(T[] data)

throws NullArgumentException {162 if (data == null) {163 throw new NullArgumentException();

164 }165 return (data.length * data[0].length <= 4096) ?

166 new Array2DRowFieldMatrix<T>(data) :

167 new BlockFieldMatrix<T>(data);

168 }

public static <T extends FieldElement<T>> FieldMatrix<T>

createFieldDiagonalMatrix(final T[] diagonal) {final FieldMatrix<T> m =

createFieldMatrix(diagonal[0].getField());

for (int i = 0; i < diagonal.length; ++i) {m.setEntry(i, i, diagonal[i]);

}return m;

}

(a) Target branch

Fraction[] fractionArray0 = null;

FieldMatrix<Fraction> fieldMatrix0 =

MatrixUtils.createFieldDiagonalMatrix(fractionArray0);

(b) TC1 with branch distance d = 1.0

FractionField fractionField0=FractionField getInstance();

Fraction[] fractionArray0 = new Fraction[1];

Fraction fraction0 = fractionField0.getZero();

fractionArray0[0] = fraction0;

FieldMatrix<Fraction> fieldMatrix1 =

MatrixUtils.createFieldDiagonalMatrix(fractionArray0);

(c) TC2 with branch distance d = 0.9997

Figure 4.4: Example of uncovered branch for MatrixUtils

57


i.e., the false branch of line 165 of method createFieldMatrix. The re-lated branch condition checks the size of the input matrix data and returns anobject of class Array2DRowFieldMatrix or of class BlockFieldMatrixdepending on the outcome of the branch condition. At the end of the searchprocess, the final test suite obtained by WS has a whole suite fitness f =

34.17. Within the final test suite, the test case closest to cover the consid-ered goal is shown in Figure 4.4-b, i.e., a test case with maximum branch dis-tance, d = 1.0, for the branch under analysis. This test case executes methodcreateFieldMatrix indirectly by calling a second method createField-DiagonalMatrix as reported in Figure 4.4-a. However, by analyzing all thetest cases generated by WS during the search process we found that TC1 is notthe closest test case to the false branch of line 165 across all generations. Forexample, at some generation WS generated a test case TC2 with a lower branchdistance d = 0.9997 that is reported in Figure 4.4-c. As we can see, TC1 ex-ecutes line 162 because the input data is null, while test case TC2 executesthe branch condition in line 165 and the corresponding true branch in line 166.However, TC2 was generated within a candidate test suite with a poor wholesuite fitness f = 170.0, which was also the worst candidate test suite in its gen-eration. Thus, in the next generation this test suite (with the promising TC2) isnot selected to form the next generation and this promising test case is lost. Bymanual investigation we verified that this scenario is quite common, especiallyfor classes with a large number of branches to cover. As we can see from thisexample, the whole suite fitness is really useful in increasing the global num-ber of covered goals, but when aggregating the branch distances of uncoveredbranches, the individual contribution of single promising test cases may remainunexploited.

Unlike WS, MOSA selects the best test case (and not test suites) within thecurrent population for each uncovered branch. Therefore, in a similar scenarioit would place TC2 in the first non-dominated front F0 according to the pro-

58


posed preference criterion. Thus, generation by generation test case TC2 willbe assigned to front F0 until it is replaced by a new test case that is closer tocovering the target branch. Eventually, MOSA covers the false branch of line165, while WS does not.

4.4.7 Threats to Validity

The main threats to validity for our results are construct, internal, conclusion,and external validity threats [Wohlin et al., 2000].

Threats to construct validity regard the relation between theory and exper-imentation. For measuring performance of the compared techniques, we usedmetrics that are widely adopted in the literature: branch coverage and numberof statements executed. In the context of test data generation, these metrics givereasonable estimates of the effectiveness and efficiency of the test data genera-tion techniques.

Threats to internal validity regard factors that could influence our results. Todeal with the inherent randomness of GA, we repeated each execution 100 timesand reported average performance together with sound statistical evidence. An-other potential threat arises from GA parameters. While different parametersettings could potentially result in different results, determining the best config-uration is an extremely difficult and resource intensive task. Furthermore, suchattempts to find the best configuration may not always pay off in practice ascompared to using default configurations widely used in the literature [Arcuriand Fraser, 2013]. Hence, we used default values suggested in related literature.

Threats to conclusion validity stem from the relationship between the treat-ment and the outcome. In analyzing the results of our experiments, we haveused appropriate statistical tests coupled with enough repetitions of the exper-iments to enable the statistical tests. In particular, we have used the Wilcoxontest for testing significance in the differences and the Vargha-Delaney effectsize statistics for estimating the magnitude of the observed difference. We drew

59

4.5. RELATED WORKS CHAPTER 4. MANY-OBJECTIVE OPTIMIZATION

conclusions only when results were statistically significant according to thesetests.

There is a potential threat to external validity with respect to the generaliza-tion of our results. We carried out experiments on 64 Java classes taken from16 widely used open source projects with a total number of branches rangingfrom 50 to 1213. While these classes exhibit a reasonable degree of diversity,further experiments on a larger set of subjects would increase the confidencein the generalization of our results. We also evaluated the performance of thecompared techniques with subjects having less than 50 branches, in which caseboth MOSA and WS are comparatively effective.

4.5 Related Works

The application of search algorithms for test data generation has been the sub-ject of increasing research efforts. As a result, several techniques and tools havebeen proposed. Existing works on search based test data generation rely on thesingle objective formulation of the problem, as discussed in Section 1.2. Inthe literature, two variants of the single objective formulation can be found: (i)targeting one branch at a time [McMinn, 2004], and (ii) targeting all branchesat once (whole suite approach [Fraser and Arcuri, 2013]). The first variant(i.e., targeting one branch at a time) has been shown to be inferior to the wholesuite approach [Fraser and Arcuri, 2013, Arcuri and Fraser, 2014], mainly be-cause it is significantly affected by the inevitable presence of unreachable ordifficult targets. Consequently, we focused on the whole suite formulation as astate of the art representative of the single objective approach.

In the related literature, previous works applying multi-objective approachesin evolutionary test data generation have been reported. However, all suchworks considered branch coverage as a unique objective, while other additionaldomain-specific goals have been considered as further objectives the tester would

60

CHAPTER 4. MANY-OBJECTIVE OPTIMIZATION 4.6. CONCLUSION

like to achieve, such as memory consumption, execution time, test suite size,etc [Harman et al., 2010, Ferrer et al., 2012, Pinto and Vergilio, 2010, Oster andSaglietti, 2006, Lakhotia et al., 2007] (see Section 3.1.1 for details).

It is important to notice that all previous multi-objective approaches forevolutionary test data generation used the targeting one branch at a time ap-proach [McMinn, 2004]. The branch distance of a single targeted branch isone objective, considered with additional non-coverage goals. From all thesestudies, there is no evidence that the usage of additional (non-coverage) ob-jectives provides benefits in terms of coverage with respect to the traditionalsingle-objective approach based on branch coverage alone [Ferrer et al., 2012].Moreover, the number of objectives considered in these studies remains limitedto a relatively small number.

Unlike previous multi-objective approaches to evolutionary test data gener-ation, in this thesis we proposed to consider the branch coverage by itself as amany-objective problem, where the goal is to minimize simultaneously all thedistances between the test cases and the uncovered branches in the class undertest.

4.6 Conclusion

We have reformulated branch coverage as a many-objective problem, where dif-ferent branches are considered as different objectives to be optimized. The novelmany-objective genetic algorithm, MOSA, described in this thesis exploits thepeculiarities of branch coverage with respect to traditional many-objective prob-lems to overcome scalability issues when dealing with hundreds of objectives(branches).

An empirical study conducted on 64 Java classes extracted from widely usedopen source libraries demonstrated that the proposed algorithm, MOSA, (i)yielded strong, statistically significant improvements (i.e., either higher cov-

61

4.6. CONCLUSION CHAPTER 4. MANY-OBJECTIVE OPTIMIZATION

erage or faster convergence) with respect to the whole suite approach, and (ii)is highly scalable to programs with even more than one thousand of branches.Specifically, the improvements can be summarized as follows: coverage wassignificantly higher in 66% of the subjects and the search budget consumed wassignificantly lower in 62% of the subjects for which coverage was the same.

62

Chapter 5

System Level Test Generation forCoverage of Programs with StructuredInput

Programs whose inputs exhibit complex structures, which are often governedby a specification, such as a grammar, pose special challenges to automated testdata generation. An example of such systems, which we refer to as grammar

based systems, is Rhino, a compiler/interpreter for the JavaScript language.Test cases for this system are JavaScript programs that must respect the rules ofthe underlying JavaScript grammar specification. The challenge in generatingtest cases for grammar based systems lies in choosing a set of input sentences,out of those that can be potentially derived from the given grammar, in sucha way that the desired test adequacy criterion is met. In practice, the gram-mars that govern the structure of the input are far from trivial. For instance, theJavaScript grammar, which defines the structure of the input for Rhino, con-tains 331 rules and many of these rules are deeply nested and recursive. Hence,an appropriate mechanism for generating (deriving) sentences from such gram-mars is needed.

Despite some efforts made in recent years in this direction [Beyene and An-drews, 2012,Poulding et al., 2013,Majumdar and Xu, 2007], there is still a need

63

5.1. LEARNING PROBABILITIES CHAPTER 5. SYSTEM LEVEL TEST GENERATION

for a solution that is effective at achieving the desired level of coverage and isable to scale up to reasonably large/complex grammars.

Towards achieving this goal, we introduce a testing approach that combinesGP with SCFGs for generating system level test suites for programs with highlystructured (grammar based) input. To account for difficulties during sentencegeneration (derivation) due to recursive grammar rules, we propose two alter-native strategies: learning probabilities and grammar annotations. Each strategyis empirically evaluated on 6 grammar based systems, using random search asa baseline.

We first introduce our proposed approach for learning production probabili-ties from a corpus of manually written sentences (Section 5.1). We then presentour proposed language for grammar annotation (Section 5.2), followed by a de-scription of our evolutionary algorithm for sentence derivation from (possiblyannotated) grammars (Section 5.3).

5.1 Learning Probabilities from Samples

As discussed in Chapter 2, generating sentences from a CFG can run into prob-lems in the presence of recursive grammar rules. The 80/20 rule (also discussedin Chapter 2) tries to address this problem by limiting the application of (mutu-ally) recursive rules, with respect to the non-recursive ones, by assigning theman overall aggregate probability which is less (20%) than that assigned for thenon-recursive rules (80%). While this strategy may work relatively well forgrammars with limited structural complexity, it faces difficulty when presentedwith grammars that define deeply nested structures (e.g., grammar productionsthat define the syntax of expressions in the JavaScript grammar). For suchsituations, we propose to learn the production probabilities for the associatedCFG from an available corpus of human written sentences (e.g., JavaScript pro-grams).

64

CHAPTER 5. SYSTEM LEVEL TEST GENERATION 5.2. ANNOTATED GRAMMARS

If the grammar is not ambiguous, every sentence has only one parse tree andprobabilities can be easily assigned to rules by observing how many times arule is used in the parse tree for each sentence in the corpus. In the presence ofambiguity, learning can take advantage of the Inside-outside algorithm [Lari andYoung, 1990]. The inside-outside algorithm is an iterative algorithm based onexpectation-maximization. Starting from randomly chosen probability values, itrepeatedly refines the rule probabilities so as to maximize the corpus likelihood.Detailed description of the algorithm is given in Appendix A.

It is worth noticing that even with formal languages used in programs, nonambiguity cannot be assumed for granted. Although the parsers used with suchformal languages are typically deterministic, the CFG extracted from the parsercan still be ambiguous. In practice, parsers typically solve the ambiguous casesin the CFG by means of disambiguation predicates [Parr and Quong, 1994],type/context information or ad-hoc heuristics.

5.2 Annotated Grammars

Annotations allow the developer to specify constraints and relations amonggrammar elements which are very difficult, if not impossible, to express withCFGs. Such annotations ensure that semantically invalid sentences are neveror seldom generated from the annotated grammar. We introduce an annotationscheme that allows the developer to annotate grammar rules with semantic rules

in such a way that they can be processed automatically during sentence genera-tion. Consequently, annotations ensure that the generated sentences, aside frombeing structurally well-formed, respect also the semantic constraints of the lan-guage defined by the CFG.

65

5.2. ANNOTATED GRAMMARS CHAPTER 5. SYSTEM LEVEL TEST GENERATION

5.2.1 Types

Semantic rules are defined based on types. Types are tuples of the form (t1, t2,. . . ;t

′

1, t′

2, . . .; t′′

1, t′′

2, . . .) where ti are either base types or * (which indicates theabsence of any type restriction – i.e., any type is allowed).

Types are grouped and type groups are separated by semicolons. Differenttype groups can be introduced for different aspects of the program semantics.For instance, the first group may specify data types (types such as number,string, etc.). Such types will be applied only to constructs for which a data typerestriction makes sense (e.g., expressions). The following groups in the typetuple might be used for structural type constraints (e.g., break constructs canonly appear nested inside loop or switch statements), or different kinds of datatype constraints (e.g., applicable to specific constructs).

The types used to annotate a given grammar are determined by the devel-oper, depending on the nature of the language represented by the grammar. Theannotation scheme does not prescribe any predefined types (except for * whichstands for “any type”).

The base types determined by the developer are specified at the beginningof the grammar file, so that the sentence generator can process and use themwhen enforcing the annotation rules. The specification may include initial de-fault values, as well as whether a matching declaration for a construct havingsuch type is mandatory in the language or not (e.g., a variable declaration forlanguages where declaration of variables is mandatory). When declarations aremandatory for some types, a ‘template’ for the syntax of declarations is to beprovided. Base types consisting of simple labels, without initial default valuesand matching declarations, are specified in a simplified syntax.

66


Example:

#begintypes

#int,0,false

#bool,false,false

#date,2015.01.01,false

#qstring,’’,false

#fun,nil,true

#endtypes

#declarator:def @id = @init ;

#[type;inloop;infunc;inswitch]

The first base type in this example is a data type, consisting of 5 possibleinstances (int, bool, etc.), for which initial values are provided. Except forthe fun type, all other data types do not require any matching declaration.The template for the declarator (required by the fun type) consists of thedef keyword, followed by the identifier to be matched (specified by meansof the special meta-variable @id). Equal signs and semicolons are part of thesyntax of the declarator, while the meta-variable @init is compatible withany expansion of a non terminal that can appear in the meta-variable positionaccording to the given grammar. A simplified type specification terminates thisexample. It introduces three type labels (inloop, infunc, inswitch) for anew structural type group. These type instances have no default initializers andno matching declarators.

5.2.2 Annotation Syntax

The proposed annotation scheme includes two types of annotations for: (1) se-lecting productions based on types; and, (2) propagating type information. Thesyntax of each is described below.

67


Production SelectionAnnotations for type-based selection of a production (among the list of produc-tions in a grammar rule) are intended to match the type information propagatedalong the derivation process up to the current point. These annotations are in-dicated by preceding each production with selector types enclosed in squarebrackets ([]).

Given a grammar rule of the form LHS ::= A1 A2 (LHS stands for LeftHand Side), selection annotations indicate which productions can be selectedduring derivation without invalidating the type correctness of the sentence beingderived. The syntax of such an annotation is as follows:

LHS ::= [t1,t2,. . .;t′

1,t′

2,. . .;t′′

1,t′′

2,. . .] A1 A2

this production is selected if the tuple propagated for the type of LHS(ta; tb; tc) matches the given annotation. That is, ta is any of {t1,t2,. . .}or *; tb is any of {t′1,t

′

2,. . .} or *; and, tc is any of {t′′1,t′′

2,. . .} or *.

LHS ::= [$t=t1,t2,. . .;t′

1,t′

2,. . .;t′′

1,t′′

2,. . .] A1 A2

this production is also selected if the tuple propagated for the type of LHSmatches the given annotation. In this case, the actually propagated LHStype ta (taken from the first type group) is assigned to variable $t for lateruse in the rest of the rule.

LHS ::= [(t1,t2,. . .);t′

1,t′

2,. . .;t′′

1,t′′

2,. . .] A1 A2

this production is selected if the tuple propagated for the type of the LHSmatches the given annotation. The parts of the annotation within bracketsrequire strict type matching, i.e., matching with the implicit any-type *

selector is not allowed and the match occurs only and exclusively if thepropagated LHS type is any of t1, t2, etc.

To simplify the process of annotation, if only the first type group is pro-vided in the tuple, the others are assumed to be *. Hence, the annotation

68


[int,float] is equivalent to [int,float;*;*]. Let us consider thefollowing annotated excerpt, taken from a grammar for expressions:

<expr> ::= [int] <expr> % <expr>

| [float] <expr> * <expr>

| ( <expr> )

| [int, float] <const>

According to these annotations, the production <expr> ::= <expr> %

<expr> is eligible for selection during sentence derivation if the propagatedtype for the LHS non-terminal <expr> is either int or *. Similarly, the pro-duction <expr> ::= <const> is eligible for selection if the propagatedLHS type of <expr> is either int, float, or *, while the production <expr>::= ( <expr> ) is always eligible for selection.

Type PropagationThe default, implicit rule is that types propagate from the LHS non-terminal tothe right hand side (RHS) non-terminals during sentence derivation. However,type propagation annotations can be used to direct and control the types beingpropagated from LHS to RHS non-terminals. In fact, the type to be propagatedto a grammar symbol can be specified explicitly by putting it into curly braces

{}. If such an annotation is not present, the default rule applies (i.e., the LHStype propagates to the RHS).

Given a grammar rule of the form LHS ::= A1 A2, the syntax of a typepropagation annotation is as follows:

LHS ::= [t1,t2,. . .;t′

1,t′

2,. . .;t′′

1,t′′

2,. . .] A1 A2

whichever type is selected from the list, it is by default propagated to A1and A2.

69


LHS ::= [t1,t2,. . .;t′

1,t′

2,. . .;t′′

1,t′′

2,. . .] A1 {t3;t5;*} A2

whichever type is selected from the list, it is propagated to A1, whereas A2is assigned the type tuple (t3;t5;*).

LHS ::= [$t=t1,t2,. . .;t′

1,t′

2,. . .;t′′

1,t′′

2,. . .] {t3} A1 {$t} A2

A1 is assigned the type (t3;*;*) whereas A2 is assigned ($t; ∗; ∗) where$t holds whichever type is instantiated from the selection annotation.

Example:

<expr> ::= [int,float] {int} <expr> % {int} <expr>

| [(float)] <expr> * <expr>

In this example, if the production <expr> ::= <expr> % <expr> isselected, type int is propagated to the two <expr>s on the right hand side. Onthe other hand, if the production <expr> ::= <expr> * <expr> is se-lected, the type of the left hand side <expr> (in this case, necessarily float)is propagated to the two <expr>s on the right hand side. Figure 5.1 and Fig-ure 5.2 show, respectively, an example grammar and a possible annotation forit.

The example in Figure 5.2 shows a useful feature provided by our annotationlanguage: suffix string concatenation, expressed through the syntax ##. Some-times it is useful to distinguish different instances of the same type. For exam-ple, the type of a newly declared variable as compared to the type of a previouslydeclared variable. By adding a suffix to the variable type (e.g., N in $t##N) wecan then easily separate productions to be selected only for newly declared vari-ables (e.g., <new> production for the LHS <id>) with respect to those to beselected when previously declared variables are involved (e.g., <pool> pro-duction). These two cases respectively correspond to the declaration and use ofa variable.

70


Figure 5.1: Example grammar

<prog> ::= <stats>

<stats> ::= <stat> <stats> | <stat>

<stat> ::= <def> ; | <expr> ;

<def> ::= def <id> = <const>

| def <id> [ <const> ] = { <expr> }

<const> ::= <int>

| <float>

<int> ::= 1 | 2 | 3

<float> ::= 0.5 | 1.5

<expr> ::= <expr> % <expr>

| <expr> * <expr>

| (<expr>)

| <id>

| <const>

The last two productions shown in Figure 5.2 are special productions that areintroduced for facilitating the automated processing of the annotations. The pro-duction <new> ::= <new()> @id triggers the generation of a new iden-tifier of a given type, while the production <pool> ::= <pool()> @id

triggers the use of an existing identifier from the pool of already declared iden-tifiers.

5.2.3 Supporting Data Structures

The implementation of the type propagation mechanism described above re-quires the following data structures:

Identifier Pool: We maintain a pool of identifiers to be used for naming vari-ables and functions. At the beginning of each sentence derivation, the pool is

71


Figure 5.2: A possible annotation for the grammar in Figure 5.1

#begintypes

#int,0,false

#bool,false,false

#date,2015.01.01,false

#qstring,’’,false

#fun,nil,true

#int[],[],false

#endtypes

#declarator:def @id = @init ;


<prog> ::= <stats>

<stats> ::= <stat> <stats>

| <stat>

<stat> ::= <def>;

| <expr>;

<def> ::= [$t=int,float,int[]] def {$t##N} <id> = <const>

| [$t=int[],float[]]def {$t##N} <id> [{basetype($t)} <const>] = {<expr>}

<const> ::= [int] <int>

| [float] <float>

| [date] <date>

| [bool] <bool>

| [qstring] <string>

<int> ::= 1 | 2 | 3 ...

<float> ::= 0.5 | ...

<date> ::= 2015.01.01 | ...

<bool> ::= true | false

<string> ::= ’’

<expr> ::= [int] <expr> % <expr>

| [float] <expr> * <expr>

| ( <expr> )

| [int,float,int[],float[],bool,date,qstring] <id>

| [int,float,bool,date,qstring] <const>

| [int,float] {int[]} <expr> [ {int} <expr> ]

<id> ::= [intN,floatN,int[]N,float[]N,boolN,dateN,qstringN] <new>

|[int,float,int[],float[],bool,date,qstring] <pool>

<new> ::= <{new()}> @id // generate a new identifier name

<pool> ::= <{pool()}> @id // choose an identifier name

// already declared so far

72


initialized with a predefined set of identifiers, of predefined types. Whenever anew identifier is created in the derivation process, the pool is updated with thenew identifier. Whenever a name is needed for either a variable or a function(of a certain type), the pool is consulted and an appropriate identifier is chosen.

Symbol Table: We maintain a symbol table in which we keep track of theidentifiers used so far, along with their types, in the derivation. At the endof the derivation, some of these identifiers may need to be declared, if theirdeclaration is indicated as mandatory in the type specification preamble andno declaration was already derived for them. Hence, after the derivation of asentence is completed successfully, a repair phase follows, in which we referback to the symbol table and insert declarations for those identifiers which areused in the sentence without an accompanying declaration, in cases where suchdeclaration has been specified as mandatory.

5.2.4 Annotation Example

To illustrate better the annotation scheme, we apply it to an excerpt from theJavaScript grammar. The annotated grammar is shown in Figure 5.3. Missingportions of the grammar are represented by ellipses (. . . ). The excerpt repre-sents part of the grammar that defines the structures of statements in JavaScript.With the annotation scheme, we would like to ensure that the sentences gen-erated from the grammar respect certain (structural and semantic) rules of thelanguage. For instance, expressions used as conditions for while and if state-ments are required to be of boolean type. Hence, in the annotated grammarwe specify this constraint by annotating the <Expression> non-terminalswith {bool} so that during sentence derivation this type is propagated in thesyntax tree, ultimately constraining the derivation to the generation of booleanexpressions.

Similarly, there are structural rules that constrain the placement of certain

73


Figure 5.3: Annotated grammar extracted from the JavaScript grammar#begintypes

#int,0,false

#bool,false,false

...

#endtypes

#declarator:var @id = @init ;


<Program> ::= <Sourceelement_star>

...

<Statementtail> ::= <Withstatement>

| <Iterationstatement>

| <Ifstatement>

| [*;(inloop)] <Continuestatement>

| [*;*;*;(inswitch)] <Breakstatement>

| ...

...

<Whilestatement> ::= while ( {bool} <Expression> )

{*;inloop} <Statement>

<Forstatement> ::= for ( <Forcontrol> ) {*;inloop} <Statement>

<Dostatement> ::= do {*;inloop} <Statement>

while ( {bool} <Expression> ) ;

<Ifstatement> ::= if ( {bool} <Expression> )

<Statement> <Alt_248_opt>

| if ( {bool} <Expression> ) <Statement>

...

74

CHAPTER 5. SYSTEM LEVEL TEST GENERATION 5.3. SENTENCE GENERATION

types of language constructs in the program. For example, statements such asbreak and continue are allowed only in certain contexts (e.g. loops). Conse-quently, in the annotated grammar we guard the usage of such constructs byintroducing type tuples that limit the generation of these types of statements inthe appropriate contexts. As shown in Figure 5.3, <Continuestatement>is annotated with the type tuple [*;(inloop)], which ensures that the pro-duction is eligible for selection during derivation only if the type tuple propa-gated to it down from the containing statement matches the specified type tu-ple. In order for this production to be selected during sentence derivation, thetype tuple associated with <Statementtail> must have the type inloop(brackets enforce strict match) in second position in the type tuple, while itcould have any type in first position in the tuple. Correspondingly, the produc-tions for the looping constructs (<Whilestatement>, <Forstatement>,and <Dostatement>) are annotated in such a way that they propagate downa type tuple permissive of statements allowed within loops. This is specified byannotating the body of the loops (represented by the non-terminal <Statement>in the grammar) with a type tuple {*;inloop}, which is propagated down thesyntax tree during sentence derivation.

5.3 Evolutionary Sentence Generation

Our evolutionary approach to sentence generation combines stochastic gram-mars with genetic programming, using a suitable fitness function, so as to evolvetest suites for system-level branch coverage of the SUT. Furthermore, we takeadvantage of grammar annotations (when available) in order to promote thegeneration of structurally and semantically well-formed sentences from a givengrammar.

75

5.3. SENTENCE GENERATION CHAPTER 5. SYSTEM LEVEL TEST GENERATION

5.3.1 Representation of Individuals

A test case is a single input to the SUT. In other words, a test case is a well-formed sentence derived from the grammar of the SUT. Hence, a test suite is aset of sentences, represented by their parse trees. Furthermore, when grammarannotation is activated, each individual additionally stores the annotation typesassociated with the grammar symbols in the derivation tree. Such informationis later used by genetic operators when evolving the individual.

The initial population of test suites is obtained by generating input sentencesaccording to the stochastic process described in Chapter 2, Algorithm 1 andby grouping them randomly into test suites. Stochastic sentence generationuses either heuristically fixed (80/20 rule discussed in Chapter 2) or learnedprobabilities (discussed in Section 5.1 above).

When grammar annotation is activated, the initialization of the populationfollows a similar process as the one described in Algorithm 1. However, withgrammar annotation, the function choose in Algorithm 1 is applied consider-ing not only the probabilities of the productions for the current non-terminal,but also the types specified in the grammar annotations. Algorithm 5 shows amodified version of Algorithm 1 which is applied when grammar annotation isactivated.

Hence, when choosing a production to expand the current non-terminal, wefirst collect all the eligible productions for the non-terminal according to thematching between propagated type and selection types, as described in Sec-tion 5.2. This gives the set of eligible productions Peligible (line 4 of Algo-rithm 5). The probabilities of the productions in Peligible are re-computed so asto redistribute the probabilities of the non-eligible productions to the eligibleones (line 5 of Algorithm 5). This is achieved by proportionally re-normalizingthe aggregate probability in Peligible to 1. Then, function choose is applied onthe productions in Peligible (line 6 of Algorithm 5).

76

CHAPTER 5. SYSTEM LEVEL TEST GENERATION 5.3. SENTENCE GENERATION

Algorithm 5: Generation of a string from a CFG, in the presence of grammar annotations

1: S ← s

2: k = 1

3: while k < MAX ITER and S has the form α · u · β, where α ∈ T ∗ and u ∈ N do4: Peligible ← collectEligible(Pu, T ypeP u)

5: normalizeProbabilities(Peligible)

6: π ← choose(Peligible)

7: S ← α · π(u) · β8: k = k + 1

9: end while10: if k < MAX ITER then11: return S

12: else13: return TIMEOUT14: end if

5.3.2 Genetic Operators

In our approach, genetic operators work at two levels: at the upper level, GAoperators are used to evolve test suites (TS); at the lower level, GP operators areused to evolve the parse trees that represent the input sentences of the test casescontained in a test suite. Evolution at the lower level is regarded as a specialkind of mutation (namely, parse tree mutation) at the upper level. Hence, GPoperators are activated according to the probability of parse tree mutation set inthe upper GA level. In particular, the GP operator subtree mutation is applied toa test case that belongs to test suite T with probability 1/|T |. The GP operatorsubtree crossover is applied with probability α.

Subtree mutation: Subtree mutation is performed by replacing a subtree inthe tree representation of the individual with a new subtree (see Chap-ter 2), generated from the underlying stochastic grammar by means of Al-gorithm 1.

77

5.3. SENTENCE GENERATION CHAPTER 5. SYSTEM LEVEL TEST GENERATION

If grammar annotation is activated, when a subtree is selected for mutationand when a replacement subtree is generated, the GP operator must ensurethat type annotations are not violated:

• Type annotations associated with the root of the subtree being replacedneeds to be propagated to the new subtree that will be generated toreplace it.

• The pool of declared identifier names in the tree being mutated needsto be available to the new subtree.

• Subtrees containing declarations of identifiers are never replaced, toavoid the situation in which any later uses of those identifiers are in-validated.

Subtree crossover: Two subtrees rooted at the same non terminal are selectedin the parent trees and swapped, so as to originate two new offspring trees(see Chapter 2).

If grammar annotation is activated, in addition to being associated withthe same non-terminal, the two subtree nodes chosen as points of ex-change/crossover must have the same type annotation.

5.3.3 Fitness Evaluation

The GA evaluates each test suite by computing its fitness value. For this pur-pose, the tree representation of the test cases in the suite is unparsed to a string,which is passed to the SUT as input. The GA determines the fitness value byrunning the SUT with all unparsed trees from the suite and by measuring theamount of branches that are covered, as well as the distance from covering theuncovered branches.

During fitness evaluation, branch distances [McMinn, 2004] are computedfrom all possible branches in the SUT, spanning over multiple classes. The

78

CHAPTER 5. SYSTEM LEVEL TEST GENERATION 5.4. EMPIRICAL EVALUATION

fitness of the suite is the sum of all such branch distances. This fitness functionis an extended form of the one employed by Fraser et al [Fraser and Arcuri,2013] for unit testing of classes. GA uses Equation 5.1 to compute the fitnessvalue of a test suite T , where |M | is the total number of methods in the SUT;|MT | is the number of methods executed by T (hence |M−MT | accounts for theentry branches of the methods that are never executed); d(bk, T ) is the minimumbranch distance computed for the branch bk; a value of 0 means the branch iscovered.

fitness(T ) = |M | − |MT |+∑bk∈B

d(bk, T ) (5.1)


To evaluate the effectiveness of the proposed strategies, we have implementedthem in a prototype tool. We then carried out experiments using the prototypetool on 6 open source grammar based systems with varying levels of complexity.Effectiveness was assessed in terms of generation of semantically valid inputs,coverage and fault detection. Specifically, we formulated the following researchquestions:

RQ1 (valid sentences): Does the use of annotations or learning increase thenumber of semantically valid sentences that are generated by a stochasticgrammar?

RQ2 (combination): Does the combination of GP with annotations or learningincrease the level of coverage achieved during test case generation?

RQ3 (comparison): Which technique between annotations and learning pro-vides the most effective combination with GP in terms of coverage? Arethe two compared techniques complementary or largely overlapping?

79

5.4. EMPIRICAL EVALUATION CHAPTER 5. SYSTEM LEVEL TEST GENERATION

RQ4 (fault detection): What is the fault detection capability of RND/GP witheither annotations or learning?

RQ5 (annotator bias): What is the sensitivity of the stochastic grammar tochanges in the annotations introduced by the annotator?


We implemented the proposed approach in a prototype by extending the Evo-Suite test generation framework [Fraser and Arcuri, 2011]. In particular, weextended EvoSuite with: (1) a new parse-tree based representation of individu-als; (2) a new initialization method, which resorts to stochastic grammar basedsentence derivation; (3) new GP operators which manipulate parse tree repre-sentations of individuals; (4) support for handling the proposed type annotationof grammars. Moreover, the top-level algorithm has been modified to accom-modate the two levels (GA and GP) required by our approach. For each SUT,we assume that there is a system level entry point through which it can be in-voked. In cases where such entry point is missing, we define one, acting as atest driver for invoking the core functionalities of the SUT.

To learn rule probabilities from a corpus, we used an existing implementa-tion of the inside-outside algorithm1. Given a grammar and a set of sentences,this implementation of the algorithm produces as output a probability for eachrule in the grammar. We also implemented a tool that performs all necessarytransformations on the input grammar [Grune and Jacobs, 1990], so as to turnit into a format acceptable for the inside-outside implementation. We also im-plemented a tool for tokenizing the sentences in the corpora that we used forlearning. Since the tool expects a stream of tokens, rather than a stream of char-acters, it is necessary to tokenize the sentences into streams of tokens so thatthe learning algorithm can make use of them.

1http://web.science.mq.edu.au/˜mjohnson/Software.htm

80


During fitness evaluation, the tree representation of each individual test caseis unparsed to a string which is then wrapped into a sequence of Java statements.These sequences of Java statements are then executed against the instrumentedSUT. Figure 5.4 shows a simplified example of this process.

3

( E ) 6

E / E

E

(3)/6

try{ Driver driver = new Driver (); String input = "(3)/6"; driver.entryMethod (input);} catch (...) { ...}

unparse wrap

Figure 5.4: During fitness evaluation, tree representations are unparsed and wrapped into se-quences of Java statements

As recommended by Arcuri et al. [Arcuri et al., 2010b], our implementationtakes advantage of accidental coverage. If the execution of a test case coversa coverage target which was not covered so far, such a test case is kept as asolution, regardless of the survival of the test suite it belongs to. At the end ofthe search, such test cases are merged with the best suite evolved by the search.In this way, test cases that exercise uncovered targets but are not part of the final“best” test suite are not lost.

We implemented a random generation technique (RND hereafter) as a base-line for comparing the performance of the proposed approach. RND is imple-mented as follows: generate a random test case (either using stochastic gram-mars or annotated grammars), execute it against the SUT, and collect all coveredbranches [Arcuri et al., 2010b]. RND stops either when full coverage is reachedor the search budget is finished.

81


5.4.2 Metrics

To answer RQ1, we compute the proportion of semantically valid sentencesgenerated by a stochastic grammar using annotations or learning, in compar-ison with a baseline consisting of a stochastic grammar that implements the80/20 rule described in Chapter 2. For RQ2 and RQ3, the metrics used to mea-sure the effectiveness of the techniques being compared is branch coverage at

the system level, computed as the number of branches covered out of the totalnumber of branches in the SUT. To answer the second part of RQ3, we computeset intersection and set difference between the branches covered by GP withannotations vs. GP with learning. For RQ4 we consider artificial faults (mu-tants) injected into the SUT by the mutation tool PIT2. We measure the mutation

score, i.e., the proportion of mutants that are killed by the generated test cases.A mutant is considered as killed if the original and mutated programs producedifferent outputs when the generated test cases are executed. To answer RQ5,we measure the coverage reached at different levels of annotation, by droppingannotations from the annotated grammar. We then compute the delta cover-age with respect to the baseline (no annotations dropped), when an increasingproportion of annotations is dropped.

5.4.3 Subjects

The subjects used in our experiments are 6 open source Java systems that ac-cept structured input based on a grammar. Calc3 is an expression evaluatorthat accepts an input language including variable declarations and arbitrary ex-pressions. MDSL4 is an interpreter for the Minimalistic Domain Specific Lan-guage (MDSL), a language including programming constructs such as func-

2http://www.pitest.org3https://github.com/cmhulett/ANTLR-java-calculator/4http://mdsl.sourceforge.net/

82

http://www.pitest.org

https://github.com/cmhulett/ANTLR-java-calculator/

http://mdsl.sourceforge.net/


tions, loops, conditionals etc. Rhino5 is a JavaScript compiler/interpreter.Basic6 is an interpreter for a BASIC-like language called COCOA. Kahlua7

is a compiler for the Lua language. Javascal8 is a Pascal to Java compiler.

Table 5.1: Subjects used in our experimental study

Subject Size(LOC) # Prods # Annot Prods # Types # Selectors # Propagators

Calc 1,736 38 5 (13.16%) 1 3 5Basic 4,866 108 50 (46.30%) 3 48 16Kahlua 8,093 132 51 (38.64%) 4 22 43MDSL 10,008 161 50 (31.06%) 12 27 48Javascal 13,486 270 86 (31.85%) 11 65 34Rhino 56,849 331 82 (24.77%) 11 39 60

Considering the complexity of the input structure (specifically, the associ-ated grammar) they accept, these subjects are representative of a wide range ofgrammar based systems. The smallest is Calc and the largest Rhino, whilethe others are in between. Table 5.1 reports the size in LOC 9 (Lines Of Code)of the source code (excluding comments) and the number of productions in therespective grammars. Terminal productions, accounting for the lexical struc-ture of the tokens, are excluded. These grammars are far more complex thanthose typically found in the GP literature and contain several nested and recur-sive definitions. Hence, they represent a significant challenge for the automatedgeneration of test data.

Also shown in Table 5.1 are details related to the annotations we performedon the grammars. Between 13% (Calc) and 46% (Basic) of the productionscontain some form of annotation, i.e., selector annotations or propagator anno-

5http://www.mozilla.org/rhino (version 1.7R4)6http://www.mcmanis.com/chuck/java/cocoa/7https://code.google.com/p/kahlua/8http://javascal.sourceforge.net/9counted by CLOC - http://cloc.sourceforge.net/

83

http://www.mozilla.org/rhino

http://www.mcmanis.com/chuck/java/cocoa/

https://code.google.com/p/kahlua/

http://javascal.sourceforge.net/

http://cloc.sourceforge.net/


tations (see Section 5.2.2). Table 5.1 further reports the number of each specifictype of annotation introduced by the annotations.

The annotation of the subject grammars was performed by the author of thisthesis. The time spent for performing the annotations ranges between 2 and 8hours, depending on the grammar. As the annotator is not a developer of any ofthe subject programs nor has detailed prior knowledge of the grammars, mostof the time was spent in understanding the grammar and the intended semanticsof the underlying language. Presumably, the developers/testers of the respectiveprograms know very well the grammar and the corresponding language. Hencein practice, the quality of the annotations could potentially be much higher thanin our experiments. The amount of time it takes to perform the annotation wouldalso be lower. Consequently, the experiments and the results reported in thissection are not necessarily indicative of the best case scenario. Rather, theyshow the potential of the annotation scheme introduced.

Table 5.2: Number of sentences and tokens in the corpus used for each subject during learning.Also shown is the average number of tokens per sentence.

Subject # Sentences # Tokens Tokens per sentence

Calc 19 189 9.95Basic 19 1186 62.42Kahlua 9 1583 175.89MDSL 65 5285 81.31Javascal 5 618 123.60Rhino 990 32752 33.08

The corpus used for learning the probabilities of stochastic grammars is com-posed of sentences we selected from the test suites distributed with each SUT(for Calc, MDSL, Kahlua), from the V8 JavaScript Engine benchmark10 (forRhino), and from code examples freely available on the Internet (for Basic

10https://code.google.com/p/v8/

84

https://code.google.com/p/v8/


and Javascal). Detailed information about the size of the corpus used forprobability learning is shown in Table 5.2.


Since the approaches being compared (GP and RND, with or without annota-tions) are based on stochastic grammars, they heavily rely on non-deterministicchoices. Therefore, we repeated each experiment 10 times and measured statis-tical significance of the differences using the Wilcoxon non parametric test.

Based on some preliminary sensitivity experiments, we assigned the fol-lowing values to the main parameters of our algorithm: population size = 50,crossover rate = 0.75, subtree crossover rate α = 0.1, new test insertion rate β= 0.1. For the other parameters we kept the default values set by the EvoSuitetool. Since the subjects used in our experiments differ significantly in size andcomplexity, giving the same search budget to all would not be fair. Hence, weresorted to the following heuristic rule for budget assignment: we give eachSUT a budget of n∗ |branches|, where |branches| is the number of branches inthe SUT. Based on a few preliminary experiments, we chose the value n = 5.

To measure the proportion of semantically valid sentences, we generateds = n ∗ |branches| sentences (with n = 1) when applying each approachand counted how many sentences are semantically valid. By semantically valid

sentences we mean sentences that do not make the SUT emit a custom errormessage (either related to the input language syntax or semantics). Since de-ciding whether an input is semantically valid or not in an automatic manner isa non trivial task, we adopted the following approximation. If for a given inputsentence the SUT throws a custom exception (i.e., an exception class defined inthe SUT itself), the input is considered invalid (i.e the SUT recognizes that theinput does not respect the specifications of the language). Otherwise it is con-sidered valid (i.e., no exception is thrown or the thrown exception is a standardJava exception, such as a NullPointerException).

85


To measure the sensitivity of the stochastic grammar to changes in the anno-tations, we performed simulated experiments in which we randomly dropped afraction of the annotations from the annotated grammar. Since the type of anno-tation that determines whether a production is selected or not during sentencederivation is the selector annotation, we decided to drop selector annotationsfor this experiment. By dropping the selector annotation for a given production,we intend to simulate the situation in which the developer did not annotate theproduction, either intentionally or unintentionally. The number of each type ofannotation for each subject in the experiment is presented in Table 5.1. As canbe seen from the table, the selector annotations for the smallest subject (Calc)are very few (3), and hence not suitable for this analysis. For this reason, weexcluded Calc and carried out the experiment on the remaining 5 subjects. Weexperimented with five different configurations in which we dropped 5%, 10%,15%, 20%, and 25% of the annotations. For each configuration, we executedboth GP and RND 10 times on each subject, every time changing the set ofannotations dropped.

5.4.5 Results

Table 5.3 reports the proportion of well-formed sentences generated by the var-ious strategies (columns 5-6-7). In addition, the number of unique sentences isalso reported in columns 2-3-4, accompanied (in brackets) by the percentage ofunique sentences over the total number of sentences that have been generated.Such percentage is indicative of the proportion of duplicate sentences gener-ated by the various strategies (in fact, the complement of the reported percent-ages is the percentage of duplicate sentences). The proportion of well-formedsentences is computed with respect to the unique sentences generated by eachtechnique.

From the results, we can notice that annotation (AN hereafter) produces thelargest proportion of well-formed sentences in four subjects (Kahlua, MDSL,

86


Table 5.3: Unique sentences and proportion of valid sentences produced by the various sentencegeneration strategies. In the last 3 columns, highest values in boldface differ from the secondhighest in a statistically significant way according to the Wilcoxon test, at significance level0.05

Unique Sentences Generated Valid Unique SentencesSubject 8020 AN LRN 8020 AN LRN

Calc 212 (78.83%) 212 (81.57%) 212 (76.29%) 100.00% 100.00% 100.00%Basic 907 (78.29%) 907 (80.48%) 907 (72.90%) 22.01% 51.58% 56.12%Kahlua 1191 (56.15%) 1191 (48.27%) 1191 (90.70%) 64.53% 86.26% 71.94%MDSL 2630 (60.97%) 2630 (43.77%) 2630 (86.90%) 25.46% 97.86% 75.13%Javascal 730 (95.74%) 730 (57.34%) 730 (97.41%) 14.38% 95.07% 41.04%Rhino 6138 (43.80%) 6138 (10.29%) 6138 (60.26%) 84.70% 97.29% 55.52%

Javascal, and Rhino). For subject Basic, learning (LRN hereafter) pro-duces the highest proportion of valid sentences, while for Calc they are equal.In fact, for Calc all three strategies generate 100% valid sentences. This is be-cause the approximation we applied to determine whether a sentence is valid ornot relies on the SUT throwing custom exceptions, while in the case of Calcno custom exceptions are defined. Furthermore, the language defined by thegrammar has a relatively simple structure and malformed sentences generatedby the considered techniques are quite rare, and whenever they are found, theymanifest themselves as standard Java exceptions (e.g NullPointerException).We manually verified that the proportion of such exceptions is negligible anddoes not affect the results in any substantial way.

We can notice that AN is effective in generating the highest proportion ofvalid sentences in the larger subjects (both in terms of grammar size and LOC).AN generates also the largest proportion of duplicates for such subjects (seeTable 5.3, columns 2-3-4). We argue that annotations introduce constraints thaton one hand ensure the generation of valid sentences almost always, but on the

87


other hand they limit the possibility of exploring alternative generation paths,hence resulting in repeated generation of the same sentences.

We can answer RQ1 positively. The annotation scheme significantly in-

creases the proportion of semantically valid sentences as compared to learning

in four out of five subjects (in one subject all techniques behave the same).

Table 5.4: Branch coverage and p-values obtained from the Wilcoxon test comparing LRNand AN. Statistically significant values (at significance level 0.05) are shown in boldface undercolumns LRN and AN; the highest values are in gray background. Effect size measures usingthe Vargha-Delaney statistics (A12) are also shown. A12 < 0.5 means LRN is better than AN,A12 > 0.5 means AN is better than LRN, and A12 = 0.5 means they are equal.

RANDOM GP

Subject 8020 LRN AN p A12 Magnitude 8020 LRN AN p A12 MagnitudeCalc 80.57% 80.24% 80.57% 0.0054 0.80 Large 80.57% 80.38% 80.67% 0.0587 0.69 MediumBasic 66.24% 67.96% 56.94% 0.0002 0.00 Large 67.38% 69.14% 67.90% 0.0098 0.16 LargeKahlua 77.92% 78.85% 75.57% 0.0002 0.00 Large 78.95% 79.50% 79.83% 0.2253 0.67 MediumMDSL 73.32% 72.58% 60.47% 0.0002 0.00 Large 79.99% 78.62% 79.33% 0.0028 0.90 LargeJavascal 8.49% 18.42% 19.92% 0.1730 0.69 Medium 12.09% 19.59% 22.21% 0.1770 0.69 MediumRhino 25.86% 23.03% 25.76% 0.0002 1.00 Large 48.81% 46.35% 49.16% 0.0001 0.96 Large

The level of branch coverage achieved by each technique is shown in Ta-ble 5.4; the corresponding box plots are shown in Figure 5.5 and Figure 5.6.Results of the Wilcoxon test of significance (p-values) comparing LRN and ANare also shown (under column p). Effect size measures computed using theVargha-Delaney statistics are also reported under column A12, and their textualdescriptions in column Magnitude.

Let us consider GP-based test case generation, which results in the highestcoverage levels reached on all subjects. With the exception of MDSL, on which80/20 is marginally but not significantly better, on all the other subjects eitherlearning (GP LRN) or annotations (GP AN) give the highest coverage. Specifi-cally, for subject Basic, LRN gives significantly higher coverage than AN. Forthe other subjects, AN gives either significantly better coverage (this is the case

88


●

RND_8020 RND_LRN RND_AN GP_8020 GP_LRN GP_AN

80.0

80.5

81.0

81.5

Coverage (%) for: Calc

●

●

●●


6065

70

Coverage (%) for: Basic


7576

7778

7980

Coverage (%) for: Kahlua

●


6065

7075

80

Coverage (%) for: MDSL

Figure 5.5: Coverage box plots under various configurations (from left to right: RND 8020,RND LRN, RND AN, GP 8020, GP LRN, GP AN) for the experimental subjects: Calc, Basic,Kahlua, and MDSL

89


●

●

●


510

1520

2530

Coverage (%) for: Javascal

●●

●●●


2530

3540

4550

Coverage (%) for: Rhino

Figure 5.6: Coverage box plots under various configurations (from left to right: RND 8020,RND LRN, RND AN, GP 8020, GP LRN, GP AN) for the experimental subjects: Javascaland Rhino

of Rhino and MDSL) or better coverage without statistical significance (this isthe case of Calc, Kahlua and Javascal) as compared to LRN.

When combined with random test case generation, all strategies achievecoverage scores which are lower than the same strategy combined with GP. Itcan be noticed that, within the random generation strategy, LRN achieves bet-ter coverage than AN on three subjects (Basic, Kahlua, and MDSL) whileAN achieves better coverage on the remaining three, although in the case ofJavascal the difference is not statistically significant. The baseline (80/20)achieves a (relatively) high coverage on Calc, MDSL and Rhino.

Furthermore, from Table 5.4 we can observe that whenever there are signif-icant differences, the magnitude of those differences is large, as computed bythe Vargha-Delaney effect size measure.

We can notice that on Calc all techniques achieve similar levels of coverage.Since this subject is the smallest one, both in terms of source code and grammar,

90


it is relatively easy for all techniques to reach quickly maximum coverage.

For what concerns RQ2, GP search combined with any of the three strategies

(80/20, LRN, AN) outperforms random search. Within GP, using either LRN or

AN gives higher coverage in five out of the six subjects, with just a marginal

difference on the sixth subject. Hence we can answer positively RQ2, i.e., the

combination of GP with either annotations (AN) or learning (LRN) increases

coverage.

Table 5.5: Branches covered by AN and LRN: intersection, differences and similarity (Jaccardindex)

Subject Branches AN LRN AN ∩ LRN ANrLRN LRNrAN Similarity

Calc 211 172 170 170 2 0 0.99Basic 906 653 657 639 14 18 0.95Kahlua 1190 977 968 964 13 4 0.98MDSL 2629 2162 2160 2112 50 48 0.96Javascal 729 218 219 158 60 61 0.57Rhino 6138 3249 2984 2953 296 31 0.90

In Table 5.5 we report the overlaps in the branches covered by GP whencombined with annotations and learning. Specifically, column AN shows thenumber of branches covered by the sentences generated by GP with annotations,column LRN shows the number of branches covered by the sentences generatedby GP with learning. Column AN ∩ LRN shows the number of branches coveredby both (intersection). Columns ANrLRN and LRNrAN show the differences,i. e., branches covered by one but not by the other. Column Similarity showsthe Jaccard similarity between the set of branches covered by the two strategies,computed as |AN ∩ LRN |/|AN ∪ LRN |.

From Table 5.5, we can see that the similarity between the set of branchescovered by GP with either annotations or learning is quite high in the majority

91


of the subjects. In some of the subjects, in particular Rhino and Kahlua,the number of branches covered by annotations but not by learning (columnANrLRN) is relatively high (in comparison with column LRNrAN). For therest of the subjects, differences (ANrLRN, LRNrAN) are small and compa-rable, with the exception of Javascal, for which LRN and AN are largelycomplementary (with Jaccard index equal to 0.57 only and set differences ofsize 60, 61).

For what concerns RQ3, AN achieves the highest coverage (either statisti-cally significantly or not) in four out of six subjects while LRN gives the high-est coverage in one case. Hence, in terms of coverage AN is slightly superiorto LRN. However, since the number of statistically significant improvements islow (2 in favor of AN and 1 in favor of LRN), the gap may not be meaningful.Hence we can answer RQ3 as follows:

For what concerns RQ3, the statistical evidence is not strong enough to con-

clude that either combination (i.e., GP AN or GP LRN) is better than the other.

Both are however better than the baseline (GP 80/20). Furthermore, the over-

lap between the set of branches covered by GP AN and GP LRN is high, with

just a few cases in which GP AN covers a large number of branches not covered

by GP LRN.

To cope with the resource intensive nature of mutation analysis, we carriedout experiments on a selected subset of classes from each SUT. In particular,we selected classes that are involved in deep computations inside the SUT. Thismeans that to reach these classes the input must be well formed and meaningful.For instance, in Rhino the input JavaScript program needs to pass lexical andsyntax checking before reaching the Interpreter or CodeGenerator.

Table 5.6 reports the mutation scores. The mutation score for a given testsuite is computed as the proportion of mutants killed by the suite out of thetotal number of mutants. We generated a maximum of 100 mutants per class.For each SUT we used the test suites resulting from the 10 executions carried

92


Table 5.6: Mutation scores achieved by RND and GP when using annotations (AN) and learn-ing (LRN). Significantly better values are shown in boldface; the highest values are in graybackground.

Subject class RND LRN RND AN p GP LRN GP AN p

Calc CalcLexer 79.98% 86.12% 0.0040 77.40% 75.12% 0.0492Calc CalcParser 65.60% 66.40% 0.9093 62.90% 56.90% 0.9698Basic Expression 9.18% 7.81% 0.0281 9.59% 12.05% 0.0044Basic Program 8.00% 10.10% 0.1241 8.60% 9.00% 0.5783Kahlua LexState 59.70% 58.30% 0.0589 58.30% 58.90% 0.9382Kahlua LuaState 36.90% 38.30% 0.1463 32.70% 36.70% 0.0133MDSL Dispatcher 57.83% 44.29% 0.0002 56.42% 56.37% 1.0000MDSL MiniLexer 48.95% 49.64% 0.4454 51.75% 51.59% 0.8498MDSL MiniParser 57.98% 40.54% 0.0003 58.50% 54.85% 0.0398Rhino CodeGenerator 22.40% 29.90% 0.0001 54.60% 59.30% 0.0006Rhino Interpreter 12.20% 16.20% 0.0001 23.60% 27.20% 0.0083

out to measure coverage. The values reported in Table 5.6 are averages overthe 10 test suites. We do not report mutation scores for the subject Javascalbecause the mutation tool we used was not able to measure the mutation scorefor this particular subject.

From the results in Table 5.6 we can see that there is a statistically significantdifference between AN and LRN only in a few cases. Specifically, in 7 casesAN is superior (see boldface values under RND AN or GP AN in Table 5.6),while in 5 cases LRN has significantly higher mutation score. In the majorityof the cases (10 remaining cases), the two strategies (AN vs. LRN) achievecomparable levels of mutation scores. For what concerns the benefits comingfrom GP, we can notice that they become visible, in terms of mutation score, asthe size and grammar complexity of the subjects increase (see highest mutationscores, shown in gray background in Table 5.6, where subjects are sorted byincreasing size/grammar complexity).

93


Regarding RQ4, mutation analysis shows no significant difference between

annotations and learning. There is no statistically significant difference in most

cases, while the remaining cases are split almost evenly between AN and LRN

(7 vs. 5).

Table 5.7: Coverage achieved by RND and GP when 5, 10, 15, 20, and 25% of the annotationsare dropped. The corresponding loss or gain in coverage is also shown; values above 1pp are inhighlighted background.

Subject Config 0% 5% ∆(pp) 10% ∆(pp)

Basic GP 67.90 67.79 -0.11 67.37 -0.53RND 56.94 56.47 -0.47 58.25 1.30

Kahlua GP 79.83 79.75 -0.08 79.34 -0.49RND 75.57 75.21 -0.36 75.21 -0.36

MDSL GP 79.33 79.75 0.41 79.99 0.65RND 60.47 60.72 0.25 60.26 -0.21

Javascal GP 22.21 18.61 -3.60 22.01 -0.20RND 19.92 21.19 1.27 20.10 0.19

Rhino GP 49.16 49.36 0.20 49.60 0.43RND 25.76 25.82 0.06 25.98 0.22

Average GP -0.64 -0.03RND 0.15 0.23

Subject Config 0% 15% ∆(pp) 20% ∆(pp) 25% ∆(pp)

Basic GP 67.90 67.69 -0.21 68.12 0.22 67.49 -0.41RND 56.94 59.08 2.14 60.30 3.36 59.90 2.96

Kahlua GP 79.83 79.82 -0.01 79.76 -0.08 79.54 -0.29RND 75.57 75.56 -0.01 74.96 -0.61 75.82 0.25

MDSL GP 79.33 79.86 0.52 79.97 0.64 79.94 0.61RND 60.47 60.79 0.32 60.54 0.07 60.97 0.50

Javascal GP 22.21 17.11 -5.09 16.88 -5.33 18.94 -3.26RND 19.92 20.88 0.96 21.23 1.32 19.19 -0.73

Rhino GP 49.16 48.47 -0.69 48.66 -0.51 48.16 -1.01RND 25.76 25.99 0.23 25.99 0.23 25.92 0.15

Average GP -1.10 -1.01 -0.87RND 0.73 0.87 0.63

94


In Table 5.7, we report the average coverage achieved by RND and GP foreach subject when a given percentage of annotations is dropped. Table 5.7shows also the delta in pp (percentage points, i.e., difference between two per-centage values) for each configuration (as compared to the baseline in which noannotations are dropped). In the majority of the cases, the variability in cover-age (either positive or negative) is very low, mostly below 1pp. However, thereare a few cases where dropping annotations has a noticeable effect on coverage.In particular for the subject Javascal, we observe a decrease in coverage ashigh as 5.33pp for GP, while for RND we observe an increase (up to 1.32pp)in coverage. This could be due to the strict structural requirements imposed bythe underlying grammar (for the Pascal language). For the subject Rhino, weobserve small decreases in coverage (up to 1pp) as the percentage of droppedannotations increases. For the subject Basic, RND showed an increase in cov-erage (up to 3.36pp) while variations in the coverage of GP remained quite low.

We can observe in Table 5.7 that quite surprisingly some of the deltas arepositive. This may indicate that the set of annotations used in the experimentsmay at times be too restrictive, hence preventing the techniques from exploringthe space of sentences that could be derived from the grammar. Hence, droppingannotations in these cases led to increased coverage.

Regarding RQ5, we can conclude that for the majority of the subjects on

which we experimented, the performance of the stochastic grammar is robust

with respect to annotation changes.


The main threats to the validity of our results are related to internal, conclusion,and external validity.

Internal validity threats concern factors that may affect a dependent variableand were not considered in the study. In our case, different grammar based testdata generation techniques could be used, with potentially varying effective-

95

5.5. RELATED WORKS CHAPTER 5. SYSTEM LEVEL TEST GENERATION

ness. We chose stochastic random generation as a baseline as it is representativeof state of the art techniques for random grammar based test generation.

Another potential threat could arise from the fact that the author of this thesisperformed the annotation of the grammars, simulating the developer. This mayresult in under utilization of the capabilities of the annotation scheme due tolack of deeper understanding of the grammars (while the developer is expectedto know very well the grammar and the characteristics of the language definedby it). On the other hand, the intended semantics of the annotation scheme wasperfectly clear and familiar to the author, while it might not be so for a developerwho has not been adequately trained for the job. Consequently, the performanceof GP combined with our own annotations might have been marginally penal-ized as compared to the performance achievable by a well trained developerwho is very familiar with the grammar.

Threats to conclusion validity concern the relationship between the treat-ments and outcomes. In addition to reporting the results of our experiments,whenever applicable, we have also tested them for statistical significance usingthe Wilcoxon non-parametric test, and drew conclusions in accordance with theresults of the test.

External validity threats are related to the generalizability of results. We havechosen six subjects representative of grammar based systems of various size(both in terms of code and grammar size). Even though these subjects are quitediverse, generalization to other subjects should be done with care. Replicationof the experiment on more subjects would further increase our confidence in thegeneralizability of the results.

5.5 Related Works

Though there are a number of research works in the literature that deal withsentence generation from grammars (see Chapter 3.2), the majority of them

96

CHAPTER 5. SYSTEM LEVEL TEST GENERATION 5.5. RELATED WORKS

do not directly address the problem of generating sentences for achieving codecoverage. In recent years however, there have been some attempts towards thisobjective. One such work closely related to ours is that of Beyene and Andrews[Beyene and Andrews, 2012]. The authors propose an approach in which theysplit the generation of sentences into two parts. In the first part they generateJava classes that represent each category in the grammar (terminals and non-terminals), called Grammatical Category Objects (GCOs). In the second part,they use various testing techniques to derive instances of the GCOs, ultimatelyproducing sentences from the grammar. The SUT is then executed with thesentences generated and (line) coverage measured.

In another closely related work by Poulding et al. [Poulding et al., 2013], theauthors propose to automatically find an optimal distribution of weights for theproductions of a stochastic CFG using metaheuristic search. The metaheuris-tic search evolves profiles - production weights and dependencies among theproductions. The fitness of a certain profile is then measured by sampling theset of sentences that could potentially be generated from the given distributionof weights defined by the profile and measuring the coverage achieved by thesampled sentences on the SUT. The (branch) coverage achieved is then usedas feedback (fitness metric) to guide the metaheuristic search into a profile thatachieves high coverage.

Our approach differs from the aforementioned works in three aspects. First,we introduce practical and scalable alternative mechanisms (80/20, learning,annotations) that address issues related to sentence generation from grammars(i.e., unbounded recursion and semantic constraints). Second, we use GP oper-ators (crossover/mutation) to directly evolve the syntax trees during the search,hence we introduce further flexibility so that more suitable individuals couldbe evolved. Third, we use a fine-grained fitness function which is based on thebranch distances (from uncovered branches), hence the search gets a more guid-ing feedback that helps it effectively navigate the search space towards a set of

97

5.6. CONCLUSION CHAPTER 5. SYSTEM LEVEL TEST GENERATION

sentences that achieve high coverage in the SUT.

5.6 Conclusion

We have introduced and compared two approaches for grammar-based test datageneration. One approach combines GP with stochastic grammars learned froma corpus of manually written sentences. Learned stochastic grammars are ef-fective in controlling recursion during sentence derivation and in promotingsentence structures similar to the valid sentences used for learning. However,learning comes with added costs, mainly associated with finding an appropriatecorpus and adapting the grammar to the learning tool. The second approachcombines GP with semantically annotated grammars. Annotations constrainthe generation process so that the proportion of semantically valid sentences isincreased.

Experimental results obtained on six grammar based systems of varyinggrammar complexities show that the two approaches are comparably effective,both in terms of coverage and fault exposure capability. Moreover, both ap-proaches achieve maximum effectiveness when combined with GP. Dependingon the difficulty of finding an appropriate corpus of valid sentences, as com-pared to the effort necessary for grammar annotation, developers may decide toadopt one or the other alternative approach. According to our empirical data,such a choice has minimal impact on the final coverage and fault detection ca-pability of the resulting test suites, provided the test suite generation process isbased on the GP evolutionary scheme.

In our future work, we will investigate mechanisms to promote diversityduring sentence derivation from annotated grammars. In fact, preliminary ob-servations (see, for instance, results on unique vs. duplicate sentences) seem toindicate that annotations tend to over-constrain the generation process, reduc-ing the exploration of alternative sentence structures. We intend to introduce

98

CHAPTER 5. SYSTEM LEVEL TEST GENERATION 5.6. CONCLUSION

some form of history, recording the derivation choices made in the past, so asto promote different choices whenever possible. We will also extend the em-pirical benchmark by including additional subjects, so as to further increase theexternal validity of the findings reported.

99

5.6. CONCLUSION CHAPTER 5. SYSTEM LEVEL TEST GENERATION

100

Chapter 6

System Level Test Generation forReproducing Failures of Programs withStructured Input

Software systems are increasingly complex and can be used in unpredictable(and untested) ways. Hence, users experience field failures (failures occurringafter deployment), which should be reproduced and fixed as quickly as possible.Field failure reproduction, however, is not an easy task for developers [Betten-burg et al., 2008], as bug reports provide limited information on how the failurecan be reproduced in the testing lab.

Research in the area has resulted in approaches based on Symbolic Execu-tion (SE) (e.g., BugRedux [Jin and Orso, 2012]) which are able to reproducefield failures from minimal runtime information (e.g., call sequences) collectedvia instrumentation. While these approaches are effective in reproducing fieldfailures for a certain class of programs, they have limitations that arise from thefact that they rely on symbolic execution, which is known to be ineffective on(1) programs with highly structured inputs, such as inputs that must adhere toa (non-trivial) grammar, (2) programs that interact with external libraries, and(3) large complex programs in general (e.g., programs that generate constraintsthat the constraint solver cannot handle).

101

6.1. TERMINOLOGY CHAPTER 6. FAILURE REPRODUCTION

In this thesis we present a failure reproduction technique, called SBFR (Search-Based Failure Reproduction), that is specifically designed to handle complexprograms with highly structured inputs (e.g., compilers and interpreters), hencegoing beyond the features of the programs used in previous failure reproductionstudies [Jin and Orso, 2012, Chilimbi et al., 2009, Clause and Orso, 2007, Jiangand Su, 2007, Artzi et al., 2008, Rossler et al., 2013]. SBFR takes as input the

failing program, a grammar describing the program input, and the (partial) call

sequence for the failing execution, and uses GGGP for generating inputs thattrigger the observed failure.

6.1 Terminology

A field failure is a failure of a deployed program while it executes on a usermachine. We use the term execution data to refer to any runtime informationcollected from a program executing on a user machine. In particular, a call

sequence is a specific type of execution data that consists of a (sub)sequenceof functions (or methods) invoked during the execution of a program on a usermachine.

We define a field failure reproduction technique as a technique that can syn-thesize an in-house failure reproducing execution E ′, given a program P , a fieldexecution E of P that results in a failure F , and a set of execution data D for Eas follows. First, E ′ should result in a failure F ′ that is analogous to F , that is,F ′ has the same observable behavior of F . If F is the violation of an assertionat a given location in P , for instance, F ′ should violate the same assertion at thesame point. Second, E ′ should be an actual execution of P , that is, the approachshould be sound and generate an actual input that, when provided to P , resultsin E ′. Finally, the approach should be able to generate E ′ using only P and D,without the need for any additional information.

102

CHAPTER 6. FAILURE REPRODUCTION 6.2. SBFR

6.2 SBFR - Search Based Failure Reproduction

SBFR’s goal is to reproduce an observed (field) failure based on a (ideally min-imal) set of runtime information about the failure, or execution data. As com-monly done in these cases, SBFR collects execution data by instrumenting thesoftware under test (SUT) before deploying it to users. Upon failure, the exe-cution data for the failing execution are used by SBFR to perform a GP basedsearch for inputs that can trigger the failure observed while the SUT was run-ning on the user machine. This part of the approach would be ideally performedin the field (by leveraging free cycles on the user’s machine), so as to send backonly the generated inputs instead of the actual execution data.

Figure 6.1: Overview of SBFR

An overview of SBFR is shown in Figure 6.1. The Instrumentation compo-nent takes as input a program P and generates an instrumented version P’. P’

is then deployed and produces execution data. Upon failure, the execution dataT and P’ are used by the component GP Search to generate a test case TC thatreproduces the observed failure. The test case TC is then used by the developersto reproduce and debug the failure.

Execution data are used to provide guidance to the evolutionary search inthe form of a fitness function (discussed in Section 6.2.3). The fitness function

103

6.2. SBFR CHAPTER 6. FAILURE REPRODUCTION

assigns fitness values to an individual based on the ‘distance’ between the ex-ecution data obtained from its execution and the execution data obtained fromthe actual failure. The aim of the GP search is to minimize this distance.

The individuals in the GP search are candidate test inputs for the SUT, that is,structured input strings that adhere to a formal grammar. The search maintains apopulation of individuals, evaluates them by measuring how close they get to thedesired solution using a fitness function, and evolves them via genetic operators(see Chapter 2). If at least one candidate is able to trigger the desired failure,a solution is found and the search terminates. If no candidate solution is foundafter consuming the whole search budget, the search is deemed unsuccessful.

6.2.1 Seeding the Search with Representative Inputs

The initial population in SBFR is generated by enumerating sentences from thegrammar of the SUT following the two strategies: 80/20 and grammar learn-ing. As discussed in Chapter 2, the 80/20 rule is a simple and practical heuristicin which, during sentence derivation, for a given rule an overall probability of0.8 is assigned to the non-recursive productions while a probability of 0.2 isdistributed among the recursive productions. This helps control the process ofderivation by limiting the application of recursive rules, increasing the likeli-hood of termination of the derivation process.

For complex grammars, evolving meaningful and representative sentencesmay be quite difficult for GP if the initial population is not chosen carefully.One of the subject programs used in our experimental study, PicoC, is a goodexample of such a case. PicoC is an interpreter for a subset of the C pro-gramming language. The grammar used to generate inputs for this program isfairly large (194 production rules) with complex and highly recursive structures.Sentences obtained using the 80/20 rule would still be very different from realC programs, being mostly limited to shallow structures, such as paired braces.Common programming constructs, such as assignment statements, will be very

104

CHAPTER 6. FAILURE REPRODUCTION 6.2. SBFR

difficult to generate from such a complex grammar using random techniques.

Instead of applying a set of production probabilities which are pre-determinedby the 80/20 rule, in these cases probabilities are learned from examples of ex-isting, human written inputs. Namely, the stochastic grammar used for the gen-eration of the initial population is learned from a corpus of well formed, humanwritten sentences for the SUT. In our experiments, for subjects with very com-plex grammars, we applied learning of production probabilities from a corpusof input sentences. Learning of rule probabilities from a corpus is discussed inSection 5.1 of the previous chapter.

6.2.2 Input Representation and Genetic Operators

Once the initial population of individuals is generated, either using the 80/20rule or by learning probabilities as discussed in the previous subsection, theyare represented as syntax trees. The genetic operators (subtree crossover andmutation) manipulate these tree representations of the individuals.

We apply subtree crossover and mutation operations on the individuals (seeChapter 2). We chose to use tree based operations because they ensure thecreation of well formed individuals—if both parents are well formed individuals(according to the grammar), the offspring produced by subtree crossover arealso going to be well formed. Similarly, subtree mutation of a well formedindividual results in a well formed individual. The probabilities used for thegeneration of new subtrees during subtree mutation are the same as those usedto generate the initial population (i.e., they are either determined by the 80/20rule or they are learned from a corpus).

6.2.3 Fitness Computation and Search Termination

SBFR evaluates candidate solutions based on the trace obtained when executingthem against the instrumented SUT. During fitness evaluation of an individual,

105

6.2. SBFR CHAPTER 6. FAILURE REPRODUCTION

its tree representation is “unparsed” into a string representation and passed tothe instrumented SUT as input. The execution of the instrumented SUT resultsin execution data (trace). In this thesis, we consider execution data that consistof call sequences and refer to a call sequence using the term trajectory. Moreformally, we define a trajectory as a sequence T = 〈c1, ..., cn〉, where each ci isa function/method call.

Call sequences have been shown to provide the best trade-offs in terms ofcost benefit for synthesizing in-house executions [Jin and Orso, 2012]. Further-more, with anecdotal evidence obtained by manually checking the collected callsequences in our empirical study, call sequences are unlikely to reveal sensitiveor confidential information about the original execution.

SBFR’s fitness function compares how “similar” the trajectory of a candi-date individual is to the trajectory of the failing execution obtained from thefield. This comparison is implemented using sequence alignment between thetwo trajectories. That is, we propose a fitness function based on the distance

between the trajectory of the failing execution and the trajectory of a candi-date individual. Hence, our GP approach tries to minimize this distance withthe objective of finding individuals that generate trajectories identical to that ofthe failing execution. The distance between two trajectories T1 and T2 can bedefined as:

distance(T1, T2) = |T1|+ |T2| − 2 ∗ |LCS(T1, T2)| (6.1)

where LCS stands for Longest Common Subsequence [Cormen et al., 1990],and |T | is the length of trajectory T .

For instance, T1 = 〈f, g, g, h,m, n〉 and T2 = 〈f, g, h,m,m〉 have LCS =

〈f, g, h,m〉. Hence, their distance is 6 + 5 - 8 = 3, which corresponds to thenumber of calls that appear only in T1 (second g, n) or in T2 (second m).

The fitness value of an individual in the GP will hence be computed as thedistance between its trajectory and that of the target trajectory using Equation

106

CHAPTER 6. FAILURE REPRODUCTION 6.3. EMPIRICAL EVALUATION

6.1. The fitness value is then minimized by the search, with the ultimate ob-jective of producing individuals that reproduce the desired failure. The searchstops when a desired solution is found or the search budget, expressed as themaximum number of fitness evaluations, is exhausted. If successful, the searchwill produce an individual (i.e., an input) that causes the program to follow atrajectory similar to that of the observed failure, reaching the point of failure,and failing at that point with the same observable behavior as the original fail-ure. How our current implementation actually assesses whether the reproducedfailing behavior matches the observed one is discussed in Section 6.3.1 below.


The main goal of our empirical evaluation is to assess the effectiveness andpractical applicability of our SBFR approach for programs with structured andcomplex input. To achieve this goal we developed a prototype tool that im-plements the proposed approach and performed a study on several real worldprograms and real failures for these programs. Specifically, we investigated thefollowing research questions:

• RQ1 (effectiveness): How effective is SBFR in reproducing real field fail-

ures for programs with structured input?

• RQ2 (instrumentation overhead): What is the performance overhead im-

posed due to the instrumentation required by SBFR?

• RQ3 (input seeding): What is the role of input seeding in SBFR?

In the rest of this section, we present the detail of the prototype tool, thesubject programs and failures that we used for our experiments, illustrate ourexperiment protocol, and discuss our results and the possible threats to theirvalidity.

107

6.3. EMPIRICAL EVALUATION CHAPTER 6. FAILURE REPRODUCTION


Figure 6.2 shows an overall view of the prototype tool that implements SBFR.The tool consists of three main modules: TC Generator, Instrumenter, andLearner. The TC Generator, via the GP Search component inside it, performsthe evolutionary search guided by the trajectory, eventually producing a test

case. To evaluate individuals in the search, it uses the SUT Runner compo-nent, which is a wrapper that executes the individual with a given timeout andreturns the execution trace together with the exit status of the execution (includ-ing possible error messages). In cases where learning is employed, the Learner

component produces a stochastic CFG (see Chapter 2) starting from the SUT’sinput grammar and a corpus.

Grammar TC

LearnedgrammarIndividual Trace+ Exit Status

InstrumentedSUT

SUT Instrumenter

GP Search

SUTRunner

Learner

TC GeneratorTrajectory

Figure 6.2: Prototype tool that implements SBFR

We have implemented the core GP Search component of our prototype toolby extending GEVA [O’Neill et al., 2008], a general purpose grammatical evo-lution tool written in Java. GEVA provides the necessary infrastructure, suchas representation of individuals, basic GP operators (e.g., subtree crossover andmutation), and the general functionality to manage the overall search process.On top of this infrastructure, we have implemented customized operators for

108


stochastic initialization and fitness evaluation, which are central to our pro-posed failure reproduction scheme.

Fitness evaluation is performed by executing the instrumented version ofthe SUT externally using the SUT Runner, with the string representation of anindividual as input. When the execution terminates, its trajectory is returned tothe search component, which computes the distance between the trajectory ofthe individual and that of the target. Since the generated sentences may containconstructs that lead the SUT to non-terminating executions, the SUT Runnerexecutes the SUT with a timeout.

The major computational cost associated with SBFR is the cost of the fit-ness evaluation. To reduce this cost, our tool uses caching, which minimizesthe fitness evaluation cost by avoiding the re-execution of previously evaluatedinputs. Consequently, the search budget is computed as the total number ofunique fitness evaluations.

To determine whether an individual triggers a failure analogous to the oneobserved in the field, our tool proceeds as follows. For each candidate input, theerror/exception possibly generated while executing the SUT is compared withthat of the reported failure. This comparison is performed by the SUT Runner

by comparing the error messages and the location where the errors manifestthemselves. It then returns an Exit Status indicating success if the two failuresmatch.

The Instrumenter module adds software probes to the SUT at compile time,for collecting call sequences when the SUT is executed. We implemented twoversions of this module, one for C programs (based on the LLVM compilerinfrastructure 1) and the other for Java programs (based on the Javassist byte-code manipulation library 2). As a result, the instrumented version of the SUToutputs the dynamic call sequence for the given input.

1http://llvm.org2http://www.csg.is.titech.ac.jp/˜chiba/javassist/

109

http://llvm.org

http://www.csg.is.titech.ac.jp/~chiba/javassist/


Note that the search is completely programming language agnostic. Han-dling SUTs developed in another language L would simply amount to develop-ing an instrumentation tool for L and giving the SUT Runner the ability to runprograms written in L, so that call sequences can be collected during execution.The core components of SBFR would remain unchanged.

6.3.2 Subjects

In our empirical evaluation of SBFR, we consider eleven failures from fivegrammar based programs. We selected these programs because they are rep-resentative of the kind of programs we target and required artifacts (such asgrammars, reproducible failures, test suites, etc) are available. As our approachdeals with grammar-based programs, the corresponding grammars are generallyavailable with the program itself. Even so, some work may be still necessary,for example to convert the available grammar into a format (BNF) accepted byour tool (this task is usually easy to automate).

Table 6.1: Subjects used in the experimental study.

Name Language Size (KLOC) # Productions # Faults

Calc Java 2 38 2bc C 12 80 1MDSL Java 13 140 5PicoC C 11 194 1Lua C 17 106 2

Table 6.1 presents a summary of the subject programs used in our experimen-tal study. Calc and MDSL have also been used for the experiment in the previ-ous chapter. bc3 is a command-line calculator commonly found in Linux/Unix

3http://www.gnu.org/software/bc/

110

http://www.gnu.org/software/bc/


systems. PicoC4 is an interpreter for a subset of the C language. Lua5 is aninterpreter for the Lua scripting language. Calc and MDSL are developed inJava based on the ANTLR parser generator. bc is developed in C based on theLex/Yacc parser generator tools. PicoC and Lua are developed in C, but donot rely on a parser generator.

We defined a BNF grammar for PicoC based on an existing C grammarsuitably reduced to the subset of C accepted by PicoC. We also defined a BNFgrammar for Lua based on the semi-formal specification of the language pro-vided on the official website. For the other subjects, we either extracted thegrammar used by the respective parsers, or used freely available grammars fromthe Internet.

Table 6.1 reports the number of productions in the grammar of each applica-tion, ranging from 38, for Calc, to 194, for PicoC. These grammars are fairlylarge and complex, and bigger than those typically found in the GP literature.Even if the subject programs are not necessarily large in terms of LOCs, they arechallenging for input generation techniques. We tried to apply BugRedux [Jinand Orso, 2012] to reproduce the same field failures used in our experiments.However, BugRedux failed to generate any input for all faults considered in thisstudy after 72 hours. The reason why BugRedux is ineffective is that the cur-rent implementation does not leverage the grammar information (as done fore.g. with “symbolic tokens” [Majumdar and Xu, 2007, Godefroid et al., 2008])and the guided symbolic execution search gets stuck in the lexical analysis func-tions because these functions usually contain a huge number of paths.

Table 6.1 also reports the number of faults (equal to the number of failures)considered for each subject. The faults in bc, PicoC, and Lua have been se-lected from their respective bug tracking systems and affect the latest versions ofthe programs. For instance, the bc bug crashes the bc program deployed with

4https://code.google.com/p/picoc/5http://www.lua.org

111

https://code.google.com/p/picoc/

http://www.lua.org


most modern Linux systems. The bugs for Calc and MDSL have been discov-ered by us while investigating the programs in a different experiment (presentedin Chapter 5). Each fault causes a crashing failure, that is, a failure that resultsin the unexpected termination of the program execution. The execution dataused to guide the search in SBFR are generated by test cases that expose thesefailures, and thus simulate the occurrence of a field failure.


We evaluated the effectiveness of SBFR using random grammar-based test casegeneration as a baseline. For this purpose, we implemented a random genera-tion technique (RND hereafter) that generates a new input from the grammarand executes the SUT with that input. If the input triggers the desired failure,a solution is found and RND stops. If the input does not trigger the failure,another input is generated and evaluated. This process continues until either asolution is found or the search budget is finished.

When seeding is employed (related to RQ3), for each of the considered sub-jects we used the stochastic grammar learned from a corpus of human writtentests (see Section 5.1) to generate inputs. Hence, in SBFR the initial populationis generated from the stochastic grammar, rather than using the 80/20 rule. Sim-ilarly, in the case of RND, inputs are generated from the stochastic grammar.

Since both SBFR and RND involve non-deterministic actions, we ran bothSBFR and RND 10 times for every failure considered in the experiment. Foreach such run, we recorded whether or not the failure was reproduced and, ifreproduced, how much of the search budget was consumed to reproduce it. Ifthe failure was not reproduced after consuming the entire budget, the search wasdeemed unsuccessful. We calculated the failure reproduction probability (FRP)as the number of runs that reproduced the failure divided by the total number ofruns (i.e., 10) for each subject. For example, using SBFR with the 80/20 rule,we reproduced Calc Bug 1 in 6 runs out of 10 runs, hence the FRP = 0.6.

112


When there was no statistically significant difference in FRP, we measureda secondary effectiveness indicator, which accounts for the computational costincurred by each technique to achieve the measured FRP: the number of fitnessevaluations (FIT). Fitness evaluation represents the main computational costfor both SBFR and RND, and largely dominates all other computational costs.Therefore, we used FIT as an indicator of the cost of failure reproduction andmeasured it to assess whether SBFR offers any cost saving as compared to RNDwhen both achieve the same FRP. In our experiments, this happened only for oneof the bugs (discussed below in Section 6.3.4).

We also measured the execution time of the SUT before and after instrumen-tation, so as to determine the time overhead imposed on the end user due tothe instrumentation. Specifically, we ran all test cases available for each subjectused in our experimental study and measured the associated execution time with(ET’) and without (ET) instrumentation. The percentage increment of the testsuite execution time is used to quantify the overhead introduced by the instru-mentation. We also measure the size (SZ) of the trace files used to store the callsequences associated with failing executions, so as to assess the space overheadimposed by SBFR. We consider the size of the trace files both before and aftercompressing them (ZSZ), as in practice such data can be stored (and transferredover networks) in compressed format.

As there are several parameters that control the search process, we performedsensitivity analysis to determine appropriate values for the dominant search pa-rameters in our experiments. The values we used are: population size of 500;crossover probability of 0.8; mutation probability of 0.2; three-way tournamentselection, preserving the elite; total search budget of 10,000 unique fitness eval-uations.

113


6.3.4 Results

Table 6.2 presents the results of our empirical evaluation. For each bug, thetable reports FRP for both SBFR and RND, together with the results of theWilcoxon statistical test of significance and effect size measures computed us-ing the Vargha-Delaney (A12) statistics. For Bug3 of MDSL, both SBFR andRND are able to reproduce the failure, so we further performed a comparisonof the search budget consumed by each to reproduce the failure (FIT metrics,discussed above). However, a Wilcoxon test (p-value 0.4813) shows that thereis no significant difference in the consumption of search budget either.

Table 6.2: Failure reproduction probability for RND and SBFR. Statistically significant p-valuesare shown in boldface.

Subject:Bug FRP (RND) FRP (SBFR) p-value A12 Magnitude

Calc Bug1 0.0 0.6 0.00502 0.80 LargeCalc Bug2 0.0 0.8 0.00044 0.90 Largebc 0.0 1.0 0.00002 1.00 LargeMDSL Bug1 0.0 1.0 0.00002 1.00 LargeMDSL Bug2 0.0 1.0 0.00002 1.00 LargeMDSL Bug3 1.0 1.0 NA 0.50 -MDSL Bug4 0.0 1.0 0.00002 1.00 LargeMDSL Bug5 0.0 1.0 0.00002 1.00 LargePicoC 0.0 0.0 NA 0.50 -Lua Bug 1 0.0 0.0 NA 0.50 -Lua Bug 2 0.0 0.0 NA 0.50 -

Table 6.3 shows the size of the execution trace collected for each crash beforeand after compression with the zip utility. As can be seen from the table, thesize of the traces, especially after compression, is almost negligible.

Table 6.4 reports execution times with and without instrumentation. The ex-ecution time overhead ranges between 2.8% and 16.4%. The instrumentationis implemented relying on buffering to minimize the number of disk writes. In

114


Table 6.3: Uncompressed (SZ) and compressed (ZSZ) execution trace size.

Subject SZ (kb) ZSZ (kb)

Calc Bug1 3.50 0.49Calc Bug2 1.60 0.40bc 12.00 0.46MDSL Bug1 1.20 0.56MDSL Bug2 1.30 0.56MDSL Bug3 2.90 0.60MDSL Bug4 3.30 0.68MDSL Bug5 0.66 0.46PicoC 8.40 0.55Lua Bug 1 75.92 2.37Lua Bug 2 62.40 1.77

Table 6.4: Test suite execution time before and after instrumentation.

Subject ET (sec) ET ′ (sec) ∆ET%

Calc 4.28 4.47 4.4%bc 7.57 8.81 16.4%MDSL 15.97 16.64 4.2%PicoC 1.00 1.11 11%Lua 1.38 1.42 2.8%

practice, the size of the execution data, consisting only of call sequences, isusually small enough to be kept entirely in memory during a program execu-tion. Since a trace is dumped to file only upon crash, for normal (non-failing)executions the entire trace can be kept in memory, and no disk write operationis required.

Table 6.5 presents the results of using stochastic grammar learning for thegeneration of test cases in RND and for seeding the initial population of SBFR.We consider only bugs that could not be reproduced by SBFR when the 80/20rule is used for the initialization (see Table 6.2). As the table shows, the FRP

115


for SBFR is significantly higher than that of RND for two of the three bugsconsidered. For the third bug (Lua Bug 1), both SBFR and RND are unable toreproduce the failure.

Table 6.5: Failure reproduction probability (FRP) for SBFR and RND with initialization usingthe learned stochastic grammar, rather than the 80/20 rule.

Subject:Bug FRP (RND) FRP (SBFR) p-value A12 Magnitude

PicoC 0.1 0.8 0.00250 0.85 LargeLua Bug 1 0.0 0.0 NA 0.50 -Lua Bug 2 0.0 0.5 0.01365 0.75 Large

6.3.5 Discussion

As Table 6.2 shows, SBFR with 80/20 seeding was able to reproduce all failuresbut three—the failures of PicoC and Lua. Grammar based test case generationusing RND (80/20), conversely, was able to reproduce only one of the elevenfailures—Bug3 of MDSL, which both SBFR and RND are able to reproduce withprobability of 1. After further investigation of the results, we discovered thatthis failure is relatively easy to reproduce, as it is triggered by an input MDSLprogram that calls a method on an undeclared object, which is automaticallyinitialized to null.

Let us consider a specific example in which SBFR is successful and RNDfails. The failure in bc is a segmentation fault that happens when perform-ing memory allocation under very specific circumstances [Lu et al., 2005]; thefailure is triggered by an instruction sequence that allocates at least 32 arraysand declares a number of variables higher than the number of allocated arrays.SBFR successfully recreates the input sequence that leads to this failure, whileRND is unable to reproduce the failure.

Furthermore, from Tables 6.2 and 6.5 it can be seen that the magnitude ofthe effect size as computed by the Vargha-Delaney (A12) statistics is always

116


large (shown in the last column of the respective tables). This means that thestatistical evidence in support of the effectiveness of SBFR as compared to thebaseline (RND) is strong.

Based on the results we obtained, we can answer RQ1 and state that SBFR is

effective in reproducing real field failures for programs with structured, gram-

mar based input, while RND is not so.

With respect to the overhead imposed by SBFR’s instrumentation, which isthe topic of RQ2, Table 6.3 shows that the size of the collected execution data(call sequences, in this case) is very small. In the worst case, for Lua Bug 1,the size of the uncompressed trace is 75.92Kb (2.37Kb compressed). Overall,on average, the size of the uncompressed trace is 15.74Kb, while the averagecompressed trace size is 0.8Kb.

As Table 6.4 shows, the execution time overhead imposed by SBFR’s instru-mentation is also acceptable for all five subjects considered, with an averageoverhead of about 8%. Moreover, we also expect these results to be really worstcase scenarios for many reasons. First, all of these applications are processing-intensive applications with no interaction with the user. The overhead wouldtypically decrease dramatically in the case of interactive applications, for whichidle time is dominant. Second, these are for the most part short executions,where the fixed cost of the instrumentation’s initialization is not amortized overtime. Third, it is always possible to sample and collect only partial executiondata [Jin and Orso, 2012]. Finally, we use an unoptimized instrumentation;more advanced implementations of the instrumentation could considerably re-duce the time overhead imposed on the instrumented programs.

In summary, we can answer RQ2 in a positive way. According to our results,

SBFR imposes almost negligible space overhead and acceptable time overhead

in all cases considered.

With respect to the role of input seeding (RQ3), as can be seen from Ta-ble 6.5, for two of the three bugs, the FRP of SBFR has significantly im-

117


proved with the aid of the learned grammar, while RND is still not able toreproduce any of the three bugs using the stochastic grammar. As shown inTable 6.2, these three failures are particularly difficult to reproduce using ini-tialization with the 80/20 rule. For instance, the failure in PicoC is a seg-mentation fault caused by an incorrect use of pointers. The test case that trig-gers the failure, from the original bug report, contains the following statements:int n =5; int *k; k = &n; **k = &n; In particular, the failure iscaused by the last assignment statement. Assignment statements, especiallythose associated with complex expressions, involve deeply nested and recursivegrammar definitions. As a result, generating such type of statements from agrammar using randomized techniques is quite difficult and the derivation pro-cess for such constructs either stops prematurely or goes into infinite recursion.With learning, these kinds of constructs can be easily generated in the initialpopulation. GP operators can then make use of these basic constructs, by ma-nipulating and exchanging them, to evolve the desired trees with the constructsnecessary for reproducing the failure at hand.

Lua Bug 1 is not reproduced by neither SBFR nor RND, even with learning.This bug involves specific invocations of built-in functions of the language (inparticular, calls to print and load). Such types of input are very difficultto generate from the grammar alone, because they depend on the identifiersinstantiating the grammar tokens, in addition to the input structure. Unless thegrammar is augmented with names of built-in functions and library names of thelanguage, it would be extremely difficult for the search to evolve inputs with thedesired structure and containing the right names. In our experiments, we useda simple token instantiation strategy in SBFR, where a pool of random tokeninstances is first generated; later, during test case generation and evolution, onlytoken instances from the pool are used for newly created tokens. While thisstrategy works well for bugs that involve only user defined functions, it failswhen built-in or library functions must be called.

118


When considering RQ3, based on our results we can conclude that for cases

where the grammar of the SUT contains complex structures, learning a stochas-

tic grammar from a corpus of existing inputs can improve substantially the ef-

fectiveness of SBFR.


The main threats to validity for our results are internal, construct, conclusion,and external validity threats.

Internal validity threats concern factors that may affect a dependent variableand were not considered in the study. In our case, different grammar based testcase generators may have different failure reproduction performance. We havechosen RND (with 80/20 or learned stochastic grammar), since it is represen-tative of state of the art tools for random, grammar-based test case generation.Further experiments using other generators are necessary to increase our confi-dence on the results.

Construct validity threats concern the relationship between theory and ob-servation. We have carefully chosen the experimental measures used to answerRQ1, RQ2 and RQ3. In particular, the metrics used in the evaluation (FRP, FIT,SZ, ZSZ, ET) are direct measures of the effects under investigation. Moreover,all these metrics have been measured objectively, using tools.

Conclusion validity threats concern the relationship between the treatment(SBFR vs. grammar based test generation) and the measured performance. Toaddress this threat, we have drawn conclusions only when performance differ-ences were reported to be statistically significant at level 0.05 by the Wilcoxontest.

External validity threats are related to the generalizability of the results.We considered five subjects and eleven failures, with three subjects involvinga moderately complex grammar and two subjects involving a fairly complexgrammar. Generalization to other subjects should be done with care, especially

119

6.4. RELATED WORKS CHAPTER 6. FAILURE REPRODUCTION

if the associated grammar is highly complex. We plan to replicate our exper-iment on more subjects to increase our confidence in the external validity andgeneralizability of our findings.

6.4 Related Works

SBFR’s aim is to recreate, in the developer’s environment, failures observed inthe field. In this regards, a closely related work is BugRedux [Jin and Orso,2012], which applies a similar overall approach for reproducing field failures.There are two major differences between SBFR and BugRedux. (1) SBFR dealswith programs whose inputs are highly structured (defined by CFGs), whileBugRedux deals with programs whose inputs are not structured (e.g., numeric,string, etc.). (2) SBFR uses evolutionary search algorithms to search for thedesired inputs while BugRedux uses a guided symbolic execution approach toperform the search. As a result BugRedux is limited in handling programs thatcould generate complex constraints which could not be solved by the underlyingconstraint solver.

Another closely related work is RECORE [Rossler et al., 2013] which usesGAs to synthesize executions from crash call stacks for reproducing unit-levelfailures in standalone libraries. The fitness function employed by RECORE re-lies on aligning the stack traces and minimizing the object distance computedfrom the values in the stack. For system-level failure reproduction, the guid-ance gained from the distances between the objects (which is normalized toa value in [0,1) and averaged over all variables) is (in our experience) mini-mal, while the exposure of sensitive user data from these variables could begenerally unacceptable. While value distances could be easily integrated intoSBFR’s fitness function (e.g., by collecting parameter values as part of the callsequences [Kifetew, 2012]), in our experience this does not result in a signif-icant gain in failure reproduction power that justifies the potential exposure of

120

CHAPTER 6. FAILURE REPRODUCTION 6.5. CONCLUSION

sensitive data.

6.5 Conclusion

We have presented SBFR, a technique that leverages genetic programming togenerate complex and structured test inputs capable of reproducing failures ob-served in the field. SBFR evolves a population of candidate failure inducinginputs by means of genetic operators that manipulate parse-tree representationsof the inputs. Evolution is guided by a fitness function that measures the dis-tance between the execution trace (call sequence) of the observed failure andthose of the generated test cases.

In our empirical evaluation, SBFR widely outperformed random grammarbased test case generation, as well as BugRedux, a field failure reproductiontechnique based on guided symbolic execution. For subjects with moderatelycomplex grammars that describe the structured input, no stochastic grammarlearning is needed to produce the initial population evolved by SBFR. For sub-jects involving more complex grammars (e.g., a program that accepts as inputa large subset of the C language), our results show that the learning componentin our approach can dramatically improve the effectiveness of SBFR. Overall,SBFR was able to successfully reproduce 10 out of the 11 failures considered,while a purely random technique was able to reproduce only 1 of the failures,and BugRedux none of them.

Future work could investigate (1) additional empirical studies on programswith highly complex input grammars (e.g., JavaScript), (2) selective instrumen-tation techniques to further reduce the overhead imposed by SBFR without de-grading its performance, and (3) hybrid approaches to field failures reproductionthat combine the strengths of symbolic execution and genetic programming andcan handle a broader class of programs than the two approaches in isolation.

121

6.5. CONCLUSION CHAPTER 6. FAILURE REPRODUCTION

122

Chapter 7

Conclusion

Software testing is an important activity in the software development process.And as such, it accounts for a significant portion of the overall budget, mainlybecause it relies heavily on the test engineer writing test cases, which is costly.To this end, research has focused on automating the process of generating testcases, for several decades now. However, a survey of the literature in softwaretesting still shows gaps that need further improvement. During unit testing, thestate of the art approaches (such as whole suite optimization [Fraser and Ar-curi, 2013]) consider test generation from the perspective of single-objectiveoptimization. As a result, they are not able to exploit the full potential of opti-mization algorithms, in particular many-objective optimization. At the systemlevel, test case generation, specially for grammar based programs, is in need ofapproaches that are able to scale to realistic programs and realistic grammars.

In this thesis, we have introduced and evaluated novel approaches for testcase generation both at the unit and system level for the purpose of cover-age testing as well as field failure reproduction. At the unit level, we havere-formulated branch coverage as a many-objective optimization problem andproposed a novel algorithm that is able to scale quite easily to programs withmore than a thousand branches. We have shown that our algorithm outperformsthe state of the art techniques for unit test generation.

123

7.1. SUMMARY OF CONTRIBUTIONS CHAPTER 7. CONCLUSION

At the system level, we have suitably combined stochastic grammars and ge-netic programming for generating test input from grammars. The approach isable to handle large and complex grammars (such as the JavaScript grammar)that define deeply nested and recursive structures. Furthermore, the proposedsentence generation procedure is able to handle subject programs of varying size(programs as big as Rhino, a JavaScript interpreter). Results of empirical ex-periments show that the proposed approach is able to achieve significant levelsof system level branch coverage as well as mutation score on real programs.

We have also shown how system level test generation could be exploited forreproducing field failures for programs with grammar based inputs. We haveformulated a framework which is able to synthesize system level program inputsthat make the program fail in a similar manner as reported from the end usersin the field. The approach is guided by a limited amount of runtime informa-tion collected during the execution of the program in the field via light-weightinstrumentation. Evaluation of the approach on eleven real field failures of realprograms resulted in the reproduction of nine of the eleven failures.

7.1 Summary of Contributions

Overall, the contributions made by this thesis work are:

• a highly scalable many-objective algorithm (MOSA) custom made for tack-ling test case generation where the number of test goals to satisfy (e.g., branchesto cover) is considerably large. Results of empirical experiments showthat MOSA is able to effectively scale up to programs with a thousandbranches.

• a system level test case generation scheme which combines stochasticgrammars and genetic programming for achieving high system-level branchcoverage on programs with grammar based input. Results show that the

124

CHAPTER 7. CONCLUSION 7.2. SUMMARY OF FUTURE WORKS

scheme is able to scale up to large programs (e.g., Rhino) with inputs de-rived from complex grammars (e.g., JavaScript).

• a search based failure reproduction scheme (SBFR) for the reproductionof field failures for programs with grammar based input. Results showSBFR’s effectiveness in reproducing real program failures (e.g., the bc

calculator) with reasonable overhead.

• a number of tools that implement the approaches proposed in the thesis:

– a tool that implements many-objective optimization for branch ade-quate test case generation for Java programs.

– a tool that implements system level test generation for branch cover-age of Java programs with grammar based input.

– a tool that implements the SBFR failure reproduction framework forgrammar based programs developed in either C or Java languages.

7.2 Summary of Future Works

This thesis addressed important open problems in test case generation, both atthe unit and system level. It has presented innovative solutions whose effective-ness was shown through empirical experiments. At the unit level, a highly scal-able test generation algorithm (MOSA) has been presented for the generationof branch adequate test cases. While in principle MOSA could also be appliedfor other adequacy criteria (e.g., mutation killing), this has to be investigatedand empirically evaluated in the future, taking into account any peculiarities thecriterion might have (high computation expense in mutation testing).

The effectiveness of a many-objective algorithm for test case generationcould also be investigated at system level. While MOSA is able to scale toa high number of branches at the unit level, at the system level the number of

125

7.2. SUMMARY OF FUTURE WORKS CHAPTER 7. CONCLUSION

branches is expected to be significantly higher (e.g., Rhino has more than sixthousand branches). In such a scenario, the applicability of MOSA could be in-vestigated, and potential countermeasures could be explored in case limitationswould emerge. One possible direction could be to integrate a sampling mech-anism in which MOSA could target a subset of the branches at a time, ratherthan all of them at once.

The failure reproduction scheme SBFR has been shown to be effective atreproducing real field failures with reasonable storage and computational over-head. Its overhead is due to the instrumentation used for collecting vital execu-tion data from the field execution. In our implementation of SBFR, the focuswas mainly on the test generation aspect rather than on the data collection. Fu-ture works could investigate alternative mechanisms for collecting executiondata in a smarter way so as to minimize the overhead imposed on the client.

126

Bibliography

[Ali et al., 2010] Ali, S., Briand, L., Hemmati, H., and Panesar-Walawege, R.(2010). A systematic review of the application and empirical investigation ofsearch-based test case generation. Software Engineering, IEEE Transactions

on, 36(6):742–762.

[Anand et al., 2013] Anand, S., Burke, E. K., Chen, T. Y., Clark, J. A., Cohen,M. B., Grieskamp, W., Harman, M., Harrold, M. J., and McMinn, P. (2013).An orchestrated survey of methodologies for automated software test casegeneration. Journal of Systems and Software, 86(8):1978–2001.

[Arcuri, 2010] Arcuri, A. (2010). It Does Matter How You Normalise theBranch Distance in Search Based Software Testing. In Software Testing,

Verification and Validation (ICST), 2010 Third International Conference on,pages 205–214. IEEE.

[Arcuri and Fraser, 2013] Arcuri, A. and Fraser, G. (2013). Parameter tuningor default values? an empirical investigation in search-based software engi-neering. Empirical Software Engineering, 18(3):594–623.

[Arcuri and Fraser, 2014] Arcuri, A. and Fraser, G. (2014). On the effective-ness of whole test suite generation. In Search-Based Software Engineering,volume 8636 of Lecture Notes in Computer Science, pages 1–15. SpringerInternational Publishing.

127

BIBLIOGRAPHY BIBLIOGRAPHY

[Arcuri et al., 2010a] Arcuri, A., Iqbal, M. Z., and Briand, L. (2010a). Formalanalysis of the effectiveness and predictability of random testing. In Proceed-

ings of the 19th international symposium on Software testing and analysis,ISSTA ’10, pages 219–230. ACM.

[Arcuri et al., 2010b] Arcuri, A., Iqbal, M. Z., and Briand, L. (2010b). Formalanalysis of the effectiveness and predictability of random testing. In Proceed-

ings of the 19th International Symposium on Software Testing and Analysis,ISSTA ’10, pages 219–230, New York, NY, USA. ACM.

[Artzi et al., 2008] Artzi, S., Kim, S., and Ernst, M. D. (2008). ReCrash: Mak-ing Software Failures Reproducible by Preserving Object States. In Proceed-

ings of the 22nd European Conference on Object-Oriented Programming,pages 542–565.

[Baars et al., 2011] Baars, A., Harman, M., Hassoun, Y., Lakhotia, K.,McMinn, P., Tonella, P., and Vos, T. (2011). Symbolic search-based testing.In Automated Software Engineering (ASE), 2011 26th IEEE/ACM Interna-

tional Conference on, pages 53–62.

[Bertolino, 2007] Bertolino, A. (2007). Software testing research: Achieve-ments, challenges, dreams. In 2007 Future of Software Engineering, FOSE’07, pages 85–103, Washington, DC, USA. IEEE Computer Society.

[Bettenburg et al., 2008] Bettenburg, N., Just, S., Schroter, A., Weiss, C., Prem-raj, R., and Zimmermann, T. (2008). What makes a good bug report? In Pro-

ceedings of the 16th ACM SIGSOFT International Symposium on Founda-

tions of software engineering, SIGSOFT ’08/FSE-16, pages 308–318. ACM.

[Beyene and Andrews, 2012] Beyene, M. and Andrews, J. H. (2012). Gener-ating string test data for code coverage. Proceedings of the International

Conference on Software Testing, Verification, and Validation (ICST), pages270–279.

128


[Booth and Thompson, 1973] Booth, T. L. and Thompson, R. A. (1973). Ap-plying probability measures to abstract languages. Computers, IEEE Trans-

actions on, 100(5):442–450.

[Cadar et al., 2011] Cadar, C., Godefroid, P., Khurshid, S., Pasareanu, C. S.,Sen, K., Tillmann, N., and Visser, W. (2011). Symbolic execution for soft-ware testing in practice: Preliminary assessment. In Proceedings of the 33rd

International Conference on Software Engineering, ICSE ’11, pages 1066–1071, New York, NY, USA. ACM.

[Chandra et al., 2009] Chandra, S., Fink, S. J., and Sridharan, M. (2009). Snug-glebug: A Powerful Approach to Weakest Preconditions. In Proceedings of

the 2009 ACM SIGPLAN Conference on Programming Language Design and

Implementation, pages 363–374.

[Chen et al., 2001] Chen, S.-K., Fuchs, W. K., and Chung, J.-Y. (2001). Re-versible debugging using program instrumentation. IEEE Transactions on

Software Engineering, 27(8):715–727.

[Chen et al., 2005] Chen, T., Leung, H., and Mak, I. (2005). Adaptive randomtesting. In Maher, M., editor, Advances in Computer Science - ASIAN 2004.

Higher-Level Decision Making, volume 3321 of Lecture Notes in Computer

Science, pages 320–329. Springer Berlin Heidelberg.

[Chilimbi et al., 2009] Chilimbi, T. M., Liblit, B., Mehra, K., Nori, A. V., andVaswani, K. (2009). HOLMES: Effective Statistical Debugging via EfficientPath Profiling. In ICSE 2009, pages 34–44.

[Ciupa et al., 2008] Ciupa, I., Leitner, A., Oriol, M., and Meyer, B. (2008). Ar-too. In Software Engineering, 2008. ICSE ’08. ACM/IEEE 30th International

Conference on, pages 71–80.

129


[Claessen and Hughes, 2011] Claessen, K. and Hughes, J. (2011). Quickcheck:a lightweight tool for random testing of haskell programs. Acm sigplan no-

tices, 46(4):53–64.

[Clause and Orso, 2007] Clause, J. and Orso, A. (2007). A Technique for En-abling and Supporting Debugging of Field Failures. In ICSE 2007, pages261–270.

[Conover, 1998] Conover, W. J. (1998). Practical Nonparametric Statistics.Wiley, 3rd edition edition.

[Cormen et al., 1990] Cormen, T. H., Leiserson, C. E., and Rivest, R. L. (1990).Introduction to Algorithms. MIT Press.

[Crameri et al., 2011] Crameri, O., Bianchini, R., and Zwaenepoel, W. (2011).Striking a New Balance Between Program Instrumentation and DebuggingTime. In Proceedings of the 6th European Conference on Computer Systems,pages 199–214.

[Csallner et al., 2008] Csallner, C., Smaragdakis, Y., and Xie, T. (2008). DSD-Crasher: A hybrid analysis tool for bug finding. ACM Trans. Softw. Eng.

Methodol., 17(2):1–37.

[de Moura and Bjrner, 2008] de Moura, L. and Bjrner, N. (2008). Z3: An effi-cient smt solver. In Ramakrishnan, C. and Rehof, J., editors, Tools and Algo-

rithms for the Construction and Analysis of Systems, volume 4963 of Lecture

Notes in Computer Science, pages 337–340. Springer Berlin Heidelberg.

[Deb, 2014] Deb, K. (2014). Multi-objective optimization. In Search Method-

ologies, pages 403–449. Springer US.

[Deb et al., 2000] Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. (2000).A fast elitist multi-objective genetic algorithm: NSGA-II. IEEE Trans. on

Evolutionary Computation, 6:182–197.

130


[di Pierro et al., 2007] di Pierro, F., Khu, S.-T., and Savic, D. (2007). An inves-tigation on preference order ranking scheme for multiobjective evolutionaryoptimization. IEEE Trans. on Evolutionary Computation, 11(1):17–45.

[Duchon et al., 2004] Duchon, P., Flajolet, P., Louchard, G., and Schaeffer, G.(2004). Boltzmann samplers for the random generation of combinatorialstructures. Combinatorics, Probability and Computing, 13(4-5):577–625.

[Dutertre and de Moura, 2006] Dutertre, B. and de Moura, L. (2006). A fastlinear-arithmetic solver for dpll(t). In Ball, T. and Jones, R., editors, Com-

puter Aided Verification, volume 4144 of Lecture Notes in Computer Science,pages 81–94. Springer Berlin Heidelberg.

[Eiben and Smith, 2003] Eiben, A. E. and Smith, J. E. (2003). Introduction to

evolutionary computing. Springer Science & Business Media.

[Elbaum et al., 2006] Elbaum, S., Chin, H. N., Dwyer, M. B., and Dokulil, J.(2006). Carving differential unit test cases from system test cases. In Pro-

ceedings of the 14th ACM SIGSOFT International Symposium on Founda-

tions of Software Engineering, SIGSOFT ’06/FSE-14, pages 253–264, NewYork, NY, USA. ACM.

[Feldt and Poulding, 2013] Feldt, R. and Poulding, S. (2013). Finding test datawith specific properties via metaheuristic search. In Software Reliability En-

gineering (ISSRE), 2013 IEEE 24th International Symposium on, pages 350–359. IEEE.

[Ferrer et al., 2012] Ferrer, J., Chicano, F., and Alba, E. (2012). Evolutionaryalgorithms for the multi-objective test data generation problem. Software

Practise & Experience, 42(11):1331–1362.

[Flanagan et al., 2002] Flanagan, C., Leino, K. R. M., Lillibridge, M., Nelson,G., Saxe, J. B., and Stata, R. (2002). Extended Static Checking for Java.

131


In Proceedings of the 2002 ACM SIGPLAN Conference on Programming

Language Design and Implementation, pages 234–245.

[Fraser and Arcuri, 2011] Fraser, G. and Arcuri, A. (2011). Evosuite: auto-matic test suite generation for object-oriented software. In Proceedings of

the 19th ACM SIGSOFT Symposium and the 13th European Conference

on Foundations of Software Engineering, ESEC/FSE ’11, pages 416–419,Szeged, Hungary.

[Fraser and Arcuri, 2013] Fraser, G. and Arcuri, A. (2013). Whole test suitegeneration. IEEE Transactions on Software Engineering, 39(2):276–291.

[Galeotti et al., 2013] Galeotti, J., Fraser, G., and Arcuri, A. (2013). Improvingsearch-based test suite generation with dynamic symbolic execution. In Soft-

ware Reliability Engineering (ISSRE), 2013 IEEE 24th International Sympo-

sium on, pages 360–369.

[Ganesh and Dill, 2007] Ganesh, V. and Dill, D. (2007). A decision procedurefor bit-vectors and arrays. In Damm, W. and Hermanns, H., editors, Com-

puter Aided Verification, volume 4590 of Lecture Notes in Computer Science,pages 519–531. Springer Berlin Heidelberg.

[Godefroid et al., 2008] Godefroid, P., Kiezun, A., and Levin, M. Y. (2008).Grammar-based whitebox fuzzing. In Proceedings of the ACM SIGPLAN

Conference on Programming Language Design and Implementation (PLDI),pages 206–215.

[Godefroid et al., 2005] Godefroid, P., Klarlund, N., and Sen, K. (2005).DART: directed automated random testing. In Proceedings of the 2005 ACM

SIGPLAN conference on Programming language design and implementation,volume 40 of PLDI ’05, pages 213–223. ACM.

132


[Gross et al., 2012] Gross, F., Fraser, G., and Zeller, A. (2012). Search-basedsystem testing: High coverage, no false alarms. In Proceedings of the 2012

International Symposium on Software Testing and Analysis, ISSTA 2012,pages 67–77, New York, NY, USA. ACM.

[Grune and Jacobs, 1990] Grune, D. and Jacobs, C. J. H. (1990). Parsing tech-

niques: a practical guide. Ellis Horwood Limited, Chichester, England.

[Guo and Qiu, 2014] Guo, H.-F. and Qiu, Z. (2014). A dynamic stochasticmodel for automatic grammar-based test generation. Software: Practice and

Experience.

[Handl et al., 2008] Handl, J., Lovell, S. C., and Knowles, J. (2008). Multiob-jectivization by decomposition of scalar cost functions. In Parallel Problem

Solving from Nature, volume 5199, pages 31–40. Springer Berlin Heidelberg.

[Harman et al., 2015] Harman, M., Jia, Y., and Zhang, Y. (2015). Achieve-ments, open problems and challenges for search based software testing. InSoftware Testing, Verification and Validation (ICST), 2015 IEEE 8th Interna-

tional Conference on, pages 1–12.

[Harman and Jones, 2001] Harman, M. and Jones, B. F. (2001). Search-basedsoftware engineering. Information and Software Technology, 43(14):833 –839.

[Harman et al., 2010] Harman, M., Kim, S. G., Lakhotia, K., McMinn, P., andYoo, S. (2010). Optimizing for the number of tests generated in search basedtest data generation with an application to the oracle cost problem. In 3rd

International Conference on Software Testing, Verification, and Validation

Workshops (ICSTW), pages 182–191.

133


[Harman et al., 2012] Harman, M., Mansouri, S. A., and Zhang, Y. (2012).Search-based software engineering: Trends, techniques and applications.ACM Comput. Surv., 45(1):11:1–11:61.

[Harman and McMinn, 2010] Harman, M. and McMinn, P. (2010). A Theoreti-cal and Empirical Study of Search-Based Testing: Local, Global, and HybridSearch. IEEE Transactions on Software Engineering, 36(2):226–247.

[Hennessy and Power, 2005] Hennessy, M. and Power, J. F. (2005). An analysisof rule coverage as a criterion in generating minimal test suites for grammar-based software. In Proceedings of the 20th IEEE/ACM international Con-

ference on Automated software engineering, ASE ’05, pages 104–113, NewYork, NY, USA. ACM.

[Highsmith and Cockburn, 2001] Highsmith, J. and Cockburn, A. (2001). Ag-ile software development: the business of innovation. Computer, 34(9):120–127.

[Holland, 1975] Holland, J. H. (1975). Adaptation in Natural and Artificial

Systems, volume Ann Arbor. University of Michigan Press.

[Horoba and Neumann, 2008] Horoba, C. and Neumann, F. (2008). Bene-fits and drawbacks for the use of epsilon-dominance in evolutionary multi-objective optimization. In 10th Conference on Genetic and Evolutionary

Computation, GECCO ’08, pages 641–648, New York, NY, USA. ACM.

[Inkumsah and Xie, 2008] Inkumsah, K. and Xie, T. (2008). Improving struc-tural testing of object-oriented programs via integrating evolutionary testingand symbolic execution. In Automated Software Engineering, 2008. ASE

2008. 23rd IEEE/ACM International Conference on, pages 297–306.

[Jiang and Su, 2007] Jiang, L. and Su, Z. (2007). Context-aware Statistical De-bugging: From Bug Predictors to Faulty Control Flow Paths. In Proceedings

134


of the 22nd IEEE/ACM International Conference on Automated Software En-

gineering, pages 184–193.

[Jin and Orso, 2012] Jin, W. and Orso, A. (2012). Bugredux: Reproducing fieldfailures for in-house debugging. In Proc. of the 34th International Confer-

ence on Software Engineering (ICSE), pages 474–484.

[Kifetew, 2012] Kifetew, F. (2012). A Search-Based framework for failure re-production. In Fraser, G. and Teixeira de Souza, J., editors, Search Based

Software Engineering, volume 7515 of Lecture Notes in Computer Science,pages 279–284. Springer Berlin Heidelberg.

[Kifetew et al., 2013] Kifetew, F. M., Panichella, A., Lucia, A. D., Oliveto, R.,and Tonella, P. (2013). Orthogonal exploration of the search space in evolu-tionary test case generation. In Proceedings of the International Symposium

on Software Testing and Analysis (ISSTA), pages 257–267.

[King, 1976] King, J. C. (1976). Symbolic Execution and Program Testing.Communications of the ACM, 19(7):385–394.

[King et al., 2005] King, S. T., Dunlap, G. W., and Chen, P. M. (2005). Debug-ging operating systems with time-traveling virtual machines. In Proceedings

of the annual conference on USENIX Annual Technical Conference, pages1–1.

[Knowles et al., 2001] Knowles, J., Watson, R. A., and Corne, D. (2001). Re-ducing local optima in single-objective problems by multi-objectivization. InEvolutionary Multi-Criterion Optimization, volume 1993 of Lecture Notes in

Computer Science, pages 269–283. Springer Berlin Heidelberg.

[Koza, 1994] Koza, J. (1994). Genetic programming as a means for program-ming computers by natural selection. Statistics and Computing, 4(2).

135


[Lakhotia et al., 2007] Lakhotia, K., Harman, M., and McMinn, P. (2007). Amulti-objective approach to search-based test data generation. In Proceed-

ings of the 9th annual conference on Genetic and evolutionary computation,GECCO ’07, pages 1098–1105. ACM.

[Lari and Young, 1990] Lari, K. and Young, S. J. (1990). The estimation ofstochastic context-free grammars using the inside-outside algorithm. Com-

puter speech & language, 4(1):35–56.

[Laumanns et al., 2002] Laumanns, M., Thiele, L., Deb, K., and Zitzler, E.(2002). Combining convergence and diversity in evolutionary multiobjec-tive optimization. Evolutionary Computation, 10(3):263–282.

[Li et al., 2015] Li, B., Li, J., Tang, K., and Yao, X. (2015). Many-objectiveevolutionary algorithms: A survey. ACM Comput. Surv., 48(1):13:1–13:35.

[Lu et al., 2005] Lu, S., Li, Z., Qin, F., Tan, L., Zhou, P., and Zhou, Y. (2005).BugBench: Benchmarks for Evaluating Bug Detection Tools. In Workshop

on the Evaluation of Software Defect Detection Tools.

[Majumdar and Xu, 2007] Majumdar, R. and Xu, R.-G. (2007). Directed testgeneration using symbolic grammars. In Proceedings of the 22nd IEEE/ACM

International Conference on Automated Software Engineering (ASE), pages134–143.

[Malburg and Fraser, 2011] Malburg, J. and Fraser, G. (2011). Combiningsearch-based and constraint-based testing. In Proceedings of the 2011 26th

IEEE/ACM International Conference on Automated Software Engineering,ASE ’11, pages 436–439, Washington, DC, USA. IEEE Computer Society.

[Maurer, 1990] Maurer, P. M. (1990). Generating test data with enhancedcontext-free grammars. Software, IEEE, 7(4):50–55.

136


[McKay et al., 2010] McKay, R. I., Hoai, N. X., Whigham, P. A., Shan, Y., andO’Neill, M. (2010). Grammar-based genetic programming: a survey. Genetic

Programming and Evolvable Machines, 11(3-4):365–396.

[McMinn, 2004] McMinn, P. (2004). Search-based software test data genera-tion: a survey. Softw. Test. Verif. Reliab., 14(2):105–156.

[Michael et al., 2001] Michael, C., McGraw, G., and Schatz, M. (2001). Gen-erating software test data by evolution. Software Engineering, IEEE Trans-

actions on, 27(12):1085–1110.

[Nanda and Sinha, 2009] Nanda, M. G. and Sinha, S. (2009). Accurate Inter-procedural Null-Dereference Analysis for Java. In Proceedings of the 31st

International Conference on Software Engineering, pages 133–143.

[Narayanasamy et al., 2005] Narayanasamy, S., Pokam, G., and Calder, B.(2005). BugNet: Continuously Recording Program Execution for Determin-istic Replay Debugging. SIGARCH Comput. Archit. News, 33(2):284–295.

[Netzer and Weaver, 1994] Netzer, R. H. B. and Weaver, M. H. (1994). Optimaltracing and incremental reexecution for debugging long-running programs.In Proceedings of the ACM SIGPLAN 1994 Conference on Programming

Language Design and Implementation, PLDI ’94, pages 313–325, New York,NY, USA. ACM.

[O’Neill et al., 2008] O’Neill, M., Hemberg, E., Gilligan, C., Bartley, E., Mc-Dermott, J., and Brabazon, A. (2008). Geva: grammatical evolution in java.ACM SIGEVOlution, 3(2):17–22.

[Orso et al., 2006] Orso, A., Joshi, S., Burger, M., and Zeller, A. (2006). Iso-lating relevant component interactions with jinsi. In Proceedings of the 2006

International Workshop on Dynamic Systems Analysis, WODA ’06, pages3–10, New York, NY, USA. ACM.

137


[Orso and Kennedy, 2005] Orso, A. and Kennedy, B. (2005). Selective captureand replay of program executions. In Proceedings of the Third International

Workshop on Dynamic Analysis, WODA ’05, pages 1–7, New York, NY,USA. ACM.

[Oster and Saglietti, 2006] Oster, N. and Saglietti, F. (2006). Automatic testdata generation by multi-objective optimisation. In Computer Safety, Reli-

ability, and Security, volume 4166 of Lecture Notes in Computer Science,pages 426–438. Springer Berlin Heidelberg.

[Park et al., 2009] Park, S., Zhou, Y., Xiong, W., Yin, Z., Kaushik, R., Lee,K. H., and Lu, S. (2009). PRES: Probabilistic Replay with Execution Sketch-ing on Multiprocessors. In Proceedings of the ACM SIGOPS 22nd Sympo-

sium on Operating Systems Principles, pages 177–192.

[Parr and Quong, 1994] Parr, T. J. and Quong, R. W. (1994). Adding semanticand syntactic predicates to ll (k): pred-ll (k). In Compiler Construction,pages 263–277. Springer.

[Pezze and Young, 2007] Pezze, M. and Young, M. (2007). Software Testing

and Analysis: Process, Principles and Techniques. John Wiley and Sons.

[Pinto and Vergilio, 2010] Pinto, G. and Vergilio, S. (2010). A multi-objectivegenetic algorithm to test data generation. In 22nd IEEE International Confer-

ence on Tools with Artificial Intelligence (ICTAI), volume 1, pages 129–134.

[Poulding et al., 2013] Poulding, S., Alexander, R., Clark, J. A., and Hadley,M. J. (2013). The optimisation of stochastic grammars to enable cost-effective probabilistic structural testing. In Proceedings of the 15th Annual

Conference on Genetic and Evolutionary Computation, GECCO ’13, pages1477–1484, New York, NY, USA. ACM.

138


[Pasareanu et al., 2011] Pasareanu, C. S., Rungta, N., and Visser, W. (2011).Symbolic execution with mixed concrete-symbolic solving. In Proceedings

of the 2011 International Symposium on Software Testing and Analysis, IS-STA ’11, pages 34–44, New York, NY, USA. ACM.

[Purdom, 1972] Purdom, P. (1972). A sentence generator for testing parsers.BIT Numerical Mathematics, 12:366–375. 10.1007/BF01932308.

[Ronsse and De Bosschere, 1999] Ronsse, M. and De Bosschere, K. (1999).RecPlay: a fully integrated practical record/replay system. ACM Trans. Com-

put. Syst., 17(2):133–152.

[Rossler et al., 2013] Rossler, J., Zeller, A., Fraser, G., Zamfir, C., and Candea,G. (2013). Reconstructing core dumps. In Proceedings of the 2013 IEEE

Sixth International Conference on Software Testing, Verification and Vali-

dation, ICST ’13, pages 114–123, Washington, DC, USA. IEEE ComputerSociety.

[Saff et al., 2005] Saff, D., Artzi, S., Perkins, J. H., and Ernst, M. D. (2005).Automatic test factoring for java. In Proceedings of the 20th IEEE/ACM In-

ternational Conference on Automated Software Engineering, ASE ’05, pages114–123, New York, NY, USA. ACM.

[Sen et al., 2005] Sen, K., Marinov, D., and Agha, G. (2005). CUTE: A Con-colic Unit Testing Engine for C. In Proceedings of the 10th European Soft-

ware Engineering Conference and 13th ACM SIGSOFT Symposium on the

Foundations of Software Engineering, pages 263–272.

[Srinivasan et al., 2004] Srinivasan, S. M., Kandula, S., Andrews, C. R., andZhou, Y. (2004). Flashback: A lightweight extension for rollback and deter-ministic replay for software debugging. In Proceedings of the Annual Con-

ference on USENIX Annual Technical Conference, ATEC ’04, pages 3–3,Berkeley, CA, USA. USENIX Association.

139


[Tillmann and de Halleux, 2008] Tillmann, N. and de Halleux, J. (2008).Pexwhite box test generation for .net. In Beckert, B. and Hhnle, R., edi-tors, Tests and Proofs, volume 4966 of Lecture Notes in Computer Science,pages 134–153. Springer Berlin Heidelberg.

[Tonella, 2004] Tonella, P. (2004). Evolutionary testing of classes. In Proceed-

ings of the 2004 ACM SIGSOFT international symposium on Software testing

and analysis, ISSTA ’04, pages 119–128. ACM.

[Vargha and Delaney, 2000] Vargha, A. and Delaney, H. D. (2000). A critiqueand improvement of the cl common language effect size statistics of mcgrawand wong. Journal of Educational and Behavioral Statistics, 25(2):101–132.

[Venolia et al., 2005] Venolia, G. D., DeLine, R., and LaToza, T. (2005). Soft-ware development at microsoft observed. Technical Report MSR-TR-2005-140, Microsoft Research.

[Visser et al., 2004] Visser, W., Pasareanu, C. S., and Khurshid, S. (2004). TestInput Generation with Java PathFinder. SIGSOFT Software Engineering

Notes, 29(4):97–107.

[von Lucken et al., 2014] von Lucken, C., Baran, B., and Brizuela, C. (2014).A survey on multi-objective evolutionary algorithms for many-objectiveproblems. Computational Optimization and Applications, 58(3):707–756.

[Weeratunge et al., 2010] Weeratunge, D., Zhang, X., and Jagannathan, S.(2010). Analyzing multicore dumps to facilitate concurrency bug reproduc-tion. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural

Support for Programming Languages and Operating Systems, pages 155–166.

140


[Wohlin et al., 2000] Wohlin, C., Runeson, P., Host, M., Ohlsson, M., Regnell,B., and Wesslen, A. (2000). Experimentation in Software Engineering - An

Introduction. Kluwer Academic Publishers.

[Xie et al., 2009] Xie, T., Tillmann, N., de Halleux, J., and Schulte, W. (2009).Fitness-guided path exploration in dynamic symbolic execution. In Depend-

able Systems Networks, 2009. DSN ’09. IEEE/IFIP International Conference

on, pages 359–368.

[Yang et al., 2013] Yang, S., Li, M., Liu, X., and Zheng, J. (2013). A grid-based evolutionary algorithm for many-objective optimization. IEEE Trans.

on Evolutionary Computation, 17(5):721–736.

[Yuan et al., 2010] Yuan, D., Mai, H., Xiong, W., Tan, L., Zhou, Y., and Pasu-pathy, S. (2010). SherLog: Error Diagnosis by Connecting Clues from Run-time Logs. In Proceedings of the 15th International Conference on Archi-

tectural Support for Programming Languages and Operating Systems, pages143–154.

[Yuan et al., 2011] Yuan, D., Zheng, J., Park, S., Zhou, Y., and Savage, S.(2011). Improving Software Diagnosability via Log Enhancement. In Pro-

ceedings of the 16th International Conference on Architectural Support for

Programming Languages and Operating Systems, pages 3–14.

[Yuan et al., 2014] Yuan, Y., Xu, H., and Wang, B. (2014). An improved nsga-iii procedure for evolutionary many-objective optimization. In 14th Confer-

ence on Genetic and Evolutionary Computation, GECCO ’14, pages 661–668. ACM.

[Zamfir and Candea, 2010] Zamfir, C. and Candea, G. (2010). Execution Syn-thesis: A Technique for Automated Software Debugging. In Proceedings of

the 5th European Conference on Computer Systems, pages 321–334.

141


[Zheng and Wu, 2009] Zheng, L. and Wu, D. (2009). A sentence generation al-gorithm for testing grammars. In Proceedings of the 2009 33rd Annual IEEE

International Computer Software and Applications Conference - Volume 01,COMPSAC ’09, pages 130–135, Washington, DC, USA. IEEE ComputerSociety.

[Zhu et al., 1997] Zhu, H., Hall, P. A. V., and May, J. H. R. (1997). Softwareunit test coverage and adequacy. ACM Comput. Surv., 29(4):366–427.

[Zitzler and Kunzli, 2004] Zitzler, E. and Kunzli, S. (2004). Indicator-based se-lection in multiobjective search. In 8th International Conference on Parallel

Problem Solving from Nature (PPSN VIII, pages 832–842. Springer.

[Zitzler et al., 2001] Zitzler, E., Laumanns, M., and Thiele, L. (2001). Spea2:Improving the strength pareto evolutionary algorithm. Technical report.

142

Appendix A

The Inside-Outside Algorithm

Let us assume that grammars are in Chomsky’s Normal Form (CNF), i.e., theycomprise only rules of the type (a) u → t or (b) u → ab, where u, a, b arenon-terminal symbols and t is a terminal symbol. As every CFG grammar canbe put in CNF, this doesn’t constitute a restriction.

Let us consider a SCFG (G, p). If G is ambiguous, a sentence w from acorpus W ⊆ L(G) is the frontier of more than one derivation tree. Thus, wecannot know which rules among the given alternatives were actually used togenerate w. However, by knowing the probabilities p associated to rules in G,we can derive a probability of rule usage in generating sentence w.

s

u

tj + 1 t lt jt it i - 1t1

Figure A.1: Graphical representation of a derivation of w which uses non-terminal u

Given a derivation for the sentence w = t1...tl from the start symbol s that

143

APPENDIX A. THE INSIDE-OUTSIDE ALGORITHM

exhibits a non-terminal u in a certain point of the derivation (the situation isdepicted in Figure A.1), we first compute two probability functions, namely theouter probability:

f(u, i, j) = P (s∗⇒ t1...ti−1 u tj+1...tl),

i.e., the probability that the sentential form t1...ti−1 u tj+1...tl is generated fromthe start symbol s, and the inner probability:

e(u, i, j) = P (u∗⇒ ti...tj),

i.e., the probability that the subsequence ti...tj of w is generated by applying aspecific rule to the non-terminal u.

Probability functions e and f can be recursively computed from probabilitiesp assigned to rules. For example, e(u, i, j) is defined by the following twoequations:

e(u, i, i) = p(u→ ti)

e(u, i, j) =∑a,b

j−1∑k=i

p(u→ ab)e(a, i, k)e(b, k + 1, j), if i < j

f is defined by similar expressions (the interested reader may refer to the workby Lari and Young [Lari and Young, 1990] for an exhaustive explanation of thealgorithm’s details). The implementation in the tool we used for our experi-ments, io1, performs this computation by means of a modified version of theCYK bottom-up parser [Grune and Jacobs, 1990].

Once e and f are known, the probability P (u→ ab|u used) that a rule u→ab was used if non-terminal u is produced during the derivation of w can beexpressed in terms of e, f and p as follows:

1io was written by Mark Johnson, see http://web.science.mq.edu.au/˜mjohnson/Software.htm

144


P (u→ ab|u used) =

∑l−1i=1

∑lj=i+1

∑j−1k=i p(u→ ab)e(a, i, k)e(b, k + 1, j)f(u, i, j)∑li=1

∑lj=i e(u, i, j)f(u, i, j)

Since P (u → ab|u used) (left hand side) is the same as p(u → ab) onthe right hand side, we can apply an expectation-maximization algorithm. Weinitially guess such probabilities, indicated as p(0). Then we compute e and fusing p(0) and we use the formula above to obtain a new, better estimate p(1) forthe rule probabilities. Then we continue with p(2), ..., p(k), until convergence.The actual algorithm computes new estimates using the whole corpusW insteadof a single sentence w, but computations are the same as the one presented herefor a single sentence w. At each iteration k the likelihood Λ(W ) of the corpusW :

Λ(W ) =∏w∈W

P (s∗⇒ w)

is computed. The algorithm has been proven to provide estimates for p thatnever decrease Λ(W ), hence converging to a (possibly local) maximum. Inpractice, the computation continues until no sensible increase in likelihood isobserved. The complete algorithm is shown as Algorithm 6.

145


Algorithm 6: The Inside-Outside AlgorithmRequire: A CFG G, a corpus of sentences W , a small real number δEnsure: p are probabilities associated to rules of G that (locally) maximize the likelihood of

the corpus WRandomly assign probabilities p to rules of G and compute the likelihood L′ of WrepeatL← L′

parse each sentence w ∈ W using CYKcompute inside probability e(u, i, j) and outside probability f(u, i, j) using availablevalues for pcompute a new estimate for p using e(u, i, j), f(u, i, j) and available values for pcompute the new likelihood L′ of W

until L′ − L < δ

146

Appendix B

Own Publications

B.1 JOURNAL PUBLICATIONS

1. Fitsum Meshesha Kifetew, Roberto Tiella, and Paolo Tonella. Generat-

ing Valid Grammar-based Test Inputs by means of Genetic Programming

and Annotated Grammars In Journal of Empirical Software Engineering(EMSE). (Under Review)

B.2 CONFERENCE PUBLICATIONS

2. Annibale Panichella, Fitsum Meshesha Kifetew, and Paolo Tonella. Re-

sults for EvoSuite-MOSA at the Third Unit Testing Tool Competition. In In-ternational Symposium on Search Based Software Testing (SBST), IEEE,2015.

3. Annibale Panichella, Fitsum Meshesha Kifetew, and Paolo Tonella. Refor-

mulating Branch Coverage as a Many-Objective Optimization Problem. InInternational Conference on Software Testing, Verification and Validation(ICST), IEEE, 2015.

4. Fitsum Meshesha Kifetew, Roberto Tiella, and Paolo Tonella. Combin-

147

B.2. CONFERENCE PUBLICATIONS APPENDIX B. OWN PUBLICATIONS

ing Stochastic Grammars and Genetic Programming for Coverage Test-

ing at the System Level. In Search Based Software Engineering (SSBSE),Springer, 2014.

5. Fitsum Meshesha Kifetew, Wei Jin, Roberto Tiella, Alessandro Orso, andPaolo Tonella. Reproducing Field Failures for Programs with Complex

Grammar Based Input. In International Conference on Software Testing,Verification and Validation (ICST), IEEE, 2014.

6. Fitsum Meshesha Kifetew, Wei Jin, Roberto Tiella, Alessandro Orso, andPaolo Tonella. SBFR: A Search Based Approach for Reproducing Failures

of Programs with Grammar Based Input. In International Conference onAutomated Software Engineering (ASE), pp 604-609. ACM, 2013.

7. Fitsum Meshesha Kifetew, Annibale Panichella, Andrea De Lucia, RoccoOliveto, and Paolo Tonella. Orthogonal Exploration of the Search Space

in Evolutionary Test Case Generation. In International Symposium onSoftware Testing and Analysis (ISSTA), pp 257-267. ACM, 2013.

8. Fitsum Meshesha Kifetew. A search-based framework for failure repro-

duction. In Search Based Software Engineering (SSBSE), pp 279-284.Springer Berlin Heidelberg, 2012. [Best Graduate Paper]

148

disi - university of trento e t c g m o g · combines stochastic grammars with genetic programming...

Documents