the oracle problem in software testing: a · pdf filethe oracle problem in software testing: a...

1

The Oracle Problem in Software Testing:A Survey

Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz and Shin Yoo

Abstract—Testing involves examining the behaviour of a system in order to discover potential faults. Given aninput for a system, the challenge of distinguishing the corresponding desired, correct behaviour from potentiallyincorrect behavior is called the “test oracle problem”. Test oracle automation is important to remove a currentbottleneck that inhibits greater overall test automation. Without test oracle automation, the human has todetermine whether observed behaviour is correct. The literature on test oracles has introduced techniques fororacle automation, including modelling, specifications, contract-driven development and metamorphic testing.When none of these is completely adequate, the final source of test oracle information remains the human,who may be aware of informal specifications, expectations, norms and domain specific information that provideinformal oracle guidance. All forms of test oracles, even the humble human, involve challenges of reducing costand increasing benefit. This paper provides a comprehensive survey of current approaches to the test oracleproblem and an analysis of trends in this important area of software testing research and practice.

Index Terms—Test oracle; Automatic testing; Testing formalism.

F

1 INTRODUCTION

Much work on software testing seeks to auto-mate as much of the test process as practicaland desirable, to make testing faster, cheaper,and more reliable. To this end, we need a testoracle, a procedure that distinguishes betweenthe correct and incorrect behaviors of the Sys-tem Under Test (SUT).

However, compared to many aspects of testautomation, the problem of automating thetest oracle has received significantly less at-tention, and remains comparatively less well-solved. This current open problem represents asignificant bottleneck that inhibits greater testautomation and uptake of automated testingmethods and tools more widely. For instance,the problem of automatically generating testinputs has been the subject of research interestfor nearly four decades [46], [108]. It involvesfinding inputs that cause execution to revealfaults, if they are present, and to give confi-dence in their absence, if none are found. Au-

tomated test input generation been the subjectof many significant advances in both Search-Based Testing [3], [5], [83], [127], [129] andDynamic Symbolic Execution [75], [109], [162];yet none of these advances address the issueof checking generated inputs with respect toexpected behaviours—that is, providing an au-tomated solution to the test oracle problem.

Of course, one might hope that the SUThas been developed under excellent design-for-test principles, so that there might be adetailed, and possibly formal, specification ofintended behaviour. One might also hope thatthe code itself contains pre- and post- condi-tions that implement well-understood contract-driven development approaches [136]. In thesesituations, the test oracle cost problem is ame-liorated by the presence of an automatable testoracle to which a testing tool can refer to checkoutputs, free from the need for costly humanintervention.

Where no full specification of the propertiesof the SUT exists, one may hope to construct a

2

partial test oracle that can answer questions forsome inputs. Such partial test oracles can beconstructed using metamorphic testing (builtfrom known relationships between desired be-haviour) or by deriving oracular informationfrom execution or documentation.

For many systems and most testing as cur-rently practiced in industry, however, the testerdoes not have the luxury of formal specifica-tions or assertions, or automated partial testoracles [91], [92]. The tester therefore facesthe daunting task of manually checking thesystem’s behaviour for all test cases. In suchcases, automated software testing approachesmust address the human oracle cost problem[1], [82], [131].

To achieve greater test automation and wideruptake of automated testing, we therefore needa concerted effort to find ways to address thetest oracle problem and to integrate automatedand partially automated test oracle solutionsinto testing techniques. This paper seeks tohelp address this challenge by providing acomprehensive review and analysis of the ex-isting literature of the test oracle problem.

Four partial surveys of topics relating to testoracles precede this one. However, none hasprovided a comprehensive survey of trendsand results. In 2001, Baresi and Young [17] pre-sented a partial survey that covered four topicsprevalent at the time the paper was published:assertions, specifications, state-based confor-mance testing, and log file analysis. Whilethese topics remain important, they captureonly a part of the overall landscape of researchin test oracles, which the present paper covers.Another early work was the initial motivationfor considering the test oracle problem con-tained in Binder’s textbook on software testing[23], published in 2000. More recently, in 2009,Shahamiri et al. [165] compared six techniquesfrom the specific category of derived test or-acles. In 2011, Staats et al. [174] proposed atheoretical analysis that included test oraclesin a revisitation of the fundamentals of testing.Most recently, in 2014, Pezze et al. focus on

automated test oracles for functional proper-ties [151].

Despite this work, research into the test or-acle problem remains an activity undertakenin a fragmented community of researchers andpractitioners. The role of the present paper is toovercome this fragmentation in this importantarea of software testing by providing the firstcomprehensive analysis and review of work onthe test oracle problem.

The rest of the paper is organised as follows:Section 2 sets out the definitions relating to testoracles that we use to compare and contrast thetechniques in the literature. Section 3 relatesa historical analysis of developments in thearea. Here we identify key milestones and trackthe volume of past publications. Based on thisdata, we plot growth trends for four broad cat-egories of solution to the test oracle problem,which we survey in Sections 4–7. These fourcategories comprise approaches to the oracleproblem where:• test oracles can be specified (Section 4);• test oracles can be derived (Section 5);• test oracles can be built from implicit in-

formation (Section 6); and• no automatable oracle is available, yet

it is still possible to reduce human effort(Section 7)

Finally, Section 8 concludes with closing re-marks.

2 DEFINITIONS

This section presents definitions to establish alingua franca in which to examine the literatureon oracles. These definitions are formalised toavoid ambiguity, but the reader should findthat it is also possible to read the paper usingonly the informal descriptions that accompanythese formal definitions. We use the theory toclarify the relationship between algebraic spec-ification, pseudo oracles, and metamorphic re-lations in Section 5.

To begin, we define a test activity as a stim-ulus or response, then test activity sequences

3

f(i)I O

i o

RS

Fig. 1. Stimulus and observations: S is anythingthat can change the observable behavior of theSUT f ; R is anything that can be observedabout the system’s behavior; I includes f ’s ex-plicit inputs; O is its explicit outputs; everythingnot in S ∪R neither affects nor is affected by f .

that incorporate constraints over stimuli andresponses. Test oracles accept or reject testactivity sequences, first deterministically thenprobabilistically. We then define notions ofsoundness and completeness of test oracles.

2.1 Test Activities

To test is to stimulate a system and observeits response. A stimulus and a response bothhave values, which may coincide, as whenthe stimulus value and the response are bothreals. A system has a set of components C. Astimulus and its response target a subset ofcomponents. For instance, a common patternfor constructing test oracles is to compare theoutput of distinct components on the samestimulus value. Thus, stimuli and responsesare values that target components. Collectively,stimuli and responses are test activities:

Definition 2.1 (Test Activities). For the SUT p,S is the set of stimuli that trigger or constrain p’scomputation and R is the set of observable responsesto a stimulus of p. S and R are disjoint. Testactivities form the set A = S ]R.

The use of disjoint union implicitly labels theelements of A, which we can flatten to the tupleL × C × V , where L = {stimulus, response} is

the set of activities labels, C is the set of com-ponents, and V is an arbitrary set of values.To model those aspects of the world that areindependent of any component, like a clock,we set an activity’s target to the empty set.

We use the terms “stimulus” and “observa-tion” in the broadest sense possible to caterto various testing scenarios, functional andnonfunctional. As shown in Figure 1, a stim-ulus can be either an explicit test input fromthe tester, I ⊂ S, or an environmental factorthat can affect the testing, S \ I . Similarly, anobservation ranges from an output of the SUT,O ⊂ R, to a nonfunctional execution profile,like execution time in R \O.

For example, stimuli include the configu-ration and platform settings, database tablecontents, device states, resource constraints,preconditions, typed values at an input device,inputs on a channel from another system, sen-sor inputs and so on. Notably, resetting a SUTto an initial state is a stimulus and stimulatingthe SUT with an input runs it. Observationsinclude anything that can be discerned andascribed a meaning significant to the purposeof testing — including values that appear on anoutput device, database state, temporal prop-erties of the execution, heat dissipated duringexecution, power consumed, or any other mea-surable attributes of its execution. Stimuli andobservations are members of different sets oftest activities, but we combine them into testactivities.

2.2 Test Activity Sequence

Testing is a sequence of stimuli and responseobservations. The relationship between stimuliand responses can often be captured formally;consider a simple SUT that squares its input. Tocompactly represent infinite relations betweenstimulus and response values such as (i, o =i2), we introduce a compact notation for setcomprehensions:

x: [φ] = {x | φ},

4

where x is a dummy variable over an arbitraryset.

Definition 2.2 (Test Activity Sequence). A testactivity sequence is an element of TA = {w |T∗→ w} over the grammar

T ::= A ′: [′ φ ′]′ T | AT | ε

where A is the test activity alphabet.

Under Definition 2.2, the testing activity se-quence io:[o = i2] denotes the stimulus of in-voking f on i, then observing the response out-put. It further specifies valid responses obeyingo = i2. Thus, it compactly represents the infi-nite set of test activity sequences i1o1, i2o2, · · ·where ok = i2k.

For practical purposes, a test activity se-quence will almost always have to satisfyconstraints in order to be useful. Under ourformalism, these constraints differentiate theapproaches to test oracle we survey. As aninitial illustration, we constrain a test activitysequence to obtain a practical test sequence:

Definition 2.3 (Practical Test Sequence). Apractical test sequence is any test activity se-quence w that satisfies

w = TsTrT, for s ∈ S, r ∈ R.

Thus, the test activity sequence, w, is practicaliff w contains at least one stimulus followed byat least one observation.

This notion of a test sequence is nothingmore than a very general notion of what itmeans to test; we must do something to thesystem (the stimulus) and subsequently ob-serve some behaviour of the system (the obser-vation) so that we have something to check (theobservation) and something upon which thisobserved behaviour depends (the stimulus).

A reliable reset (p, r) ∈ S is a specialstimulus that returns the SUT’s componentp to its start state. The test activity se-quence (stimulus,p, r)(stimulus,p, i) is there-fore equivalent to the conventional applicationnotation p(i). To extract the value of an activity,

we write v(a); to extract its target component,we write c(a). To specify two invocations of asingle component on the different values, wemust write r1i1r2, i2 : [r1, i1, r2, i2 ∈ S, c(r1) =c(i1) = c(r2) = c(i2) ∧ v(i1) 6= v(i2)]. In thesequel, we often compare different executionsof a single SUT or compare the output of in-dependently implemented components of theSUT on the same input value. For clarity, weintroduce syntactic sugar to express constraintson stimulus values and components. We letf(x) denote ri : [c(i) = f ∧ v(i) = x], for f ∈ C.

A test oracle is a predicate that determineswhether a given test activity sequence is anacceptable behaviour of the SUT or not. Wefirst define a “test oracle”, and then relax thisdefinition to “probabilistic test oracle”.

Definition 2.4 (Test Oracle). A test oracle D :TA 7→ B is a partial1 function from a test activitysequence to true or false.

When a test oracle is defined for a testactivity, it either accepts the test activity ornot. Concatenation in a test activity sequencedenotes sequential activities; the test oracleD permits parallel activities when it acceptsdifferent permutations of the same stimuli andresponse observations. We use D to distinguisha deterministic test oracle from probabilisticones. Test oracles are typically computationallyexpensive, so probabilistic approaches to theprovision of oracle information may be desir-able even where a deterministic test oracle ispossible [125].

Definition 2.5 (Probabilistic Test Oracle). Aprobabilistic test oracle D : TA 7→ [0, 1] mapsa test activity sequence into the interval [0, 1] ∈ R.

A probabilistic test oracle returns a real num-ber in the closed interval [0, 1]. As with testoracles, we do not require a probabilistic testoracle to be a total function. A probabilistic test

1. Recall that a function is implicitly total: it maps everyelement of its domain to a single element of its range. Thepartial function f : X 7→ Y is the total function f ′ : X′ →Y , where X′ ⊆ X .

5

oracle can model the case where the test oracleis only able to efficiently offer a probabilitythat the test case is acceptable, or for othersituations where some degree of imprecisioncan be tolerated in the test oracle’s response.

Our formalism combines a language-theoretic view of stimulus and responseactivities with constraints over thoseactivities; these constraints explicitly capturespecifications. The high-level language viewimposes a temporal order on the activities.Thus, our formalism is inherently temporal.The formalism of Staats et al. captures anytemporal exercising of the SUT’s behaviorin tests, which are atomic black boxes forthem [174]. Indeed, practitioners write testplans and activities, they do not often writespecifications at all, let alone a formal one. Thisfact and the expressivity of our formalism, asevident in our capture of existing test oracleapproaches, is evidence that our formalism isa good fit with practice.

2.3 Soundness and Completeness

We conclude this section by defining sound-ness and completeness of test oracles.

In order to define soundness and complete-ness of a test oracle, we need to define aconcept of the “ground truth”, G. The groundtruth is another form of oracle, a conceptualoracle, that always gives the “right answer”.Of course, it cannot be known in all but themost trivial cases, but it is a useful definitionthat bounds test oracle behaviour.

Definition 2.6 (Ground Truth). The groundtruth oracle, G, is a total test oracle that alwaysgives the “right answer”.

We can now define soundness and complete-ness of a test oracle with respect to G.

Definition 2.7 (Soundness). The test oracle D issound iff

D(a)⇒ G(a)

Definition 2.8 (Completeness). The test oracleD is complete iff

G(a)⇒ D(a)

While test oracles cannot, in general, be bothsound and complete, we can, nevertheless,define and use partially correct test oracles.Further, one could argue, from a purely philo-sophical point of view, that human oraclescan be sound and complete, or correct. In thisview, correctness becomes a subjective humanassessment. The foregoing definitions allow forthis case.

We relax our definition of soundness to caterfor probabilistic test oracles:

Definition 2.9 (Probablistic Soundness andCompleteness). A probabilistic test oracle D isprobabilistically sound iff

P (D(w) = 1) >1

2+ ε⇒ G(w)

and D is probabilistically complete iff

G(w)⇒ P (D(w) = 1) >1

2+ ε

where ε is non-negligible.

The non-negligible advantage ε requires Dto do sufficiently better than flipping a faircoin, which for a binary classifier maximizesentropy, that we can achieve arbitrary confi-dence in whether the test sequence w is validby repeatedly sampling D on w.

3 TEST ORACLE RESEARCH TRENDS

The term “test oracle” first appeared in WilliamHowden’s seminal work in 1978 [99]. In thissection, we analyze the research on test oracles,and its related areas, conducted since 1978.We begin with a synopsis of the volume ofpublications, classified into specified, derived,implicit, and lack of automated test oracles. Wethen discuss when key concepts in test oracleswere first introduced.

6

!"#"$%&'''()%$*++",-"#"$%.//))"

$"

0$"

&$$"

&0$"

)$$"

)0$"

+$$"

+0$"

&.'$

"&.'&

"&.')

"&.'+

"&.'/

"&.'0

"&.'1

"&.'*

"&.''

"&.'.

"&..$

"&..&

"&..)

"&..+

"&../

"&..0

"&..1

"&..*

"&..'

"&...

")$$$

")$$&

")$$)

")$$+

")$$/

")$$0

")$$1

")$$*

")$$'

")$$.

")$&$

")$&&

")$&)

"!"##$%&'

()**+

$#,)

-**".**/$,

%01&'

"23*

45)106)7*8-&1%)3*!"#"$%&'()*%+,-*"./"#"$%0$*1'"

$"

-$"

*$$"

*-$"

'$$"

'-$"

&$$"

*01$

"*01*"

*01'

"*01&

"*01(

"*01-"

*01+

"*01,"

*011"

*010"

*00$

"*00*"

*00'

"*00&

"*00(

"*00-"

*00+

"*00,"

*001"

*000"

'$$$

"'$$*"

'$$'

"'$$&

"'$$(

"'$$-"

'$$+

"'$$,"

'$$1"

'$$0"

'$*$

"'$**"

'$*'

"!"##$%&'

()**+

$#,)

-**".**/$,

%01&'

"23* 4)-0()5*6-&1%)3*

!"#"$%&'()*'%+,-'"./"#"$%)(-&+"

$"

'$"

&$"

-$"

0$"

,$"

+$"

($"

)$"

'1))"

'1)1"

'11$"

'11'"

'11&"

'11-"

'110"

'11,"

'11+"

'11("

'11)"

'111"

&$$$"

&$$'"

&$$&"

&$$-"

&$$0"

&$$,"

&$$+"

&$$("

&$$)"

&$$1"

&$'$"

&$''"

&$'&"!"

##$%&'

()**+

$#,)

-**".**/$,

%01&'

"23* 4#5%0106*7-&1%)3* y = 0.2997x1.4727

R² = 0.8871

0

10

20

30

40

50

60

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012 Commula'

ve N

umbe

r of Pub

lica'

ons Handling the lack of oracles

Fig. 2. Cumulative number of publications from 1978 to 2012 and research trend analysis for eachtype of test oracle. The x-axis represents years and y-axis the cumulative number of publications.We use a power regression model to perform the trend analysis. The regression equation andthe coefficient of determination (R2) indicate a upward future trend, a sign of a healthy researchcommunity.

3.1 Volume of Publications

We constructed a repository of 694 publica-tions on test oracles and its related areas from1978 to 2012 by conducting web searches forresearch articles on Google Scholar and MicrosoftAcademic Search using the queries “software +test + oracle” and “software + test oracle”2,for each year. Although some of the queriesgenerated in this fashion may be similar, dif-ferent responses are obtained, with particulardifferences around more lowly-ranked results.

We classify work on test oracles into fourcategories: specified test oracles (317), derivedtest oracles (245), implicit test oracles (76), andno test oracle (56), which handles the lack of a

2. We use + to separate the keywords in a query; aphrase, not internally separated by +, like “test oracle”,is a compound keyword, quoted when given to the searchengine.

test oracle.

Specified test oracles, discussed in detail inSection 4, judge all behavioural aspects of asystem with respect to a given formal specifi-cation. For specified test oracles we searchedfor related articles using queries “formal+ specification”, “state-based specification”,“model-based languages”, “transition-basedlanguages”, “assertion-based languages”, “al-gebraic specification” and “formal + confor-mance testing”. For all queries, we appendedthe keywords with “test oracle” to filter theresults for test oracles.

Derived test oracles (see Section 5) in-volve artefacts from which a test oracle maybe derived — for instance, a previous ver-sion of the system. For derived test oracles,we searched for additional articles using thequeries “specification inference”, “specification

7

mining”, “API mining”, “metamorphic test-ing”, “regression testing” and “program doc-umentation”.

An implicit oracle (see Section 6) refers tothe detection of “obvious” faults such as a pro-gram crash. For implicit test oracles we appliedthe queries “implicit oracle”, “null pointer +detection”, “null reference + detection”, “dead-lock + livelock + race + detection”, “memoryleaks + detection”, “crash + detection”, “per-formance + load testing”, “non-functional +error detection”, “fuzzing + test oracle” and“anomaly detection”.

There have also been papers researchingstrategies for handling the lack of an auto-mated test oracle (see Section 7). Here, weapplied the queries “human oracle”, “test mini-mization”, “test suite reduction” and “test data+ generation + realistic + valid”.

Each of the above queries were appendedby the keywords “software testing”. The re-sults were filtered, removing articles that werefound to have no relation to software testingand test oracles. Figure 2 shows the cumulativenumber of publications on each type of testoracle from 1978 onwards. We analyzed theresearch trend on this data by applying differ-ent regression models. The trend line, shownin Figure 2, is fitted using a power model.The high values for the four coefficients ofdetermination (R2), one for each of the fourtypes of test oracle, confirm that our models aregood fits to the trend data. The trends observedsuggest a healthy growth in research volumesin these topics related to the test oracle problemin the future.

3.2 The Advent of Test Oracle Techniques

We classified the collected publications by tech-niques or concepts they proposed to (partially)solve a test oracle problem; for example, ModelChecking [35] and Metamorphic Testing [36]fall into the derived test oracle and DAISTIS[69] is an algebraic specification system thataddresses the specified test oracle problem.

For each type of test oracle and the advent ofa technique or a concept, we plotted a timelinein chronological order of publications to studyresearch trends. Figure 3 shows the timelinestarting from 1978 when the term “test oracle”was first coined. Each vertical bar presents thetechnique or concept used to solve the problemlabeled with the year of its first publication.

The timeline shows only the work that isexplicit on the issue of test oracles. For exam-ple, the work on test generation using finitestate machines (FSM) can be traced back to asearly as 1950s. But the explicit use of finitestate machines to generate test oracles canbe traced back to Jard and Bochmann [103]and Howden in 1986 [98]. We record, in thetimeline, the earliest available publication for agiven technique or concept. We consider onlypublished work in journals, the proceedingsof conferences and workshops, or magazines.We excluded all other types of documentation,such as technical reports and manuals.

Figure 3 shows a few techniques and con-cepts that predate 1978. Although not explicitlyon test oracles, they identify and address issuesfor which test oracles were later developed. Forexample, work on detecting concurrency issues(deadlock, livelock, and races) can be tracedback to the 1960s. Since these issues requireno specification, implicit test oracles can andhave been built that detect them on arbitrarysystems. Similarly, Regression Testing detectsproblems in the functionality a new version ofa system shares with its predecessors and is aprecursor of derived test oracles.

The trend analysis suggests that propos-als for new techniques and concepts for theformal specification of test oracles peaked in1990s, and has gradually diminished in the lastdecade. However, it remains an area of muchresearch activity, as can be judged from thenumber of publications for each year in Fig-ure 2. For derived test oracles, many solutionshave been proposed throughout this period.Initially, these solutions were primarily theoret-ical, such as Partial/Pseudo-Oracles [196] and

8

1980 1985 1990 1995 2000 2005 2010

Des

ign-

by-C

ontra

ct

AN

NA

, LA

RC

H

IOR

L, P

RO

SP

ER

, SD

L

MS

C, V

DM

A

STO

OT,

GIL

, TO

AS

, Z,

Lust

re

DA

ISTI

SH

, Tem

pora

l Lo

gic,

Mod

el C

heck

ing

IOC

O

FSM

H S

tate

char

ts, O

bjec

t-Z

UM

L

OC

L

LETO

AS

ML,

Allo

y, R

ES

OLV

E

JML

DA

ISTS

TTC

N-3

88 92 01 81 89 96 94 93 97 98 99 00 04 07 09 Specified Test Oracles

83

N-V

ersi

ons

Pse

udo,

Par

tial O

racl

e

Lo

g Fi

le A

naly

sis,

M

etam

orph

ic T

estin

g

Spe

c M

inin

g, N

Net

wor

ks

Cod

e C

omm

ents

Mut

ant-b

ased

Spe

c In

fere

nce

Sem

i-For

mal

Doc

s

AP

I Doc

s

Reg

ress

ion

Test

ing

82 83 86 94 98

Inva

riant

Det

ectio

n

pre-78 99 02 07 09 10 Derived Test Oracles

Ano

mal

y D

etec

tion

Rob

ustn

ess

Che

ckin

g

Con

curr

ency

Issu

es D

etec

tion

Exc

eptio

n C

heck

ing

pre-78 98 06

Mem

ory

Leak

s D

etec

tion

88 Implicit Test Oracles

Load

Tes

ting

Fuzz

ing

90

Rea

listic

Tes

t Dat

a

Test

Siz

e R

educ

tion

Mac

hine

Lea

rnin

g

Inpu

t Cla

ssifi

catio

n

Handling the Lack of Test Oracles 93

Usa

ge M

inin

g

00 05 11 12

Min

ing

for C

oncu

rren

cy Is

sues

08

CA

SL

02

Del

ta D

ebug

ging

02

Par

titio

n Te

stin

g

pre-78

1978

Test

Ora

cle

CA

SC

AT

08

LOFT

95

Fig. 3. Chronological introduction of test oracles techniques and concepts.

Specification Inference [194]; empirical studieshowever followed in late 1990s.

For implicit test oracles, research into the so-lutions established before 1978 has continued,but at a slower pace than the other types of testoracles. For handling the lack of an automatedtest oracle, Partition Testing is a well-knowntechnique that helps a human test oracle selecttests. The trend line suggests that only recentlyhave new techniques and concepts for tacklingthis problem started to emerge, with an explicitfocus on the human oracle cost problem.

4 SPECIFIED TEST ORACLES

Specification is fundamental to computer sci-ence, so it is not surprising that a vast body ofresearch has explored its use as a source of testoracle information. This topic could merit anentire survey on its own right. In this section,we provide an overview of this work. We alsoinclude here partial specifications of systembehaviour such as assertions and models.

A specification defines, if possible usingmathematical logic, the test oracle for partic-ular domain. Thus, a specification languageis a notation for defining a specified test oracleD, which judges whether the behaviour of a

9

system conforms to a formal specification. Ourformalism, defined in Section 2, is, itself, aspecification language for specifying test ora-cles.

Over the last 30 years, many methods andformalisms for testing based on formal spec-ification have been developed. They fall intofour broad categories: model-based specifica-tion languages, state transition systems, asser-tions and contracts, and algebraic specifica-tions. Model-based languages define modelsand a syntax that defines desired behavior interms of its effect on the model. State transitionsystems focus on modeling the reaction of asystem to stimuli, referred to as “transitions”in this particular formalism. Assertions andcontracts are fragments of a specification lan-guage that are interleaved with statements ofthe implementation language and checked atruntime. Algebraic specifications define equa-tions over a program’s operations that holdwhen the program is correct.

4.1 Specification Languages

Specification languages define a mathemati-cal model of a system’s behaviour, and areequipped with a formal semantics that definesthe meaning of each language construct interms of the model. When used for testing,models do not usually fully specify the sys-tem, but seek to capture salient properties of asystem so that test cases can be generated fromor checked against them.

4.1.1 Model-Based Specification LanguagesModel-based specification languages model asystem as a collection of states and operationsto alter these states, and are therefore alsoreferred to as “state-based specifications” inthe literature [101], [110], [182], [183]. Precon-ditions and postconditions constrain the sys-tem’s operations. An operation’s preconditionimposes a necessary condition over the inputstates that must hold in a correct applicationof the operation; a postcondition defines the

(usually strongest) effect the operation has onprogram state [110].

A variety of model-based specification lan-guages exist, including Z [172], B [111],UML/OCL [31], VDM/VDM-SL [62], Al-loy [102], and the LARCH family [71],which includes an algebraic specificationsub-language. Broadly, these languages haveevolved toward being more concrete, closerto the implementation languages programmersuse to solve problems. Two reasons explainthis phenomenon: the first is the effort to in-crease their adoption in industry by makingthem more familiar to practitioners and thesecond is to establish synergies between spec-ification and implementation that facilitate de-velopment as iterative refinement. For instance,Z models disparate entities, like predicates,sets, state properties, and operations, througha single structuring mechanism, its schemaconstruct; the B method, Z’s successor, pro-vides a richer array of less abstract languageconstructs.

Borger discusses how to use the abstractstate machine formalism, a very general set-theoretic specification language geared towardthe definition of functions, to define high leveltest oracles [29]. The models underlying speci-fication languages can be very abstract, quitefar from concrete execution output. For in-stance, it may be difficult to compute whethera model’s postcondition for a function permitsan observed concrete output. If this impedancemismatch can be overcome, by abstracting asystem’s concrete output or by concretizing aspecification model’s output, and if a specifica-tion’s postconditions can be evaluated in finitetime, they can serve as a test oracle [4].

Model-based specification languages, suchas VDM, Z, and B can express invariants,which can drive testing. Any test case thatcauses a program to violate an invariant hasdiscovered an incorrect behavior; therefore,these invariants are partial test oracles.

In search of a model-based specification lan-guage accessible to domain experts, Parnas

10

et al. proposed TOG (Test Oracles Genera-tor) from program documentation [143], [146],[149]. In their method, the documentation iswritten in fully formal tabular expressions inwhich the method signature, the external vari-ables, and relation between its start and endstates are specified [105]. Thus, test oracles canbe automatically generated to check the out-puts against the specified states of a program.The work by Parnas et al. has been developedover a considerable period of more than twodecades [48], [59], [60], [145], [150], [190], [191].

4.1.2 State Transition SystemsState transition systems often present a graph-ical syntax, and focus on transitions be-tween different states of the system. Here,states typically abstract sets of concrete stateof the modeled system. State transition sys-tems have been referred as visual languagesin the literature [197]. A wide variety ofstate transition systems exist, including Fi-nite State Machines [112], Mealy/Moore ma-chines [112], I/O Automata [118], LabeledTransition Systems [180], SDL [54], Harel Stat-echarts [81], UML state machines [28], X-Machines [95], [96], Simulink/Stateflow [179]and PROMELA [97]. Mouchawrab et al. con-ducted a rigorous empirical evaluation of testoracle construction techniques using state tran-sition systems [70], [138].

An important class of state transition sys-tems have a finite set of states and are thereforeparticularly well-suited for automated reason-ing about systems whose behaviour can beabstracted into states defined by a finite setof values [93]. State transition systems cap-ture the behavior of a system under test asa set of states3, with transitions representingstimuli that cause the system to change state.State transition systems model the output of

3. Unfortunately, the term ‘state’ has different interpre-tation in the context of test oracles. Often, it refers to a‘snapshot’ of the configuration of a system at some pointduring its execution; in context of state transition systems,however, ‘state’ typically refers to an abstraction of a setof configurations, as noted above.

a system they abstract either as a property ofthe states (the final state in the case of Mooremachines) or the transitions traversed (as withMealy machines).

Models approximate a SUT, so behavioraldifferences between the two are inevitable.Some divergences, however, are spurious andfalsely report testing failure. State-transitionmodels are especially susceptible to this prob-lem when modeling embedded systems, forwhich time of occurrence is critical. Recentwork model tolerates spurious differences intime by “steering” model’s evaluation: whenthe SUT and its model differ, the model is back-tracked, and a steering action, like modifyingtimer value or changing inputs, is applied toreduce the distance, under a similarity mea-sure [74].

Protocol conformance testing [72] and, later,model-based testing [183] motivated much ofthe work applying state transition systems totesting. Given a specification F as a state tran-sition system, e.g. a finite state machine, atest case can be extracted from sequences oftransitions in F . The transition labels of sucha sequence define an input. A test oracle canthen be constructed from F as follows: if Faccepts the sequence and outputs some value,then so should the system under test; if F doesnot accept the input, then neither should thesystem under test.

Challenges remain, however, as the defini-tion of conformity comes in different flavours,depending on whether the model is determin-istic or non-deterministic and whether the be-haviour of the system under test on a given testcase is observable and can be interpreted at thesame level of abstraction as the model’s. Theresulting flavours of conformity have been cap-tured in alternate notions, in terms of whetherthe system under test is isomorphic to, equiv-alent to, or quasi-equivalent to F. These no-tions of conformity were defined in the mid-1990s in the famous survey paper by Lee andYannakakis [112] among other notable papers,including those by Bochmann et al. [26] and

11

Tretmans [180].

4.2 Assertions and ContractsAn assertion is a boolean expression that isplaced at a certain point in a program to checkits behaviour at runtime. When an assertionevaluates to true, the program’s behaviour isregarded “as intended” at the point of theassertion, for that particular execution; whenan assertion evaluates to false, an error hasbeen found in the program for that particularexecution. It is obvious to see how assertionscan be used as a test oracle.

The fact that assertions are embedded in animplementation language has two implicationsthat differentiate them from specification lan-guages. First, assertions can directly referenceand define relations over program variables,reducing the impedance mismatch betweenspecification and implementation, for the prop-erties an assertion can express and check. Inthis sense, assertions are a natural consequenceof the evolution of specification languages to-ward supporting development through iter-ative refinement. Second, they are typicallywritten along with the code whose runtimebehavior they check, as opposed to preced-ing implementation as specification languagestend to do.

Assertions have a long pedigree dating backto Turing [181], who first identified the needto separate the tester from the developer andsuggested that they should communicate bymeans of assertions: the developer writingthem and the tester checking them. Asser-tions gained significant attention as a meansof capturing language semantics in the semi-nal work of Floyd [64] and Hoare [94] andsubsequently were championed as a meansof increasing code quality in the developmentof the contract-based programming approach,notably in the language Eiffel [136].

Widely used programming languages nowroutinely provide assertion constructs; for in-stance, C, C++, and Java provide a constructcalled assert and C# provides a Debug.Assert

method. Moreover, a variety of systems havebeen independently developed for embeddingassertions into a host programming languages,such as Anna [117] for Ada, APP [156] andNana [120] for C languages.

In practice, assertion approaches can checkonly a limited set of properties at a certainpoint in a program [49]. Languages based ondesign by contract principles extend the ex-pressivity of assertions by providing meansto check contracts between client and supplierobjects in the form of method pre- and post-conditions and class invariants. Eiffel was thefirst language to offer design by contract [136],a language feature that has since found its wayinto other languages, such as Java in the formof Java modeling language (JML) [140].

Cheon and Leavens showed how to con-struct an assertion-based test oracle on topof JML [45]. For more on assertion-based testoracles, see Coppit and Haddox-Schatz’s eval-uation [49], and, later, a method proposed byCheon [44]. Both assertions and contracts areenforced observation activity that are embed-ded into the code. Araujo et al. provide asystematic evaluation of design by contract ona large industrial system [9] and using JML inparticular [8]; Briand et al. showed how to sup-port testing by instrumenting contracts [33].

4.3 Algebraic Specification Languages

Algebraic specification languages define a soft-ware module in terms of its interface, a sig-nature consisting of sorts and operation sym-bols. Equational axioms specify the requiredproperties of the operations; their equivalenceis often computed using term rewriting [15].Structuring facilities, which group sorts andoperations, allow the composition of interfaces.Typically, these languages employ first-orderlogic to prove properties of the specification,like the correctness of refinements. Abstractdata types (ADT), which combine data andoperations over that data, are well-suited toalgebraic specification.

12

One of the earliest algebraic specificationsystems, for implementing, specifying and test-ing ADTs, is DAISTS [69]. In this system,equational axioms generally equate a term-rewriting expression in a restricted dialect ofALGOL 60 against a function composition inthe implementation language. For example,consider this axiom used in DAISTS:

Pop2(Stack S,EltType I) :

Pop(Push(S, I)) = if Depth(S) = Limit

then Pop(S)

else S;

This axiom is taken from a specification thatdifferentiates the accessor Top, which returnsthe top element of a stake without modifyingthe stack, and the mutator Pop, which returns anew stack lacking the previous top element. Atest oracle simply executes both this axiom andits corresponding composition of implementedfunctions against a test suite: if they disagree, afailure has been found in the implementationor in the axiom; if they agree, we gain someassurance of their correctness.

Gaudel and her colleagues [19], [20], [72],[73] were the first to provide a general test-ing theory founded on algebraic specification.Their idea is that an exhaustive test suitecomposed only of ground terms, i.e., termswith no free variables, would be sufficientto judge program correctness. This approachfaces an immediate problem: the domain ofeach variable in a ground term might be in-finite and generate an infinite number of testcases. Test suites, however, must be finite, apractical limitation to which all forms of testingare subject. The workaround is, of course, toabandon exhaustive coverage of all bindingsof values to ground terms and select a finitesubset of test cases [20].

Gaudel’s theory focuses on observationalequivalence. Observational inequivalence is,however, equally important [210]. For this rea-son, Frankl and Doong extended Gaudel’s the-ory to express inequality as well an equal-

ity [52]. They proposed a notation that is suit-able for object-oriented programs and devel-oped an algebraic specification language calledLOBAS and a test harness called ASTOOT. Inaddition to handling object-orientation, Frankland Doong require classes to implement thetesting method EQN that ASTOOT uses tocheck the equivalence or inequivalence of twoinstances of a given class. From the vantagepoint of an observer, an object has observableand unobservable, or hidden, state. Typically,the observable state of an object is its publicfields and method return values. EQN enhancesthe testability of code and enables ASTOOTto approximate the observational equivalenceof two objects on a sequence of messages,or method calls. When ASTOOT checks theequivalence of an object and a specification inLOBAS, it realizes a specified test oracle.

Expanding upon ASTOOT, Chen et al. [40][41] built TACCLE, a tool that employs a white-box heuristic to generate a relevant, finitenumber of test cases. Their heuristic builds adata relevance graph that connects two fieldsof a class if one affects the other. They usethis graph to consider only that can affectan observable attributes of a class when con-sidering the (in)equivalence of two instances.Algebraic specification has been a fruitful lineof research; many algebraic specification lan-guages and tools exist, including Daistish [100],LOFT [123], CASL [11], CASCAT [205]. Theprojects have been evolving toward testing awider array of entities, from ADTS, to classes,and most recently, components; they also differin their degree of automation of test case gen-eration and test harness creation. Bochmann etal. used LOTOS to realise test oracle functionsfrom algebraic specifications [184]; most re-cently, Zhu also considered the use of algebraicspecifications as test oracles [210].

4.3.1 Specified Test Oracle ChallengesThree challenges must be overcome to buildspecified test oracles. The first is the lack of aformal specification. Indeed, the other classes

13

of test oracles, discussed in this survey, alladdress the problem of test oracle constructionin the absence of a formal specification. Formalspecifications models necessarily rely on ab-straction that can lead to the second problem:imprecision, models that include infeasible be-havior or that do not capture all the behav-ior relevant to checking a specification [68].Finally, one must contend with the problem ofinterpreting model output and equating it toconcrete program output.

Specified results are usually quite abstract,and the concrete test results of a program’s ex-ecutions may not be represented in a form thatmakes checking their equivalence to the speci-fied result straightforward. Moreover, specifiedresults can be partially represented or over-simplified. This is why Gaudel remarked thatthe existence of a formal specification doesnot guarantee the existence of a successful testdriver [72]. Formulating concrete equivalencefunctions may be necessary to correctly inter-pret results [119]. In short, solutions to thisproblem of equivalence across abstraction lev-els depend largely on the degree of abstractionand, to a lesser extent, on the implementationof the system under test.

5 DERIVED TEST ORACLES

A derived test oracle distinguishes a system’scorrect from incorrect behavior based on in-formation derived from various artefacts (e.g.documentation, system executions) or proper-ties of the system under test, or other versionsof it. Testers resort to derived test oracles whenspecified test oracles are unavailable, whichis often the case, since specifications rapidlyfall out of date when they exist at all. Ofcourse, the derived test oracle might becomea partial “specified test oracle”, so that testoracles derived by the methods discussed inthis section could migrate, over time, to be-come, those considered to be the “specified testoracles” of the previous section. For example,JWalk incrementally learns algebraic properties

of the class under test [170]. It allows interac-tive confirmation from the tester, ensuring thatthe human is in the “’learning loop”.

The following sections discuss research onderiving test oracles from development arte-facts, beginning in Section 5.1 with pseudo-oracles and N-version programming, whichfocus on agreement among independent imple-mentations. Section 5.2 then introduces meta-morphic relations which focuses on relationsthat must hold among distinct executions ofa single implementation. Regression testing,Section 5.3, focuses on relations that shouldhold across different versions of the SUT.Approaches for inferring models from sys-tem executions, including invariant inferenceand specification mining, are described in Sec-tion 5.4. Section 5.5 closes with a discussion ofresearch into extracting test oracle informationfrom textual documentation, like comments,specifications, and requirements.

5.1 Pseudo-OraclesOne of the earliest versions of a derived testoracle is the concept of a pseudo-oracle, intro-duced by Davis and Weyuker [50], as a meansof addressing so-called non-testable programs:

“Programs which were written in or-der to determine the answer in thefirst place. There would be no needto write such programs, if the correctanswer were known.” [196].

A pseudo-oracle is an alternative version ofthe program produced independently, e.g. bya different programming team or written inan entirely different programming language.In our formalism (Section 2), a pseudo-oracleis a test oracle D that accepts test activitysequences of the form

f1(x)o1f2(x)o2 : [f1 6= f2 ∧ o1 = o2], (1)

where f1, f2 ∈ C, the components of the SUT(Section 2), are alternative, independently pro-duced, versions of the SUT on the same value.We draw the reader’s attention to the similarity

14

between pseudo-oracles and algebraic speci-fication systems (Section 4.3), like DIASTIS,where the function composition expression inthe implementation language and the term-rewriting expression are distinct implementa-tions whose output must agree and form apseudo-oracle.

A similar idea exists in fault-tolerant com-puting, referred to as multi- or N-version pro-gramming [13], [14], where the software isimplemented in multiple ways and executedin parallel. Where results differ at run-time, a“voting” mechanism decides which output touse. In our formalism, an N-version test oracleaccepts test activities of the following form:

f1(x)o1f2(x)o2 · · · fk(x)ok :

[∀i, j ∈ [1..k], i 6= j ⇒ fi 6= fj

∧ arg maxoi

m(oi) ≥ t](2)

In Equation 2, the outputs form a multisetand m is the multiplicity, or number of repeti-tions of an element in the multiset. The arg maxoperator finds the argument that maximizes afunction’s output, here an output with greatestmultiplicity. Finally, the maximum multiplicityis compared against the threshold t. We cannow define a N-version test oracle as Dnv(w, x)where w obeys Equation 2 with t bound to x.Then Dmaj(w) = Dnv(w, dk2 e) is an N-versionoracle that requires a majority of the outputsto agree and Dpso(w) = Dnv(w, k) generalizespseudo oracles to agreement across k imple-mentations.

More recently, Feldt [58] investigated thepossibility of automatically producing differ-ent versions using genetic programming, andMcMinn [128] explored the idea of producingdifferent software versions for testing throughprogram transformation and the swapping ofdifferent software elements with those of asimilar specification.

5.2 Metamorphic RelationsFor the SUT p that implements the function f ,a metamorphic relation is a relation over applica-

tions of f that we expect to hold across multipleexecutions of p. Suppose f(x) = ex, theneae−a = 1 is a metamorphic relation. Underthis metamorphic relation, p(0.3) * p(-0.3)= 1 will hold if p is correct [43]. The key ideais that reasoning about the properties of f willlead us to relations that its implementation pmust obey.

Metamorphic testing is a process of exploitingmetamorphic relations to generate partial testoracles for follow-up test cases: it checks im-portant properties of the SUT after certain testcases are executed [36]. Although metamorphicrelations are properties of the ground truth, thecorrect phenomenon (f in the example above)that a SUT seeks to implement and could beconsidered a mechanism for creating specifiedtest oracles. We have placed them with derivedtest oracles, because, in practice, metamorphicrelations are usually manually inferred from awhite-box inspection of a SUT.

Metamorphic relations differ from algebraicspecifications in that a metamorphic relationrelates different executions, not necessarily onthe same input, of the same implementationrelative to its specification, while algebraicspecifications equates two distinct implementa-tions of the specification, one written in an im-plementation language and the other writtenin formalism free of implementation details,usually term rewriting [15].

Under the formalism of Section 2, a meta-morphic relation is

f(x1)o1f(x2)o2 · · · f(ik)ok : [expr ∧ k ≥ 2],

where expr is a constraint, usually arithmetic,over the inputs xi and ox. This definitionmakes clear that a metamorphic relation isa constraint on the values of stimulating thesingle SUT f at least twice, observing theresponses, and imposing a constraint on howthey interrelate. In contrast, algebraic specifi-cation is a type of pseudo-oracle, as specifiedin Equation 1, which stimulates two distinctimplementations on the same value, requiringtheir output to be equivalent.

15

It is often thought that metamorphic rela-tions need to concern numerical properties thatcan be captured by arithmetic equations, butmetamorphic testing is, in fact, more general.For example, Zhou et al. [209] used meta-morphic testing to test search engines suchas Google and Yahoo!, where the relationsconsidered are clearly non-numeric. Zhou etal. build metamorphic relations in terms ofthe consistency of search results. A motivatingexample they give is of searching for a paperin the ACM digital library: two attempts, thesecond quoted, using advanced search fail, buta general search identical to the first succeeds.Using this insight, the authors build metamor-phic relations, like ROR : A1 = (A2 ∪ A3) ⇒|A2| ≤ |A1|, where the Ai are sets of web pagesreturned by queries. Metamorphic testing isalso means of testing Weyuker’s “non-testableprograms”, introduced in the last section.

When the SUT is nondeterministic, such asa classifier whose exact output varies from runto run, defining metamorphic relations solelyin terms of output equality is usually insuffi-cient during metamorphic testing. Murphy etal. [139], [140] investigate relations other thanequality, like set intersection, to relate the out-put of stochastic machine learning algorithms,such as classifiers. Guderlei and Mayer intro-duced statistical metamorphic testing, wherethe relations for test output are checked us-ing statistical analysis [80], a technique laterexploited to apply metamorphic testing tostochastic optimisation algorithms [203].

The biggest challenge in metamorphic test-ing is automating the discovery of metamor-phic relations. Some of those in the literatureare mathematical [36], [37], [42] or combina-torial [139], [140], [161], [203]. Work on thediscovery of algebraic specifications [88] andJWalk’s lazy systematic unit testing, in whichthe specification is lazily, and incrementally,learned through interactions between JWalkand the developer [170] might be suitable foradaptation to the discovery metamorphic re-lations. For instance, the programmer’s devel-

opment environment might track relationshipsamong the output of test cases run during de-velopment, and propose ones that hold acrossmany runs to the developer as possible meta-morphic relations. Work has already begunthat exploits domain knowledge to formulatemetamorphic relations [38], but it is still at anearly stage and not yet automated.

5.3 Regression Test Suites

Regression testing aims to detect whetherthe modifications made to the new versionof a SUT have disrupted existing functional-ity [204]. It rests on the implicit assumptionthat the previous version can serve as an oraclefor existing functionality.

For corrective modifications, desired func-tionality remains the same so the test oraclefor version i, Di, can serve as the next ver-sion’s test oracle, Di+1. Corrective modifica-tions may fail to correct the problem they seekto address or disrupt existing functionality; testoracles may be constructed for these issuesby symbolically comparing the execution ofthe faulty version against the newer, allegedlyfixed version [79]. Orstra generates assertion-based test oracles by observing the programstates of the previous version while executingthe regression test suite [199]. The regressiontest suite, now augmented with assertions, isthen applied to the newer version. Similarly,spectra-based approaches use the program andvalue spectra obtained from the original ver-sion to detect regression faults in the newerversions [86], [200].

For perfective modifications, those that addnew features to the SUT, Di must be modi-fied to cater for newly added behaviours, i.e.Di+1 = Di∪∆D. Test suite augmentation tech-niques specialise in identifying and generating∆D [6], [132], [202]. However, more work isrequired to develop these augmentation tech-niques so that they augment, not merely thetest input, but also the expected output. In thisway, test suite augmentation could be extended

16

to augment the existing oracles as well as thetest data.

Changes in the specification, which isdeemed to fail to meet requirements per-haps because the requirements have them-selves changed, drives another class of modifi-cations. These changes are generally regardedas “perfective” maintenance in the literaturebut no distinction is made between perfectionsthat add new functionality to code (withoutchanging requirements) and those changes thatarise due to changed requirements (or incorrectspecifications).

Our formalisation of test oracles in Section 2forces a distinction of these two categoriesof perfective maintenance, since the two haveprofoundly different consequences for test or-acles. We therefore refer to this new categoryof perfective maintenance as “changed require-ments”. Recall that, for the function f : X → Y ,dom(f) = X . For changed requirements:

∃α ·Di+1(α) 6= Di(α),

which implies, of course, dom(Di+1) ∩dom(Di) 6= ∅ and the new test oracle cannotsimply union the new behavior with the oldtest oracle. Instead, we have

Di+1 =

{∆D if α ∈ dom(∆D)

Di otherwise.

5.4 System Executions

A system execution trace can be exploited toderive test oracles or to reduce the cost ofa human test oracle by aligning an incorrectexecution against the expected execution, asexpressed in temporal logic [51]. This sectiondiscusses the two main techniques for deriv-ing test oracles from traces — invariant de-tection and specification mining. Derived testoracles can be built on both techniques to au-tomatically check expected behaviour similarto assertion-based specification, discussed inSection 4.2.

5.4.1 Invariant Detection

Program behaviours can be automaticallychecked against invariants. Thus, invariantscan serve as test oracles to help determine thecorrect and incorrect outputs.

When invariants are not available for a pro-gram in advance, they can be learned from theprogram (semi-) automatically. A well-knowntechnique proposed by Ernst et al. [56], imple-mented in the Daikon tool [55], is to execute aprogram on a collection of inputs (test cases)against a collection of potential invariants. Theinvariants are instantiated by binding theirvariables to the program’s variables. Daikonthen dynamically infers likely invariants fromthose invariants not violated during the pro-gram executions over the inputs. The inferredinvariants capture program behaviours, andthus can be used to check program correct-ness. For example, in regression testing, invari-ants inferred from the previous version can bechecked as to whether they still hold in thenew version.

In our formalism, Daikon invariant detec-tion can define an unsound test oracle thatgathers likely invariants from the prefix of atesting activity sequence, then enforces thoseinvariants over its suffix. Let Ij be the set oflikely invariants at observation j; I0 are theinitial invariants; for the test activity sequencer1r2 · · · rn, In = {x ∈ I | ∀i ∈ [1..n], ri |=x}, where |= is logical entailment. Thus, wetake an observation to define a binding of thevariables in the world under which a likelyinvariant either holds or does not: only thoselikely invariants remain that no observationinvalidates. In the suffix rn+1rn+2 · · · rm, thetest oracle then changes gear and accepts onlythose activities whose response observationsobey In, i.e. ri : [ri |= In], i > n.

Invariant detection can be computationallyexpensive, so incremental [22], [171] and lightweight static analyses [39], [63] have beenbrought to bear. A technical report summarisesvarious dynamic analysis techniques [158].Model inference [90], [187] could also be re-

17

garded as a form of invariant generation inwhich the invariant is expressed as a model(typically as an FSM). Ratcliff et al. usedSearch-Based Software Engineering (SBSE) [84]to search for invariants, guided by mutationtesting [154].

The accuracy of inferred invariants dependsin part on the quality and completeness of thetest cases; additional test cases might providenew data from which more accurate invariantscan be inferred [56]. Nevertheless, inferring“perfect” invariants is almost impossible withthe current state of the art, which tends to fre-quently infer incorrect or irrelevant invariants[152]. Wei et al. recently leveraged existing con-tracts in Eiffel code to infer postconditions oncommands (as opposed to queries) involvingquantification or implications whose premisesare conjunctions of formulae [192], [193].

Human intervention can, of course, be usedto filter the resulting invariants, i.e., retainingthe correct ones and discarding the rest. How-ever, manual filtering is error-prone and themisclassification of invariants is frequent. In arecent empirical study, Staats et al. found thathalf of the incorrect invariants Daikon inferredfrom a set of Java programs were misclassi-fied [175]. Despite these issues, research on thedynamic inference of program invariants hasexhibited strong momentum in the recent pastwith the primary focus on its application to testgeneration [10], [142], [207].

5.4.2 Specification MiningSpecification mining or inference infers a for-mal model of program behaviour from a setof observations. In terms of our formalism, atest oracle can enforce these formal modelsover test activities. In her seminal work onusing inference to assess test data adequacy,Weyuker connected inference and testing as in-verse processes [194]. The testing process startswith a program, and looks for I/O pairs thatcharacterise every aspect of both the intendedand actual behaviours, while inference startswith a set of I/O pairs, and derives a program

to fit the given behaviour. Weyuker definedthis relation for assessing test adequacy whichcan be stated informally as follows.

A set of I/O pairs T is an inference adequatetest set for the program P intended to fulfilspecification S iff the program IT inferred fromT (using some inference procedure) is equiva-lent to both P and S. Any difference wouldimply that the inferred program is not equiva-lent to the actual program and, therefore, thatthe test set T used to infer the program P isinadequate.

This inference procedure mainly dependsupon the set of I/O pairs used to infer be-haviours. These pairs can be obtained fromsystem executions either passively, e.g., by run-time monitoring, or actively, e.g., by queryingthe system [106]. However, equivalence check-ing is undecidable in general, and thereforeinference is only possible for programs in arestricted class, such as those whose behaviourcan be modelled by finite state machines [194].With this, equivalence can be accomplished byexperiment [89]. Nevertheless, serious practicallimitations are associated with such experi-ments (see the survey by Lee and Yannakakis[112] for complete discussion).

The marriage between inference and testinghas produced wealth of techniques, especiallyin the context of “black-box” systems, whensource code/behavioural models are unavail-able. Most work has applied L∗, a well-knownlearning algorithm, to learn a black-box sys-tem B as a finite state machine (FSM) withn states [7]. The algorithm infers an FSM byiteratively querying B and observing the cor-responding outputs. A string distinguishes twoFSMs when only one of the two machines endsin a final state upon consuming the string. Ateach iteration, an inferred model Mi with i < nstates is given. Then, the model is refined withthe help of a string that distinguishes B andMi to produce a new model, until the numberof states reaches n.

Lee and Yannakakis [112] showed how touse L∗ for conformance testing of B with a

18

specification S. Suppose L∗ starts by inferringa model Mi, then we compute a string thatdistinguishes Mi from S and refine Mi throughthe algorithm. If, for i = n, Mn is S, then wedeclare B to be correct, otherwise faulty.

Apart from conformance testing, inferencetechniques have been used to guide test gen-eration to focus on particular system behaviorand to reduce the scope of analysis. For exam-ple, Li et al. applied L∗ to the integration test-ing of a system of black-box components [114].Their analysis architecture derives a test oraclefrom a test suite by using L∗ to infer a modelof the systems from dynamically observingsystem’s behavior; this model is then searchedto find incorrect behaviors, such as deadlocks,and used to verify the system’s behaviour un-der fuzz testing (Section 6).

To find concurrency issues in asynchronousblack-box systems, Groz et al. proposed anapproach that extracts behavioural modelsfrom systems through active learning tech-niques [78] and then performs reachabilityanalysis on the models [27] to detect issues,notably races.

Further work in this context has been com-piled by Shahbaz [166] with industrial ap-plications. Similar applications of inferencecan be found in system analysis [21], [78],[135], [188], [189], component interaction test-ing [115], [122], regression testing [200], se-curity testing [168] and verification [53], [77],[148].

Zheng et al. [208] extract item sets from websearch queries and their results, then applyassociation rule mining to infer rules. Fromthese rules, they construct derived test ora-cles for web search engines, which had beenthought to be untestable. Image segmentationdelineates objects of interest in an image; im-plementing segmentation programs is a te-dious, iterative process. Frouchni et al. success-fully apply semi-supervised machine learningto create test oracles for image segmentationprograms [67]. Memon et al. [133], [134], [198]introduced and developed the GUITAR tool,

which has been evaluated by treating the cur-rent version of the SUT as correct, inferring thespecification, and then executing the generatedtest inputs. Artificial Neural Networks havealso been applied to learn system behaviourand detect deviations from it [163], [164].

The majority of specification mining tech-niques adopt Finite State Machines as the out-put format to capture the functional behaviourof the SUT [21], [27], [53], [77], [78], [89], [112],[114], [135], [148], [166], [168], [189], some-times extended with temporal constraints [188]or data constraints [115], [122] which are, inturn, inferred by Daikon [56]. Buchi automatahave been used to check properties againstblack-box systems [148]. Annotated call treeshave been used to represent the program be-haviour of different versions in the regressiontesting context [200]. GUI widgets have beendirectly modelled with objects and propertiesfor testing [133], [134], [198]. Artificial Neu-ral Nets and machine learning classifiers havebeen used to learn the expected behaviour ofSUT [67], [163], [164]. For dynamic and fuzzybehaviours such as the result of web searchengine queries, association rules between in-put (query) and output (search result strings)have been used as the format of an inferredoracle [208].

5.5 Textual DocumentationTextual documentation ranges from naturallanguage descriptions of requirements to struc-tured documents detailing the functionalitiesof APIs. These documents describe the func-tionalities expected from the SUT to varyingdegrees, and can therefore serve as a basisfor generating test oracles. They are usuallyinformal, intended for other humans, not tosupport formal logical or mathematical rea-soning. Thus, they are often partial and am-biguous, in contrast to specification languages.Their importance for test oracle constructionrests on the fact that developers are more likelyto write them than formal specifications. Inother words, the documentation defines the

19

constraints that the test oracle D, as definedin Section 2, enforces over testing activities.

At first sight, it may seem impossible toderive test oracles automatically because nat-ural languages are inherently ambiguous andtextual documentation is often imprecise andinconsistent. The use of textual documentationhas often been limited to humans in practicaltesting applications [144]. However, some par-tial automation can assist the human in testingusing documentation as a source of test oracleinformation.

Two approaches have been explored. Thefirst category builds techniques to construct aformal specification out of an informal, tex-tual artefact, such as an informal textual spec-ification, user and developer documentation,and even source code comments. The secondrestricts a natural language to a semi-formalfragment amenable to automatic processing.Next, we present representative examples ofeach approach.

5.5.1 Converting Text into SpecificationsProwell and Poore [153] introduced a se-quential enumeration method for developinga formal specification from an informal one.The method systematically enumerates all se-quences from the input domain and maps thecorresponding outputs to produce an arguablycomplete, consistent, and correct specification.However, it can suffer from an exponentialexplosion in the number of input/output se-quences. Prowell and Poore employ abstractiontechniques to control this explosion. The endresult is a formal specification that can betransferred into a number of notations, e.g.,state transition systems. A notable benefit ofthis approach is that it tends to discover manyinconsistent and missing requirements, makingthe specification more complete and precise.

5.5.2 Restricting Natural LanguageRestrictions on a natural language reduce com-plexities in its grammar and lexicon and allowthe expression of requirements in a concise

vocabulary with minimal ambiguity. This, inturn, eases the interpretation of documents andmakes the automatic derivation of test oraclespossible. The researchers who have proposedspecification languages based on (semi-) formalsubsets of a natural language are motivatedby the fact that model-based specification lan-guages have not seen wide-spread adoption,and believe the reason is the inaccessibilitytheir formalism and set-theoretic underpin-nings to the average programmer.

Schwitter introduced a computer-processable, restricted natural languagecalled PENG [160]. It covers a strict subset ofstandard English with a restricted grammarand a domain specific lexicon for contentwords and predefined function words.Documents written in PENG can be translateddeterministically into first-order predicatelogic. Schwitter et al. [30] provided guidelinesfor writing test scenarios in PENG that canautomatically judge the correctness of programbehaviours.

6 IMPLICIT TEST ORACLES

An implicit test oracle is one that relies ongeneral, implicit knowledge to distinguish be-tween a system’s correct and incorrect be-haviour. This generally true implicit knowl-edge includes such facts as “buffer overflowsand segfaults are nearly always errors”. Thecritical aspect of an implicit test oracle is that itrequires neither domain knowledge nor a for-mal specification to implement, and it appliesto nearly all programs.

Implicit test oracle can be built on any proce-dure that detects anomalies such as abnormaltermination due to a crash or an execution fail-ure [34], [167]. This is because such anomaliesare blatant faults; that is, no more informationis required to ascertain whether the programbehaved correctly or not. Under our formalism,an implicit oracle defines a subset of stimulusand response relations as guaranteed failures,in some context.

20

Implicit test oracles are not universal. Be-haviours abnormal for one system in one con-text may be normal for that system in a dif-ferent context or normal for a different system.Even crashing may be considered acceptable, oreven desired behaviour, as in systems designedto find crashes.

Research on implicit oracles is evident fromearly work in software engineering. The veryfirst work in this context was related to dead-lock, livelock and race detection to countersystem concurrency issues [24] [107] [185][16] [169]. Similarly, research on testing non-functional attributes have garnered much at-tention since the advent of the object-orientedparadigm. In performance testing, systemthroughput metrics can highlight degradationerrors [121], [124], as when a server fails torespond when a number of requests are sentsimultaneously. A case study by Weyuker andVokolos showed how a process with excessiveCPU usage caused service delays and disrup-tions [195]. Similarly, test oracles for memoryleaks can be built on a profiling technique thatdetects dangling references during the run ofa program [12], [57], [87], [211]. For example,Xie and Aiken proposed a boolean constraintsystem to represent the dynamically allocatedobjects in a program [201]. Their system raisesan alarm when an object becomes unreachablebut has not yet been deallocated.

Fuzzing is an effective way to find implicitanomalies, such a crashes [137]. The main ideais to generate random, or “fuzz”, inputs andfeed them to the system to find anomalies.This works because the implicit specificationusually holds over all inputs, unlike explicitspecifications which tend to relate subsets ofinputs to outputs. If an anomaly is detected,the fuzz tester reports it along with the inputthat triggers it. Fuzzing is commonly used todetect security vulnerabilities, such as bufferoverflows, memory leaks, unhandled excep-tions, denial of service, etc. [18], [177].

Other work has focused on developing pat-terns to detect anomalies. For instance, Ricca

and Tonella [155] considered a subset of theanomalies that Web applications can harbor,such as navigation problems, hyperlink incon-sistencies, etc. In their empirical study, 60% ofthe Web applications exhibited anomalies andexecution failures.

7 THE HUMAN ORACLE PROBLEM

The above sections give solutions to the testoracle problem when some artefact exists thatcan serve as the foundation for either a full orpartial test oracle. In many cases, however, nosuch artefact exists so a human tester must ver-ify whether software behaviour is correct givensome stimuli. Despite the lack of an automatedtest oracle, software engineering research canstill play a key role: finding ways to reducethe effort that the human tester has to expendin directly creating, or in being, the test oracle.

This effort is referred to as the Human OracleCost [126]. It aims to reduce the cost of humaninvolvement along two dimensions: 1) writingtest oracles and 2) evaluating test outcomes.Concerning the first dimension, the work ofStaats et al. is a representative. They seek toreduce the human oracle cost by guiding hu-man testers to those parts of the code they needto focus on when writing test oracles [173].This reduces the cost of test oracle construc-tion, rather than reducing the cost of a humaninvolvement in testing in the absence of an au-tomated test oracle. Additional recent work ontest oracle construction includes Dodona, a toolthat suggests oracle data to a human who thendecides whether to use it to define a test oraclerealized as a Java unit test [116]. Dodona infersrelations among program variables during ex-ecution, using network centrality analysis anddata flow.

Research that seeks to reduce the humanoracle cost broadly focuses on finding a quan-titative reduction in the amount of work thetester has to do for the same amount of testcoverage or finding a qualitative reduction inthe work needed to understand and evaluatetest cases.

21

7.1 Quantitative Human Oracle Cost

Test suites can be unnecessarily large, coveringfew test goals in each individual test case.Additionally, the test cases themselves may beunnecessarily long — for example containinglarge numbers of method calls, many of whichdo not contribute to the overall test case. Thegoal of quantitative human oracle cost reductionis to reduce test suite and test case size soas to maximise the benefit of each test caseand each component of that test case. Thisconsequently reduces the amount of manualchecking effort that is required on behalf ofa human tester performing the role of a testoracle. Cast in terms of our formalism, quan-titative reduction aims to partition the set oftest activity sequences so the human need onlyconsider representative sequences, while testcase reduction aims to shorten test activitysequences.

7.1.1 Test Suite Reduction

Traditionally, test suite reduction has been ap-plied as a post-processing step to an existingtest suite, e.g. the work of Harrold et al, [85],Offutt et al. [141] and Rothermel et al. [157]. Re-cent work in the search-based testing literaturehas sought to combine test input generationand test suite reduction into one phase toproduce smaller test suites.

Harman et al. proposed a technique for gen-erating test cases that penetrate the deepestlevels of the control dependence graph forthe program, in order to create test cases thatexercise as many elements of the program aspossible [82]. Ferrer et al. [61] attack a multi-objective version of the problem in which theysought to simultaneously maximize branchcoverage and minimize test suite size; theirfocus was not this problem per se, but its useto compare a number of multi-objective opti-misation algorithms, including the well-knownNon-dominated Sorting Genetic Algorithm II(NSGA-II), Strength Pareto EA 2 (SPEA2), andMOCell. On a series of randomly-generated

programs and small benchmarks, they foundMOCell performed best.

Taylor et al. [178] use an inferred model asa semantic test oracle to shrink a test suite.Fraser and Arcuri [65] generate test suites forJava using their EvoSuite tool. By generatingthe entire suite at once, they are able to simulta-neously maximize coverage and minimize testsuite size, thereby aiding human oracles andalleviating the human oracle cost problem.

7.1.2 Test Case ReductionWhen using randomised algorithms for gen-erating test cases for object-oriented systems,individual test cases can generate very longtraces very quickly — consisting of a largenumber of method calls that do not actuallycontribute to a specific test goal (e.g. the cov-erage of a particular branch). Such methodcalls unnecessarily increase test oracle cost,so Leitner et al. remove such calls [113] us-ing Zeller’s and Hildebrandt’s Delta Debug-ging [206]. JWalk simplifies test sequences byremoving side-effect free functions from them,thereby reducing test oracle costs where thehuman is the test oracle [170]. Quick testsseek to efficiently spend a small test budgetby building test suites whose execution is fastenough for it to be run after compilations [76].These quick tests must be likely to trigger bugsand therefore generate short traces, which, asa result, are easier for humans to comprehend.

7.2 Qualitative Human Oracle Cost

Human oracle costs may also be minimisedfrom a qualitative perspective. That is, theextent to which test cases, more generally test-ing activities, may be easily understood andprocessed by a human. The input profile ofa SUT is the distribution of inputs it actuallyprocesses when running in its operational en-vironment. Learning an input profile requiresdomain knowledge. If such domain knowledgeis not built into the test data generation pro-cess, machine-generated test data tend to be

22

drawn from a different distribution over theSUT’s inputs than its input profile. While thismay be beneficial for trapping certain typesof faults, the utility of the approach decreaseswhen test oracle costs are taken into account,since the tester must invest time comprehendingthe scenario represented by test data in order tocorrectly evaluate the corresponding programoutput. Arbitrary inputs are much harder tounderstand than recognisable pieces of data,thus adding time to the checking process.

All approaches to qualitatively alleviatingthe human oracle cost problem incorporatehuman knowledge to improve the understand-ability of test cases. The three approaches wecover are 1) augmenting test suites designed bythe developers; 2) computing “realistic” inputsfrom web pages, web services, and naturallanguage; and 3) mining usage patterns toreplicate them in the test cases.

In order to improve the readability ofautomatically-generated test cases, McMinn etal. propose the incorporation of human knowl-edge into the test data generation process [126].With search-based approaches, they proposedinjecting this knowledge by “seeding” the algo-rithm with test cases that may have originatedfrom a human source such as a “sanity check”performed by the programmer, an already ex-isting, partial test suite, or input–output ex-amples generated by programming paradigmsthat involve the developer in computation, likeprorogued programming [2].

The generation of string test data is par-ticularly problematic for automatic test datagenerators, which tend to generate nonsensi-cal strings. These nonsensical strings are, ofcourse, a form of fuzz testing (Section 6) andgood for exploring uncommon, shallow codepaths and finding corner cases, but they areunlikely to exercise functionality deeper in aprogram’s control flow. This is because stringcomparisons in control expressions are usuallystronger than numerical comparisons, makingone of a control point’s branches much lesslikely to traverse via uniform fuzzing. We see

here the seminal computer science trade-off be-tween breadth first and depth first search in thechoice between fuzz testing with nonsensicalinputs and testing with realistic inputs.

Bozkurt and Harman, introduced the idea ofmining web services for realistic test inputs,using the outputs of known and trusted testservices as more realistic inputs to the serviceunder test [32]. The idea is that realistic testcases are more likely to reveal faults that de-velopers care about and yield test cases thatare more readily understood. McMinn et al.also mine the web for realistic test cases. Theyproposed mining strings from the web to assistin the test generation process [130]. Since webpage content is generally the result of humaneffort, the strings contained therein tend to bereal words or phrases with high degrees ofsemantic and domain relevant context that canthus be used as sources of realistic test data.

Afshan et al. [1] combine a natural languagemodel and metahueristics, strategies that guidea search process [25], to help generate read-able strings. The language model scores howlikely a string is to belong to a language basedon the character combinations. Incorporatingthis probability score into a fitness function,a metaheuristic search can not only cover testgoals, but generate string inputs that are morecomprehensible than the arbitrary strings gen-erated by the previous state of the art. Overa number of case studies, Afshan et al. foundthat human oracles more accurately and morequickly evaluated their test strings.

Fraser and Zeller [66] improve the familiarityof test cases by mining the software under testfor common usage patterns of APIs. They thenseek to replicate these patterns in generatedtest cases. In this way, the scenarios generatedare more likely to be realistic and representactual usages of the software under test.

7.3 Crowdsourcing the Test Oracle

A recent approach to handling the lack of atest oracle is to outsource the problem to an

23

online service to which large numbers of peo-ple can provide answers — i.e., through crowd-sourcing. Pastore et al. [147] demonstrated thefeasibility of the approach but noted problemsin presenting the test problem to the crowdsuch that it could be easily understood, and theneed to provide sufficient code documentationso that the crowd could determine correct out-puts from incorrect ones. In these experiments,crowdsourcing was performed by submittingtasks to a generic crowdsourcing platform —Amazon’s Mechanical Turk4. However, somededicated crowdsourcing services now existfor the testing of mobile applications. Theyspecifically address the problem of the explod-ing number of devices on which a mobileapplication may run, and which the developeror tester may not own, but which may bepossessed by the crowd at large. Examplesof these services include Mob4Hire5, MobTest6

and uTest7.

8 FUTURE DIRECTIONS AND CON-CLUSION

This paper has provided a comprehensive sur-vey of test oracles, covering specified, derivedand implicit oracles and techniques that caterfor the absence of test oracles. The paper hasalso analyzed publication trends in the test ora-cle domain. This paper has necessarily focusedon the traditional approaches to the test oracleproblem. Much work on test oracles remains tobe done. In addition to research deepening andinterconnecting these approaches, test oracleproblem is open to new research directions.We close with a discussion of two of these thatwe find noteworthy and promising: test oraclereuse and test oracle metrics.

As this survey has shown, test oracles aredifficult to construct. Oracle reuse is thereforean important problem that merits attention.

4. http://www.mturk.com5. http://www.mob4hire.com6. http://www.mobtest.com7. http://www.utest.com/

Two promising approaches to oracle reuse aregeneralizations of reliable reset and the sharingof oracular data across software product lines(SPLs). Generalizing reliable reset to arbitrarystates allows the interconnection of differentversions of a program, so we can build testoracles that based on older versions of a pro-gram, using generalized reliable reset to ignoreor handle new inputs and functionality. SPLsare sets of related versions of a system [47].A product line can be thought of as a tree ofrelated software products in which branchescontain new alternative versions of the system,each of which shares some core functionalityenjoyed by a base version. Research on testoracles should seek to leverage these SPL treesto define trees of test oracles that share oraculardata where possible.

Work has already begun on using test ora-cle as the measure of how well the programhas been tested (a kind of test oracle cover-age) [104], [176], [186] and measures of oraclessuch as assessing the quality of assertions [159].More work is needed. “Oracle metrics” is achallenge to, and an opportunity for, the “soft-ware metrics” community. In a world in whichtest oracles become more prevalent, it will beimportant for testers to be able to assess thefeatures offered by alternative test oracles.

A repository of papers on test oracles accom-panies this paper at http://crestweb.cs.ucl.ac.uk/resources/oracle repository.

9 ACKNOWLEDGEMENTS

We would like to thank Bob Binder for helpfulinformation and discussions when we beganwork on this paper. We would also like thankall who attended the CREST Open Workshopon the Test Oracle Problem (21–22 May 2012) atUniversity College London, and gave feedbackon an early presentation of the work. We arefurther indebted to the very many responsesto our emails from authors cited in this survey,who provided several useful comments on anearlier draft of our paper.

http://www.mturk.com

http://www.mob4hire.com

http://www.mobtest.com

http://www.utest.com/

http://crestweb.cs.ucl.ac.uk/resources/oracle_repository

http://crestweb.cs.ucl.ac.uk/resources/oracle_repository

24

REFERENCES

[1] Sheeva Afshan, Phil McMinn, and Mark Stevenson.Evolving readable string test inputs using a naturallanguage model to reduce human oracle cost. In In-ternational Conference on Software Testing, Verificationand Validation (ICST 2013). IEEE, March 2013.

[2] Mehrdad Afshari, Earl T. Barr, and Zhendong Su.Liberating the programmer with prorogued pro-gramming. In Proceedings of the ACM internationalsymposium on New ideas, new paradigms, and reflectionson programming and software, Onward! ’12, pages 11–26, New York, NY, USA, 2012. ACM. Track atOOPSLA/SPLASH’12.

[3] Wasif Afzal, Richard Torkar, and Robert Feldt. Asystematic review of search-based testing for non-functional system properties. Information and Soft-ware Technology, 51(6):957–976, 2009.

[4] Bernhard K. Aichernig. Automated black-box testingwith abstract VDM oracles. In SAFECOMP, pages250–259. Springer-Verlag, 1999.

[5] Shaukat Ali, Lionel C. Briand, Hadi Hemmati, andRajwinder Kaur Panesar-Walawege. A systematicreview of the application and empirical investigationof search-based test-case generation. IEEE Transac-tions on Software Engineering, pages 742–762, 2010.

[6] Nadia Alshahwan and Mark Harman. Automatedsession data repair for web application regressiontesting. In Proceedings of 2008 International Conferenceon Software Testing, Verification, and Validation, pages298–307. IEEE Computer Society, 2008.

[7] Dana Angluin. Learning regular sets from queriesand counterexamples. Inf. Comput., 75(2):87–106,1987.

[8] W. Araujo, L.C. Briand, and Y. Labiche. Enablingthe runtime assertion checking of concurrent con-tracts for the java modeling language. In SoftwareEngineering (ICSE), 2011 33rd International Conferenceon, pages 786–795, 2011.

[9] W. Araujo, L.C. Briand, and Y. Labiche. On theeffectiveness of contracts as test oracles in the detec-tion and diagnosis of race conditions and deadlocksin concurrent object-oriented software. In EmpiricalSoftware Engineering and Measurement (ESEM), 2011International Symposium on, pages 10–19, 2011.

[10] Shay Artzi, Michael D. Ernst, Adam Kiezun, CarlosPacheco, and Jeff H. Perkins. Finding the needlesin the haystack: Generating legal test inputs forobject-oriented programs. In 1st Workshop on Model-Based Testing and Object-Oriented Systems (M-TOOS),Portland, OR, October 23, 2006.

[11] Egidio Astesiano, Michel Bidoit, Helene Kirchner,Bernd Krieg-Bruckner, Peter D. Mosses, Donald San-nella, and Andrzej Tarlecki. CASL: the commonalgebraic specification language. Theor. Comput. Sci.,286(2):153–196, 2002.

[12] Todd M. Austin, Scott E. Breach, and Gurindar S.Sohi. Efficient detection of all pointer and arrayaccess errors. In PLDI, pages 290–301. ACM, 1994.

[13] A. Avizienis. The N-version approach to fault-

tolerant software. IEEE Transactions on Software En-gineering, 11:1491–1501, 1985.

[14] A. Avizienis and L. Chen. On the implementation ofN-version programming for software fault-toleranceduring execution. In Proceedings of the First Inter-national Computer Software and Application Conference(COMPSAC ’77), pages 149–155, 1977.

[15] Franz Baader and Tobias Nipkow. Term Rewritingand All That. Cambridge University Press, New York,NY, USA, 1998.

[16] A. F. Babich. Proving total correctness of parallelprograms. IEEE Trans. Softw. Eng., 5(6):558–574,November 1979.

[17] Luciano Baresi and Michal Young. Test oracles. Tech-nical Report CIS-TR-01-02, University of Oregon,Dept. of Computer and Information Science, August2001. http://www.cs.uoregon.edu/∼michal/pubs/oracles.html.

[18] Sofia Bekrar, Chaouki Bekrar, Roland Groz, andLaurent Mounier. Finding software vulnerabilitiesby smart fuzzing. In ICST, pages 427–430, 2011.

[19] Gilles Bernot. Testing against formal specifications:a theoretical view. In Proceedings of the InternationalJoint Conference on Theory and Practice of SoftwareDevelopment on Advances in Distributed Computing(ADC) and Colloquium on Combining Paradigms forSoftware Development (CCPSD): Vol. 2, TAPSOFT ’91,pages 99–119, New York, NY, USA, 1991. Springer-Verlag New York, Inc.

[20] Gilles Bernot, Marie Claude Gaudel, and BrunoMarre. Software testing based on formal specifica-tions: a theory and a tool. Softw. Eng. J., 6(6):387–405,November 1991.

[21] Antonia Bertolino, Paola Inverardi, Patrizio Pellic-cione, and Massimo Tivoli. Automatic synthesis ofbehavior protocols for composable web-services. InProceedings of ESEC/SIGSOFT FSE, ESEC/FSE 2009,pages 141–150, 2009.

[22] D. Beyer, T. Henzinger, R. Jhala, and R. Majumdar.Checking memory safety with Blast. In M. Cerioli,editor, Fundamental Approaches to Software Engineer-ing, 8th International Conference, FASE 2005, Held asPart of the Joint European Conferences on Theory andPractice of Software, ETAPS 2005, Edinburgh, UK, April4-8, 2005, Proceedings, volume 3442 of Lecture Notesin Computer Science, pages 2–18. Springer, 2005.

[23] R. Binder. Testing Object-Oriented Systems: Models,Patterns, and Tools. Addison-Wesley, 2000.

[24] A. Blikle. Proving programs by sets of compu-tations. In Mathematical Foundations of ComputerScience, pages 333–358. Springer, 1975.

[25] Christian Blum and Andrea Roli. Metaheuristics incombinatorial optimization: Overview and concep-tual comparison. ACM Computing Surveys (CSUR),35(3):268–308, 2003.

[26] Gregor V. Bochmann and Alexandre Petrenko. Pro-tocol testing: review of methods and relevance forsoftware testing. In Proceedings of the 1994 ACMSIGSOFT international symposium on Software testingand analysis, ISSTA ’94, pages 109–124. ACM, 1994.

[27] G.V. Bochmann. Finite state description of commu-

http://www.cs.uoregon.edu/~michal/pubs/oracles.html

http://www.cs.uoregon.edu/~michal/pubs/oracles.html

25

nication protocols. Computer Networks, 2(4):361–372,1978.

[28] E. Borger, A. Cavarra, and E. Riccobene. Modelingthe dynamics of UML state machines. In AbstractState Machines-Theory and Applications, pages 167–186. Springer, 2000.

[29] Egon Borger. High level system design and analysisusing abstract state machines. In Proceedings of theInternational Workshop on Current Trends in AppliedFormal Method: Applied Formal Methods, FM-Trends98, pages 1–43, London, UK, UK, 1999. Springer-Verlag.

[30] Kathrin Bottger, Rolf Schwitter, Diego Molla, andDebbie Richards. Towards reconciling use cases viacontrolled language and graphical models. In INAP,pages 115–128, Berlin, Heidelberg, 2003. Springer-Verlag.

[31] F. Bouquet, C. Grandpierre, B. Legeard, F. Peureux,N. Vacelet, and M. Utting. A subset of preciseUML for model-based testing. In Proceedings of the3rd International Workshop on Advances in Model-BasedTesting, A-MOST ’07, pages 95–104, New York, NY,USA, 2007. ACM.

[32] Mustafa Bozkurt and Mark Harman. Automaticallygenerating realistic test input from web services. InIEEE 6th International Symposium on Service OrientedSystem Engineering (SOSE), pages 13–24, 2011.

[33] L. C. Briand, Y. Labiche, and H. Sun. Investigatingthe use of analysis contracts to improve the testa-bility of object-oriented code. Softw. Pract. Exper.,33(7):637–672, June 2003.

[34] Cristian Cadar, Vijay Ganesh, Peter M. Pawlowski,David L. Dill, and Dawson R. Engler. EXE: Auto-matically generating inputs of death. ACM Trans.Inf. Syst. Secur., 12(2):10:1–10:38, December 2008.

[35] J. Callahan, F. Schneider, S. Easterbrook, et al. Au-tomated software testing using model-checking. InProceedings 1996 SPIN workshop, volume 353. Cite-seer, 1996.

[36] F. T. Chan, T. Y. Chen, S. C. Cheung, M. F. Lau,and S. M. Yiu. Application of metamorphic testingin numerical analysis. In Proceedings of the IASTEDInternational Conference on Software Engineering, pages191–197, 1998.

[37] W.K. Chan, S.C. Cheung, and Karl R.P.H. Leung.A metamorphic testing approach for online testing ofservice-oriented software applications, chapter 7, pages2894–2914. IGI Global, 2009.

[38] W.K. Chan, S.C. Cheung, and K.R.P.H. Leung.Towards a metamorphic testing methodology forservice-oriented software applications. In QSIC,pages 470–476, September 2005.

[39] F. Chen, N. Tillmann, and W. Schulte. Discoveringspecifications. Technical Report MSR-TR-2005-146,Microsoft Research, October 2005.

[40] Huo Yan Chen, T. H. Tse, F. T. Chan, and T. Y.Chen. In black and white: an integrated approach toclass-level testing of object-oriented programs. ACMTrans. Softw. Eng. Methodol., 7:250–295, July 1998.

[41] Huo Yan Chen, T. H. Tse, and T. Y. Chen. TACCLE: amethodology for object-oriented software testing at

the class and cluster levels. ACM Trans. Softw. Eng.Methodol., 10(1):56–109, January 2001.

[42] T. Y. Chen, F.-C. Kuo, T. H. Tse, and Zhi Quan Zhou.Metamorphic testing and beyond. In Proceedingsof the International Workshop on Software Technologyand Engineering Practice (STEP 2003), pages 94–100,September 2004.

[43] Tsong Chen, Dehao Huang, Haito Huang, Tsun-Him Tse, Zong Yang, and Zhi Zhou. Metamorphictesting and its applications. In Proceedings of the 8thInternational Symposium on Future Software Technology,ISFST 2004, pages 310–319, 2004.

[44] Yoonsik Cheon. Abstraction in assertion-based testoracles. In Proceedings of the Seventh InternationalConference on Quality Software, pages 410–414, Wash-ington, DC, USA, 2007. IEEE Computer Society.

[45] Yoonsik Cheon and Gary T. Leavens. A simple andpractical approach to unit testing: The JML and JUnitway. In Proceedings of the 16th European Conference onObject-Oriented Programming, ECOOP ’02, pages 231–255, London, UK, 2002. Springer-Verlag.

[46] L.A. Clarke. A system to generate test data andsymbolically execute programs. Software Engineering,IEEE Transactions on, SE-2(3):215 – 222, sept. 1976.

[47] Paul C. Clements. Managing variability for softwareproduct lines: Working with variability mechanisms.In 10th International Conference on Software ProductLines (SPLC 2006), pages 207–208, Baltimore, Mary-land, USA, 2006. IEEE Computer Society.

[48] Markus Clermont and David Parnas. Using informa-tion about functions in selecting test cases. In Pro-ceedings of the 1st international workshop on Advancesin model-based testing, A-MOST ’05, pages 1–7, NewYork, NY, USA, 2005. ACM.

[49] David Coppit and Jennifer M. Haddox-Schatz. Onthe use of specification-based assertions as test ora-cles. In Proceedings of the 29th Annual IEEE/NASAon Software Engineering Workshop, pages 305–314,Washington, DC, USA, 2005. IEEE Computer Society.

[50] M. Davies and E. Weyuker. Pseudo-oracles for non-testable programs. In Proceedings of the ACM ’81Conference, pages 254–257, 1981.

[51] Laura K. Dillon. Automated support for testing anddebugging of real-time programs using oracles. SIG-SOFT Softw. Eng. Notes, 25(1):45–46, January 2000.

[52] Roong-Ko Doong and Phyllis G. Frankl. The AS-TOOT approach to testing object-oriented programs.ACM Trans. Softw. Eng. Methodol., 3:101–130, April1994.

[53] Edith Elkind, Blaise Genest, Doron Peled, andHongyang Qu. Grey-box checking. In FORTE, pages420–435, 2006.

[54] J. Ellsberger, D. Hogrefe, and A. Sarma. SDL: for-mal object-oriented language for communicating systems.Prentice Hall, 1997.

[55] M.D. Ernst, J.H. Perkins, P.J. Guo, S. McCamant,C. Pacheco, M.S. Tschantz, and C. Xiao. The Daikonsystem for dynamic detection of likely invariants.Science of Computer Programming, 69(1):35–45, 2007.

[56] Michael D. Ernst, Jake Cockrell, William G. Gris-wold, and David Notkin. Dynamically discovering

26

likely program invariants to support program evo-lution. IEEE Trans. Software Eng., 27(2):99–123, 2001.

[57] R.A. Eyre-Todd. The detection of dangling referencesin C++ programs. ACM Letters on Programming Lan-guages and Systems (LOPLAS), 2(1-4):127–134, 1993.

[58] R. Feldt. Generating diverse software versions withgenetic programming: an experimental study. Soft-ware, IEE Proceedings, 145, December 1998.

[59] Xin Feng, David Lorge Parnas, T. H. Tse, and TonyO’Callaghan. A comparison of tabular expression-based testing strategies. IEEE Trans. Softw. Eng.,37(5):616–634, September 2011.

[60] Xin Feng, David Lorge Parnas, and T.H. Tse. Tabularexpression-based testing strategies: A comparison.Testing: Academic and Industrial Conference Practice andResearch Techniques - MUTATION, 0:134, 2007.

[61] J. Ferrer, F. Chicano, and E. Alba. Evolutionaryalgorithms for the multi-objective test data gen-eration problem. Software: Practice and Experience,42(11):1331–1362, 2011.

[62] John S. Fitzgerald and Peter Gorm Larsen. ModellingSystems - Practical Tools and Techniques in SoftwareDevelopment (2. ed.). Cambridge University Press,2009.

[63] C. Flanagan and K. R. M. Leino. Houdini, anannotation assistant for ESC/Java. Lecture Notes inComputer Science, 2021:500–517, 2001.

[64] Robert W. Floyd. Assigning meanings to programs.In J. T. Schwartz, editor, Mathematical Aspects ofComputer Science, volume 19 of Symposia in AppliedMathematics, pages 19–32. American MathematicalSociety, Providence, RI, 1967.

[65] Gordon Fraser and Andrea Arcuri. Whole test suitegeneration. IEEE Transactions on Software Engineering,39(2):276–291, 2013.

[66] Gordon Fraser and Andreas Zeller. Exploiting com-mon object usage in test case generation. In Proceed-ings of the 2011 Fourth IEEE International Conferenceon Software Testing, Verification and Validation, ICST’11, pages 80–89. IEEE Computer Society, 2011.

[67] Kambiz Frounchi, Lionel C. Briand, Leo Grady, YvanLabiche, and Rajesh Subramanyan. Automatingimage segmentation verification and validation bylearning test oracles. Information and Software Tech-nology, 53(12):1337–1348, 2011.

[68] Pascale Gall and Agns Arnould. Formal specifica-tions and test: Correctness and oracle. In MagneHaveraaen, Olaf Owe, and Ole-Johan Dahl, editors,Recent Trends in Data Type Specification, volume 1130of Lecture Notes in Computer Science, pages 342–358.Springer Berlin Heidelberg, 1996.

[69] J. Gannon, P. McMullin, and R. Hamlet. Data ab-straction, implementation, specification, and testing.ACM Transactions on Programming Languages and Sys-tems (TOPLAS), 3(3):211–223, 1981.

[70] Angelo Gargantini and Elvinia Riccobene. ASM-based testing: Coverage criteria and automatic testsequence. Journal of Universal Computer Science,7(11):1050–1067, nov 2001.

[71] Stephen J. Garland, John V. Guttag, and James J.Horning. An overview of Larch. In Functional

Programming, Concurrency, Simulation and AutomatedReasoning, pages 329–348, 1993.

[72] Marie-Claude Gaudel. Testing from formal specifi-cations, a generic approach. In Proceedings of the 6thAde-Europe International Conference Leuven on ReliableSoftware Technologies, Ada Europe ’01, pages 35–48,London, UK, 2001. Springer-Verlag.

[73] Marie-Claude Gaudel and Perry R. James. Testingalgebraic data types and processes: A unifying the-ory. Formal Asp. Comput., 10(5-6):436–451, 1998.

[74] Gregory Gay, Sanjai Rayadurgam, and Mats Heim-dah. Improving the accuracy of oracle verdictsthrough automated model steering. In AutomatedSoftware Engineering (ASE 2014). ACM Press, 2014.

[75] Patrice Godefroid, Nils Klarlund, and Koushik Sen.DART: directed automated random testing. In PLDI,pages 213–223. ACM, 2005.

[76] Alex Groce, Mohammad Amin Alipour, ChaoqiangZhang, Yang Chen, and John Regehr. Cause reduc-tion for quick testing. In ICST, pages 243–252, 2014.

[77] Alex Groce, Doron Peled, and Mihalis Yannakakis.Amc: An adaptive model checker. In CAV, pages521–525, 2002.

[78] Roland Groz, Keqin Li, Alexandre Petrenko, andMuzammil Shahbaz. Modular system verificationby inference, testing and reachability analysis. InTestCom/FATES, pages 216–233, 2008.

[79] Zhongxian Gu, Earl T. Barr, David J. Hamilton, andZhendong Su. Has the bug really been fixed? In Pro-ceedings of the 2010 International Conference on SoftwareEngineering (ICSE’10). IEEE Computer Society, 2010.

[80] R. Guderlei and J. Mayer. Statistical metamorphictesting testing programs with random output bymeans of statistical hypothesis tests and metamor-phic testing. In QSIC, pages 404–409, October 2007.

[81] D. Harel. Statecharts: A visual formalism forcomplex systems. Science of computer programming,8(3):231–274, 1987.

[82] Mark Harman, Sung Gon Kim, Kiran Lakhotia, PhilMcMinn, and Shin Yoo. Optimizing for the num-ber of tests generated in search based test datageneration with an application to the oracle costproblem. In International Workshop on Search-BasedSoftware Testing (SBST 2010), pages 182–191. IEEE, 6April 2010.

[83] Mark Harman, Afshin Mansouri, and YuanyuanZhang. Search based software engineering: A com-prehensive analysis and review of trends techniquesand applications. Technical Report TR-09-03, Depart-ment of Computer Science, King’s College London,April 2009.

[84] Mark Harman, Phil McMinn, Jerffeson Teixeira deSouza, and Shin Yoo. Search based software engi-neering: Techniques, taxonomy, tutorial. In BertrandMeyer and Martin Nordio, editors, Empirical softwareengineering and verification: LASER 2009-2010, pages1–59. Springer, 2012. LNCS 7007.

[85] M. Jean Harrold, Rajiv Gupta, and Mary Lou Soffa.A methodology for controlling the size of a test suite.ACM Trans. Softw. Eng. Methodol., 2(3):270–285, July1993.

27

[86] Mary Jean Harrold, Gregg Rothermel, Kent Sayre,Rui Wu, and Liu Yi. An empirical investigationof the relationship between spectra differences andregression faults. Software Testing, Verification, andReliability, 10(3):171–194, 2000.

[87] David L. Heine and Monica S. Lam. A practical flow-sensitive and context-sensitive C and C++ memoryleak detector. In PLDI, pages 168–181. ACM, 2003.

[88] J. Henkel and A. Diwan. Discovering algebraicspecifications from Java classes. Lecture Notes inComputer Science, 2743:431–456, 2003.

[89] F.C. Hennie. Finite-state models for logical machines.Wiley, 1968.

[90] MarijnJ.H. Heule and Sicco Verwer. Software modelsynthesis using satisfiability solvers. Empirical Soft-ware Engineering, pages 1–32, 2012.

[91] R.M. Hierons. Oracles for distributed testing. Soft-ware Engineering, IEEE Transactions on, 38(3):629–641,2012.

[92] Robert M. Hierons. Verdict functions in testing witha fault domain or test hypotheses. ACM Trans. Softw.Eng. Methodol., 18(4):14:1–14:19, July 2009.

[93] Robert M. Hierons, Kirill Bogdanov, Jonathan P.Bowen, Rance Cleaveland, John Derrick, JeremyDick, Marian Gheorghe, Mark Harman, KalpeshKapoor, Paul Krause, Gerald Luttgen, Anthony J. H.Simons, Sergiy Vilkomir, Martin R. Woodward, andHussein Zedan. Using formal specifications tosupport testing. ACM Comput. Surv., 41:9:1–9:76,February 2009.

[94] Charles Anthony Richard Hoare. An AxiomaticBasis of Computer Programming. Communicationsof the ACM, 12:576–580, 1969.

[95] M. Holcombe. X-machines as a basis for dynamicsystem specification. Software Engineering Journal,3(2):69–76, 1988.

[96] Mike Holcombe and Florentin Ipate. Correct sys-tems: building a business process solution. SoftwareTesting Verification and Reliability, 9(1):76–77, 1999.

[97] Gerard J. Holzmann. The model checker SPIN. IEEETrans. Softw. Eng., 23(5):279–295, May 1997.

[98] W E Howden. A functional approach to pro-gram testing and analysis. IEEE Trans. Softw. Eng.,12(10):997–1005, October 1986.

[99] W.E. Howden. Theoretical and empirical studiesof program testing. IEEE Transactions on SoftwareEngineering, 4(4):293–298, July 1978.

[100] Merlin Hughes and David Stotts. Daistish: sys-tematic algebraic testing for OO programs in thepresence of side-effects. SIGSOFT Softw. Eng. Notes,21(3):53–61, May 1996.

[101] Daniel Jackson. Software Abstractions: Logic, Language,and Analysis. The MIT Press, 2006.

[102] Daniel Jackson. Software Abstractions: Logic, Language,and Analysis. The MIT Press, 2006.

[103] Claude Jard and Gregor v. Bochmann. An approachto testing specifications. Journal of Systems and Soft-ware, 3(4):315 – 323, 1983.

[104] D. Jeffrey and R. Gupta. Test case prioritizationusing relevant slices. In Computer Software and Appli-

cations Conference, 2006. COMPSAC ’06. 30th AnnualInternational, volume 1, pages 411–420, 2006.

[105] Ying Jin and David Lorge Parnas. Defining themeaning of tabular mathematical expressions. Sci.Comput. Program., 75(11):980–1000, November 2010.

[106] M.J. Kearns and U.V. Vazirani. An Introduction toComputational Learning Theory. MIT Press, 1994.

[107] R.M. Keller. Formal verification of parallel programs.Communications of the ACM, 19(7):371–384, 1976.

[108] James C. King. Symbolic execution and programtesting. Communications of the ACM, 19(7):385–394,July 1976.

[109] Kiran Lakhotia, Phil McMinn, and Mark Harman.An empirical investigation into branch coverage forC programs using CUTE and AUSTIN. Journal ofSystems and Software, 83(12):2379–2391, 2010.

[110] Axel van Lamsweerde. Formal specification: aroadmap. In Proceedings of the Conference on TheFuture of Software Engineering, ICSE ’00, pages 147–159, New York, NY, USA, 2000. ACM.

[111] K. Lano and H. Haughton. Specification in B: AnIntroduction Using the B Toolkit. Imperial CollegePress, 1996.

[112] D. Lee and M. Yannakakis. Principles and methodsof testing finite state machines—a survey. Proceedingsof the IEEE, 84(8):1090–1123, aug 1996.

[113] A. Leitner, M. Oriol, A Zeller, I. Ciupa, and B. Meyer.Efficient unit test case minimization. In AutomatedSoftware Engineering (ASE 2007), pages 417–420, At-lanta, Georgia, USA, 2007. ACM Press.

[114] Keqin Li, Roland Groz, and Muzammil Shahbaz.Integration testing of components guided by incre-mental state machine learning. In TAIC PART, pages59–70, 2006.

[115] D. Lorenzoli, L. Mariani, and M. Pezze. Automaticgeneration of software behavioral models. In pro-ceedings of the 30th International Conference on SoftwareEngineering (ICSE), 2008.

[116] Pablo Loyola, Matthew Staats, In-Young Ko, andGregg Rothermel. Dodona: Automated oracle dataset selection. In International Symposium on SoftwareTesting and Analysis 2014, ISSTA’04, 2014.

[117] D. Luckham and F.W. Henke. An overview ofANNA-a specification language for ADA. Technicalreport, Stanford University, 1984.

[118] Nancy A. Lynch and Mark R. Tuttle. An introductionto input/output automata. CWI Quarterly, 2:219–246,1989.

[119] Patrıcia D. L. Machado. On oracles for interpret-ing test results against algebraic specifications. InAMAST, pages 502–518. Springer-Verlag, 1999.

[120] Phil Maker. GNU Nana: improved support forassertion checking and logging in GNU C/C++,1998. http://gnu.cs.pu.edu.tw/software/nana/.

[121] Haroon Malik, Hadi Hemmati, and Ahmed E. Has-san. Automatic detection of performance deviationsin the load testing of large scale systems. In Proceed-ings of International Conference on Software Engineering- Software Engineering in Practice Track, ICSE 2013,page to appear, 2013.

http://gnu.cs.pu.edu.tw/software/nana/

28

[122] L. Mariani, F. Pastore, and M. Pezze. Dynamicanalysis for diagnosing integration faults. IEEETransactions on Software Engineering, 37(4):486–508,2011.

[123] Bruno Marre. Loft: A tool for assisting selectionof test data sets from algebraic specifications. InTAPSOFT, pages 799–800. Springer-Verlag, 1995.

[124] A.P. Mathur. Performance, effectiveness, and relia-bility issues in software testing. In COMPSAC, pages604–605. IEEE, 1991.

[125] Johannes Mayer, Ralph Guderlei, Abteilung Angew,Te Informationsverarbeitung, Abteilung Stochastik,and Universitt Ulm. Test oracles using statisticalmethods. In In: Proceedings of the First InternationalWorkshop on Software Quality, Lecture Notes in Infor-matics P-58, Kllen Druck+Verlag GmbH, pages 179–189. Springer, 2004.

[126] P. McMinn, M. Stevenson, and M. Harman. Reduc-ing qualitative human oracle costs associated withautomatically generated test data. In 1st InternationalWorkshop on Software Test Output Validation (STOV2010), Trento, Italy, 13th July 2010, pages 1–4, 2010.

[127] Phil McMinn. Search-based software test data gen-eration: a survey. Softw. Test. Verif. Reliab., 14(2):105–156, June 2004.

[128] Phil McMinn. Search-based failure discovery us-ing testability transformations to generate pseudo-oracles. In Genetic and Evolutionary ComputationConference (GECCO 2009), pages 1689–1696. ACMPress, 8-12 July 2009.

[129] Phil McMinn. Search-based software testing: Past,present and future. In International Workshop onSearch-Based Software Testing (SBST 2011), pages 153–163. IEEE, 21 March 2011.

[130] Phil McMinn, Muzammil Shahbaz, and MarkStevenson. Search-based test input generation forstring data types using the results of web queries.In ICST, pages 141–150, 2012.

[131] Phil McMinn, Mark Stevenson, and Mark Harman.Reducing qualitative human oracle costs associatedwith automatically generated test data. In Inter-national Workshop on Software Test Output Validation(STOV 2010), pages 1–4. ACM, 13 July 2010.

[132] Atif M. Memon. Automatically repairing eventsequence-based GUI test suites for regression test-ing. ACM Transactions on Software EngineeringMethodology, 18(2):1–36, 2008.

[133] Atif M. Memon, Martha E. Pollack, and Mary LouSoffa. Automated test oracles for GUIs. In SIGSOFT’00/FSE-8: Proceedings of the 8th ACM SIGSOFT inter-national symposium on Foundations of software engineer-ing, pages 30–39, New York, NY, USA, 2000. ACMPress.

[134] Atif M. Memon and Qing Xie. Using tran-sient/persistent errors to develop automated testoracles for event-driven software. In ASE ’04: Pro-ceedings of the 19th IEEE international conference onAutomated software engineering, pages 186–195, Wash-ington, DC, USA, 2004. IEEE Computer Society.

[135] Maik Merten, Falk Howar, Bernhard Steffen, Pa-trizio Pellicione, and Massimo Tivoli. Automated

inference of models for black box systems basedon interface descriptions. In Proceedings of the 5thinternational conference on Leveraging Applications ofFormal Methods, Verification and Validation: technologiesfor mastering change - Volume Part I, ISoLA’12, pages79–96. Springer-Verlag, 2012.

[136] Bertrand Meyer. Eiffel: A language and environmentfor software engineering. The Journal of Systems andSoftware, 8(3):199–246, June 1988.

[137] B.P. Miller, L. Fredriksen, and B. So. An empiricalstudy of the reliability of UNIX utilities. Communi-cations of the ACM, 33(12):32–44, 1990.

[138] S. Mouchawrab, L.C. Briand, Y. Labiche, andM. Di Penta. Assessing, comparing, and combiningstate machine-based testing and structural testing:A series of experiments. Software Engineering, IEEETransactions on, 37(2):161–187, 2011.

[139] Christian Murphy, Kuang Shen, and Gail Kaiser.Automatic system testing of programs without testoracles. In ISSTA, pages 189–200. ACM Press, 2009.

[140] Christian Murphy, Kuang Shen, and Gail Kaiser.Using JML runtime assertion checking to automatemetamorphic testing in applications without test or-acles. 2009 International Conference on Software TestingVerification and Validation, pages 436–445, 2009.

[141] A. J. Offutt, J. Pan, and J. M. Voas. Procedures forreducing the size of coverage-based test sets. InInternational Conference on Testing Computer Software,pages 111–123, 1995.

[142] C. Pacheco and M. Ernst. Eclat: Automatic genera-tion and classification of test inputs. ECOOP 2005-Object-Oriented Programming, pages 734–734, 2005.

[143] D L Parnas, J Madey, and M Iglewski. Precisedocumentation of well-structured programs. IEEETransactions on Software Engineering, 20(12):948–976,1994.

[144] David Lorge Parnas. Document based rationalsoftware development. Journal of Knowledge BasedSystems, 22:132–141, April 2009.

[145] David Lorge Parnas. Precise documentation: Thekey to better software. In Sebastian Nanz, editor,The Future of Software Engineering, pages 125–148.Springer Berlin Heidelberg, 2011.

[146] David Lorge Parnas and Jan Madey. Functional doc-uments for computer systems. Sci. Comput. Program.,25(1):41–61, October 1995.

[147] F. Pastore, L. Mariani, and G. Fraser. Crowdoracles:Can the crowd solve the oracle problem? In Proceed-ings of the International Conference on Software Testing,Verification and Validation (ICST), 2013.

[148] Doron Peled, Moshe Y. Vardi, and Mihalis Yan-nakakis. Black box checking. Journal of Automata,Languages and Combinatorics, 7(2), 2002.

[149] D K Peters and D L Parnas. Using test oracles gen-erated from program documentation. IEEE Transac-tions on Software Engineering, 24(3):161–173, 1998.

[150] Dennis K. Peters and David Lorge Parnas.Requirements-based monitors for real-time systems.IEEE Trans. Softw. Eng., 28(2):146–158, February2002.

29

[151] Mauro Pezze and Cheng Zhang. Automated testoracles: A survey. In Ali Hurson and Atif Memon,editors, Advances in Computers, volume 95, pages 1–48. Elsevier Ltd., 2014.

[152] Nadia Polikarpova, Ilinca Ciupa, and BertrandMeyer. A comparative study of programmer-writtenand automatically inferred contracts. In ISSTA,pages 93–104. ACM, 2009.

[153] S.J. Prowell and J.H. Poore. Foundations ofsequence-based software specification. Software En-gineering, IEEE Transactions on, 29(5):417–429, 2003.

[154] Sam Ratcliff, David R. White, and John A. Clark.Searching for invariants using genetic programmingand mutation testing. In Proceedings of the 13th annualconference on Genetic and evolutionary computation,GECCO ’11, pages 1907–1914, New York, NY, USA,2011. ACM.

[155] F. Ricca and P. Tonella. Detecting anomaly and fail-ure in web applications. MultiMedia, IEEE, 13(2):44– 51, april-june 2006.

[156] D.S. Rosenblum. A practical approach to program-ming with assertions. Software Engineering, IEEETransactions on, 21(1):19–31, 1995.

[157] G. Rothermel, M. J. Harrold, J. von Ronne, andC. Hong. Empirical studies of test-suite reduction.Software Testing, Verification and Reliability, 12:219—249, 2002.

[158] N. Walkinshaw S. Ali, K. Bogdanov. A comparativestudy of methods for dynamic reverse-engineeringof state models. Technical Report CS-07-16, TheUniversity of Sheffield, Department of ComputerScience, October 2007. http://www.dcs.shef.ac.uk/intranet/research/resmes/CS0716.pdf.

[159] David Schuler and Andreas Zeller. Assessing oraclequality with checked coverage. In Proceedings of the2011 Fourth IEEE International Conference on SoftwareTesting, Verification and Validation, ICST ’11, pages90–99, Washington, DC, USA, 2011. IEEE ComputerSociety.

[160] R. Schwitter. English as a formal specificationlanguage. In Database and Expert Systems Applica-tions, 2002. Proceedings. 13th International Workshopon, pages 228–232, sept. 2002.

[161] Sergio Segura, Robert M. Hierons, David Benavides,and Antonio Ruiz-Cortes. Automated metamorphictesting on the analyses of feature models. Informationand Software Technology, 53(3):245 – 258, 2011.

[162] Koushik Sen, Darko Marinov, and Gul Agha. CUTE:a concolic unit testing engine for C. In ESEC/FSE,pages 263–272. ACM, 2005.

[163] Seyed Shahamiri, Wan Wan-Kadir, Suhaimi Ibrahim,and Siti Hashim. Artificial neural networks as multi-networks automated test oracle. Automated SoftwareEngineering, 19(3):303–334, 2012.

[164] Seyed Reza Shahamiri, Wan Mohd Nasir Wan Kadir,Suhaimi Ibrahim, and Siti Zaiton Mohd Hashim.An automated framework for software test oracle.Information and Software Technology, 53(7):774 – 788,2011.

[165] Seyed Reza Shahamiri, Wan Mohd Nasir Wan-Kadir,and Siti Zaiton Mohd Hashim. A comparative study

on automated software test oracle methods. InICSEA, pages 140–145, 2009.

[166] M. Shahbaz. Reverse Engineering and Testing of Black-Box Software Components. LAP Lambert AcademicPublishing, 2012.

[167] K. Shrestha and M.J. Rutherford. An empiricalevaluation of assertions as oracles. In ICST, pages110–119. IEEE, 2011.

[168] Guoqiang Shu, Yating Hsu, and David Lee. De-tecting communication protocol security flaws byformal fuzz testing and machine learning. In FORTE,pages 299–304, 2008.

[169] J. Sifakis. Deadlocks and livelocks in transition sys-tems. Mathematical Foundations of Computer Science1980, pages 587–600, 1980.

[170] A. J. H. Simons. JWalk: a tool for lazy systematictesting of java classes by design introspection anduser interaction. Automated Software Engineering,14(4):369–418, December 2007.

[171] Rishabh Singh, Dimitra Giannakopoulou, and Co-rina S. Pasareanu. Learning component interfaceswith may and must abstractions. In Tayssir Touili,Byron Cook, and Paul Jackson, editors, ComputerAided Verification, 22nd International Conference, CAV2010, Edinburgh, UK, July 15-19, 2010. Proceedings,volume 6174 of Lecture Notes in Computer Science,pages 527–542. Springer, 2010.

[172] J. Michael Spivey. Z Notation - a reference manual (2.ed.). Prentice Hall International Series in ComputerScience. Prentice Hall, 1992.

[173] M. Staats, G. Gay, and M. P. E. Heimdahl. Auto-mated oracle creation support, or: How I learned tostop worrying about fault propagation and love mu-tation testing. In Proceedings of the 34th InternationalConference on Software Engineering, ICSE 2012, pages870–880, 2012.

[174] M. Staats, M.W. Whalen, and M.P.E. Heimdahl. Pro-grams, tests, and oracles: the foundations of testingrevisited. In ICSE, pages 391–400. IEEE, 2011.

[175] Matt Staats, Shin Hong, Moonzoo Kim, and GreggRothermel. Understanding user understanding: de-termining correctness of generated program invari-ants. In ISSTA, pages 188–198. ACM, 2012.

[176] Matt Staats, Pablo Loyola, and Gregg Rothermel.Oracle-centric test case prioritization. In Proceedingsof the 23rd International Symposium on Software Relia-bility Engineering, ISSRE 2012, pages 311–320, 2012.

[177] Ari Takanen, Jared DeMott, and Charlie Miller.Fuzzing for Software Security Testing and Quality As-surance. Artech House, Inc., Norwood, MA, USA, 1edition, 2008.

[178] Ramsay Taylor, Mathew Hall, Kirill Bogdanov, andJohn Derrick. Using behaviour inference to optimiseregression test sets. In Brian Nielsen and CarstenWeise, editors, Testing Software and Systems - 24th IFIPWG 6.1 International Conference, ICTSS 2012, Aalborg,Denmark, November 19-21, 2012. Proceedings, volume7641 of Lecture Notes in Computer Science, pages 184–199. Springer, 2012.

[179] A. Tiwari. Formal semantics and analysis methodsfor Simulink Stateflow models. Technical report, SRI

http://www.dcs.shef.ac.uk/intranet/research/resmes/CS0716.pdf

http://www.dcs.shef.ac.uk/intranet/research/resmes/CS0716.pdf

30

International, 2002. http://www.csl.sri.com/users/tiwari/html/stateflow.html.

[180] Jan Tretmans. Test generation with inputs, outputsand repetitive quiescence. Software - Concepts andTools, 17(3):103–120, 1996.

[181] Alan M. Turing. Checking a large routine. In Reportof a Conference on High Speed Automatic CalculatingMachines, pages 67–69, Cambridge, England, June1949. University Mathematical Laboratory.

[182] Mark Utting and Bruno Legeard. Practical Model-Based Testing: A Tools Approach. Morgan KaufmannPublishers Inc., San Francisco, CA, USA, 2007.

[183] Mark Utting, Alexander Pretschner, and Bruno Leg-eard. A taxonomy of model-based testing ap-proaches. Softw. Test. Verif. Reliab., 22(5):297–312,August 2012.

[184] G. v. Bochmann, C. He, and D. Ouimet. Protocoltesting using automatic trace analysis. In Proceed-ings of Canadian Conference on Electrical and ComputerEngineering, pages 814–820, 1989.

[185] A. van Lamsweerde and M. Sintzoff. Formal deriva-tion of strongly correct concurrent programs. ActaInformatica, 12(1):1–31, 1979.

[186] J.M. Voas. PIE: a dynamic failure-based technique.Software Engineering, IEEE Transactions on, 18(8):717–727, 1992.

[187] N. Walkinshaw, K. Bogdanov, J. Derrick, and J. Paris.Increasing functional coverage by inductive testing:A case study. In Alexandre Petrenko, Adenilsoda Silva Simao, and Jose Carlos Maldonado, edi-tors, Testing Software and Systems - 22nd IFIP WG6.1 International Conference, ICTSS 2010, Natal, Brazil,November 8-10, 2010. Proceedings, volume 6435 ofLecture Notes in Computer Science, pages 126–141.Springer, 2010.

[188] Neil Walkinshaw and Kirill Bogdanov. Inferringfinite-state models with temporal constraints. InASE, pages 248–257, 2008.

[189] Neil Walkinshaw, John Derrick, and Qiang Guo.Iterative refinement of reverse-engineered models bymodel-based testing. In FM, pages 305–320, 2009.

[190] Yabo Wang and DavidLorge Parnas. Trace rewritingsystems. In Michael Rusinowitch and Jean-LucRemy, editors, Conditional Term Rewriting Systems,volume 656 of Lecture Notes in Computer Science,pages 343–356. Springer Berlin Heidelberg, 1993.

[191] Yabo Wang and D.L. Parnas. Simulating the be-haviour of software modules by trace rewriting. InSoftware Engineering, 1993. Proceedings., 15th Interna-tional Conference on, pages 14–23, 1993.

[192] Yi Wei, Carlo A. Furia, Nikolay Kazmin, andBertrand Meyer. Inferring better contracts. In Pro-ceedings of the 33rd International Conference on SoftwareEngineering, ICSE ’11, pages 191–200, New York, NY,USA, 2011. ACM.

[193] Yi Wei, H. Roth, C.A. Furia, Yu Pei, A. Horton,M. Steindorfer, M. Nordio, and B. Meyer. Statefultesting: Finding more errors in code and contracts.In Automated Software Engineering (ASE), 2011 26thIEEE/ACM International Conference on, pages 440–443,2011.

[194] E.J. Weyuker. Assessing test data adequacy throughprogram inference. ACM Transactions on Program-ming Languages and Systems (TOPLAS), 5(4):641–655,1983.

[195] E.J. Weyuker and F.I. Vokolos. Experience withperformance testing of software systems: issues, anapproach, and case study. Software Engineering, IEEETransactions on, 26(12):1147–1156, 2000.

[196] Elaine J. Weyuker. On testing non-testable programs.The Computer Journal, 25(4):465–470, November 1982.

[197] Jeannette M. Wing. A specifier’s introduction toformal methods. IEEE Computer, 23(9):8–24, 1990.

[198] Qing Xie and Atif M. Memon. Designing andcomparing automated test oracles for GUI-basedsoftware applications. ACM Transactions on SoftwareEngineering and Methodology, 16(1):4, 2007.

[199] Tao Xie. Augmenting automatically generated unit-test suites with regression oracle checking. In Proc.20th European Conference on Object-Oriented Program-ming (ECOOP 2006), pages 380–403, July 2006.

[200] Tao Xie and David Notkin. Checking inside the blackbox: Regression testing by comparing value spectra.IEEE Transactions on Software Engineering, 31(10):869–883, October 2005.

[201] Yichen Xie and Alex Aiken. Context-and path-sensitive memory leak detection. ACM SIGSOFTSoftware Engineering Notes, 30(5):115–125, 2005.

[202] Zhihong Xu, Yunho Kim, Moonzoo Kim, GreggRothermel, and Myra B. Cohen. Directed test suiteaugmentation: techniques and tradeoffs. In Pro-ceedings of the eighteenth ACM SIGSOFT internationalsymposium on Foundations of software engineering, FSE’10, pages 257–266, New York, NY, USA, 2010. ACM.

[203] Shin Yoo. Metamorphic testing of stochastic optimi-sation. In Proceedings of the 2010 Third InternationalConference on Software Testing, Verification, and Vali-dation Workshops, ICSTW ’10, pages 192–201. IEEEComputer Society, 2010.

[204] Shin Yoo and Mark Harman. Regression testingminimisation, selection and prioritisation: A survey.Software Testing, Verification, and Reliability, 22(2):67–120, March 2012.

[205] Bo Yu, Liang Kong, Yufeng Zhang, and Hong Zhu.Testing Java components based on algebraic specifi-cations. Software Testing, Verification, and Validation,2008 International Conference on, 0:190–199, 2008.

[206] A. Zeller and R Hildebrandt. Simplifying and iso-lating failure-inducing input. IEEE Transactions onSoftware Engineering, 28(2):183–200, 2002.

[207] S. Zhang, D. Saff, Y. Bu, and M.D. Ernst. Combinedstatic and dynamic automated test generation. InISSTA, volume 11, pages 353–363, 2011.

[208] Wujie Zheng, Hao Ma, Michael R. Lyu, Tao Xie,and Irwin King. Mining test oracles of web searchengines. In Proc. 26th IEEE/ACM International Con-ference on Automated Software Engineering (ASE 2011),Short Paper, ASE 2011, pages 408–411, 2011.

[209] Zhi Quan Zhou, ShuJia Zhang, Markus Hagenbuch-ner, T. H. Tse, Fei-Ching Kuo, and T. Y. Chen. Au-tomated functional testing of online search services.

http://www.csl.sri.com/users/tiwari/html/stateflow.html

http://www.csl.sri.com/users/tiwari/html/stateflow.html

31

Software Testing, Verification and Reliability, 22(4):221–243, 2012.

[210] Hong Zhu. A note on test oracles and semanticsof algebraic specifications. In Proceedings of the3rd International Conference on Quality Software, QSIC2003, pages 91–98, 2003.

[211] B. Zorn and P. Hilfinger. A memory allocationprofiler for C and Lisp programs. In Proceedings ofthe Summer USENIX Conference, pages 223–237, 1988.