from daikon to agitator: lessons and challenges in building a … · 2019-03-16 · from daikon to...

From Daikon to Agitator: Lessons and Challenges inBuilding a Commercial Tool for Developer Testing

Marat [email protected]

Roongko [email protected]

Alberto [email protected]

Agitar Software, Inc.1350 Villa Street

Mountain View, CA 94041, USA

ABSTRACTDeveloper testing is of one of the most e!ective strategiesfor improving the quality of software, reducing its cost, andaccelerating its development. Despite its widely recognizedbenefits, developer testing is practiced by only a minority ofdevelopers. The slow adoption of developer testing is pri-marily due to the lack of tools that automate some of themore tedious and time-consuming aspects of this practice.Motivated by the need for a solution, and helped and in-spired by the research in software test automation, we cre-ated a developer testing tool based on software agitation.Software agitation is a testing technique that combines theresults of research in test-input generation and dynamic in-variant detection. We implemented software agitation in acommercial testing tool called Agitator. This paper gives ahigh-level overview of software agitation and its implemen-tation in Agitator, focusing on the lessons and challenges ofleveraging and applying the results of research to the imple-mentation of a commercial product.

Categories and Subject DescriptorsD.2.5 [Software Engineering]: Testing and Debugging -testing tools.

General TermsAlgorithms, Performance, Reliability.

KeywordsDeveloper testing, unit testing, automated testing tools, soft-ware agitation, dynamic invariant detection, test-input gen-eration, technology transfer.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ISSTA’06, July 17–20, 2006, Portland, Maine, USA.Copyright 2006 ACM 1-59593-263-1/06/0007 ...$5.00.

1. INTRODUCTIONResearchers and practitioners alike agree that developer test-ing is one of the most e!ective strategies for improving thequality of software and ensuring timely completion of soft-ware projects [22]. Developer testing requires developers tocommit to testing their own code prior to integration withthe rest of the system. Typically this type of testing is ac-complished by creating and running small unit tests thatfocus on isolated behavior and do not require running theentire software system. Developer testing is not intended toreplace system testing or integration testing that are typi-cally practiced by the Quality Assurance (QA) organization.Rather, developer testing is designed to make it possiblefor the QA organization to focus on identifying system- andintegration-level defects. This is critical for two reasons. Thefirst reason is that the time and the cost of detecting and fix-ing a unit-level defect is orders of magnitude greater duringsystem or integration testing than during development [34].The second reason is that any time spent by the QA organi-zation on unit-level defects either precludes or limits system-and integration-level testing.

The evidence suggests that the practice of developer test-ing finds limited and inconsistent adoption among devel-opers. We believe that this stems both from a culturalproblem—historically, developers have not been held respon-sible for testing their own code—and from a lack of adequatedeveloper testing tools. Fortunately, there is reason for op-timism. Agile software development practices, which havebeen growing in popularity in the last few years, considerdeveloper testing an integral and important part of the devel-opment process [2]. A new breed of unit-testing frameworks(such as JUnit [15] and NUnit [17]) has helped to standardizethe way unit tests are written and executed. But even withthese frameworks, the cost of developing unit tests is high,both for individual developers and for development organiza-tions. Tests have to be written by hand, and the combinato-rial nature of software testing requires a significant amountof test code to achieve adequate application code coverage.It is di"cult for developers to switch modes from develop-ment activities—mostly constructive and focused—to test-ing activities—mostly destructive and exploratory. Thinkingabout all the ways their freshly-written code could be wrong,broken, or incomplete is not natural for most developers.This is one of the most cited reasons for letting developersfocus on development and QA engineers on testing.

A proper developer testing tool can address and minimizemany obstacles to a more widespread adoption of developertesting. To achieve this, a developer testing tool must bee!ective in three areas. First, it must automate most of thetasks that don’t require human intelligence, creativity, or in-sight. Second, it should facilitate and accelerate the devel-oper’s mental transition from development mode to testingmode. The tool has to present interesting input conditionsthat the developer may not have considered, and help thedeveloper explore and discover the code’s actual behaviorunder those conditions. Third, it should automatically gen-erate a reusable set of tests to prevent regressions, once thedeveloper is satisfied with the breadth and depth of explo-ration, has achieved a su"cient level of insight into the code,and has corrected any undesired behavior.

Various aspects of software test automation and the nec-essary static and dynamic code analyses have been popularresearch topics. The body of work in these areas alreadycontains the core concepts and ideas necessary for buildinghighly-e!ective developer testing tools. Yet, the benefits ofmuch of this research have not reached the broader devel-opment community in a form that could be used in theirday-to-day work. At Agitar Software, we leveraged severalkey test automation ideas from research, integrated them ina technique called software agitation, and built a novel devel-oper testing product. This product, called Agitator, makesdeveloper testing of Java programs practical and e!ective.

Software agitation brings two research ideas—dynamic in-variant detection and test-input generation—to bear on thetask of automating developer testing. Dynamic detectionof likely program invariants was pioneered by Ernst in theDaikon invariant detector [6]. We adapted Ernst’s work tothe problem of developer testing. Automated generation,selection, and optimization of test-input values has been anactive area in testing research. We combined some of thewell-known ideas from that body of work with human-guidedtest-input creation to provide rich test inputs that enableautomatic detection of invariants.

In moving these research ideas into practice, we facedmany challenges. Dealing with these challenges involvedunderstanding the trade-o!s between di!erent design issuessuch as completeness, correctness, usability, and performance.This paper presents some of these issues, summarizes our so-lutions, and discusses the lessons we learned while applyingresearch ideas in the industrial setting.

The rest of this paper is organized as follows. In Section 2we describe software agitation in more detail. We outlineits research origins and summarize a number of contribut-ing technologies that inspired the development of Agitator.Section 3 discusses some of the challenges that must be ad-dressed in applying this research technology to “industrial-grade” software systems. Section 4 presents a high-leveldiscussion of software agitation in Agitator. We describethe tool both at the architectural level and from the user’s(developer’s) perspective. We also present several of thelessons learned while building and deploying Agitator, in-cluding some of the surprising discoveries of how the use ofAgitator a!ects software development habits. Section 5 sum-marizes our experience in using Agitator on itself. Section 6presents some future directions and discusses a number ofareas that warrant further research exploration.

Figure 1: Overview of software agitation workflow.

2. SOFTWARE AGITATIONSoftware agitation is a unit-testing technique for automati-cally exercising individual units of source code in isolationfrom the rest of the system. Figure 1 presents a high-levelview of the agitation workflow. Having finished a codingtask, the developer invokes software agitation. The result ofthe agitation is a set of observations about the code’s behav-ior. The developer checks observations to see if they revealany bugs in the code and, if so, fixes those bugs and repeatsthe process. If an observation represents desired behavior,it can be “promoted” to an assertion. These assertions arechecked during subsequent agitations to detect regressionsfrom the desired behavior. Thus, software agitation doesnot entail test generation. Rather, it is a technique for ex-ploring the implemented behavior of the code under test.When this behavior is desirable, it is the developer’s respon-sibility to create an assertion (a test) that expresses thatbehavior. Software agitation reduces the grunt work, but itdoes not solve the test-oracle problem—this responsibility isdelegated to the developer.

Broadly, software agitation comprises three phases: (1)creating instances of the classes being exercised, (2) call-ing all of the methods of those classes with a wide varietyof input data, and (3) recording the results for subsequentanalysis. The resulting observations represent the observedbehavior of the unit under test. Observations take the formof relationships between various values in the source codethat were determined to hold under a variety of di!erentinputs. For example, for a method computing the maxi-mum of two values a and b, agitation might determine thatmax(a, b) ! a "max(a, b) ! b.

The development of software agitation was inspired byErnst’s work on dynamic invariant detection [6]. Dynamicinvariant detection discovers program invariants from exe-cution traces that are extracted by executing a system on avariety of valid test inputs. In their ICSE’99 paper [7], Ernstet al. demonstrate that Daikon dynamic invariant detectioncan infer many of the same invariants that a diligent pro-grammer would write as part of a program’s specification.

We extended the Daikon work to the domain of testing bynoticing that dynamic invariant detection can be used fordeveloper testing in two ways. First, the inferred invariantsrepresent the observable behavior of the unit under test. De-velopers can check these invariants to determine whether theobserved behavior coincides with the desired behavior. Thisfacilitates early detection of defects. Second, developers cancapture the invariants as part of the specification for the unitunder test. We call this process “promotion of observations

to assertions.” Subsequent agitations can check these asser-tions, creating the basis for automated regression testing.Developers can also add their own assertions to complementthose discovered during agitation. These assertions becomepart of the testing suite.

One of Daikon’s limitations is that it requires a broadrange of input data values to generate a statistically signifi-cant sample of observed invariants. Automatic generation oftest-input data has been a well-studied area of research. Wecombined several well-known techniques, such as symbolicexecution, constraint solving, heuristic- and feedback-drivenrandom input generation, and human input specification.

A similar application of dynamic invariant detection totesting was also independently proposed by Xie and Notkinin their ASE’03 paper [37]. (By that time Agitator wasalready in early beta testing, but not yet available to thepublic.) Xie and Notkin use an existing unit-test suite toexercise the program under test using Daikon, which wasmodified to generate design-by-contract (DbC) annotationsfor a commercial test generation tool. The DbC annotationsrepresent pre- and post-conditions on individual methods inthe program. The test generator attempts to generate ad-ditional unit tests that violate DbC invariants. Then, thedevelopers examine the generated tests for potential inclu-sion in the unit-test suite.

Like software agitation, Xie and Notkin’s tool is intendedfor interactive use. As in our approach, one of their majorgoals when generating test data is to challenge and violateassertions. Yet, unlike Xie and Notkin’s approach, softwareagitation does not require any pre-existing tests. Softwareagitation also delegates more control to developers: it is upto the developer to decide which of the inferred invariantscontribute to the specification.

Software agitation is a testing technique with a rich re-search heritage. It combines a number of well-researchedideas and applies them to the developer testing problem.In bringing software agitation to market we are indebted tothe many researchers who have influenced and inspired ourwork.

3. CHALLENGES AND REQUIREMENTSThis section presents several design and implementation chal-lenges that we faced while developing a commercial testingtool based on software agitation.

3.1 UsabilitySoftware agitation involves several new concepts. Becausethe developer plays an essential role in the software agita-tion workflow, a software agitation tool must provide an in-terface that makes it easy to understand these new conceptsand that fosters communication between a developer and thetool. This realization leads to several important conclusions.

Speak the developer’s language. While it may be tempt-ing to present discovered invariants to the developers in theform of axioms in first-order logic, it is likely that developerswill not be able to connect the invariants to the behavior oftheir code. Likewise, we cannot expect developers to learna new specification language or expend any significant e!orton figuring out the tool.

These conclusions are partly due to Doong and Savoia’searlier work on the Assertion Definition Language (ADL) [26].

Despite its many innovations, the final report on the firstfour years of ADL research concludes that “developer pro-ductivity in producing [a test suite] was lower than would beexpected from manual development [...] due to a very steeplearning curve for ADL.” [4]

Make no assumptions about developer’s expertise.Some developers may have experience programming with in-variants and pre- and post-conditions; others—may not. Asoftware agitation tool must provide sca!olding for those de-velopers that need it, while not inhibiting the work of moreexperienced developers.

Integrate into the established workflow. Software ag-itation requires developers to adjust some of their routinesand habits. For example, when using software agitation thedevelopers switch from a traditional edit/compile/debug cy-cle to an edit/compile/agitate/review/debug cycle. We dis-covered that it is essential for developers to be able to per-form the agitate phase entirely within their preferred devel-opment environment. This implies that, in order to max-imize adoption, a software agitation tool must be a fullyintegrated extension of the development environment and itmust use the idioms and conventions of that environment.

Support refactoring. Helped by the automated supportin most modern IDEs, code refactoring has become a verycommon practice. A software agitation tool must be able totrack and accommodate refactoring operations to preservedevelopers’ investment in assertions and test data.

Support mock objects. When the code under test re-lies on complex objects or layers of infrastructure, it is oftenimpractical and time-consuming to set up an isolated envi-ronment for testing that code. Mock objects that conformto the interfaces of real objects but only implement partialfunctionality are an important component of a unit-testingstrategy [32]. A software agitation tool must support andautomatically generate mock objects when necessary.

3.2 ApplicabilityA software agitation tool must work with a variety of codebases. In contrast to the relatively small and self-containedexamples described in published research, real-world appli-cations are large collections of often nasty code. These ap-plication may have started with an elegant design but, overtime, have organically evolved into systems that make test-ing a considerable challenge, even by a human. It is not un-usual to find poorly-structured code where a single functionor a method stretches for hundreds of lines. A software sys-tem may depend on a variety of support libraries for whichthe source code is not available, inhibiting source-level anal-ysis. A Java application may include native methods im-plemented in a low-level language (typically, C or C++), in-hibiting any analysis of those methods. Unfortunately, muchof the existing research in automated testing makes simplify-ing assumptions about the amenability of a software systemto analysis and about the size of these systems. A com-mercial software agitation tool must be ready to face thesechallenges.

Additional complications result from the continuous evo-lution of commercial software systems. From the day thefirst line of source code is written, the entire system un-

dergoes constant modification. Any tool that collects andstores information about a software system must be able toparticipate in this continuous change process. In particu-lar, a developer testing tool must facilitate evolution of teststogether with the system under test without requiring a de-veloper to manually propagate source code changes to thetests.

3.3 Scalability and PerformanceSoftware systems grow increasingly more complex and in-terdependent. It is not uncommon for modern software toconsist of millions of lines of code written by a multitudeof software developers over the course of many years. Thiscomplexity can a!ect the performance of any software de-velopment tool. A software agitation tool is no exception.While only a small part of source code is exercised during theagitation of a unit, many of the analyses that take place areglobal and non-linear with respect to the size of the sourcecode base. Because software agitation exercises user code,bad performance of that code may reflect badly on the over-all performance of the tool. At the same time, tight inte-gration with the development workflow dictates that the re-sponse time of a software agitation tool must be su"cientlyshort to avoid distracting developers from their work.

These limitations present a serious dilemma. On the onehand, many of the recently developed program analysis tech-niques can be applied to software agitation to provide bet-ter test input data and to generate better invariants. Onthe other hand, less precise approaches can result in betterperformance on the real-life software systems. This choice,however, is not binary. Most of the analysis technologies canbe outfitted with various shortcuts to tolerate some degreeof imprecision and ambiguity. Finding a sweet spot in thiscontinuum represents one of the most significant challengesin building a commercial tool for developer testing.

4. AGITATOR: THEORY INTO PRACTICEThis section briefly describes the implementation of soft-ware agitation in Agitar’s Agitator. We make no attemptto provide a complete description of the product. Rather,we concentrate only on those aspects of the implementationthat illustrate how we addressed the challenges in movingAgitator’s research foundation into practice.

4.1 Agitator from a User’s PerspectiveIn order to provide tight integration with the developmentworkflow, we implemented Agitator’s interface on top ofEclipse [5], a popular IDE for the Java programming lan-guage. While users can invoke agitation from the command-line, this mode of use is mostly reserved for build automa-tion and is not meant for day-to-day development. Usersinvoke Agitator by selecting a unit to test (a class, a pack-age, or even an entire project) and clicking the “Agitate”button. Starting an agitation is akin to invoking a compilerin a traditional development workflow.1 Upon completion,Agitator presents several information panes (called “views”in Eclipse). Four of these views—the Outcomes view, theObservations view, the Snapshots view, and the Coverage

1The compiler metaphor is not entirely accurate in the caseof Eclipse, where incremental compilation (when enabled)eliminates a separate compilation step.

view—are particularly illustrative of how Agitator meets theusability requirements put forth in Section 3. These views,shown in Figure 2, are discussed below.

The Outcomes View (Figure 2a). This view shows theoutcomes that Agitator identified for a class or a methodtogether with any additional outcome partitions that theuser has defined. Outcomes represent possible ways in whicha method’s execution can terminate. Outcomes are deter-mined statically—by considering all possible exit points froma method—and dynamically by observing the preceding ag-itation. A “normal” outcome occurs when no exceptionswere thrown; other outcomes represent possible exceptions.The notion of method outcomes originated in ADL and rep-resents a mechanism that enables users to reason about theflow of execution through a method. Users can further sub-divide an outcome into outcome partitions that representvarious post-conditions on that outcome.

Outcomes can fall into three categories: expected, unex-pected, and unspecified. Expected outcomes include the nor-mal outcome, any user-defined partitions of the normal out-come, and any explicitly thrown exception in the body ofa method. Unspecified outcomes represent outcomes wherethe user does not care about tracking the results. Unex-pected outcomes include Java runtime exceptions that arenot explicitly declared in the method signature.

The Observations View (Figure 2b). This view displaysobservations and assertions for the selected class, method,or outcome. Observations represent the likely invariants in-ferred by Agitator during preceding agitation. Agitator dis-plays separate sets of observations for each outcome (and foreach outcome partition) of a method. When an observationrepresents expected behavior, users can click the check boxnext to the observation to promote it to an assertion. Asser-tions can also represent explicit conditions manually enteredby the user. Each assertion gets a pass/fail score after eachagitation run.

Agitator displays observations and assertions as booleanJava expressions. This choice of representation reflects ourdesire to present information to users in a language theycan clearly understand. Using the expression notation fromthe underlying programming language raises the level of dis-course between the user and the tool without overburdeningthe user. Similarly, the choice of terminology (e.g., “observa-tion” vs. “likely invariant”) is a conscious attempt to trans-late traditional notions into a more readily understandableform.

The Snapshots View (Figure 2c). This view shows sam-ple input values used when exercising the method, for eachoutcome of a method. Each snapshot captures the state ofthe method’s class and its parameters before and after amethod call. The snapshots provide a mechanism for theuser to examine the input data values used during agitationand to set debugger breakpoints based on these values.

The Coverage View (Figure 2d). Agitator presents cov-erage information on a side panel inside the Eclipse sourcecode editor. Coverage reflects Agitator’s success in findingdata values that direct execution to every statement and toevery possible exception in the code. The coverage metricsused by Agitator represent line coverage, augmented with

(a) Outcomes View (b) Observations View

(c) Snapshots View (d) Coverage View

Figure 2: Four information views available to Agitator users upon completion of agitation.

expression coverage for every conditional in a branch. Usersuse the coverage view to assess the quality of the agitationand to decide whether they need to assist Agitator in findingsuitable input values to improve coverage.

The coverage view represents one of the mechanisms thatAgitator uses to provide immediate feedback about agitationto users. By reviewing the sections of source code that werenot covered during agitation, users can identify input valuesthat Agitator was not able to construct, and detect possibleunreachable paths in the source code.

4.1.1 Helping Agitator Generate Input DataUsing auto-generation of input data, Agitator can generatemany relevant test-input values with no user intervention.For some methods, however, users need to refine the gen-erated values by specifying a valid range of input data orby providing a set of test values that improve the quality ofobservations. In some cases, it may be necessary to supplymore information so that Agitator can construct complexobject states. (This situation is further considered in Sec-tion 4.3.2, where we discuss the analysis algorithms used toconstruct test-input data.) In all of these cases, Agitator al-lows users to define test-input factories to accomplish theirgoals.

A factory tells Agitator how to construct interesting andrelevant test objects of a given type. Users can define facto-ries by configuring one of the predefined factories, or by using

the Factory API to define their own. Agitator includes a richset of pre-defined factories. These factories range from thevery simple, such as those that return values in a given nu-merical range, to the fairly sophisticated, such as those thatuse reflection to create complex objects from constructorsor methods. Factories are composable. For example, userscan configure an integer factory to generate values within aspecific range, and then use that factory in a collection fac-tory to generate collections of integers in that range. Thefactories defined by one user become durable assets for therest of the development team to use and share.

4.1.2 Domain Knowledge for TestingGetting the most out of Agitator on industrial-scale sourcecode bases requires some amount of “tweaking.” This isespecially true when an application utilizes a third-partyframework (such as J2EE [29] or Struts [30]) that requiressome configuration to enable independent agitation of units.In such situations, many reusable test assets such as fac-tory configurations, outcome partitions, and assertions arecommon across all classes and methods that rely on theseframeworks.

Configuring Agitator for every such application requiresdomain knowledge and understanding. To facilitate and ac-celerate this task, Agitator enables packaging of commondomain knowledge in independent modules that can be ap-plied in various application contexts. These modules are

called domain experts. Agitar provides experts that addresssome common patterns, such as those arising in J2EE [29],Struts [30], Hibernate [13], and Spring [28].

Agitator recognizes code patterns and uses experts to en-able better agitation results by applying appropriate boil-erplate configuration that provides factories, assertions, andoutcome partitions. Some experts are applied automatically;others are configured by the user. Using experts eliminatesmany of the repetitive tasks that users would otherwise haveto perform when agitating source code that uses a commonframework.

New experts can be created using the Expert API. TheExpert API is not intended for everyday use. Rather, thisfacility exists for the benefit of framework and componentproviders, who wish to supply an Agitator expert with theirlibrary. We envision the Expert API as evolving into ageneric plug-in mechanism for Agitator, enabling much moreextensive use of the experts within the internals of the agi-tation process. This would enable the use of Agitator as aplatform for research into novel techniques and methods forautomated testing. We present this vision in more detail inSection 6.

4.2 Agitator-Driven RefactoringDevelopers often modify their work habits to incorporate anew tool into their workflow. For example, Sa! and Ernst [25]report that the users of their continuous testing tool changedtheir working style to ensure that they “got a small part of[their] code working before moving on to the next section.”Similarly, users of Agitator found that following certain cod-ing patterns not only leads to better designed code, but alsoenables Agitator to provide better results because the codeis more testable and the tests are more thorough and morereadable. One example of this is what we call Agitator-drivenrefactoring.

The concept of refactoring was introduced by Opdyke [18]and involves modifying an object-oriented system to improvesome aspect of its internal structure without a!ecting itsexternal behavior. Refactoring is an object-oriented formu-lation of restructuring, a general paradigm for describingbehavior-preserving changes to source code [1, 10]. Fowler’sbook [8] catalogs many refactoring transformations that havesince become standard practice. Many of these transforma-tions have the beneficial side e!ect of helping Agitator usersto a!ect some aspect of agitation. For example:

Replace Data Value with Object: Introduce a new classthat wraps a simple data type, such as an integer or a string,to provide additional data or behavior (Fowler [8], p. 175).

Agitator works best when it has the most informationabout the format of the data that it is trying to use as in-put to agitation. Unfortunately, it is a common practice toconflate various types of values by using primitive or insuf-ficiently specialized types. For example, a salary value maybe represented as an integer, carrying no indication that thesalary may not be negative. Moreover, nothing in such a rep-resentation distinguishes a salary value from an employeenumber, permitting Agitator to reuse a value obtained bycalling getSalary() as an input to setEmployeeNumber().This refactoring alleviates this problem by encapsulatingeach value in the corresponding class, such as Salary orEmployeeNumber.

User Input

Agitation

FactoriesTest classesSetup methods

Assertions(tests)

Figure 3: Inside agitation.

Replace Error Code with Exception: Replace the re-turn of an error code, such as return -1, with the throw ofan exception, such as throw new IllegalArgumentException(Fowler [8], p. 310).

Agitator uses exceptions thrown from the source code tocreate distinct outcome partitions for each exception. Thispermits easier tracking and debugging of the exception out-comes. Similar results can be achieved by manually con-structing an outcome partition that spells out the conditionsunder which the error code is returned. Having Agitator cre-ate an exception partition automatically, however, is mucheasier for the user.

Extract Method: Extract a fragment of an existing methodinto a new method, giving a descriptive name to the newmethod and its parameters (Fowler [8], p. 110).

Agitator has a better chance of achieving coverage andgenerating good input data values for methods with simplercontrol flow. While it works on long “spaghetti-code” meth-ods, the list of observations generated from smaller methodsis usually shorter and more relevant.

4.3 What Happens During AgitationThe agitation process includes both automated processingand human intervention. This section focuses on the inter-nals of the agitation process and on the individual steps thatconstitute agitation, as presented in Figure 3.

4.3.1 Analysis and Input Data GenerationSoftware agitation is an iterative process, driven by the ob-jective of reaching maximum code coverage while achiev-ing every possible expected outcome. Software agitationbegins with static and dynamic analysis of Java code (atthe bytecode level), followed by dynamic generation of testdata based on identified behavior of the code. The dynamic

analysis utilizes data values and coverage information fromthe preceding iteration to direct execution along a particu-lar code path. The static analysis uses heuristic-driven pathanalysis to collect and solve input value constraints that cancause execution to take the selected path.

In order to satisfy the scalability and performance require-ments described in Section 3, Agitator takes several short-cuts and makes a number of approximations in the analysisalgorithms. For instance, rather than trying to solve con-straints for the entire execution path, Agitator may consideronly a part of that path. While such an approach may notalways direct execution down the expected path, we foundthat it works well in practice and cuts down analysis time.Similarly, Agitator does not perform a receiver-class analysiswhen considering paths spanning multiple methods. Instead,it uses heuristics to “guess” the potential candidates amongall possible implementations of a method. The heuristicsused by Agitator were determined mostly via profiling andexperimentation. For example, Agitator unrolls loops andrecursive calls until the control-flow graph reaches a particu-lar size. We determined (experimentally) that this size givesa good trade-o! between the completeness of the results andAgitator’s performance.

Solving of path constraints is implemented in Agitator us-ing a wide array of generic and specialized constraint solvers.For instance, Agitator includes a string solver for java.lang-.String objects, producing strings that satisfy constraintsimposed by the String API (e.g., a string that .matches(...)a regular expression). Similarly, Agitator includes severalsolvers for the Java Collections classes.

In addition to using constraint solving and other sophis-ticated methods for generating test inputs, Agitator uses afew heuristics that often allow it to explore otherwise un-reachable paths. Some of these include:

• For integers: use 0, 1, -1, any constants found in thesource code, the negation of each such constant, andeach constant incremented and decremented by 1.

• For strings: generate random text of varying lengths,varying combinations of alpha, numeric, and alphanu-meric values, empty strings, null strings, any specificstring constants occurring in the source code.

• For more complex objects: achieve interesting objectstates by calling mutator methods with various auto-generated arguments.

• For all objects: use various objects of the appropri-ate type created during execution. Agitator accumu-lates some of the objects that it creates at variouspoints during agitation for possible reuse in place ofconstructing new ones.

In some cases, Agitator lacks the information to constructan object, or to put an object in the specific state requiredto test a class. In those cases, users provide help by definingone or more factories that tell Agitator how to construct anobject (see Section 4.1.1).

The above shortcuts represent a point of departure fromthe established research in test-input generation. Most ofthese were necessary to achieve adequate memory and run-time performance. An important benefit a!orded by the

iterative nature of software agitation is that it is not nec-essary for Agitator to achieve the desired goal on the firstattempt. Moreover, because Agitator must execute everypath multiple times (see the next section), it is perfectly ac-ceptable for the analysis algorithms to “guess” a good set ofdata values and to adjust that guess if the execution does nottake the expected path. Guessing the right mix of data val-ues entails approximating an equivalence partition for eachinput parameter and trying di!erent combinations of valuesin that partition. In order to facilitate invariant detection,Agitator attempts to provide five distinct data values for ev-ery input parameter for an execution path. Thus, the goalof test-input generation is to create a large number of inputvalues that maximize the likelihood, rather than guaranteethat these values provide the desired coverage.

Further research is needed to adapt some of the recentwork in test-input generation to the commercial-scale sourcecode bases. We revisit this point in Section 6.

4.3.2 Execution and Invariant DetectionAfter test-input generation, Agitator exercises each methodin each class under test. Agitator attempts to execute everypath through a method multiple times, so that the invari-ant inferencing algorithm has su"cient distinct data points.Agitator uses heuristics to decide whether it executed a par-ticular path enough times with su"ciently distinct data val-ues. Infinite loops or long-running operations are terminatedusing user-settable timeouts.

Execution of potentially defective source code with ran-dom inputs may present a significant risk to system securityand integrity. For example, a utility that deletes files fromthe filesystem must not be allowed to do so during its exe-cution by an agitation tool. Agitator takes special care toisolate all application source code that it executes using theJava security manager.

During agitation of each testable method, Agitator ap-plies its dynamic invariant detection algorithm to discover“interesting” method invariants that can be presented tothe user as observations. The algorithm for invariant de-tection in Agitator is similar to that used in recent versionsof Daikon, though it was developed independently. Perkinsand Ernst [21] describe several algorithms for incrementaldetection of likely invariants; our approach is similar to theirsimple incremental algorithm. Agitator uses a di!erent set ofheuristics and optimizations than those mentioned in theirpaper to make the incremental algorithm e"cient. Our de-cision to develop an incremental algorithm was, once again,motivated by the scalability and performance considerations.

Agitator starts out by positing a number of hypotheticalrelationships between values within a method. Because thenumber of possible combinations can be very large, Agita-tor filters the list of hypothetical relationships using simpleheuristics. For example, we assume that two variables oc-curring in the same expression are related, whereas variablesthat never occur in the same context are independent. Dur-ing agitation, Agitator discards relationships that do notalways hold. When agitation is complete, the list of “surviv-ing” relationships is presented to the user as observations.

Agitator finds data values to observe using the followingheuristics:

• By considering fields related to the method under test.

• By examining the values of method parameters andinstance variables.

• By checking properties of simple data types (arraylength, string length, etc.).

• By identifying “getter” methods for objects in scope.

• By including initial values (values prior to execution)for all observable expressions in a method and by con-sidering the return value of that method.

When forming a hypothesis about data value relationships,Agitator considers a number of properties such as:

• Expressions in source code. Simple expressionsfound in the method under test are always observedand checked for invariability.

• Relational: x = y, x > y, x ! y, x < y, x # y.These relationships are possible between numeric datatypes. Note that with our approach, hypothesis x $= yis useless (unless it occurs in the code), because thelikelihood of disproving it is small.

• Range: A < x < B, A # x < B, A < x # B,A # x # B. Another set of relationships for numericvalues. Agitator makes initial estimates for the valuesof A and B and then iteratively refines the boundaries.

• Logical: x " y. Agitator checks this relationship forboolean values. Early versions of Agitator also checkedx % y and x & y, but we found that considering theserelationships resulted in too much noise (useless obser-vations).

• Linear: Ax + By = C, Ax + By + Cz = D. Agita-tor checks these relationships for integer values. Thethree-variable relationship is only considered when A,B, and C are +1 or '1; the two-variable relationshipis checked for all constant values.

• Object: x = y, x.equals(y), x = null, x $= null.Agitator considers these relationships for all applicableobject values.

Agitator considers relationships that remain after agitationinvariant, sorts them by relevance, and presents them to theuser. The relevance score reflects how closely componentsof the relationship are “related” in the source code. Expres-sions that appear in source code verbatim score the highest;expressions that are pure guesses score the lowest.

Our other point of departure from the Daikon approachis to preserve implied invariants, rather than attempt toeliminate them. (The latest releases of Daikon even includethe interface to a theorem prover specifically for that pur-pose.) Our reasons for keeping redundant invariants aretwofold. First, we found (via experimentation) that elim-inating implied invariants proved too taxing on Agitator’sperformance. Because the tool is used interactively, reduc-ing the response time is essential for usability. Second, wediscovered that a fully reduced set of observations was notalways appropriate for human inspection. For example, con-sider the following two sets of observations (full and implied):

x = yy = zx = zx $= nully $= nullz $= null

&x = yy = zx $= null

Logically, these two sets are equivalent. The user, however,might want to see and test each assertion separately andnot hide the observation z $= null inside the deductive chain(z = y " y = x " x $= null & z $= null). Consider anotherexample:

(1) x > y(2) x = y + 1

Here, we do not know whether the right behavior is (1) or(2). Perhaps, the intended behavior is (1), but due to a bugin the user’s program, the user discovers that the code alwaysincrements by 1. By presenting both options, Agitator shiftsthe responsibility for this decision to the user, where webelieve it belongs.

5. AGITATING AGITATORAfter 18 months of development Agitator had enough func-tionality to be used on its own code. In this section wereport some results of running Agitator on itself. Most ofthe metrics reported here are publicly available as part ofAgitar’s Open Quality initiative, an e!ort to demonstratethe quality of software via auto-generated publicly-viewablereports.2 These reports include code coverage achieved bothby manually-written unit tests and by automatic agitation,the number of assertions, and other project code metrics.The information we present corresponds to Agitator 3.0.

5.1 Project StatisticsAgitator is implemented in Java. It consists of 1,734 applica-tion classes and 1000 unit-test and test harness classes. Ag-itation, including execution of manually-written unit tests,achieves 80.2% code coverage. During agitation Agitatorchecks 30,998 assertions. This includes both the assertions inthe manually-written tests and the observations promoted toassertions. There are 68,990 executable lines of applicationcode and 29,404 executable lines of unit-test and test har-ness code. Agitation is supported by 26 custom test-inputfactories that help Agitator achieve adequate code coverage.The implementation of these factories takes another 766 exe-cutable lines of code. Taken together, these metrics indicatetest code to application code ratio of 1:2.3 and assertion tocode ratio of 1:2.2.

Comparing these statistics to other projects is challengingbecause very few publicly available software systems includeunit tests that achieve comparable code coverage. Some sys-tems, however, are better at this than others. For instance,consider the Commons-collections library [31], part of theJakarta project. As of version 3.1, this library includes unittests that result in 77.7% code coverage. The ratio of testcode to application code, however, is significantly larger—1.62:1. This means that on average it takes 1.62 lines of testcode to test a single line of application code, compared to 1

2http://www.agitar.com/openquality/

line of test code per 2.3 lines of application code in Agita-tor. The ratio of assertions to application code is comparableto that of Agitator—1 assertion per 2 lines of source code.This comparison demonstrates that the use of Agitator canreduce the e!ort by requiring less test code to achieve com-parable application code coverage. Of course, some e!ortmust be spent on checking Agitator’s observations, possiblypromoting them to assertions, and on manually adding newAgitator assertions in the observation view. We look at thisprocess next.

5.2 From Observations to AssertionsTo measure Agitator’s e"cacy in helping Agitator develop-ers we estimated the number of “useful” observations that itproduces from its own source code. We define “usefulness”very narrowly, counting only those observations that thedevelopers promoted to assertions without editing to makethem stronger or more meaningful. Furthermore, we lookedonly at a single point in time and only at the observationsthat existed at that point. Because we use agitation contin-uously during the development process, many observationsthat are examined by the developers represent unexpectedbehavior in the code. These observations are clearly “use-ful” because they help to fix bugs, but they are not includedin our computation.

In the source code for Agitator 3.0, Agitator is able toexercise 34,004 method outcomes. In doing so, it creates126,374 observations. Over the time that Agitator has beenself-testing, 13,868 of these observations were promoted toassertions. This indicates that at least 11% of Agitator’sobservations represent useful invariants that have been pro-moted to assertions. The total number of assertions that Ag-itator checks is actually higher—18,623. Some of the extraassertions resulted from observations that Agitator devel-opers modified to make them stronger or more meaningful.Some of these assertions developers entered manually. Be-cause we have not preserved the distinction between thesetwo kinds of assertions they are excluded from our estimate.

6. THE ROAD AHEADAfter two years of commercial deployment we are happy tostate that Agitator has become an invaluable tool for manyof our customers. Still, much work lies ahead. Some ofthe challenges are purely technical; others—require more re-search in various areas, ranging from better test-input gen-eration algorithms to understanding developers’ motivationfor writing tests. In this section, we outline several directionsfor further exploration.

6.1 Agitator as a Research PlatformCreating a research infrastructure is the bane of many soft-ware engineering research projects. On the one hand, theinfrastructure is necessary to provide the support services,such as some basic source code analyses. On the other hand,building such an infrastructure typically involves a signifi-cant engineering e!ort by several researchers, while provid-ing little opportunity for novel research.

We would like to extend Agitator to provide such a plat-form for research in testing tools. Even now, the Expert API(Section 4.1.2) provides some rudimentary hooks to Agita-tor’s internals. We are interested in extending this API and

creating a true plug-in architecture that would enable out-side researchers to modify and extend Agitator’s behavior.

6.2 Improving Testing TechnologyIn the recent years, test-input generation has been the areaof active research. Advances in techniques for symbolic ex-ecution (e.g., [33, 36, 27, 35, 23]), random testing (e.g., [9]),and test selection (e.g., [19, 38, 11]) suggest many plausi-ble improvements to the algorithms currently used by Agi-tator. Yet, scalability of these approaches remains a majorobstacle to adoption. Further research is needed in adaptingthese and similar techniques to large, commercial-scale codebases. One promising direction, suggested by Tillmann andSchulte [33], is to store the intermediate results of the analy-sis of common (library) code in a reusable manner. Sen [27]proposes a set of domain-specific optimizations that improvethe performance of constraint solving. Robschink and Snelt-ing [24] show how to solve path constraints e"ciently inrealistic-size programs by using interval analysis and binarydecision diagrams.

Increasingly, we are seeing software systems that are im-plemented using several programming languages. For exam-ple, some of the Java libraries include native code compo-nents implemented in C and C++. Unfortunately, we arenot aware of any research that attempts to address analysisand generation of test inputs across the boundary betweentwo (or more) languages.

6.3 Helping the UserThe human component plays a big role in software agitation.While we believe that we developed a successful user inter-face for software agitation, our work is far from complete.Below are just a few of the ideas that we feel merit furtherexploration.

Visual language for factory configuration. Input-valuefactories in Agitator are configured using interactive dialogs.Often, the factories need to nest (for example, an array fac-tory that uses another factory for its components), some-times more than a single level. Configuration of nested fac-tories can be confusing to the user. The dependencies be-tween di!erent factories can be visualized using a dataflownetwork, similar to those used in visual programming lan-guages, such as SCIRun [20] and Fabrik [14]. Designing aspecialized visual programming language for factory compo-sition and configuration would be a welcome improvementto Agitator.

Assertion generalization. Observations reported by Ag-itator are fairly low-level, code-based invariants. We areinterested in exploring mechanisms that can generalize theinferred invariants into higher-level observations and asser-tions. The work of Xie and Notkin [38] and Henkel andDiwan [12] appears to be a start in that direction.

Outcome filtering. Some outcomes reported by Agita-tor surprise users. For instance, Agitator might report thata method throws a java.lang.NullPointerException, be-cause it was able to cause that exception by supplying anull value as a parameter. The users, however, might ob-ject to such a discovery, claiming that they know that a nullvalue cannot propagate to that location in their system. Atpresent, global data flow analysis that could detect this sit-

uation is beyond Agitator’s capabilities. We are interestedin finding mechanisms that would help Agitator in this, andother similar cases.

6.4 Studying the TestersExperience suggests that having great tools is often insu"-cient to achieve wide acceptance of a new software-develop-ment practice. Developer testing is no exception. Unless thedevelopers are motivated to make an e!ort (however small),they will not use even best of the tools. Unfortunately, verylittle is known about the human component of the developertesting practice and no conclusive evidence exists on whatmotivates developers to test (or not to test) their code. My-ers provides some high-level insights in Chapter 2 of The Artof Software Testing [16], yet much more work is needed inthat domain.

In his recent post to the Psychology of Programming mail-ing list [3], Ruven Brooks (one of the founding fathers of thefield), poses the following questions about testing:

• How do people decide what to test?

• How do people construct tests?

• How do people actually use test tools (as opposed tothe way the authors of those tools envisioned theiruse)?

• How do programming language syntax and semanticsa!ect test strategy and behavior?

• How do novice programmers learn to do testing?

These questions have have no satisfactory answers in eithertesting or psychology literature.

7. CONCLUSIONPerhaps one of the most satisfying conclusions that we candraw from this paper is that academic research in softwaretesting can and does have relevance to industrial applica-tions. By paying attention to the usability, scalability, andperformance requirements of commercial software develop-ment, we were able to deliver a tool that adapts a number ofacademic research ideas to the problem of developer testing.

The work on the Daikon invariant detector, as well as thewealth of research on test-input generation were instrumen-tal to our success. Yet, we must ask why so few of theresearch ideas find their way into commercial applications.We do not have a definite answer to this question, but wesuspect that the primary reason is the significant investmentrequired to transform brilliant research ideas and prototypesinto full-fledged, industrial-strength applications. The cur-rent version of Agitator (3.0) is the result of over 50 engineer-years of development and testing (with a very senior team ofdevelopers with significant prior experience in test automa-tion and software development tools). And, as one mightexpect, the bulk of the development e!ort was not spent onthe basic software agitation algorithms, but on usability, ap-plicability, reliability, scalability, and performance. Trackingand addressing changing developers’ preferences in terms ofIDEs, frameworks, etc., is also a necessity for a commercialproduct. Even though support for a specific IDE is rarelya relevant factor in academic work, it is an imperative for

a commercial product. For example, when Eclipse becamethe IDE of choice for most Java developers, we had to portthe original version of Agitator from a stand-alone Swingapplication to an Eclipse plug-in.

Ultimately, however, if the problem solved is big enough—and software testing is definitely a big problem—we believethat the significant cost and e!ort required to productizepromising research ideas is well worth it. Researchers getthe satisfaction of seeing their ideas used broadly, softwaredevelopment tools companies are able to di!erentiate them-selves through innovation, and users benefit by having morepowerful and more e!ective products to help them deliverquality software.

We hope that other individuals and companies will followsuit in driving some of the best research ideas from recentyears to commercial success.

8. REFERENCES[1] R. S. Arnold. Sotware restructuring. In Proceedings of

the IEEE, volume 77, pages 607–617, 1989.[2] K. Beck and E. Gamma. Test infected: Programmers

love writing tests. In Java Report, volume 3, pages37–50, 1998.

[3] R. E. Brooks. Commercial reality. Psychology ofProgramming Interest Group Mailing List, March2005. http://www.mail-archive.com/[email protected]/msg00958.html.

[4] J. deRaeve and S. P. McCarron. Automated testgeneration technology. Technical report, X/OpenCompany Ltd., 1997. http://adl.opengroup.org/documents/Archive/adl10rep.pdf.

[5] Eclipse.org. Eclipse platform: technical overview,2003. http://eclipse.org/white-papers/eclipse-overview.pdf.

[6] M. D. Ernst. Dynamically discovering likely programinvariants. PhD thesis, University of Washington,2000.

[7] M. D. Ernst, J. Cockrell, W. G. Griswold, andD. Notkin. Dynamically discovering likely programinvariants to support program evolution. In ICSE ’99:21st International Conference on SoftwareEngineering, pages 213–224, 1999.

[8] M. Fowler. Refactoring: improving the design ofexisting code. Object Technology Series.Addison-Wesley, 1999.

[9] P. Godefroid, N. Klarlund, and K. Sen. DART:directed automated random testing. In PLDI ’05:2005 ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation, pages 213–223,2005.

[10] W. G. Griswold and D. Notkin. Automated assistancefor program restructuring. ACM Transactions onSoftware Engineering and Methodology, 2(3):228–269,July 1993.

[11] M. Harder, J. Mellen, and M. D. Ernst. Improving testsuites via operational abstraction. In ICSE ’03: 27thInternational Conference on Software Engineering,pages 60–73, 2003.

[12] J. Henkel and A. Diwan. Discovering algebraicspecifications from Java classes. In ECOOP ’03:

European Conference on Object-OrientedProgramming, pages 431–456, 2003.

[13] Hybernate.org. Relational persistence for Java and.NET. http://www.hibernate.org/.

[14] D. Ingalls. Fabrik: A visual programmingenvironment. In OOPSLA ’88: InternationalConference on Object-Oriented Programming, Systems,Languages, and Applications, volume 23, pages176–190, Nov. 1988.

[15] JUnit. http://www.junit.org.[16] G. J. Myers. The Art of Software Testing. Wiley -

Interscience, New York, 1979.[17] NUnit. http://www.nunit.org.[18] W. F. Opdyke. Refactoring object-oriented

frameworks. PhD thesis, University of Illinois atUrbana-Champaign, 1992.

[19] C. Pacheco and M. D. Ernst. Eclat: Automaticgeneration and classification of test inputs. In ECOOP’05: European Conference on Object-OrientedProgramming, pages 504–527, 2005.

[20] S. G. Parker, D. M. Weinstein, and C. R. Johnson.The SCIRun computational steering software system.In E. Arge, A. M. Bruaset, and H. P. Langtangen,editors, Modern Software Tools in ScientificComputing. Birkhauser Press, 1997.

[21] J. H. Perkins and M. D. Ernst. E"cient incrementalalgorithms for dynamic detection of likely invariants.In SIGSOFT ’04/FSE-12: Proceedings of the 12thACM SIGSOFT International Symposium onFoundations of Software Engineering, pages 23–32,New York, NY, USA, 2004. ACM Press.

[22] N. H. Petschenik. Building awareness of system testingissues. In ICSE ’85: 8th International Conference onSoftware Engineering, pages 182–188, Los Alamitos,CA, USA, 1985. IEEE Computer Society Press.

[23] Robby, M. B. Dwyer, and J. Hatcli!. Bogor: anextensible and highly-modular software model checkingframework. In ESEC/FSE-11: Proceedings of the 9thEuropean Software Engineering Conference/11th ACMSIGSOFT International Symposium on Foundations ofSoftware Engineering, pages 267–276, New York, NY,USA, 2003. ACM Press.

[24] T. Robschink and G. Snelting. E"cient pathconditions in dependence graphs. In ICSE ’02:Proceedings of the 24th International Conference onSoftware Engineering, pages 478–488, New York, NY,USA, 2002. ACM Press.

[25] D. Sa! and M. D. Ernst. An experimental evaluationof continuous testing during development. In ISSTA’04: Proceedings of the 2004 ACM SIGSOFTInternational Symposium on Software Testing andAnalysis, pages 76–85, New York, NY, USA, 2004.ACM Press.

[26] S. Sankar and R. Hayes. Specifying and testingsoftware components using ADL. Technical ReportSMLI TR-94-23, Sun Microsystems Laboratories,April 1994. http://research.sun.com/techrep/1994/smli tr-94-23.pdf.

[27] K. Sen, D. Marinov, and G. Agha. CUTE: a concolicunit testing engine for C. In ESEC/FSE-13: 10thEuropean Software Engineering Conference/13th ACMSIGSOFT International Symposium on Foundations ofSoftware Engineering, pages 263–272, 2005.

[28] SpringFramework.org. Spring application framework.http://www.springframework.org/.

[29] Sun Microsystems. Java platform, Enterprise Edition.http://java.sun.com/javaee/.

[30] The Apache Software Foundation. Apache StrutsProject. http://struts.apache.org/.

[31] The Jakarta Project. Commons collections.http://jakarta.apache.org/commons/collections/.

[32] D. Thomas and A. Hunte. Mock objects. IEEESoftware, 19(3):22–24, 2002.

[33] N. Tillmann and W. Schulte. Parameterized unit tests.In ESEC/FSE-13: 10th European SoftwareEngineering Conference/13th ACM SIGSOFTInternational Symposium on Foundations of SoftwareEngineering, pages 253–262, 2005.

[34] R. Vanmegen and D. B. Meyerho!. Costs and benefitsof early defect detection—experiences from developingclient-server and host applications. Software QualityJournal, 4(4):247–256, 1995.

[35] W. Visser, C. S. Pasareanu, and S. Khurshid. Testinput generation with Java PathFinder. In ISSTA ’04:Proceedings of the 2004 ACM SIGSOFT InternationalSymposium on Software Testing and Analysis, pages97–107, New York, NY, USA, 2004. ACM Press.

[36] T. Xie, D. Marinov, W. Schulte, and D. Notkin.Symstra: A framework for generating object-orientedunit tests using symbolic execution. In TACAS ’05:11th International Conference on Tools andAlgorithms for the Construction and Analysis ofSystems, volume 3440 of LNCS, pages 365–381.Springer-Verlag, Apr. 2005.

[37] T. Xie and D. Notkin. Tool-assisted unit test selectionbased on operational violations. In ASE ’03:International Conference on Automated SoftwareEngineering, pages 40–48, 2003.

[38] T. Xie and D. Notkin. Automatically identifyingspecial and common unit tests for object-orientedprograms. In ISSRE ’05: International Symposium onSoftware Reliability Engineering, pages 277–287, 2005.

from daikon to agitator: lessons and challenges in building a … · 2019-03-16 · from daikon to...

Documents