how to actually do high-volume automated testing

1

How to Actually DO High-Volume Automated Testing

Cem KanerCarol Oliver

Florida Institute of Technology

2

Presentation AbstractIn high volume automated testing (HiVAT), the test tool generates the test, runs it, evaluates the results, and alerts a human to suspicious results that need further investigation. Simple HiVAT approaches use simple oracles—run the program until it crashes or fails in some other extremely obvious way. More powerful HiVAT approaches are more sensitive to more types of errors. They are particularly useful for testing combinations of many variables and for hunting hard-to-replicate bugs that involve timing or corruption of memory or data. Cem Kaner and Carol Oliver present a new strategy for teaching HiVAT testing. We’ve been creating open source examples of the techniques applied to real (open source) applications. These examples are written in Ruby, making the code readable and reusable by snapping in code specific to your own application. Join Cem Kaner and Carol Oliver as they describe three HiVAT techniques, their associated code, and how you can customize them.

3

What is automated testing?• There are many testing activities, such as…

– Design a test, create the test data, run the test and fix it if it’s broken, write the test script, run the script and debug it, evaluate the test results, write the bug report, do maintenance on the test and the script

• When we automate a test, we use software to do one or more of the testing activities– Common descriptions of test automation emphasize test

execution and first-level interpretation– But that automates only part of the task– Automation of any part is automation to some degree

• All software tests are to some degree automated and no software tests are fully automated.

4

What is “high volume” automated testing?• Imagine automating so many testing activities– That the speed of a human is no longer a constraint on

how many tests you could run– That the number of tests you run is determined more by

how many are worth running than by how much time they take

• This is the underlying goal of HiVAT

5

Once upon an N+1th time…• Imagine developing an N+1th version of a program

– How could we test it?• We could try several techniques

1. Long sequence regression testing. Reuse regression tests from previous version. For those tests that the program can pass individually, run them in long random sequences

2. Program equivalence testing. • Start with function equivalence testing. For each function that exists in the old

version and the new one, generate random inputs, feed them to both versions and compare results. After a sufficiently extensive series, conclude the functions are equivalent

• Combine tests of several functions that compare (old versus new) successfully individually.

3. Large-scale system comparison by analyzing large sets of user data from the old version (satisfied user attests that they are correct). Check that the new version yields comparable results.• Before running the tests, run data qualification tests (test the plausibility of the

inputs and user-supplied sample outputs). The goal is to avoid garbage-in-garbage-out troubleshooting.

6

Working definitions of HiVAT• A family of test techniques that use software to generate,

execute and interpret arbitrarily many tests. • A family of test techniques that empower a tester to focus on

interesting results while using software to design, implement, execute, and interpret a large suite of tests (thousands to billions of tests).

7

Breaking out potential benefits of HiVAT(different techniques different benefits)

• Find bugs we don’t otherwise know how to find• Diagnostics-based• Long sequence regression• Load-enhanced functional testing• Input fuzzing• Hostile data stream testing

• Increase test coverage inexpensively • Fuzzing • Functional equivalence• High-volume combination testing• Inverse operations• State-model based testing• High-volume parametric variation

• Qualify large collections of input or output data• Constraint checks• Functional equivalence testing

8

Examples of the techniques• Equivalence testing– Function (e.g. MASPAR)– Program– Version-to-version with shared data

• Constraint checking– Continuing the version-to-version comparison

• LSRT– Mentsville

• Fuzzing– Dumb monkeys, stability and security tests

• Diagnostics– Telenova PBX intermittent failures

9

Oracles Enable HiVAT• Some specific to the testcase

– Reference Programs: Does the SUT behave like it?– Regression Tests: Do the previously-expected results still happen?

• Some independent of the testcase– Fuzzing: Run to crash, run to stack overflow– OS/System Diagnostics

• Some inform on only a very narrow aspect of the testcase– You’d never dream that passing the oracle check means passing the test in

all possible ways– Only that passing the oracle check means the program does not fail in this

specific way– E.g. Constraint Checks

• The more extensive your oracle, the more valuable your high-volume test suite– What potential errors have we covered?– What potential errors have to be checked in some other way?

Reference Oracles are Useful but PartialBased on notes from Doug Hoffman

Program state

System state

Configuration and system resources

Cooperating processes, clients or servers

System state

Impacts on connected devices / resources

To cooperating processes, clients or servers

Program state, (and uninspected outputs)

System under

test

Reference function

Monitored outputsIntended inputs

Program state

System state

Configuration and system resources

Cooperating processes, clients or servers

Program state, (and uninspected outputs)

System state

Impacts on connected devices / resources

To cooperating processes, clients or servers

Intended inputs Monitored outputs

11

Using Oracles• If you can detect a failure, you can use that oracle in any test,

automated or not• Many ways to combine oracles, as well as using them one at a

time• Each has time costs and timing costs• Heisenbug costs mean you might use just one or two oracles

at a time, rather than all the possible oracles you know, all at once.

12

MAKING HIVAT WORK FOR YOUPractical Things

13

Basic Ingredients• A Large Problem Space

– The payoff should be worth the cost of setting up the computer to do detailed testing for you

• A Data Generator– Data generation often requires more human effort than we expect. This

constrains the breadth of testing we can do, because all human effort takes time.

• A Test Runner – We have to be able to read the data, run the test, and feed the test to the

oracle without human intervention. Probably we need a sophisticated logger so we can trace what happened when the program fails.

• An Oracle– This might tell us definitively that the program passed or failed the test or

it might look more narrowly and say that the program didn’t fail in this way. In either case, the strength of the test is driven by your ability to tell whether it failed. The narrower the oracle, the weaker the technique.

14

Three examples• Function equivalence tests (testing against a reference

program)• Long-sequence regression (repurposing already-generated

data with new oracles)• TBD

15

Functional Equivalence• Oracle: A Reference Program• Method:

– The same testcases are sent to the SUT and to the Reference Program

– Matched behaviors are considered a Pass– Discrepant behaviors are considered a Fail

• Limits: – The reference program might be wrong– We can only test comparable functions– Generating semantically meaningful inputs can be hard, especially

as we create nontrivial ones• Benefits: For every comparable function or combination of functions,

you can establish that your SUT is at least as good as the Reference Program

16

Functional Equivalence Examples• Square Root calculated one way vs. Square Root calculated

another way (Hoffman)• Open Office Calc vs. a reference implementation– Comparing • Individual functions• Combinations of functions

– Reference• Currently reference formulas programmed in Ruby• Coming (maybe by STAR) compare to MS-Excel

17

Functional Equivalence Architecture

18

Functional Equivalence: Key Ideas• You provide your Domain Knowledge– How to invoke meaningful operations for your system– Which data to use during the tests– How to compare the resulting data states

• Test Runner sends the same test operations to both the Reference Oracle and the System Under Test (SUT)– Collects both answers– Compares the answers– Logs the result

19

Function Equivalence: Architecture• Component independence allows reusability for other applications

and for later versions of this application.– We can snap in a different reference function– We can snap in new versions of the software under test– The test generator has to know about the category of

application (this is a spreadsheet) but not which particular application is under test.

– In the ideal architecture, the test running passes test specifications to the SUT and oracle without knowing anything about those applications. To the maximum extent possible, we want to be able to reuse the test runner code across different applications

– The log interpreter probably has to know a lot about what makes a pass or a fail, which is probably domain-specific.

20

Functional Equivalence: Reference Code• Work in Progress– http://testingeducation.org/hivat/

http://testingeducation.org/hivat/

21

Long-Sequence Regression• Oracles:

– OS/System diagnostics– Checks built into each regression test

• Method:– Reconfigure some known-passing regression tests to ensure

they can build upon one another (i.e. never reset the system to a clean state)

– Run a subset of those regression tests randomly, interspersed with diagnostics

– When a test fails on the N+1st time that we run it, after passing N times during the run, analyze the diagnostics to figure out what happened and when in the sequence things first went wrong

22

Long-Sequence Regression• Limits:– Failures will be indirect pointers to problems– You will have to analyze the logs of diagnostic results– You may need to rerun the sequence with different

diagnostics– Heisenbug problem limits the number of diagnostics that

you can run between tests• Benefits: – This approach is proven good for finding timing errors and

memory misuse

23

Long-Sequence Regression Examples• Mentsville• Application to OpenOffice

24

Long-Sequence Regression Architecture

25

Long-Sequence Regression: Key Ideas• Pool of Long-Sequence-Compatible Regression Tests– Each of these is Known to Pass, individually, on this build– Each of these is Known to NOT reset system state

(therefore it contributes to simulating real use of the system in the field)

– Each of these is Known to Pass when run 100 times in a row, with just itself altering the system (if not, you’ve isolated a bug)

• Pool of Diagnostics– Each of these collects something interesting about the

system or program: memory state, CPU usage, thread count, etc.

26

Long-Sequence Regression: Key Ideas• Selection of Regression Tests

– Need to run a small enough set of regression tests that they will repeat frequently during the elapsed time of the Long Sequence

– Need to run a large enough set of regression tests that there’s real opportunity for interactions between functions of the system

• Selection of Diagnostics– Need to run a fixed set of diagnostics per Long Sequence, so

that you collect data to compare Regression Test failures against

– Need to run a small enough set of diagnostics that the act of running the diagnostics doesn’t significantly change the system state

27

Long-Sequence Regression: Reference Code• Work in Progress– http://testingeducation.org/hivat/



28

Third Application• We have several efforts in progress, mainly– Diagnostics-based tests – High-volume input-combination tests

• We don’t yet know which will be presentable by April, so– Stay tuned…

29

Is this the future of testing?• Maybe, but within limits• Expensive way to find simple bugs– Significant troubleshooting costs: shoot a fly with an

elephant gun and discover you have to spend enormous time wading through the debris to find the fly

• Ineffective way to hunt for design bugs• Emphasizes the families of tests that we know how to code

rather than the families of bugs that we need to hunt

As with all techniquesThese are useful under some

circumstancesProbably invaluable under some

circumstancesBut ineffective for some bugs and

inefficient for others

how to actually do high-volume automated testing

Documents

test results

test data

test script

test execution

test tool

testing combinations

testing activitiesthat

teaching hivat testing