l5 2019 lifecycle documentationfileadmin.cs.lth.se › cs › education › etsn20 › lectures ›...

Lund University / Faculty of Engineering/ Department of Computer Science / Software Engineering Research Group

Software TestingETSA20http://cs.lth.se/etsa20

Naik 7.1-7.4, 12.1-12.9Chen, Memon, Mäntylä, Whittaker

Lifecycle, Documentation

Prof. Per Runeson


Lecture• Lifecycle

– Ch 7.1-7.4– Chen– Mäntylä

• Planning and follow-up– Ch 12.1-12.9– Whittaker


Unit test vs Integration/system tests


What vs when

• Process modelsdefine conceptualworkflow

I SYSTE M

I ANALYSIS

PROGRAM DESIGN

I c o o , . o

TESTING

I OPERATIONS

Figure 2. Implementation steps to develop a large computer program for delivery to a customer.

I believe in this concept, but the implementation described above is risky and invites failure. The problem is illustrated in Figure 4. The testing phase which occurs at the end of the development cycle is the first event for which timing, storage, input /output transfers, etc., are experienced as distinguished from analyzed. These phenomena are not precisely analyzable. They are not the solutions to the standard partial differential equations of mathematical physics for instance. Yet if these phenomena fail to satisfy the various external constraints, then invariably a major redesign is required. A simple octal patch or redo of some isolated code wil l not f ix these kinds of diff iculties. The required design changes are l ikely to be so disruptive that the software requirements upon which the design is based and which provides the rationale for everything are violated. Either the requirements must be modif ied, or a substantial change in the design is required. In effect the development process has returned to the origin and one can expect up to a lO0-percent overrun in schedule and/or costs.

One might note that there has been a skipping-over of the analysis and code phases. One cannot, of course, produce software wi thout these steps, but generally these phases are managed wi th relative ease and have l i tt le impact on requirements, design, and testing. In my experience there are whole departments consumed with the analysis of orbi t mechanics, spacecraft att i tude determination, mathematical opt imizat ion of payload activity and so forth, but when these departments have completed their di f f icul t and complex work, the resultant program steps involvea few lines of serial arithmetic code. If in the execution of their d i f f icul t and complex work the analysts have made a mistake, the correction is invariably implemented by a minor change in the code with no disruptive feedback into the other development bases.

However, I believe the illustrated approach to be fundamental ly sound. The remainder of this discussion presents five addit ional features that must be added to this basic approach to eliminate most of the development risks.

329

”I believe in this concept, but the implementation described above is risky and invites failure” Royce 1970


From waterfall to agile and beyond

Time

Requirements

Design

Implementation

Testing

Waterfall Incrementale.g. RUP

Agile, e.g. XP

Functionality

Continuous


Regr

essio

n tes

t

Levels of Testing in the development process

See figures 1.7 and 1.8

Requirements

Detailed design

Unit test

Integration test

System

Acceptance test

Code

High-level design


Regr

essio

n tes

t

Phases of Testing in the development process?

Requirements

Detailed design

Unit test

Integration test

System

Acceptance test

Code

High-level design

Time? No!


Benefits of the V-model• Intuitive and easy to explain

– Makes a good model for training people– Shows how testing is related to other phases of the

waterfall process• Quite adaptable to various situations

– If the team is experienced and understands the inherent limitations of the model

• Beats the code-and-fix approach on any larger project


Weaknesses in V-model• Hard to follow in practice• Document driven

– Relies on the existence, accuracy, and timeliness of documentation

– Asserts testing on each level is designed based on the deliverables of a single design phase

• Communicates change poorly– Does not show how changes, fixes, and test rounds

are handled• Testing windows get squeezed• Based on simplistic waterfall model• Does not fit into iterative development

!(Brian Marick 2000;!James Christie 2008)


Levels, Types, and Phases –not the same thing

≠ ≠


Test Levels

Requirements

Unittesting

Coding

Moduledesign

Architecturedesign

Functionalspecification

Acceptancetesting

Systemtesting

Integrationtesting

Build Test

• Test level – a test level is a group of test activities that focus into certain level of the test target

– Test levels can be seen as levels of detail and abstraction

– How big part of the system are we testing?

≠ ≠


unit

integration

system

efficiencymaintainability

functionality

white box black box

Level of detail

Accessibility

Characteristics

usabilityreliability

acceptance

portability

Test Levels


Test Types• Test type – evaluate a system for a number of

associated quality goals – Testing selected quality characteristics– Testing to reveal certain types of problems– Testing to verify/validate selected types of functionality or

parts of the system• Common test types

– Functional: Installation, Smoke, Concurrency testing– Non-Functional: Security, Usability, Performance, …

• A certain test type can be applied on several test levels and in different phases

≠ ≠


unit

integration

system


functionality

white box black box

Level of detail

Accessibility

Characteristics


(module)

portability

Test Types


Test Phases• Test phase – Test phases are used to describe

temporal parts in testing process– Test phases are phases in a development or testing

project• Do not always match to the test levels or types

– In iterative development the phases may not match the test levels

• Often, there is unit testing but no unit test phase• You can have several integration phases

– or integration testing without an integration phase– Phases might match the test levels

• as depicted in V-model

≠ ≠


≈ Test phases/process

unit

integration

system


functionality

white box black box

Level of detail

Accessibility

Characteristics


acceptance

portability

- When? Test scheduling/process - How? Automated vs. manual- What? New vs. regression test- Whom? Test organization


Organizing testing (waterfall vs. agile)

Programmers

Testers

Division of labor Integrated teams

Customer

Programmer

Programmer

Tester

Idea: Testing in collaborationChallenge: Scaling up


Agile/Time paced development

Time is fixed, scope changes, e.g. Scrum• 30 days to complete iteration or sprint• 90 days to complete alpha release• 90 days to complete beta release• 180 for whole projects


Requirements

Unittesting

Coding

Moduledesign

Architecturedesign

Functionalspecification

Acceptancetesting

Systemtesting

Integrationtesting

Testing in Time Paced Development• Part of each development task• Testing is not a phase• Software is developed

incrementally– It has to be tested incrementally

• Rapid, time paced development– Time boxed releases and

increments– Deadlines are not flexible

• Scope is flexible– Testing provides information for

scoping


Approaches to Testing in Time Paced Development1. Automated regression testing

– Automated unit testing– Test-driven development– Daily builds and automated tests

2. Stabilisation phase or increment– Feature freeze– Testing and debugging at the end of the

increment or release

3. Separate system testing– Independent testing– Separate testers or testing team

TestingIncrements


Time perspective of testing


Coping with the Agile Challenges 1(2)• Requirements change all the time• Specifications are never final

– Let them change, test design is part of each task

• Code is never ‘finished’, never ready for testing• Not enough time to test

– Focus on developing ‘finished’ increments, tracking at the task level

– Testing is part of each development task


Coping with the Agile Challenges 2(2)

• Need, to regression test everything in each increment

• Developers always break things again• How can we trust?

• Trust comes from building in the quality, not from the external testing ‘safety net’

• Automation is critical


Agile developmentinto the extreme- deploy change by change

MARCH/APRIL 2015 | IEEE SOFTWARE 51

Many releases were a “scary” experience because the release pro-cess wasn’t often practiced and there were many error-prone manual ac-tivities. Priority 1 incidents caused by manual-con! guration mistakes weren’t uncommon. In addition, the release activities weren’t ef! cient. Just setting up the testing environ-ment could take up to three weeks.

To improve the situation, Paddy Power started an initiative to imple-ment CD. The company established a dedicated team of eight people, which has been working on this for more than two years.

The CD PipelineBecause we needed to support many diverse applications, we built a plat-form that lets us create a CD pipe-line for each application. Our team operates and maintains this plat-form. When an application develop-ment team needs a new CD pipeline for its application, we create one.

An application’s pipeline might differ slightly from another applica-tion’s pipeline, in terms of the num-ber and type of stages, to best suit that application. Figure 1 shows an example pipeline.

Code CommitThe code commit stage provides im-mediate initial feedback to develop-ers on the code they check in. When a developer checks in code to the software-con! guration-management repository, this stage triggers au-tomatically. It compiles the source code and executes unit tests.

When this stage encounters an er-ror, the pipeline stops and noti! es the developers. Developers ! x the code, get the changes peer reviewed, and check in the code. This triggers the code commit stage again and starts a

new execution of the pipeline. If ev-erything goes well, the pipeline auto-matically moves to the next stage.

BuildThe build stage executes the unit tests again to generate a code cov-erage report, runs integration tests and various static code analyses, and builds the artifacts for release. It up-loads the artifacts to the repository that manages them for deployment or distribution. All later pipeline stages will run with this set of artifacts.

Before we moved to CD, the bi-naries released to production might differ from the tested binaries. This was because we built the software multiple times, each for a different stage. Each time we built the soft-ware, we ran the risk of introduc-ing differences. We’ve seen the bugs these differences cause. Fixing them was frustrating because the software worked for the developers and testers but didn’t work in production. The CD pipeline eliminated these bugs.

If anything goes wrong, the pipe-line stops and noti! es the developers; otherwise, it automatically moves to the next stage.

Acceptance TestThe acceptance test stage mainly ensures that the software meets all

speci! ed user requirements. The pipeline creates the acceptance test environment, a production-like envi-ronment with the software deployed to it. This involves provisioning the servers, con! guring them (for ex-ample, for network and security), deploying the software to them, and con! guring the software. The pipe-line then runs the acceptance test suite in this environment.

Previously, setting up this en-vironment was a manual activity. For one of the very complex appli-cations, setup took two weeks of a developer’s time. Even for a smaller application, it took up to half a day.

For a new project, setup took even longer. The developers needed to request new machines from the infrastructure team, request that the Unix or Windows engineering team con! gure the machines, request net-work engineers to open connections between the machines, and so on. This could take a month.

With the CD pipeline, the devel-opers don’t need to perform these activities. The pipeline automatically sets up the environment in a few minutes.

Similar to the other stages, if any errors arise, the pipeline stops and noti! es the developers; otherwise, it moves to the next stage.

Build

Increasing con!dence in production readiness

Codecommit

Performancetest

Acceptancetest

Manualtest

Production

Automatic promotion Manual promotion

FIGURE 1. An example continuous delivery (CD) pipeline. Promotion—advancing the pipeline’s execution from one stage to another—can be automatic or manual. Our con! dence in a build’s production readiness increases as the build passes through each pipeline stage.

s2che.indd 51 2/4/15 6:35 PM

50 IEEE SOFTWARE | PUBLISHED BY THE IEEE COMPUTER SOCIETY 0 7 4 0 - 7 4 5 9 / 1 5 / $ 3 1 . 0 0 © 2 0 1 5 I E E E

FOCUS: RELEASE ENGINEERING

Continuous Delivery Huge Bene!ts, but Challenges Too

Lianping Chen, Paddy Power

// This article explains why Paddy Power adopted continuous delivery (CD), describes the resulting CD capability, and reports the huge bene!ts and challenges involved. This information can help practitioners plan their adoption of CD and help researchers form their research agendas. //

CONTINUOUS DELIVERY (CD) is a software engineering approach in which teams keep producing valu-able software in short cycles and ensure that the software can be re-liably released at any time. CD is attracting increasing attention and recognition.

CD advocates claim that it lets organizations rapidly, ef!ciently, and reliably bring service improve-ments to market and eventually stay a step ahead of the competition.1 This sounds great. However, imple-menting CD can be challenging—

especially in the context of a large enterprise’s existing development-and-release en vironment.

Here, I explain how we adopted CD at Paddy Power, a large book-making company. I describe the resulting CD capability and report the huge benefits and challenges involved. These experiences can provide fellow practitioners with insights for their adoption of CD, and the identified challenges can provide researchers with valuable input for developing their research agendas.

The ContextPaddy Power is a rapidly growing company, with a turnover of ap-proximately "6 billion and 4,000 employees. It offers its services in regulated markets, through betting shops, phones, and the Internet.

The company relies heavily on an increasingly large number of cus-tom software applications. These ap-plications include websites, mobile apps, trading and pricing systems, live-betting-data distribution sys-tems, and software used in the bet-ting shops. We develop these appli-cations using a range of technology stacks, including Java, Ruby, PHP, and .NET. To run these applications, the company has an IT infrastruc-ture consisting of thousands of serv-ers in different locations.

These applications are developed and maintained by the Technology Department, which employs about 400 people. A software develop-ment team’s size depends on the ap-plication’s size and complexity. Our teams range from two to 26 people; most teams have four to eight people.

The release cycle for each applica-tion also varies. Previously, each ap-plication typically had fewer than six releases a year. For each release cycle, we gathered the requirements at the cycle’s beginning. Engineers worked on development for months. Exten-sive testing and bug !xing occurred toward the cycle’s end. Then, the de-velopers handed the software over to operations engineers for deploying to production. The deployment in-volved many manual activities.

This release model arti!cially delayed features completed early in the release cycle. The value these features could generate was lost, and early feedback on them wasn’t available.

s2che.indd 50 2/4/15 6:35 PM


Continuous delivery – alternative viewDelivery team Version control Build & unit

testsAutomated

acceptance tests

F

User acceptance tests

Release

Check in

FFeedback

Trigger

Check in

P

Feedback

Trigger

Trigger

P

Check in

PTrigger

Trigger

ApprovalP Approval

P

Feedback

Feedback

FeedbackFeedback

F = failP = pass

Figure 5.2 Changes moving through the deployment pipeline

configuration of a production server. The discipline of the deployment pipelinemitigates this.

Second, when deployment and production release themselves are automated,they are rapid, repeatable, and reliable. It is often so much easier to performa release once the process is automated that they become “normal”events—meaning that, should you choose, you can perform releases morefrequently. This is particularly the case where you are able to step back to anearlier version as well as move forward. When this capability is available,releases are essentially without risk. The worst that can happen is that you findthat you have introduced a critical bug—at which point you revert to an earlierversion that doesn’t contain the bug while you fix the new release offline (seeChapter 10, “Deploying and Releasing Applications”).

To achieve this enviable state, we must automate a suite of tests that provethat our release candidates are fit for their purpose. We must also automate de-ployment to testing, staging, and production environments to remove thesemanually intensive, error-prone steps. For many systems, other forms of testingand so other stages in the release process are also needed, but the subset that iscommon to all projects is as follows.

109What Is a Deployment Pipeline?


A/B testingContinuous experimentation

• Run multiple version –evaluate online

• Deployment strategies

• Canary Releases• Dark Launches• Gradual Rollouts• A/B Testing.

2.3 Example Live Testing Strategy

A core observation underlying this paper is that rolloutsin practice often consist of multiple sequential phases of livetesting. For instance, a concrete rollout strategy may consistof initial dark launching, followed, if successful, by a gradualrollout over a defined period of time. If no problems, arediscovered, the new change may be A/B tested against analternative implementation, which may have run through asimilar live testing sequence.

A/B Test

Canary &Gradual Release

search 99%fastSearch 1%





1 day 1 day 1 day 1 day

5 daysfastSearch 100%

search 100%

XOR

Figure 1: A simplified example of a live testing strategy withmultiple phases. A change is gradually rolled out to moreand more users, and subsequentially A/B tested.

A simple example live testing strategy, which will be usedthroughout the remainder of the paper as a running ex-ample, is given in Figure 1. Assume a company hosting aservice-based web application selling consumer electronics.One of the integral services is the search service allowingcustomers to look for products they are interested in and toget an overview of the product catalog. The search serviceshall be redesigned and implement a new algorithm for de-livering more accurate search results based on other users’search requests and their buying patterns. As replacing theprevious slow, but working, search service by the new oneis associated with risks, the service shall be canary testedfirst. Once the service performs as expected from a technicalperspective, an A/B test should be conducted between thestable and canary variant. In case that the new implemen-tation performs better according to a priori defined businessmetrics, a complete rollout should happen, otherwise a fall-back to the stable version is conducted. The canary testedreimplementation fastSearch shall be rolled out to 1% ofthe US users first. Search and fastSearch are continuouslymonitored and collected metrics include response time, pro-cessing time (i.e., how long does the actual search algorithmtake to get results), number of 404 requests, and the numberof search requests per hour. Thresholds for fastSearch areset based on historic values collected for the stable searchservice, e.g., response time below 150ms. On a daily basis,and as long as the monitored metrics do not show any ab-normalities, fastSearch shall be gradually rolled out to moreand more users, first to 5%, then 10%, 20%, until at 50%the A/B test is conducted as shown in Figure 1. Besidesmore technical metrics, the A/B test focuses also on a busi-ness perspective (e.g., comparing the number of sold itemson both variants), and is conducted for 5 days to capture

enough data supporting statistical reasoning. State-of-the-art tools, such as the Configurator used by Facebook [24],require strategies such as this running example to be manu-ally implemented by a human release engineer, for analyzingthe data and the tweaking the rollout configuration after ev-ery step. This is labor-intensive, error-prone, and requiressubstantial experience in data science. In this paper, wepropose Bifrost as an automated and more principled ap-proach towards managing such release strategies.

3. A MODEL OF LIVE TESTING

Before explaining the implementation of theBifrostmid-dleware, we first introduce the fundamental ideas and char-acteristics that the system is based on, as well as the under-lying formal model.

3.1 Basic Characteristics

After thorough analysis of live testing in general, and thepractices discussed in Section 2 specifically, we have identi-fied the following basic characteristics of a formal model forlive testing.Data-Driven. Live testing require extensive monitoring

to decide on test outcomes or evaluate the current healthstate. This monitoring data is collected using existing toolsin the application’s landscape using Application PerformanceMonitoring, such as Kieker [26] or New Relic1. A model oflive testing needs to support the inclusion of monitoring datainto its runtime decision process.Timed Execution. Live testing requires the collection,

analysis, and processing of data in defined intervals. Grad-ual rollouts depend on timed increments to gradually intro-duce new versions or control the routed tra�c. Dependingon the concrete usage scenario, these methods may stretchover minutes, hours, or days.Parallel Execution and Tra�c Routing. All live test-

ing practices require the parallel operation of multiple ver-sions of a service, e.g., a stable previous version and an ex-perimental new implementation for canary releases, or twoalternative implementations for A/B testing. This also re-quires the correct routing of users to a specific version. Forinstance, canary releases are often targeted at specific usergroups. For A/B tests, it is often important that the sameuser is directed to the same implementation across sessions.Ordered Execution. Ordered execution is required to

form live testing strategies consisting of chained phases ofcanary releasing, gradual rollouts, dark launches, and A/Btests. An example for such a live testing chain is given inSection 2.

3.2 Live Testing Model

Based on these identified characteristics, we derived a for-mal representation for live testing strategies. To begin with,a strategy S is modeled as a 2-tuple:

S : hB,Ai

A strategy S consists of a set of services {b1, . . . , bn} anda deterministic finite automaton A. In our model, servicesbi 2 B represent atomic architectural components, for in-stance services in a microservice-based system. Services bi

themselves are available in di↵erent versions (e.g., a stableprevious search service version, and an experimental new

1https://newrelic.com

Gerald Schermann et al. “Bifrost – Supporting Continuous Deployment with Automated Enactment of Multi-PhaseLive Testing Strategies”. Middleware, 2016. doi: 10.1145/2988336.2988348.


Firefox case studyEmpir Software Eng (2015) 20:1384–1425DOI 10.1007/s10664-014-9338-4

On rapid releases and software testing: a case studyand a semi-systematic literature review

Mika V. Mantyla ·Bram Adams ·Foutse Khomh ·Emelie Engstrom ·Kai Petersen

Published online: 29 October 2014© Springer Science+Business Media New York 2014

Abstract Large open and closed source organizations like Google, Facebook and Mozillaare migrating their products towards rapid releases. While this allows faster time-to-marketand user feedback, it also implies less time for testing and bug fixing. Since initial researchresults indeed show that rapid releases fix proportionally less reported bugs than traditionalreleases, this paper investigates the changes in software testing effort after moving to rapidreleases in the context of a case study on Mozilla Firefox, and performs a semi-systematicliterature review. The case study analyzes the results of 312,502 execution runs of the 1,547mostly manual system-level test cases of Mozilla Firefox from 2006 to 2012 (5 major tradi-tional and 9 major rapid releases), and triangulates our findings with a Mozilla QA engineer.We find that rapid releases have a narrower test scope that enables a deeper investigation ofthe features and regressions with the highest risk. Furthermore, rapid releases make testingmore continuous and have proportionally smaller spikes before the main release. How-ever, rapid releases make it more difficult to build a large testing community , and they

Communicated by: Yann-Gael Gueheneuc and Tom Mens

M. V. Mantyla (!)Department of Computer Science and Engineering, Aalto University, Espoo, Finlande-mail: [email protected]

B. AdamsMCIS, Polytechnique Montreal, Quebec, Canadae-mail: [email protected]

F. KhomhSWAT, Polytechnique Montreal, Quebec, Canadae-mail: [email protected]

E. EngstromDepartment of Computer Science, Lund University, Lund, Swedene-mail: [email protected]

K. PetersenSchool of Computing, Blekinge Institute of Technology, Karlskrona, Swedene-mail: [email protected]

1408 Empir Software Eng (2015) 20:1384–1425

oriented”. This observation matches the spikes in Fig. 16 for testing related to the mainreleases of both TR and RR. However, we received no confirmation of our findings inFig. 15 that individual RR releases would be more deadline-oriented. Perhaps, this obser-vation, which comes from 213 individual releases, is unnoticeable in practice. Additionally,the interview confirmed that after the switch to rapid release, testing continuously is done ina high volume, in the following sense: “Before the switch to rapid release the deadlines ranin serial, now they run in parallel. As such we now test in parallel (or near parallel). Betasget most of our attention as they’re the closest to release, Aurora is our second priority, andNightly is our third priority”. However, the QA engineer pointed out that periods with verylow official testing volume in TR, in particular the zeros in Fig. 16, might have maskedother means of testing not visible in the Litmus system: “When builds were in Alpha beforerapid release there was daily dogfooding”.

5 Confounding Factors and a Theoretical Model

During our empirical study, we realized that there are two important confounding factorsthat may affect the results: release length and the project’s natural evolution. First, one couldeasily think that the differences observed between TR and RR are not due to the choice of

0

100

200

300

400

2007 2008 2009 2010 2011 2012

Tes

t exe

cutio

ns !

28

day

aver

age

Fig. 16 28-day moving average of the number of test executions for all TR (blue) and RR releases (red).Dashed lines represent main release dates for TR (blue 2.0, 3.0, 3.5, 3.6, 4.0) and RR (red 5.0-13.0)

1410 Empir Software Eng (2015) 20:1384–1425

Table 9 Partial correlation of three variables (release model, length and date) with test effort, whilecontrolling for two out of the three variables

FF-RQ Partial Kendall correlation coefficients

Release Release Project

Model Length Evolution

#Test executions/day (FF-RQ1) 0.026 !0.414*** 0.048

#Test cases/day (FF-RQ1) !0.231*** !0.593*** 0.017

#Testers/day (FF-RQ2) !0.224*** !0.331*** !0.069

#Builds/day (FF-RQ3) !0.105* !0.258*** !0.176***

#Locales/day (FF-RQ4) !0.281*** !0.443*** !0.115*

#OSs/day (FF-RQ4) 0.149** !0.822*** 0.130**

Cohen’s Kappa (Test suite) (FF-RQ5) 0.187*** 0.117* 0.177***

Cohen’s Kappa (Tester) (FF-RQ5) 0.422*** !0.036 0.030

Temporal test distance (FF-RQ6) !0.176*** !0.002 !0.110*

Significance levels *=0.05 **=0.01, ***=0.001

which test cases are important. Finally, testing starts later as time moves along. This mightbe caused by the use of a more similar set of test cases: testing a familiar set of test casestakes less time, thus, one starts working on it later. However, these findings regarding projectevolution are hypotheses and should be more thoroughly investigated in further studies.

Taking all of this into account, our interpretation of the relation between release modeland test effort is depicted in Fig. 17. Our initial hypothesis was that the shorter release lengthof RR releases was responsible for increased testing effort. However, even after controllingfor the effect of the RRs’ shorter release length, the RRs have a lower number of test cases,testers, tested builds and tested locales. This means that Firefox’ development process musthave changed, as supported by the received feedback, since otherwise one would actuallyexpect a higher proportion of the above metrics. The change in process has focused thetesting efforts for the RR releases compared to the TR releases. Only the increase in testexecutions per day can be fully attributed to the shorter release length.

Table 10 shows the effects of rapid releases before and after controlling for the confound-ing factors. We like to think that the first row of the table shows the impact of RR froma practitioner’s viewpoint. Practitioners are not interested whether something stems fromchanges to release length or to the rapid release model as these two are highly correlated, asseen in Fig. 2 and Table 8. However, from a researcher or developer viewpoint it is interest-ing to see whether something is explained directly by the release length or by the change inthe release model, cf. Table 9. Additionally, the separation allowed us to create a model forrapid release testing that explains the effects of the shortening of release cycle length andthe necessary level of test effort, see Fig. 17.

Fig. 17 Model explaining the relationship between release model, release length and test effort


unitintegration

system


functionality

white box black box

Level of detail

Accessibility

Characteristics


acceptance

portability

Test Levels


Unit testing• Unit = smallest testable software component (e.g.

labs 1 and 2)– Objects– Procedures / functions– Reusable components

• Tested in isolation• Usually done by programmer during development

– A pair tester can help– Developer guide or coding guideline can require

• Also known as component or module testing


Scaffolding test harness3.4 DYNAMIC UNIT TESTING 63

Testdriver

Call and pass inputparameters Output parameters

Unit Under Test

Call AcknowledgeAcknowledge

Stub

Call

Stub PrintPrint

Results

Figure 3.2 Dynamic unit test environment.

in fact, called. Such evidence can be shown by merely printing a message.Second, the stub returns a precomputed value to the caller so that the unitunder test can continue its execution.

The driver and the stubs are never discarded after the unit test is completed.Instead, those are reused in the future in regression testing of the unit if there issuch a need. For each unit, there should be one dedicated test driver and severalstubs as required. If just one test driver is developed to test multiple units, thedriver will be a complicated one. Any modification to the driver to accommodatechanges in one of the units under test may have side effects in testing the otherunits. Similarly, the test driver should not depend on the external input data filesbut, instead, should have its own segregated set of input data. The separate inputdata file approach becomes a very compelling choice for large amounts of testinput data. For example, if hundreds of input test data elements are required to testmore than one unit, then it is better to create a separate input test data file ratherthan to include the same set of input test data in each test driver designed to testthe unit.

The test driver should have the capability to automatically determine thesuccess or failure of the unit under test for each input test data. If appropriate,the driver should also check for memory leaks and problems in allocation anddeallocation of memory. If the module opens and closes files, the test driver shouldcheck that these files are left in the expected open or closed state after each test.The test driver can be designed to check the data values of the internal variable that


Test-driven development

1. Write a test

2. See it fail

3. Make it run

4. Make it right


Test first or lats?

! Contribution 2: Identification of outcomes for individual primarystudies with respect to the outcome variables and their interpre-tation in the light of rigor and relevance. Studies are categorizedin four groups: high rigor and high relevance, high rigor and lowrelevance, low rigor and high relevance, and low rigor and lowrelevance. The outcomes in each category are compared.! Contribution 3: The outcome of this literature review is com-

pared with the outcome of previous reviews. This providesnew insights if the conclusions change and also adds furtherevidence for the usefulness of using the assessment instrumentproposed by [12].

Overall, this assessment has practical relevance as com-panies should base their decisions on those studies withhigh rigor and relevance [12], in particular in case of con-flicting results, as presented in previous literature reviewson TDD.

The remainder of the paper is structured as follows: Section 2presents related work. Section 3 presents the research method(review protocol). Section 4 provides the results, followed by aDiscussion (Section 5) and Conclusion (Section 6).

2. Related work

We identified 8 works on test-driven development with the re-views being the main or a major part of the contribution (cf.[6,8,13,7,10,9,14,15]). One review focuses on limiting factors thathinder the adoption of TDD [8], while we are focusing on the actualoutcome achieved by using TDD in comparison to TLD. Hence, wediscuss studies [13,7,10,9,14,15] in detail as those are directly re-lated and most relevant to our systematic review, using the ques-tions suggested in [16,17]. The results of the systematic review byShull et al. [9] are based on [18].

What are the reviews’ objective? Kollanus conducted a sys-tematic review reported in two papers [13,7]. He analyzed empir-ical studies with respect to quality and productivity. While doingso, he distinguished different study types (in particular case study,experiment, and others).

Shull et al. [9] assessed internal code quality, external codequality and productivity. They assessed the results consideringthe relevance dimension by classifying a sub-set of studies as beingof high quality. This has been done through defining criteria similarto what Ivarsson and Gorschek [12] consider as the relevancedimension. Shull et al. highlighted studies when subjects are ongraduate or professional level, followed the TDD text-book process,and the size of the development task was realistic.

Sfetsos and Stamelos [10] reported on the impact of agile prac-tices on quality, including TDD, pair programming, and other agilepractices.

Rafique and Misic [14] conducted a statistical meta-analysis todetermine the effects of TDD on external quality and productivity.

Jeffries and Melnik [15] aimed at summarizing the outcomes ofTDD studies in academia and industry in relation to quality andproductivity impacts.

We complement the existing studies in the following ways: Weextend all studies by eliciting a wider set of outcome variables. Fur-thermore, we complement the study by Shull et al. [9] by addingthe dimension of scientific rigor, and extend the relevance dimen-sion to consider whether the evaluation has been done on a realindustrial system, the research methodology is industry focused,and that the research context in which the study is conductedmatches real usage (industry). The rigor and relevance dimensionsprovide a more fine-grained scale for rigor and relevance. The scaleis used to compare sets of studies with different levels of qualitywith each other. Further details on the quality assessment can befound in Section 3.7 and are based on [12].

What sources were selected to identify the primary studies?Were there any restrictions? Was the search done on appropri-ate databases and were other potentially important sourcesexplored? What were the restrictions and is the literaturesearch likely to have covered all relevant studies?

Kollanus [13,7] used IEEE Explore, ACM, and Springer. The timeperiod has not been specified. The search terms used were ‘‘TDD’’and ‘‘test-driven development’’.

Shull et al. [9] have not specified the data bases and the searchstring. The time period covered papers from 1999.

Write a test

Run the test

Write code

Run the test

Refactor code

Write code

Test fails?

Test fails?

Test fails?

yes

no

Fix errors yesno

repeat

Run the test

Write a test

Fix errors yes no

TDD TLD

Fig. 1. TDD vs. TLD (based on [87]).

H. Munir et al. / Information and Software Technology 56 (2014) 375–394 377


Test first or last? [Munir et al 2014]

No difference when TDD is compared with TLD

Disagree on whether positive or no effect using TDD, one study shows negative results

TDD improves code qual-ity, reduces complexity, and keeps or looses their productivity levels

Potential to increase external quality, but at the expense of time and productivity

High relevance

High rigor


Systematic literature reviews on TDD

StudyQuality with TDD Productivity with TDD

Bissi et al Improvement InconclusiveMunir et al Improvement or no

differenceDegradation or no differnece

Rafique and Misic Improvement InconclusiveTurhan et al and Shull et al Improvement InconclusiveKollanus Improvement DegradationSiniaalto Improvement Inconclusive

I. Karac and B. Turhan. What do we (really) know about test-driven development? IEEE Software, 35(4):81–85, 2018.

“TDD’s internal and external dynamics are more complex than the order in which tests are written. There’s no convincing evidence that TDD consistently fares better than any other development method, at least those methods that are iterative.”


Integration testing• More than one (tested) unit• Detecting defects

– On the interfaces of units– Communication between units

• Helps assembling incrementally a whole system• Done by developers/designers or independent

testers– Preferably developers and testers in collaboration

• Often omitted due to setup difficulties– Time is more efficiently spent on unit and system tests


Types of interfaces

• Procedure Call Interface• Shared Memory Interface • Message Passing Interface

[Naik & Tripathy p 159-160]


System integration techniques

• Incremental • Top down• Bottom up • Sandwich • Big bang


Integration testingBo

ttom

-up

Top-down

Stubs

Drivers

Incre-mental


System testing• Testing the system as a whole• Functional

– Functional requirements andrequirements-based testing

• Other quality attributes– Performance, stress, configuration,

security, ...– As important as functional requirements– Often poorly specified– Often done by specialists– Must be tested

• Often done by independent test group– Collaborating developers and testers


Types of system testing


Regression testing

• Testing of changed software– Retest all– Selective regression testing

• Regression test suite• Types of changes

– Defect fixes– Enhancements– Adaptations– Perfective maintenance

[Skoglund, Runeson, ISESE05]


Retest all

• Assumption:– Changes may introduce errors anywhere in the code

• Expensive, prohibitive for large systems• Reuse existing test suite• Add new tests as needed• Remove obsolete tests (discrepancies between

expected and actual output)


Selective regression testing

• Impact analysis• Only code impacted by change needs to be

retested• Select tests that exercise such code• Add new tests if needed• Remove obsolete tests


Industry practice of Regression Testing

What?• Retesting, different levels

When?• Daily – Before release

What?• Functional focus• “Smoke test” = selection

Challenges• Cost/benefit of automation• Automated test environment• Presentation of test results

Good practices• Automated daily module tests • Automation below UI• Visualize test progress

E. Engström and P. Runeson. A qualitative survey of regression testing practices.

PROFES, LNCS6156, pp 3–16. Springer, 2010


Industry relevant research

• Dependent on manyfactors

Empirical Software Engineering (2019) 24:2020–2055https://doi.org/10.1007/s10664-018-9670-1

On the search for industry-relevant regression testingresearch

Nauman bin Ali1 · Emelie Engstrom2 ·Masoumeh Taromirad3 ·Mohammad Reza Mousavi3,4 ·Nasir MehmoodMinhas1 ·Daniel Helgesson2 ·Sebastian Kunze3 ·Mahsa Varshosaz3

Published online: 12 February 2019© The Author(s) 2019

AbstractRegression testing is a means to assure that a change in the software, or its execution envi-ronment, does not introduce new defects. It involves the expensive undertaking of rerunningtest cases. Several techniques have been proposed to reduce the number of test cases toexecute in regression testing, however, there is no research on how to assess industrial rel-evance and applicability of such techniques. We conducted a systematic literature reviewwith the following two goals: firstly, to enable researchers to design and present regressiontesting research with a focus on industrial relevance and applicability and secondly, to facil-itate the industrial adoption of such research by addressing the attributes of concern fromthe practitioners’ perspective. Using a reference-based search approach, we identified 1068papers on regression testing. We then reduced the scope to only include papers with explicitdiscussions about relevance and applicability (i.e. mainly studies involving industrial stake-holders). Uniquely in this literature review, practitioners were consulted at several steps toincrease the likelihood of achieving our aim of identifying factors important for relevanceand applicability. We have summarised the results of these consultations and an analysisof the literature in three taxonomies, which capture aspects of industrial-relevance regard-ing the regression testing techniques. Based on these taxonomies, we mapped 38 papersreporting the evaluation of 26 regression testing techniques in industrial settings.

Keywords Regression testing · Industrial relevance · Systematic literature review ·Taxonomy · Recommendations

Communicated by: Shin Yoo

! Nauman bin [email protected]

Emelie [email protected]

1 Blekinge Institute of Technology, Karlskrona, Sweden2 Lund University, Lund, Sweden3 Halmstad University, Halmstad, Sweden4 University Leicester, Leicester, UK

2034 Empirical Software Engineering (2019) 24:2020–2055

Table 5 Mapping of techniques to the taxonomy

ScopeAddressedcontextfactors

Desiredeffects

Utilised information (enti-ties)Technique Study ID

Selection

Prioritiz

ation

Minim

ization

System

-related

Process-related

Peop

le-related

Testcoverage

Efficiency

and

Effectiv

eness

Awareness

Requirements

Designartefacts

Source

code

Interm

ediatecode

Binarycode

Testcases

Testexecution

Testrepo

rts

Issues

TEMSA S13 – S18History based prioritiza-tion (HPro)

S26

classification tree testing(DART)

S19, S20

I-BACCI S9 – S12Value based S33multi-perspective priori-tisation (MPP)

S3, S4

RTrace S21Echelon S29Information retrieval(REPiR)

S2

EFW S7, S8Fix-cache S24, S25Most Frequent Failures S34Continuous Multi-scaleAdditional Greedy pri-oritisation (CMAG)

S28

GWT-SRT S30Clustering (based oncoverage, fault historyand code metrics)

S37

FTOPSIS S22Difference engine S1Change and coveragebased (CCB)

S5

Similarity based minimi-sation

S36

THEO S32DynamicOverlay /OnSpot

S23

Class firewall S6model-based architec-tural regression testing

S35

component interactiongraph test case selection

S31

keyword-based-traces S27FaultTracer S38


Acceptance testing (α, β)• Final stage of validation

– Customer (user) should perform or be closely involved

– Customer can perform any test they wish

– Final user sign-off

• Approach– Mixture of scripted and unscripted

testing– Performed in real operation

environment

• Project– Contract acceptance– Customer viewpoint– Validates that the right

system was built

• Product– Final checks on releases– User viewpoint throughout

the development; validation


To discuss

• What is the difference between test levelsand test phases?

• Which of the test types are more/less feasible for different test levels?


Planning and follow-up– Chapter 12


Test management

• Goal• Policy• Plan• Follow-up

• Reach India• Go west• Get three ships• …


Plans are nothing; planning is everything

Former US President Dwight D Eisenhower (1890-1969)


Test planning

• Objectives• What to test• Who will test• When to test• How to test• When to stop


Test Planning – different levels

Test Policy

Test Strategy

Detailed Test PlanDetailed Test PlanDetailed Test PlanDetailed Test Plan

Company or product level

Project level

Test levels (one for each within a project, e.g. unit, system, ...)

Master test plan

Acceptance test plan

System test plan

Unit and integration test plan

Evaluation plan

Number of needed levels depends on context (product, project, company, domain, development process, regulations, …)

Master Test Plan

Master Test Plan


Software Project Management Plan

Goals• Business • Technical• Business/technical• Political

Quantitative/qualitative

PoliciesHigh-level statement of principle or course of action that is used to govern a set of activities in an organization.


System test plan (table 12.1)1. Introduction2. Feature description3. Assumptions4. Test approach5. Test suite structure6. Test environment7. Test execution strategy8. Test effort estimation9. Scheduling and milestones


Standards

• IEEE 829-1998 Standard for Software Test Documentation

• IEEE 1012-1998 Standard for Software Verification and Validation

• IEEE 1008-1993Standard for Software Unit Testing

->• ISO 29119

Software Testing Concepts


Test plan according toIEEE Std 829-1998 a) Test plan identifierb) Introductionc) Test itemsd) Features to be testede) Features not to be testedf) Approachg) Item pass/fail criteriah) Suspension criteria and

resumption requirements

i) Test deliverablesj) Testing tasksk) Environmental needsl) Responsibilitiesm) Staffing and training

needsn) Scheduleo) Risks and contingenciesp) Approvals


A fresh contrast

• http://googletesting.blogspot.se/2011/09/10-minute-test-plan.html

• https://www.youtube.com/watch?v=QEu3wmgTLqo

http://googletesting.blogspot.se/2011/09/10-minute-test-plan.html

https://www.youtube.com/watch%3Fv=QEu3wmgTLqo


• Attributes– Application properties

that need verification• Components

– Part of the application that need verification

• Capability– What purpose does

the application serve for users?


To discuss

• What are the pros and cons of test standards?

• Which is most important, the test plan or the planning?


This week

• Project– Work on your own. – Book meeting if you haven’t done it yet.

• Lab– Lab 3: Inspection and estimation– Delivery of report (Lab 2) Thu

Report via e-mail to [email protected] & [email protected] line: labX and your student ID.

http://cs.lth.se

http://analys.urkund.se


Recommended exercises

• Chapter 7– 1, 2, 3, 4, 5, 6, 7

• Chapter 12– 1, 2, 3, 5, 8, 13

l5 2019 lifecycle documentationfileadmin.cs.lth.se › cs › education › etsn20 › lectures ›...

Documents