1 the distribution of faults in a large industrial software system thomas ostrand elaine weyuker...

The Distribution of Faults in a Large Industrial Software System

Thomas OstrandElaine Weyuker

AT&T Labs -- ResearchFlorham Park, NJ

The $59 Billion* Question

• Can we predict where severe bugs are likely to be?

• Can we identify characteristics of code units, files, or modules that indicate a higher probability of bugs? i.e., can we characterize the fault-proneness of code units?

*The Economic Impacts of Inadequate Infrastructure for Software Testing, May 2002, NIST, May 2002

Software Engineering Folklore

• A relatively small number of modules have most of the faults

• If a relatively small number of modules have most faults, the reason is that they also contain most of the code

• Large modules are buggier than small ones• Buggy early → Buggy late (inherently bad modules)• New code is {more, less} buggy than old code• Code metrics are {good, bad} predictors of code quality

The Case Study

We studied the fault database of a • large • currently in-use• under continuing developmentinventory tracking system.

System Information

• System type: Inventory tracking• Lifespan: First release ~1998. Subsequent releases every 3

months.• Development stages: requirements, design, development,

unit testing, integration testing, system testing, beta release, controlled release, and general release.

• Code: About ¾ of the files in java, with smaller numbers written in shell script, makefiles, xml, html, perl, c, sql, and awk.

• Fault data studied from 13 successive releases.

Predicting Fault Proneness:Possible factors

• File Size: Is the density of faults found in a file related to the file’s size?

• Faults found during early development stages: Does the number of faults found in early stages predict the number that will be found in later stages?

• Faults found during early releases: Does the number of faults found in early releases predict the number that will be found in later releases?

• Age of file: Are new files more likely to have faults than files that existed in earlier releases?

Number of Files

0200400600800

10001200140016001800

1 2 3 4 5 6 7 8 9 10 11 12 13

Release Number

Size of System(KLOCs, including comments)

100150200250300350400450500

1 2 3 4 5 6 7 8 9 10 11 12

Release Number

Number of files and number of faults detected in each release

0200400600800

10001200140016001800

1 2 3 4 5 6 7 8 9 10 11 12 13

FilesFaults

Release Number

f File

Size of System (KLOCs) and Number of Faults

0100200300400500600700800900

1 2 3 4 5 6 7 8 9 10 11 12

KLOCFaults

Release Number

SW Engineering Folklore

Distribution of Faults over FilesPercent of Faulty Files

1 2 3 4 5 6 7 8 9 10 11 12 13

Release Number

f File

ults N

umber of files in

release

Distribution of Faults over FilesNumber of Faulty Files

1 2 3 4 5 6 7 8 9 10 11 12 13

Release Number

f File

Concentration of Faults in Rel 12

0 2 4 6 8Percent of Files

ent Percent of Faults

Percent of CodeSize

Fault Density, All Releases (restricted to files with at least one fault)

0 1000 2000 3000 4000 5000 6000 7000 8000

Size of File

Fault Density vs. File Size for Release 12

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Size of File

19Fault density vs. file size (Fenton & Ohlsson, TSE 1997)

20Fault density vs. module size (Hatton, Software 1997)

Does the number of faults found in early development stages predict the

number that will be found in later stages?

Does the number of faults found in early releases predict the number that will be found in later releases?

Faults by Development Stages

1 2 3 4 5 6 7 8 9 10 11 12 13

Early Pre: Dev & UnitTest Late Pre: Int & SysTestPost: Beta & Gen Release

Release Number

Once Faulty, Always Faulty?From Stage to Stage

In every release, ALL of the post-release faults were found in files whose contribution to the integration and system test faults was relatively low (6% - 28%).

In other words, 94% -72% of the late pre-release faults were in files that had NO post-release faults.

Fenton & Ohlsson observed similar results.

Once Faulty, Always Faulty?From Stage to Stage

However, across all the releases, there were only a total of 128 post-release faults. No release had more than 20 post-release faults, and half had no more than 10 post-release faults.

Not enough data to draw meaningful conclusions.

Number of Post-Release FaultsRelease in Files with 0

Late Pre-Rel Faultsin Files with 1

Late Pre-Rel Faults1 0 0

7 18 0

10 7 5

11 10 8

12 14 6

13 7 4

Total 87 41

Faults by Stages

1 2 3 4 5 6 7 8 9 10 11 12 13

Early PreLate Pre

Release Number

Faults by Stages(Sorted by Decreasing Number Early Pre-release Faults)

1 3 8 9 4 6 5 10 12 11 2 7 13

Early PreLate Pre

Release Number

Once Faulty, Always Faulty?From Release to Release

High-fault files of a release: Top 20% of files ordered by decreasing number of faults.

Over all releases, roughly 35% of these files were also high-fault files in the preceding and/or succeeding releases.

For Release 12, more than 40% of its high-fault files were also high-fault files in Release 1.

Persistence of high fault count between releases

1 2 3 4 5 6 7 8 9 10 11 12

High-fault in next release High-fault in previous release

Release

Old Files and New Files

For Release n:

an old file is a file that existed in some Release i < n, and still exists in Release n.

a new file is a file that did not exist in any Release i < n, and is in Release n.

Are new files more likely to have faults than files that existed in earlier releases?

Do new files have higher fault density than old files?

Old Files, New FilesPercent Containing (any number of)

Faults

2 3 4 5 6 7 8 9 10 11 12

OldNew

Release Number

Old Files, New FilesFaults/KLOC

over all files of the release

2 3 4 5 6 7 8 9 10 11 12

OldNew

Release Number

Summary of Fault Proneness Observations

• Faults are concentrated in a relatively small number of files, and become more heavily concentrated as the system matures.

• Large files do not generally have higher fault density than small files; the opposite seems to be true.

• Files with high fault counts during pre-release do not generally have high fault counts during post-release.

Fault Proneness Observations

• Files with the largest numbers of faults in an early release, seem to be more likely to have large numbers of faults in the next release and later releases.

• Newly written files are more likely to be faulty than old files, and to have higher fault density than old files.

Continuing Work

• Study additional systems• Statistical analysis• Study relation between bugs in successive releases

(Do persistently high-fault files have related bugs in successive releases?)

• Do the numbers change if we calculate them for different levels of fault severity?

• Are code metrics good predictors of faults?

1 the distribution of faults in a large industrial software system thomas ostrand elaine weyuker...

Documents

grade 3 writing scope and sequence - florham park school

248 columbia turnpike florham park, nj 07932 office:...

record linkage: similarity measures and...

building a tool for generating test frames automatically...

ostrand angus - 14th annual production sale

the syrian refugee crisis: a comparison of responses by...

the category-partition method for specifying and generating...

the syrian refugee crisis: a comparison of responses by...

spring 2016 connections - summit medical group ·...

computability complexity and languages _ davis _ sigal _...

grade 5 - home - florham park school district

in the supreme court of pennsylvania office of ... ·...

189 ridgedale avenue florham park, nj€¦ · bargain...

borough of florham park county of morris report of …

borough of florham park 2020 re-organization … ·...

borough of florham park · borough of florham park ... debt...

weyuker properties

friends of florham - fall 2015

elaine weyuker august 2014. to determine which files of a...

current types of payments in the u.s. healthcare system lori...