scalable statistical bug isolation ben liblit, mayur naik, alice zheng, alex aiken, and michael...

18
Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan, 2005 University of Wisconsin, Stanford University, and UC Berkeley Mustafa Dajani 27 Nov 2006 CMSC 838P

Upload: vincent-chapman

Post on 29-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Scalable StatisticalBug Isolation

Ben Liblit, Mayur Naik, Alice Zheng,Alex Aiken, and Michael Jordan, 2005

University of Wisconsin, Stanford University, and UC Berkeley

Mustafa Dajani27 Nov 2006 CMSC 838P

Overview of the Paper explained a statistical debugging algorithm that is able to isolate bugs in programs containing multiple undiagnosed bugs

showed a practical, scalable algorithm for isolating multiple bugs in many software systems 1

outline:

IntroductionBackgroundCause Isolation AlgorithmExperiments

Objective of the Study: To develop a statistical algorithm to

hunt for causes of failures

• Crash reporting systems are useful in collecting data

• Actual executions are a vast resource

• Using feedback data for causes of failures

Introduction• Statistical debugging - a dynamic analysis for detecting the causes of run failures.

- an instrumentation program basically monitor program behavior by sampling information

- this involves testing of predicates in particular events during the run

• Predicates, P - bug predictors; large programs may consist of thousands of predicates

• Feedback Report, R - contains information whether a run has succeeded or failed.

Introduction• the study’s model of behavior:

“If P is observed to be true at least once during run R then R(P) = 1, otherwise R(P) = 0.”

- In other words, it counts how often “P observed true” and “P observed” using random sampling

• previous works involved the use of regularized logistic regression (it tries to select predicates to determine outcome of every run)

- but this algorithm creates redundancy in finding predicates as well as difficulty in predicting multiple bugs

Introduction•

Study design:- determine all possible predicates

- eliminate predicates that have no predictive power

- loop {- rank the surviving predicates by

importance- remove all top-ranked predicates- discard all runs where the run passed,

R(P)=1- go to top of loop until set of runs or set of

predicates are empty

Bug Isolation Architecture

ProgramSource

Compiler

Sampler

Predicates

ShippingApplication

Counts& /

€ƒƒ

€‚Statistical

Debugging

Top bugs withlikely causes

Depicting failures through P

F(P)F(P) + S(P)

Failure(P) =

F(P) = # of failures where P observed trueS(P) = # of successes where P observed true

Consider this code fragment: if (f == NULL) {x = 0;*f;

}

When does a program fail?

Valid pointer assignment

If (…) f = …some valid pointer…;

*f;

Predicting P’s truth or falsehood

F(P observed)F(P observed) + S(P observed)

Context(P) =

F(P observed) = # of failures observing PS(P observed) = # of successes observing P

Notes

• Two predicates are redundant if they predict the same or nearly the same set of failing ones

• Because of elimination is iterative, it is only necessary that Importance selects a good predictor at each step and not necessarily the best one.

Guide to Visualization

Increase(P)

S(P)

error bound

log(F(P) + S(P))

Context(P)

http://www.cs.wisc.edu/~liblit/pldi-2005/

Rank by Increase(P)

• High Increase() but very few failing runs!• These are all sub-bug predictors

– Each covers one special case of a larger bug

• Redundancy is clearly a problem

http://www.cs.wisc.edu/~liblit/pldi-2005/

Rank by F(P)

• Many failing runs but low Increase()!• Tend to be super-bug predictors

– Each covers several bugs, plus lots of junk

http://www.cs.wisc.edu/~liblit/pldi-2005/

Notes

• In the language of information retrieval– Increase(P) has high precision, low recall– F(P) has high recall, low precision

• Standard solution:– Take the harmonic mean of both– Rewards high scores in both dimensions

http://www.cs.wisc.edu/~liblit/pldi-2005/

Rank by Harmonic Mean

• It works!– Large increase, many failures, few or no

successes

• But redundancy is still a problemhttp://www.cs.wisc.edu/~liblit/pldi-2005/

Lessons Learned

• Can learn a lot from actual executions– Users are running buggy code anyway– We should capture some of that

information

• Crash reporting is a good start, but…– Pre-crash behavior can be important– Successful runs reveal correct behavior– Stack alone is not enough for 50% of bugs

http://www.cs.wisc.edu/~liblit/pldi-2005/