statistical debugging

Statistical Debugging: Lecture #2

Statistical DebuggingBen Liblit, University of WisconsinMadisonCopyright 2007, Benjamin Liblit. All rights reserved.1

Bug Isolation ArchitectureProgramSourceCompilerShippingApplicationSamplerPredicatesCounts& J/LStatisticalDebuggingTop bugs withlikely causes2Winnowing Down the Culprits1710 counters3 570 call sites1569 zero on all runs141 remain139 nonzero on at least one successful runNot much left!file_exists() > 0xreadline() == 0

3Multiple, Non-Deterministic BugsStrict process of elimination wont workCant assume program will crash when it shouldNo single common characteristic of all failuresLook for general correlation, not perfect predictionWarning! Statistics ahead!4Ranked Predicate SelectionConsider each predicate P one at a timeInclude inferred predicates (e.g. , , )How likely is failure when P is true?(technically, when P is observed to be true)Multiple bugs yield multiple bad predicates5Some Definitions

6Are We Done? Not Exactly!if (f == NULL) {x = 0;*f;}Bad(f = NULL)= 1.07Are We Done? Not Exactly!if (f == NULL) {x = 0;*f;}Predicate (x = 0) is innocent bystanderProgram is already doomedBad(f = NULL)= 1.0Bad(x = 0)= 1.08Fun With Multi-Valued LogicIdentify unlucky sites on the doomed pathBackground risk of failure for reaching this site, regardless of predicate truth/falsehood

9Isolate the Predictive Value of PDoes P being true increase the chance of failure over the background rate?Formal correspondence to likelihood ratio testing

10Increase Isolates the Predictorif (f == NULL) {x = 0;*f;}Increase(f = NULL)= 1.0Increase(x = 0)= 0.011It Works!for programs with just one bug.Need to deal with multiple bugsHow many? Nobody knows!Redundant predictors remain a major problemGoal: isolate a single best predictor for each bug, with no prior knowledge of the number of bugs.12Multiple Bugs: Some IssuesA bug may have many redundant predictorsOnly need one, provided it is a good oneBugs occur on vastly different scalesPredictors for common bugs may dominate, hiding predictors of less common problems13Guide to VisualizationMultiple interesting & useful predicate metricsSimple visualization may help reveal trendsIncrease(P)S(P)error boundlog(F(P) + S(P))Context(P)14Confidence Interval on Increase(P)Strictly speaking, this is slightly bogusBad(P) and Context(P) are not independentCorrect confidence interval would be larger

15Bad Idea #1: Rank by Increase(P)High Increase but very few failing runsThese are all sub-bug predictorsEach covers one special case of a larger bugRedundancy is clearly a problem

16Bad Idea #2: Rank by F(P)

Many failing runs but low IncreaseTend to be super-bug predictorsEach covers several bugs, plus lots of junk17A Helpful AnalogyIn the language of information retrievalIncrease(P) has high precision, low recallF(P) has high recall, low precisionStandard solution:Take the harmonic mean of bothRewards high scores in both dimensions18Rank by Harmonic MeanDefinite improvementLarge increase, many failures, few or no successesBut redundancy is still a problem

19Redundancy EliminationOne predictor for a bug is interestingAdditional predictors are a distractionWant to explain each failure onceSimilar to minimum set-cover problemCover all failed runs with subset of predicatesGreedy selection using harmonic ranking20Simulated Iterative Bug FixingRank all predicates under considerationSelect the top-ranked predicate PAdd P to bug predictor listDiscard P and all runs where P was trueSimulates fixing the bug predicted by PReduces rank of similar predicatesRepeat until out of failures or predicates21Case Study: MossReintroduce nine historic Moss bugsHigh- and low-level errorsIncludes wrong-output bugsInstrument with everything weve gotBranches, returns, variable value pairs, the works32,000 randomized runs at 1/100 sampling22Effectiveness of FilteringSchemeTotalRetainedRatebranches 4170 180.4%returns 2964 110.4%value-pairs195,86426821.4%23Effectiveness of RankingFive bugs: captured by branches, returnsShort lists, easy to scanCan stop early if Bad drops downTwo bugs: captured by value-pairsMuch redundancyTwo bugs: never cause a failureNo failure, no problemOne surprise bug, revealed by returns!24Analysis of exif

3 bug predictors from 156,476 initial predicatesEach predicate identifies a distinct crashing bugAll bugs found quickly using analysis results25Analysis of Rhythmbox

15 bug predictors from 857,384 initial predicatesFound and fixed several crashing bugs26How Many Runs Are Needed?Failing Runs For Bug #n#1#2#3#4#5#6#9Moss18103212211120ccrypt26bc40Rhythmbox2235exif28121327Other Models, Briefly ConsideredRegularized logistic regressionS-shaped curve fittingBipartite graphs trained with iterative votingPredicates vote for runsRuns assign credibility to predicatesPredicates as random distribution pairsFind predicates whose distribution parameters differRandom forests, decision trees, support vector machines, 28Capturing Bugs and Usage PatternsBorrow from natural language processingIdentify topics, given term-document matrixIdentify bugs, given feature-run matrixLatent semantic analysis and related modelsTopics bugs and usage patternsNoise words common utility codeSalient keywords buggy code29Probabilistic Latent Semantic Analysisobserved data:Pr(pred, run)Pr(pred | topic)Pr(run | topic)topic weights

30Uses of Topic ModelsCluster runs by most probable topicFailure diagnosis for multi-bug programsCharacterize representative run for clusterFailure-inducing execution profileLikely execution path to guide developersRelate usage patterns to failure modesPredict system (in)stability in scenarios of interest31Compound Predicatesfor Complex Bugs32Edward John Moreton Drax Plunkett, Lord Dunsany,Weeds & Moss, My IrelandLogic, like whiskey, loses its beneficial effect when takenin too large quantities.33Limitations of Simple PredicatesEach predicate partitions runs into 2 sets:Runs where it was trueRuns where it was falseCan accurately predict bugs that match this partitionUnfortunately, some bugs are more complexComplex border between good & badRequires richer language of predicates34Motivation: Bad Pointer ErrorsIn function exif_mnote_data_canon_load:for (i = 0; i < c; i++) {n->count = i + 1;if (o + s > buf_size) return;n->entries[i].data = malloc(s);}Crash on later use of n->entries[i].data

35Motivation: Bad Pointer ErrorsIn function exif_mnote_data_canon_load:for (i = 0; i < c; i++) {n->count = i + 1;if (o + s > buf_size) return;n->entries[i].data = malloc(s);}Crash on later use of n->entries[i].dataKinds of PredicateBest PredicateScoreSimple Onlynew len == old len0.71Simple & Compoundo + s > buf_sizeoffset < len0.9436Great! So Whats the Problem?Too many compound predicates22N functions of N simple predicatesN2 conjunctions & disjunctions of two variablesN ~ 100 even for small applicationsIncomplete information due to samplingPredicates at different locations37Conservative DefinitionA conjunction C = p1 p2 is true in a run iff:p1 is true at least once andp2 is true at least onceDisjunction is defined similarlyDisadvantage:C may be true even if p1, p2 never true simultaneouslyAdvantages:Monitoring phase does not changep1 p2 is just another predicate, inferred offline38Three-Valued Truth TablesFor each predicate & run, three possibilities:True (at least once)Not true (and false at least once)Never observedConjunction: p1 p2p1p2TF?TTF?FFFF??F?Disjunction: p1 p2p1p2TF?TTTTFTF??T??39Mixed Compound & Simple PredicatesCompute score of each conjunction & disjunctionC = p1 p2D = p1 p2Compare to scores of constituent simple predicatesKeep if higher score: better partition between good & badDiscard if lower score: needless complexityIntegrates easily into iterative ranking & elimination40Still Too ManyComplexity: N2RN = number of simple predicatesR = number of runs being analyzed20 minutes for N ~ 500, R ~ 5,000Idea: early pruning optimizationCompute upper bound of score and discard if too lowToo low = lower than constituent simple predicatesReduce O(R) to O(1) per complex predicate41Upper Bound On Score Harmonic meanUpper Bound on C = p1 p2 Find F(C), S(C), F(C obs) and S(C obs)In terms of corresponding counts for p1, p2

42F(C) and S(C) for conjunctionF(C): true runs completely overlapS(C): true runs are disjointF(p1)F(p2)NumFMin( F(p1), F(p2) )S(p1)S(p2)0either 0 orS(p1)+S(p2) - NumSNumS(whichever is maximum)43S(C obs)Maximize two casesC = trueTrue runs of p1, p2 overlapC = falseFalse runs of p1, p2 are disjointNumSS(p1)S(p2)S(p1)S(p2)Min(S(p1), S(p2))+ S(p1) + S(p2)orNumS(whichever is minimum)44UsabilityComplex predicates can confuse programmerNon-obvious relationship between constituentsPrefer to work with easy-to-relate predicateseffort(p1, p2) = proximity of p1 and p2 in PDGPDG = CDG DDGPer Cleve and Zeller [ICSE 05]Fraction of entire programUsable only if effort < 5%Somewhat arbitrary cutoff; works well in practice45Evaluation:What kind of predicate has the top score?

46Evaluation: Effectiveness of Pruning

Analysis time: from ~20 mins down to ~1 min47Evaluation: Impact of Sampling

48Evaluation:Usefulness Under Sparse Sampling

49

statistical debugging

Documents