keynote hotswup 2012
DESCRIPTION
Keynote for the Fourth Workshop on Hot Topics in Software Upgrades, co-located with ICSE 2012, Zurich, SwitzerlandTRANSCRIPT
Will my system run (correctly) after the upgrade?
Martin PinzgerAssistant ProfessorDelft University of Technology
2Pfunds
Martin’s upgrades
PhD
Postdoc
Assistant Professor
My Experience with Software Upgrades
3
4
5
6
Bugs on upgrades get reported
Hmm, wait a minute
7
Can’t we learn “something” from that data?
Software repository mining for preventing upgrade failures
Martin PinzgerAssistant ProfessorDelft University of Technology
Goal of software repository mining
Making the information stored in software repositories available to software developers
Quality analysis and defect prediction
Recommender systems
...
9
Software repositories
10
Examples from my mining research
Predicting failure-prone source files using changes (MSR 2011)
The relationship between developer contributions and failures (FSE 2008)
There are many more studiesMSR 2012 http://2012.msrconf.org/
A survey and taxonomy of approaches for mining software repositories in the context of software evolution, Kagdi et al. 2007
11
Using Fine-Grained Source Code Changes for Bug Prediction
Joint work with Emanuel Giger, Harald GallUniversity of Zurich
Bug prediction
Goal Train models to predict the bug-prone source files of the next release
HowUsing product measures, process measures, organizational measures with machine learning techniques
Many existing studies on building prediction modelsMoser et al., Nagappan et al., Zimmermann et al., Hassan et al., etc.
Process measures performed particularly well
13
Classical change measures
Number of file revisions
Code Churn aka lines added/deleted/changed
Research question of this study: Can we further improve these models?
14
Revisions are coarse grained
What did change in a revision?
15
Code Churn can be imprecise
16
Extra changes not relevant for locating bugs
Fine Grained-Source Code Changes (SCC)
THEN
MI
IF "balance > 0"
"withDraw(amount);"
Account.java 1.5
THEN
MI
IF
"balance > 0 && amount <= balance"
"withDraw(amount);"
ELSE
MI
notify();
Account.java 1.6
3 SCC: 1x condition change, 1x else-part insert, 1x invocation statement insert
17
Research hypotheses
18
H1 SCC is correlated with the number of bugs in source files
H2 SCC is a predictor for bug-prone source files (and outperforms LM)
H3 SCC is a predictor for the number of bugs in source files (and outperforms LM)
15 Eclipse plug-ins
Data>850’000 fine-grained source code changes (SCC)
>10’000 files
>9’700’000 lines modified (LM)
>9 years of development history
..... and a lot of bugs referenced in commit messages
19
H1: SCC is correlated with #bugsTable 4: Non parametric Spearman rank correlation ofbugs, LM ,and SCC . * marks significant correlations at� = 0.01. Larger values are printed bold.
Eclipse Project LM SCCCompare 0.68� 0.76 �
jFace 0.74� 0.71�JDT Debug 0.62� 0.8�Resource 0.75� 0.86�Runtime 0.66� 0.79�Team Core 0.15� 0.66�CVS Core 0.60� 0.79�Debug Core 0.63� 0.78�jFace Text 0.75� 0.74�Update Core 0.43� 0.62�Debug UI 0.56� 0.81�JDT Debug UI 0.80� 0.81�Help 0.54� 0.48�JDT Core 0.70� 0.74�OSGI 0.70� 0.77�
Median 0.66 0.77
for JDT Core.We used a Related Samples Wilcoxon Signed-Ranks Test on
the values of the columns in Table 4. The rationale is that (1)we calculated both correlations for each project resulting ina matched correlation pair per project and (2) we can relaxany assumption about the distribution of the values. The testwas significant at � = 0.01 rejecting the null hypothesis thattwo medians are the same. Based on these results we acceptH 2—SCC do have a stronger correlation with bugs than codechurn based on LM .
3.4 Correlation Analysis of Change Types &Bugs
For the correlation analysis in the previous Section 3.3 wedid not distinct between the different categories of the changetypes. We treated them equally and related the total numberof SCC to bugs. On advantage of SCC over pure line basedcode churn is that we can determine the exact change opera-tion down to statement level and assign it to the source codeentity that actually changed. In this section we analyze thecorrelation between bugs and the categories we defined inSection 3.1. The goal is to see whether there are differencesin how certain change types correlate with bugs.
Table 5 shows the correlation between the different cate-gories and bugs for each project. We counted for each file ofa project the number of changes within each category andthe number of bugs and related both numbers by correla-tion. Regarding their mean the highest correlation with bugshave stmt, func, and mDecl. They furthermore exhibit valuesfor some projects that are close or above 0.7 and are consid-ered strong, e.g., func for Resource or JDT Core; mDecl forResource and JDT Core; stmt for JDT Debug UI and DebugUI. oState and cond still have substantial correlation in aver-age but their means are marginal above 0.5. cDecl and elsehave means below 0.5. With some exceptions, e.g., Comparethey show many correlation values below 0.5. This indicatesthat change types do correlate differently with bugs in ourdataset. A Related Samples Friedman Test was significant at� = 0.05 rejecting the null hypothesis that the distribution ofthe correlation values of SCC categories, i.e., rows in Table 5are the same. The Friedman Test operates on the mean ranksof related groups. We used this test because we repeatedlymeasured the correlations of the different categories on thesame dataset, i.e., our related groups and because it does not
Table 5: Non parametric Spearman rank correlation of bugsand categories of SCC . * marks significant correlations at� = 0.01.
Eclipse Project cDecl oState func mDecl stmt cond elseCompare 0.54� 0.61� 0.67� 0.61� 0.66� 0.55� 0.52�jFace 0.41� 0.47� 0.57� 0.63� 0.66� 0.51� 0.48�Resource 0.49� 0.62� 0.7� 0.73� 0.67� 0.49� 0.46�Team Core 0.44� 0.43� 0.56� 0.52� 0.53� 0.36� 0.35�CVS Core 0.39� 0.62� 0.66� 0.57� 0.72� 0.58� 0.56�Debug Core 0.45� 0.55� 0.61� 0.51� 0.59� 0.45� 0.46�Runtime 0.47� 0.58� 0.66� 0.61� 0.66� 0.55� 0.45�JDT Debug 0.42� 0.45� 0.56� 0.55� 0.64� 0.46� 0.44�jFace Text 0.50� 0.55� 0.54� 0.64� 0.62� 0.59� 0.55�JDT Debug UI 0.46� 0.57� 0.62� 0.53� 0.74� 0.57� 0.54�Update Core 0.63� 0.4� 0.43� 0.51� 0.45� 0.38� 0.39�Debug UI 0.44� 0.50� 0.63� 0.60� 0.72� 0.54� 0.52�Help 0.37� 0.43� 0.42� 0.43� 0.44� 0.36� 0.41�OSGI 0.47� 0.6� 0.66� 0.65� 0.63� 0.57� 0.48�JDT Core 0.39� 0.6� 0.69� 0.70� 0.67� 0.62� 0.6�
Mean 0.46 0.53 0.6 0.59 0.63 0.51 0.48
make any assumption about the distribution of the data andthe sample size.
A Related Samples Friedman Test is a global test that onlytests whether all of the groups differ. It does not tell anythingbetween which groups the difference occurs. However thevalues in Table 5 show that when comparing pairwise somemeans are closer than others. For instance func vs. mDecl andfunc vs. cDecl. To test whether some pairwise groups differstronger than others or do not differ at all post-hoc tests arerequired. We performed a Wilcoxon Test and Friedman Test oneach pair. Figure 2 shows the results of the pairwise post-hoc tests. Dashed lines mean that both tests reject their H0,i.e., the row values of those two change types do significantlydiffer; a straight line means both tests retain their H0, i.e., therow values of those change type do not significantly differ;a dotted line means only one test is significant, and it is dif-ficult to say whether the values of these rows differ signifi-cantly.
When testing post-hoc several comparisons in the contextof the result of a global test–the afore Friedman Test–it is morelikely that we fall for a Type 1 Error when agreeing upon sig-nificance. In this case either the significance probability mustbe adjusted, i.e., raised or the �-level must be adjusted, i.e.,lowered [8]. For the post-hoc tests in Figure 2 we adjusted the�-level using the Bonferroni-Holm procedure [34]. In Figure 2we can identify two groups where the categories are con-nected with a straight line among each other: (1) else,cond,oState,and cDecl, and (2) stmt, func, and mDecl. The correlation val-ues of the change types within these groups do not differsignificantly in our dataset. These findings are of more in-terest in the context of Table 2. Although func and mDecloccur much less frequently than stmt they correlate evenlywith bugs. The mass of rather small and local statementschanges correlates as evenly as the changes of functionalityand of method declarations that occur relatively sparse. Thesituation is different in the second group where all changetypes occur with more or less the same relative low frequencygigs .Mention/discuss partial correlation?/ . We use the results
and insights of the correlation analysis in Section 3.5 andSection 3.6 when we build prediction model to investigatewhether SCC and change types are adequate to predict bugsin our dataset. gigs .Show some examples of added methods thatwere later very buggy?/
*significant correlation at 0.01
20
+/-0.5 substantial+/-0.7 strong
Predicting bug-prone files
Bug-prone vs. not bug-prone
Figure 2: Scatterplot between the number of bugs andnumber of SCC on file level. Data points were obtainedfor the entire project history.
3.5 Predicting Bug- & Not Bug-Prone FilesThe goal of H 3 is to analyze if SCC can be used to dis-
criminate between bug-prone and not bug-prone files in ourdataset. We build models based on different learning tech-niques. Prior work states some learners perform better thanothers. For instance Lessman et al. found out with an ex-tended set of various learners that Random Forest performsthe best on a subset of the NASA Metrics dataset. But in re-turn they state as well that performance differences betweenlearners are marginal and not significant [20].
We used the following classification learners: Logistic Re-gression (LogReg), J48 (C 4.5 Decision Tree), RandomForest (Rnd-For), Bayesian Network (B-Net) implemented by the WEKAtoolkit [36], Exhaustive CHAID a Decision Tree based on chisquared criterion by SPSS 18.0, Support Vector Machine (Lib-SVM [7]), Naive Bayes Network (N-Bayes) and Neural Nets (NN)both provided by the Rapid Miner toolkit [24]. The classifierscalculate and assign a probability to a file if it is bug-prone ornot bug-prone.
For each Eclipse project we binned files into bug-prone andnot bug-prone using the median of the number of bugs per file(#bugs):
bugClass =
⇢not bug � prone : #bugs <= median
bug � prone : #bugs > median
When using the median as cut point the labeling of a file isrelative to how much bugs other files have in a project. Thereexist several ways of binning files afore. They mainly vary inthat they result in different prior probabilities: For instanceZimmerman et al. [40] and Bernstein et al. [4] labeled files asbug-prone if they had at least one bug. When having heavilyskewed distributions this approach may lead to high a priorprobability towards a one class. Nagappan et al. [28] used astatistical lower confidence bound. The different prior prob-abilities make the use of accuracy as a performance measurefor classification difficult.
As proposed in [20, 23] we therefore use the area underthe receiver operating characteristic curve (AUC) as perfor-mance measure. AUC is independent of prior probabilitiesand therefore a robust measure to asses the performance andaccuracy of predictor models [4]. AUC can be seen as theprobability, that, when choosing randomly a bug-prone and
Table 6: AUC values of E 1 using logistic regression withLM and SCC as predictors for bug-prone and a not bug-
prone files. Larger values are printed in bold.Eclipse Project AUC LM AUC SCCCompare 0.84 0.85jFace 0.90 0.90JDT Debug 0.83 0.95Resource 0.87 0.93Runtime 0.83 0.91Team Core 0.62 0.87CVS Core 0.80 0.90Debug Core 0.86 0.94jFace Text 0.87 0.87Update Core 0.78 0.85Debug UI 0.85 0.93JDT Debug UI 0.90 0.91Help 0.75 0.70JDT Core 0.86 0.87OSGI 0.88 0.88Median 0.85 0.90Overall 0.85 0.89
a not bug-prone file the trained model then assigns a higherscore to the bug-prone file [16].
We performed two bug-prone vs. not bug-prone classifica-tion experiments: In experiment 1 (E 1) we used logistic re-gression once with total number of LM and once with totalnumber of SCC as predictors. E 1 investigates H 3–SCC canbe used to discriminate between bug- and not bug-prone files–and in addition whether SCC is a better predictor than codechurn based on LM .
Secondly in experiment 2 (E 2) we used the above men-tioned classifiers and the number of each category of SCCdefined in Section 3.1 as predictors. E 2 investigates whetherchange types are good predictors and if the additional typeinformation yields better results than E 1 where the type of achange is neglected. In the following we discuss the resultsof both experiments:Experiment 1: Table 6 lists the AUC values of E 1 for eachproject in our dataset. The models were trained using 10 foldcross validation and the AUC values were computed whenreapplying a learned model on the dataset it was obtainedfrom. Overall denotes the model that was learned when merg-ing all files of the projects into one larger dataset. SCC achievesa very good performance with a median of 0.90–more thanhalf of the projects have AUC values equal or higher than0.90. This means that logistic regression using SCC as predic-tor ranks bug-prone files higher than not bug-prone ones witha probability of 90%. Even Help having the lowest value isstill within the range of 0.7 what Lessman et al. call ”promis-ing results” [20]. This low value is accompanied with thesmallest correlation of 0.48 of SCC in Table 4. The good per-formance of logistic regression and SCC is confirmed by anAUC value of 0.89 when learning from the entire dataset.With a value of 0.004 AUCSCChas a low variance over allprojects indicating consistent models. Based on the results ofE 1 we accept H 3—SCC can be used to discriminate betweenbug- and not bug-prone files.
With a median of 0.85 LM shows a lower performance thanSCC . Help is the only case where LM is a better predictorthan SCC . This not surprising as it is the project that yieldsthe largest difference in favor of LM in Table 4. In general thecorrelation values in Table 4 reflect the picture given by theAUC values. For instance jFace Text and JDT Debug UI thatexhibit similar correlations performed nearly equal. A Re-
21
H2: SCC can predict bug-prone files
Figure 2: Scatterplot between the number of bugs andnumber of SCC on file level. Data points were obtainedfor the entire project history.
3.5 Predicting Bug- & Not Bug-Prone FilesThe goal of H 3 is to analyze if SCC can be used to dis-
criminate between bug-prone and not bug-prone files in ourdataset. We build models based on different learning tech-niques. Prior work states some learners perform better thanothers. For instance Lessman et al. found out with an ex-tended set of various learners that Random Forest performsthe best on a subset of the NASA Metrics dataset. But in re-turn they state as well that performance differences betweenlearners are marginal and not significant [20].
We used the following classification learners: Logistic Re-gression (LogReg), J48 (C 4.5 Decision Tree), RandomForest (Rnd-For), Bayesian Network (B-Net) implemented by the WEKAtoolkit [36], Exhaustive CHAID a Decision Tree based on chisquared criterion by SPSS 18.0, Support Vector Machine (Lib-SVM [7]), Naive Bayes Network (N-Bayes) and Neural Nets (NN)both provided by the Rapid Miner toolkit [24]. The classifierscalculate and assign a probability to a file if it is bug-prone ornot bug-prone.
For each Eclipse project we binned files into bug-prone andnot bug-prone using the median of the number of bugs per file(#bugs):
bugClass =
⇢not bug � prone : #bugs <= median
bug � prone : #bugs > median
When using the median as cut point the labeling of a file isrelative to how much bugs other files have in a project. Thereexist several ways of binning files afore. They mainly vary inthat they result in different prior probabilities: For instanceZimmerman et al. [40] and Bernstein et al. [4] labeled files asbug-prone if they had at least one bug. When having heavilyskewed distributions this approach may lead to high a priorprobability towards a one class. Nagappan et al. [28] used astatistical lower confidence bound. The different prior prob-abilities make the use of accuracy as a performance measurefor classification difficult.
As proposed in [20, 23] we therefore use the area underthe receiver operating characteristic curve (AUC) as perfor-mance measure. AUC is independent of prior probabilitiesand therefore a robust measure to asses the performance andaccuracy of predictor models [4]. AUC can be seen as theprobability, that, when choosing randomly a bug-prone and
Table 6: AUC values of E 1 using logistic regression withLM and SCC as predictors for bug-prone and a not bug-
prone files. Larger values are printed in bold.Eclipse Project AUC LM AUC SCCCompare 0.84 0.85jFace 0.90 0.90JDT Debug 0.83 0.95Resource 0.87 0.93Runtime 0.83 0.91Team Core 0.62 0.87CVS Core 0.80 0.90Debug Core 0.86 0.94jFace Text 0.87 0.87Update Core 0.78 0.85Debug UI 0.85 0.93JDT Debug UI 0.90 0.91Help 0.75 0.70JDT Core 0.86 0.87OSGI 0.88 0.88Median 0.85 0.90Overall 0.85 0.89
a not bug-prone file the trained model then assigns a higherscore to the bug-prone file [16].
We performed two bug-prone vs. not bug-prone classifica-tion experiments: In experiment 1 (E 1) we used logistic re-gression once with total number of LM and once with totalnumber of SCC as predictors. E 1 investigates H 3–SCC canbe used to discriminate between bug- and not bug-prone files–and in addition whether SCC is a better predictor than codechurn based on LM .
Secondly in experiment 2 (E 2) we used the above men-tioned classifiers and the number of each category of SCCdefined in Section 3.1 as predictors. E 2 investigates whetherchange types are good predictors and if the additional typeinformation yields better results than E 1 where the type of achange is neglected. In the following we discuss the resultsof both experiments:Experiment 1: Table 6 lists the AUC values of E 1 for eachproject in our dataset. The models were trained using 10 foldcross validation and the AUC values were computed whenreapplying a learned model on the dataset it was obtainedfrom. Overall denotes the model that was learned when merg-ing all files of the projects into one larger dataset. SCC achievesa very good performance with a median of 0.90–more thanhalf of the projects have AUC values equal or higher than0.90. This means that logistic regression using SCC as predic-tor ranks bug-prone files higher than not bug-prone ones witha probability of 90%. Even Help having the lowest value isstill within the range of 0.7 what Lessman et al. call ”promis-ing results” [20]. This low value is accompanied with thesmallest correlation of 0.48 of SCC in Table 4. The good per-formance of logistic regression and SCC is confirmed by anAUC value of 0.89 when learning from the entire dataset.With a value of 0.004 AUCSCChas a low variance over allprojects indicating consistent models. Based on the results ofE 1 we accept H 3—SCC can be used to discriminate betweenbug- and not bug-prone files.
With a median of 0.85 LM shows a lower performance thanSCC . Help is the only case where LM is a better predictorthan SCC . This not surprising as it is the project that yieldsthe largest difference in favor of LM in Table 4. In general thecorrelation values in Table 4 reflect the picture given by theAUC values. For instance jFace Text and JDT Debug UI thatexhibit similar correlations performed nearly equal. A Re-
22
SCC outperforms LM
Predicting the number of bugs
Non linear regression with asymptotic model:
23#SCC
40003000200010000
#Bugs
6 0
40
20
0
Page 1
Team Core
f(#Bugs) = a1 + b2*eb3*SCC
H3: SCC can predict the number of bugsTable 8: Results of the nonlinear regression in terms of R2
and Spearman correlation using LM and SCC as predictors.Project R2
LM R2SCC SpearmanLM SpearmanSCC
Compare 0.84 0.88 0.68 0.76jFace 0.74 0.79 0.74 0.71JDT Debug 0.69 0.68 0.62 0.8Resource 0.81 0.85 0.75 0.86Runtime 0.69 0.72 0.66 0.79Team Core 0.26 0.53 0.15 0.66CVS Core 0.76 0.83 0.62 0.79Debug Core 0.88 0.92 0.63 0.78Jface Text 0.83 0.89 0.75 0.74Update Core 0.41 0.48 0.43 0.62Debug UI 0.7 0.79 0.56 0.81JDT Debug UI 0.82 0.82 0.8 0.81Help 0.66 0.67 0.54 0.84JDT Core 0.69 0.77 0.7 0.74OSGI 0.51 0.8 0.74 0.77Median 0.7 0.79 0.66 0.77Overall 0.65 0.72 0.62 0.74
of the models, i.e., an accompanied increase/decrease of theactual and the predicted number of bugs.
With an average R2LM of 0.7, LM has less explanatory pow-
er compared to SCC using an asymptotic model. Except forthe case of JDT Debug UI having equal values, LM performslower than SCC for all projects including Overall. The Re-lated Samples Wilcoxon Signed-Ranks Test on the R2 values ofLM and SCC in Table 8 was significant, denoting that the ob-served differences in our dataset are significant.
To asses the validity of a regression model one must pay at-tention to the distribution of the error terms. Figure 3 showstwo examples of fit plots with normalized residuals (y-axis)and predicted values (x-axis) of our dataset: The plot of theregression model of the Overall dataset on the left side andthe one of Debug Core having the highest R2
SCC value onthe right side. On the left side, one can spot a funnel whichis one of the ”archetypes” of residual plots and indicates thatthe constance-variance assumption may be violated, i.e., thevariability of the residuals is larger for larger predicted val-ues of SCC [19]. This is an example of a model that showsan adequate performance, i.e., R2
SCC of 0.72, but where thevalidity is questionable. On the right side, there is a first signof the funnel pattern but it is not as evident as on the leftside. The lower part of Figure 3 shows the corresponding his-togram charts of the residuals. They are normally distributedwith a mean of 0.
Therefore, we accept H 3–SCC (using asymptotic nonlin-ear regression) achieves better performance when predictingthe number of bugs within files than LM. However one mustbe careful to investigate wether the models violate the as-sumptions of the general regression model. We analyzed allresidual plots of our dataset and found that the constance-variance assumption may be generally problematic, in par-ticular when analyzing software measures and open sourcesystems that show highly skewed distributions. The othertwo assumptions concerning the error terms, i.e., zero meanand independence, are not violated. When using regressionstrictly for descriptive and prediction purposes only, as itis the case for our experiments, these assumptions are lessimportant, since the regression will still result in an unbi-ased estimate between the dependent and independent vari-able [19]. However, when inference based on the obtainedregression models is made, e.g., conclusions about the slope
Predicted Values (Overall)250.00200.00150.00100.0050.00.00
nrm.
Res
iduals
1.50
1.00
.50
.00
- . 50
-1 .00
Predicted Values (Debug Core)200.00150.00100.0050.00.00
nrm.
Res
iduals
1.00
.50
.00
- . 50
-1 .00
nrm. Residuals (Overall)1.501.00.50.00- . 50-1 .00
6,000.0
5,000.0
4,000.0
3,000.0
2,000.0
1,000.0
.0
nrm. Residuals (Debug Core)1.00.50.00- . 50-1 .00
200.0
150.0
100.0
50.0
.0
Figure 3: Fit plots of the Overall dataset (left) and DebugCore (right) with normalized residuals on the y-axis andthe predicted values on the x-axis. Below are the corre-sponding histograms of the residuals.
(� coefficients) or the significance of the entire model itself,the assumptions must be verified.
3.6 Summary of ResultsThe results of our empirical study can be summarized as
follows:SCC correlates strongly with Bugs . With an average Spear-man rank correlation of 0.77, SCC has a strong correlationwith the number of bugs in our dataset. Statistical tests in-dicated that the correlation of SCC and Bugs is significantlyhigher than between LM and Bugs (accepted H 1).SCC categories correlate differently with Bugs . Except forcDecl all SCC categories defined in Section 3.1 correlate sub-stantially with Bugs. A Friedman Test revealed that the cate-gories have significantly different correlations. Post-hoc com-parisons confirmed that the difference is mainly because oftwo groups of categories: (1) stmt, func, and mDecl, and (2)else, cond, oState, and cDecl. Within these groups the post-hoctests were not significant.SCC is a strong predictor for classifying source files intobug-prone and not bug-prone. Models built with logistic re-gression and SCC as predictor rank bug-prone files higher thannot bug-prone with an average probability of 90%. They havea significant better performance in terms of AUC than logis-tic regression models built with LM as a predictor (acceptedH 2).
In a series of experiments with different classifiers usingSCC categories as independent variables, LibSVM yieldedthe best performance—it was the best classifier for more thanhalf of the projects. LibSVM was closely followed by BNet,RFor, NBayes, and NN. Decision tree learners resulted in asignificantly lower performance. Furthermore, using cate-gories, e.g., func, rather than the total number of SCC did notyield better performance.
24
SCC outperforms LM
Summary of results
SCC performs significantly better than LMAdvanced learners are not always better
Change types do not yield extra discriminatory power
Predicting the number of bugs is “possible”
More information“Comparing Fine-Grained Source Code Changes And Code Churn For Bug Prediction”, MSR 2011
25
What is next?
Analysis of the effect(s) of changesWhat is the effect on the design?
What is the effect on the quality?
Ease understanding of changes
Recommender techniquesModels that can provide feedback on the effects
26
27
Can developer-module networks predict failures?
Joint work with Nachi Nagappan, Brendan MurphyMicrosoft Research
Research question
29
Are binaries with fragmented contributions from many developers more likely to have post-release failures?
Should developers focus on one thing?
Study with MS Vista project
DataReleased in January, 2007
> 4 years of development
Several thousand developers
Several thousand binaries (*.exe, *.dll)
Several millions of commits
30
$OLFH
%RE
'DQ
(ULF
)X
*R
+LQ
DE
F
Approach in a nutshell
31
Change
Logs
Bugs
Regression Analysis
Validation with data splitting
Alice
Dan
Eric Go
Hin c
5
4
6
2
5 7
4
a4
Bob2b
6
Fu
Binary #bugs #centrality
a 12 0.9
b 7 0.5
c 3 0.2
Contribution network
32
$OLFH
%RE
'DQ
(ULF
)X
*R
+LQ
DE
F
Windows binary (*.dll)Developer
Which binary is failure-prone?
Measuring fragmentation
33
$OLFH
%RE
'DQ
(ULF
)X
*R
+LQ
DE
F
Freeman degree
$OLFH
%RE
'DQ
(ULF
)X
*R
+LQ
DE
F
$OLFH
%RE
'DQ
(ULF
)X
*R
+LQ
DE
F
Bonacich’s powerCloseness
$OLFH
%RE
'DQ
(ULF
)X
*R
+LQ
DE
F
Research hypotheses
34
H1 Binaries with fragmented contributions are failure-prone
H2 Fragmentation correlates positively with the number of post-release failures
H3 Advanced fragmentation measures improve failure estimation
Correlation analysis
35
nrCommits nrAuthors Power dPower Closeness Reach Betweenness
Failures 0,700 0,699 0,692 0,740 0,747 0,746 0,503
nrCommits 0,704 0,996 0,773 0,748 0,732 0,466
nrAuthors 0,683 0,981 0,914 0,944 0,830
Power 0,756 0,732 0,714 0,439
dPower 0,943 0,964 0,772
Closeness 0,990 0,738
Reach 0,773
Spearman rank correlation
All correlations are significant at the 0.01 level (2-tailed)
H1: Predicting failure-prone binaries
36
Binary logistic regression of 50 random splits4 principal components from 7 centrality measures
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
Precision Recall AUC
H2: Predicting the number of failures
37
All correlations are significant at the 0.01 level (2-tailed)40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
R-Square Pearson Spearman
Linear regression of 50 random splits#Failures = b0 + b1*nCloseness + b2*nrAuthors + b3*nrCommits
H3: Basic vs. advanced measures
38
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
R-Sq
uare
Spearm
anModel with nrAuthors, nrCommits
Model with nCloseness, nrAuthors, nrCommits
Summary of results
39
Centrality measures can predict more than 83% of failure-pone Vista binaries
Closeness, nrAuthors, and nrCommits can predict the number of post-release failures
Closeness or Reach can improve prediction of the number of post-release failures by 32%
More informationCan Developer-Module Networks Predict Failures?, FSE 2008
What can we learn from that?
Increase testing effort for central binaries? - yes
Re-factor central binaries? - maybe
Re-organize contributions? - maybe 40
$OLFH
%RE
'DQ
(ULF
)X
*R
+LQ
DE
F
5
4
6
2 4
6
2
5 7
4
What is next?
Analysis of the contributions of a developerWho is working on which parts of the system?
What exactly is the contribution of a developer?
Who is introducing bugs/smells and how can we avoid it?
Global distributed software engineeringWhat are the contributions of teams, smells and how to avoid it?
Can we empirically prove Conway’s Law?
Expert recommendationWhom to ask for advice on a piece of code?
41
42
Ideas for software upgrade research
1. Mining software repositories to identify the upgrade-critical components
What are the characteristics of such components?
Product and process measures
What are the characteristics of the target environments?
Hardware, operating system, configuration
Train a model with these characteristics and reported bugs
Further ideas for research
Who is upgrading which applications when?Study upgrade behavior of users?
What is the environment of the users when they upgrade?Where did it work, where did it fail?
Collect crash reports for software upgrades?
Upgrades in distributed applications?Finding the optimal time when to upgrade which component?
43
Conclusions
44
#SCC40003000200010000
#Bugs
6 0
40
20
0
Page 1
Team Core
$OLFH
%RE
'DQ
(ULF
)X
*R
+LQ
DE
F
5
4
6
2 4
6
2
5 7
4
Questions?
Martin [email protected]