Download - Cukic Promise08 V3
![Page 1: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/1.jpg)
HASLHASL High Assurance Systems Lab
Comparing Design and Code Metrics for Software Quality
Prediction
Y. Jiang, B. Cukic, T. MenziesLane Department of CSEE
West Virginia University
PROMISE 2008
![Page 2: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/2.jpg)
HASLHASL High Assurance Systems Lab
Predicting Faults Earlier Matters
• Boehm observed that fault removal is 50 to 200 times less costly when performed in the design phase rather than after the deployment.
• NASA research shows that a fault introduced in the requirements, which leaks into the design, code, test, integration, and operational phases, ensues the correction cost factors of 5, 10, 50, 130, and 368, respectively.
• Therefore, the earlier we can identify fault-prone artifacts, the better.
![Page 3: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/3.jpg)
HASLHASL High Assurance Systems Lab
How early?
• Do requirements metrics correlate with fault-proneness? [Jiang et. al. ISSRE 07]
![Page 4: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/4.jpg)
HASLHASL High Assurance Systems Lab
Predicting From Design Metrics?
• It has been successfully demonstrated.• Ohlsson and Alberg (’96) demonstrated that design
metrics predict fault prone modules effectively.– “Design metrics are better predictors than code size (LOC)”– Telephone switching domain
• Basili validated so called CK object oriented (design) metrics using eight student developed systems.
• Nagappan, Ball & Zeller confirmed Ohlsson’s findings using OO design metrics on five Microsoft systems
![Page 5: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/5.jpg)
HASLHASL High Assurance Systems Lab
Goal of This Study
• Thorough comparison of fault prediction models which utilize:
1. Design metrics
2. Static code metrics
3. Combination of both
• Statistically significant number of projects and modules within projects.
![Page 6: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/6.jpg)
HASLHASL High Assurance Systems Lab
Metrics Description (1)• Code metrics
![Page 7: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/7.jpg)
HASLHASL High Assurance Systems Lab
Metrics Description (2)• Design metrics
![Page 8: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/8.jpg)
HASLHASL High Assurance Systems Lab
Experimental Design
Classification10x10 CV
Illustrate ResultsUsing ROC
Code
Visualize Using
Boxplot diagrams
Compare using NonparametricStatistical Tests
Evaluate ResultsUsing AUC
trapezoid rule
DesignAll
![Page 9: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/9.jpg)
HASLHASL High Assurance Systems Lab
Datasets: NASA MDP
• Used every dataset which offered both design and code level metrics.
![Page 10: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/10.jpg)
HASLHASL High Assurance Systems Lab
Experimental Design (2)
• 5 classification algorithms– Random forest, begging, boosting, logistic regression,
NaiveBayes
• 10 by 10 way cross-validation: – one 10 way experiment generates an ROC curve
=> 10 ROCs => 10 AUCs
• We analyzed 1950 experiments!– 13 [Data sets] *3 [Metrics sets] *5 [Classifiers] *10 [CV]
• We only show the best model from each metrics set in each data set (project).
![Page 11: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/11.jpg)
HASLHASL High Assurance Systems Lab
Analysis example: PC5 data set
ROC
Boxplot
• The mean AUC • All: 0.979 • Code: 0.967 • Design: 0.956.
![Page 12: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/12.jpg)
HASLHASL High Assurance Systems Lab
Typical Results
![Page 13: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/13.jpg)
HASLHASL High Assurance Systems Lab
Not So Typical Results
![Page 14: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/14.jpg)
HASLHASL High Assurance Systems Lab
Atypical Results
![Page 15: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/15.jpg)
HASLHASL High Assurance Systems Lab
Test Statistical Significance
• Use the procedure recommended by Demsar for each of the 13 data sets.
1. Friedman test tests whether performance differs amongst design, code, and all experiments.
• If no, no further test is necessary.
• If yes, then 2. Use pairwise nonparametric tests (typically the Wilcoxon test
or the Mann-Whitney test) to determine which group of metrics is the best.
• 95% confidence level used in all experiments
![Page 16: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/16.jpg)
HASLHASL High Assurance Systems Lab
Pairwise comparison
• Test the following hypotheses for pairwise comparison of two experiments A and B.
H0: There is no difference in the performance of the models from metrics from group A and group B;
H1: The performance of the group A metrics is better than that of group B metrics;
H2: The performance of the group A metrics is worse than that of group B metrics.
![Page 17: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/17.jpg)
HASLHASL High Assurance Systems Lab
The Result of Hypothesis Test (1)
• Friedman’s test– Average p-value = 0.00003604 (<0.05) – Strongly suggests there is statistically significant difference
amongst the models from all, code, and design over all 13 datasets.
• Two pairwise nonparametric tests (the Wilcoxon test or the Mann-Whitney test) agree in all cases but one– PC2: the Mann-Whitney has all >code, but the Wilcoxon has
all=code, – This discrepancy does not affect our overall trend.
![Page 18: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/18.jpg)
HASLHASL High Assurance Systems Lab
![Page 19: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/19.jpg)
HASLHASL High Assurance Systems Lab
Findings
• Statistical significance tests utilized AUC for model comparison
• In 7 datasets: all=code; • In 6 datasets, all >code. • In all 13 datasets, all>design. • In 12 datasets, code>design. • Only exception is KC4 project, where
design>code.
![Page 20: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/20.jpg)
HASLHASL High Assurance Systems Lab
Summary of Observations • The performance of models is influenced
– MORE by metrics – THAN by classification algorithms.
• Combination of design AND code metrics provides better models than code or design metrics alone.
• The models from code metrics generally perform better than that formed from design metrics only.
• Design metrics useful to predict fault prone modules earlier.
• Clear indication that integrating metrics from different phases of development is useful.
![Page 21: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/21.jpg)
HASLHASL High Assurance Systems Lab
Threats to Validity
• Noise in the metrics data sets.– Would feature selection change some outcomes?
• Generality of NASA datasets.
• Design metrics reengineered from code.– More accurately reflect the code base than those computed from
design documentation.
• All metrics data contains a few independent variables which are not in Code or Design groups.– Needs correction, but the results unlikely to change.
![Page 22: Cukic Promise08 V3](https://reader034.vdocuments.us/reader034/viewer/2022051609/546c4b20b4af9f932c8b4f3e/html5/thumbnails/22.jpg)
HASLHASL High Assurance Systems Lab
Ensuing Research
• Software fault prediction can be improved– Improvement unlikely to come from the application of more off-
the-shelf data mining algorithms. – Accounting for project’s “business context” may contribute to
improvement.• Metrics from different development stages add information
not available from the code.
• Evaluation of effectiveness should be tailored to project-
specific (subsystem/module-specific) risks. – Reliable metrics collection.