promise 2011: "detecting bug duplicate reports through locality of reference"
DESCRIPTION
PROMISE 2011:"Detecting Bug Duplicate Reports through Locality of Reference"Tomi Prifti, Sean Banerjee and Bojan Cukic.TRANSCRIPT
Detecting Bug Duplicate Reports Through Locality of Reference
Tomi Prifti, Sean Banerjee, Bojan Cukic
Lane Department of CSEEWest Virginia UniversityMorgantown, WV, USA
September 2011
Presentation Outline
• Introduction
• Goals
• Related Work
• Understanding the Firefox Repository
• Experimental Setup
• Results
• Summary
Introduction
• Bug tracking systems are essential for software maintenance and testing
• Developers and simple users can report failure occurrences
• Advantages:
Users involved in error reporting
Direct impact of software quality.• Disadvantages:
Large number of reports on daily basis.
Significant effort to triage.
Users may submit many duplicate reports.
A typical bug report
Goals
• Comprehensive empirical analysis of a large bug report dataset.
• Creation of a search tool– Encourage users to search the repository– Avoid duplicate report submission whenever possible– Assisting with report triage
• Build a list of reports possibly describing the same problem
• Let a triager examines the suggested list
Related Work
• Providing Triagers with a Suggested List– Provide a suggested list of similar bugs to triagers for
examinations.• Wang et. al. exploit NLP techniques and execution information• Duplicate detection rate as high as 67%-93%
• Semi-automated Filtering– Determine the type of the report (Duplicate or Primary). If the new
report is classified as a duplicate filter it out.• Jalbert et al. use text semantics and a graph clustering
technique to predict duplicate status• Filtered out only 8% of duplicate reports
Related Work
• Semi-automated Assignment– Apply text categorization techniques to predict the developer that
should work on the bug• Cubranic et. al. apply supervised Bayesian learning.
Correctly classify 30% of the reports• Anvik et. al. uses a supervised machine learning algorithm.
Precision rates of 57% and 64% for Firefox and Eclipse
• Improving Report Quality– Duplicate reports are not considered harmful
• Bettenburg et al. developed a tool, called CUEZILLA, that measures the quality of bug reports in real time
• “Steps to reproduce” and “Stack traces” are the most useful information in bug reports
Related Work
• Bugzilla Search Tool– Bugzilla 4.0 released around February 2011 provides duplicate
detection– Tool performs a Boolean full text search on the title over the entire
repository– Generates a dozen or so reports that may match at least one of
the search terms– In some instance testing with the exact title from an existing report
title did not return the report itself– Unknown accuracy of reported matches
Firefox Repository
• Firefox releases: 1.0.5, 1.5, 2.0, 3.0, 3.5 and the current version 3.6 (as of June 2010).
• 65% of reports reside in groups of one.• 90% of duplicates are distributed in small groups
of 2-16 reports
Time Interval Between Reports
• Many bugs receive the first duplicate within the first few months of the original report.
Experimental Setup
• Tokenization - “Bag-of-Words”
• Stemming. Reducing words to their root
• Stop Words Removal
– Lucene API used for pre-processing
• Term Frequency/Inverse Document Frequency (TF/IDF) used for weighting words
• Cosine Similarity used for similarity measures
Example of tokenizing, stemming and stop word removal
Sending email is not functional. send email function
Experimental Procedure
• Start with initial 50% as historical information• Group containing most recent primary or duplicate is on
top of the initial list• Build suggested list using IR techniques• As experiment progresses historical repository increases• Continue until reports classified as duplicate or primary
• If a bug is primary it is forwarded to the repository
• This may not be realistic as triagers may misjudge reports
Measuring Performance
• Performance of the bug search tool is measured by the recall rate
– Nrecalled refers to the number of duplicate reports correctly classified
– Ntotal refers to the total number of duplicate reports
Approach methodology
• Reporters query the repository.
• Use “title” (summary) to compare reports.
• Four experiments:– TF/IDF– “Sliding Window” - TF/IDF– “Sliding Window” - Group Centroids - TF/IDF– “Sliding Window” - Group Centroids
• The centroid is composed of all unique terms from all reports in the group and the sum of their frequencies in each report. The total frequency of each term is divided by the number of reports in the group.
Sliding Window Defined
• “Sliding-Window” approach. Keep a window of fixed size n– Sort all groups based on the time elapsed between the last
report and the new incoming report.– Select top n groups (2000 is optimal analysis shows 95%
accuracy of duplicate being in this group)– Apply IR techniques only on top n groups
– Build a short list of top m most similar reports to present to the
triager/reporter
Experimental Results
• Our results demonstrate that Time-Window/Group Centroid and report summaries predict duplicate problem reports with a recall ratio of up to 53%.
Performance and Runtime
• Large variance in recall rate initially. Time window approach stabilizes, while TF/IDF degrades.
• Classification run time is faster for the Time Window approach. Additional report increases computation time in TF/IDF
Result Comparisons
Group Approach Results
Hiew, et-al Text analysis Recall rate ~50%
Cubranic, et-al Bayesian learning Text categorization
Correctly predicted ~30% duplicates
Jalbert, et-al Text SimilarityClustering
Recall rate ~51% List size 20
Wang, et-al NLPExecution Information
67-93% detection rate (43-72% with NLP)
Wang, et-al Enhanced version of prior algorithm
17-31% improvement over state of art
Our approach Time Window/Centroids
~53% recall rate
Threats to Validity
• Assumption that the ground truth on duplicates is correct– The life cycle of a bug is ever changing
– Some reports often change state multiple times
Summary and Future Work
SUMMARY•Comprehensive study to analyze long term duplicate trends in a large, open source project.•Improve search features in duplicate detection by providing a search list.•Time interval between reports can be used to improve the search space.
FUTURE WORK•Compare with other projects (eg: Eclipse) to be able to generalize the approach.•Effects on duplicate propagation caused by a user incorrectly selecting a report from the suggested list.
TF/IDF
• Compare vector representing a new report to every vector that is currently in the database.
• Vectors in the database are weighted using TF/IDF to emphasize rare words.
• The reports are ranked based on their cosine-similarity scores.
• Report ranking is used to build the suggested list presented to the user.
• Run time impacted as repository size grows.
Sliding Window - TF/IDF
• Apply time window to limit groups under consideration for search.
• Only the reports within 2,000 groups are considered.
• Reports are weighted using TF/IDF.
• Scoring and building of the suggested list same as TF/IDF
Sliding Window – Centroid
• Same time window. • Reports from the 2,000 groups not immediately
searched and weighted using TF/IDF. • Centroid vector representing each group is used.• Example:
– Summary 1 unable send email– Summary 2 send email function– Summary 3 send email after enter recipient– The resulting centroid of the group is: 1.0 send, 0.33 unable, 1.0
mail, 0.33 function, 0.33 after, 0.33 enter, 0.33 recipient.
Sliding Window – Centroid – TD/IDF
• Uses centroid technique described before.
• Weight each term in centroids using TF/IDF weighting scheme.