mining software data: code tao xie university of illinois at urbana-champaign taoxie/...
TRANSCRIPT
Mining Software Data: Code
Tao XieUniversity of Illinois at Urbana-Champaign
http://web.engr.illinois.edu/~taoxie/
2
MAIN GOAL Transform static record-
keeping SE data to active data
Make SE data actionable by uncovering hidden patterns and trends
Mining Software Engineering Data
MailingsBugzilla
Code repository
Executiontraces
CVS
Mining Software Engineering Data
code bases
change history
programstates
structuralentities
software engineering data
bugreports/nl
programming defect detection testing debugging maintenance
software engineering tasks
data mining techniques
…
…
https://sites.google.com/site/asergrp/dmse
Mining Software Engineering Data
code bases
change history
programstates
structuralentities
software engineering data
bugreports/nl
programming defect detection testing debugging maintenance
software engineering tasks
data mining techniques
…
…
5
5
Programmers commonly reuse APIs of existing frameworks or libraries
– Advantages: High productivity of development– Challenges: Complexity and lack of documentation– Consequences:
• Spend more efforts in understanding APIs• Introduce defects in API client code
– Solution: Mining API properties as common patterns across API client code
Frameworks
Motivation
6
Basic mining
algorithms
Solution-Driven Problem-Driven
Advanced mining
algorithmsNew/adapted
mining algorithms
Where can I apply X miner? What patterns do we really need?
E.g., frequent partial order mining [ESEC/FSE 07]
E.g., association rule mining, frequent itemset mining…
E.g., [ICSE 09], [ASE 09]
7
77
Code repositoriesCode repositories
1 2 N…
1 2mining patterns
searching mining patterns
Code search engine e.g., Open source code
on the web
Eclipse, Linux, …
Traditional approaches
Our new approaches
Often lack sufficient relevant data points (Eg. API call sites)
Code repositories
Mining Searching + Mining
8
Agenda MotivationMining Sequence Association Rules (CAR-Miner) [ICSE 09] Detecting Exception-Handling Defects
Mining Alternative Patterns(Alattin) [ASE 09] Detecting Neglected Condition Defects
Conclusion
9
APIs throw exceptions during runtime errorsExample: Session API of Hibernate framework throws
HibernateException
APIs expect client applications to implement recovery actions after exceptions occur
Example: Hibernate Session API expects client application to rollback open uncommitted transactions after HibernateException occurs
Failure to handle exceptions results inFatal issues, e.g., database lock won’t be released if the
transaction is not rolled back
Exception Handling
10
Use exception-handling specification to detect violations as defects
Problem: Often specifications are not documentedSolution: Mine specifications from existing API client codeChallenges:
Limited data points: Only from a few code bases searching + mining
Limited expressiveness: Not sufficient to characterize common exception-handling behaviors: why?
Problem Addressed by CAR-Miner
11
Example1.1: .. .1.2: Orac leD ataS o urce o ds = nu ll; S ess ion sess ion = n ull; Co nn ec tio n conn = nu ll; S tatem e nt state m ent = n ull;1.3: logg er .deb ug ("S ta rt ing u pd ate");1.4: try {1.5: od s = n ew Ora c leD ataS ou rce( );1.6: od s.setUR L(" jdbc :orac le:thin:sco tt/t iger@ 1 92 .1 68 .1 .2 :152 1:catfish") ;1.7: conn = od s.g etCo nne c tio n() ;1.8: state m ent = co nn.cre ate S ta tem e nt( );1.9: state m ent .exe cuteU pda te("D E LE TE F RO M table 1") ;1.10 : conn ec tio n.com m it() ; }1.11 : catch (S QL Excep tio n se ) {1.13 : log ger .error ("E xcep tion o ccur re d"); }1.14 : f ina lly {1.15 : if(s ta te m en t != n ull) { s tatem e nt.clo se () ; }1.16 : if(conn != nu ll) { conn .c lose( ); }1.17 : if(od s != n ull) { o ds .clo se () ; } } 1.18 : }
S ce nar io 1 Defect: No rollback done
when SQLException occurs Requires specification such
as “Connection should be rolled back when a connection is created and SQLException occurs”
Q: Should every connection instance has to be rolled back when SQLException occurs?
Missing “conn.rollback()”
12
Example (cont.)2.1: Co nn ec tio n conn = nu ll;2.2: S ta te m en t s tm t = null;2.3: B uffe re dW riter b w = null; F ileW riter fw = nu ll;2.3: try {2.4: fw = n ew F il eW riter( "ou tp ut.tx t") ;2.5: bw = B ufferedW r ite r(fw );2.6: conn = D riverM an age r.ge tCon ne ctio n(" jdb c:p l:d b", "ps ", "ps ");2.7: S tatem e nt stm t = c onn.cre ateS ta te m ent( );2.8: Re sultS e t re s = s tm t.ex ec ute Que ry("S E LE CT P a th FR OM File s") ;2.9: wh ile (res .n ex t() ) {2.10 : bw .w r ite (res .g etS tring (1 ));2.11 : }2.12 : re s.c lose( );2.13 : } catch( IO E xcep tio n ex ) { log ge r.er ror (" IOE xce ption o ccur re d");2.14 : } f ina lly {2.15 : if(s tm t != nu ll) s tm t.clo se () ;2.16 : if(conn != nu ll ) co nn.c lose( );2.17 : if (bw != nu ll) bw .c lose( );2.18 : }
1.1: ...1.2: Orac leD ataS o urce od s = nu ll; S ess ion sessio n = n ull; Co nn ec tio n conn = nu ll; S tatem e nt statem ent = n ull;1.3: logg er .deb ug ("S ta rt ing u pd ate");1.4: try {1.5: od s = n ew Ora c leD ataS ou rce( );1.6: od s.setUR L(" jdbc :orac le:th in:sco tt/t ig er@ 1 92 .1 68 .1 .2 :152 1:ca tfish");1.7: conn = od s.g etCo nne c tio n() ;1.8: state m ent = co nn.cre ate S ta tem e nt( );1.9: state m ent .exe cuteU pda te("D E LE TE F RO M table1 ") ;1.10 : conn ec tio n.com m it() ; }1.11 : catch (S QL Excep tio n se ) {1.12 : if (co nn != n ull) { c onn.rollb ack () ; } 1.13 : log ger .error ("E xcep tion o ccur re d"); }1.14 : f ina lly {1.15 : if(s ta te m en t != n ull) { s tatem e nt.clo se () ; }1.16 : if(conn != nu ll) { co nn .c lose( ); }1.17 : if(od s != n ull) { o ds .clo se () ; } } 1.18 : }
S cena rio 2S ce nar io 1
Specification: “Connection creation => Connection rollback” Satisfied by Scenario 1 but not by Scenario 2 But Scenario 2 has no defect
c
13
Simple association rules of the form “FCa => FCe” are not expressive
Requires more general association rules (sequence association rules) such as
(FCc1 FCc2) Λ FCa => FCe1, where
FCc1 -> Connection conn = OracleDataSource.getConnection()
FCc2 -> Statement stmt = Connection.createStatement()
FCa -> stmt.executeUpdate()
FCe1 -> conn.rollback()
Example (cont.)
14
Simple association rules of the form “FCa => FCe” are not expressive
Requires more general association rules (sequence association rules) such as
(FCc1 FCc2) Λ FCa => FCe1, where
FCc1 -> Connection conn = OracleDataSource.getConnection()
FCc2 -> Statement stmt = Connection.createStatement()
FCa -> stmt.executeUpdate() //Triggering ActionFCe1 -> conn.rollback()
Example (cont.)
15
Simple association rules of the form “FCa => FCe” are not expressive
Requires more general association rules (sequence association rules) such as
(FCc1 FCc2) Λ FCa => FCe1, where
FCc1 -> Connection conn = OracleDataSource.getConnection()
FCc2 -> Statement stmt = Connection.createStatement()
FCa -> stmt.executeUpdate()
FCe1 -> conn.rollback() //Recovery Action
Example (cont.)
16
Simple association rules of the form “FCa => FCe” are not expressive
Requires more general association rules (sequence association rules) such as
(FCc1 FCc2) Λ FCa => FCe1, where
FCc1 -> Connection conn = OracleDataSource.getConnection()
FCc2 -> Statement stmt = conn.createStatement() //Context
FCa -> stmt.executeUpdate()
FCe1 -> conn.rollback()
Example (cont.)
17
CAR-Miner Approach
InputApplication
Check whether there are any exception-related
defects
Classes and Functions
Open Source Projects on web Open Source Projects on web
1 2 N…
…Exception-Flow
GraphsStatic Traces
SequenceAssociation
RulesViolations
Extract classes and functions
reused
Issue queries and collect relevant code examples. Eg: “lang:java
java.sql.Statement executeUpdate”Construct exception-
flow graphs
Collect static traces
Mine static traces
Detect violations
18
CAR-Miner Approach
InputApplication
Classes and Functions
Open Source Projects on web Open Source Projects on web
1 2 N…
…Exception-Flow
GraphsStatic Traces
SequenceAssociation
RulesViolations
Exception-Flow-Graph Construction
Based on a previous algorithm [Sinha&Harrold TSE 00] : normal execution path ----: exceptional execution path
20
Exception-Flow-Graph Construction
Prevent infeasible edges using a sound static analysis [Robillard&Murphy FSE 99]
21
CAR-Miner Approach
InputApplication
Classes and Methods
Open Source Projects on web Open Source Projects on web
1 2 N…
…Exception-Flow
GraphsStatic Traces
SequenceAssociation
RulesViolations
22
Static Trace Generation
Collect static traces with the actions taken when exceptions occur
A static trace for Node 7:“4 -> 5 -> 6 -> 7 -> 15 -> 16 -> 17”
23
Static Trace Generation Includes 3 sections:
Normal function-call sequence (4 -> 5 -> 6)
Function call (7) Exception
function-call sequence (15 -> 16 -> 17)
A static trace for Node 7: “4 -> 5 -> 6 -> 7 -> 15 -> 16 -> 17”
24
Trace Post-Processing
Identify and remove unrelated function calls using data dependency
“4 -> 5 -> 6 -> 7 -> 15 -> 16 -> 17”
4: FileWriter fw = new FileWriter(“output.txt”)
5: BufferedWriter bw = new BufferedWriter(fw)
...
7: Statement stmt = conn.createStatement()
...
Filtered sequence “6 -> 7 -> 15 -> 16“
25
CAR-Miner Approach
InputApplication
Classes and Methods
Open Source Projects on web Open Source Projects on web
1 2 N…
…Exception-Flow
GraphsStatic Traces
SequenceAssociation
RulesViolations
26
Static Trace Mining Handle traces of each function call (triggering
function call) individually
Input: Two sequence databases with a one-to-one mapping
• normal function-call sequences (context)• exception function-call sequences (recovery)
Objective: Generate sequence association rules of the form
(FCc1 ... FCcn) Λ FCa => FCe1 ... FCenContext Trigger Recovery
27
Input: Two sequence databases with a one-to-one mapping
Mining Problem Definition
Objective: To get association rules of the formFC1 FC2 ... FCm -> FE1 FE2 ... FEn
where {FC1, FC2, ..., Fcm} Є SDB1 and {FE1, FE2, ..., Fen} Є SDB2
Existing association rule mining algorithms cannot be directly applied on multiple sequence databases
Context Recovery
28
Annotate the sequences to generate a single combined database
Mining Problem Solution
Apply frequent subsequence mining algorithm [Wang and Han, ICDE 04] to get frequent sequences
Transform mined sequences into sequence association rules
Rank rules based on the support assigned by frequent subsequence mining algorithm
(3 10) Λ FCa => (2 8)Context Trigger Recovery
29
CAR-Miner Approach
InputApplication
Classes and Methods
Open Source Projects on web Open Source Projects on web
1 2 N…
…Exception-Flow
GraphsStatic Traces
SequenceAssociation
RulesViolations
30
Violation Detection Analyze each call site of triggering call FCa
Step 1: Extract context call sequence “CC1 CC2 ... CCm” from the beginning of the function to the call site of FCa
Step 2: If CC1 CC2 ... CCm is super-sequence of FCc1 ... FCcn
Report any missing function calls of {FCe1 ... FCen} in any exception path
API client: (CC1 CC2 ... CCm) Λ FCa => Missing any? isSuperSeqOf API Rule: (FCc1 ... FCcn) Λ FCa => FCe1 ... FCen
Context Trigger Recovery
31
EvaluationResearch Questions:
1. Do the mined rules represent real rules?2. Do the detected violations represent real
defects?3. Does CAR-Miner perform better than WN-
miner [Weimer and Necula, TACAS 05]?4. Do the sequence association rules help
detect new defects?
32
Subjects
Internal Info: classes and methods belonging to the app External Info: classes and methods used by the app Code examples: #files collected through code search engine
33
RQ1: Real Rules
Real rules: 55% (Total: 294)Usage patterns: 3%False positives: 43%
Do the mined rules represent real rules?
34
RQ1: Distribution of Real Rules for Axion
#false positives is quite low between 1 to 60 rules
Distribution of rules based on ranks assigned by CAR-Miner
35
RQ2: Detected Violations Do the detected violations represent real defects?
Total number of defects: 160 New defects not found by WN-Miner approach: 87
36
RQ2: Status of Detected Violations
HsqlDB developers responded on the first 10 reported defects
Accepted 7 defects Rejected 3 defects
Reason given by HsqlDB developers for rejected defects:“Although it can throw exceptions in general, it should not throw with
HsqlDB, So it is fine”
37
RQ3: Comparison with WN-miner Does CAR-Miner performs better than WN-miner?
Found 224 new rules and missed 32 rules CAR-Miner detected most of the rules mined by WN-miner Two major factors:
sequence association rules Increase in the data scope
38
RQ4: New defects by sequence association rules
Detected 21 new real defects among all applications
Do the sequence association rules detect new defects?
39
Agenda MotivationMining Sequence Association Rules (CAR-Miner) [ICSE 09] Detecting Exception-Handling Defects
Mining Alternative Patterns(Alattin) [ASE 09] Detecting Neglected Condition Defects
Conclusion
40
40
Existing approaches produce a large number of false positives
One major observation: Programmers often write code in different ways for
achieving the same task Some ways are more frequent than others
Large Number of False Positives
Frequent ways
Infrequent ways
Mined Patterns
mine patterns detect violations
41
Example: java.util.Iterator.next()
PrintEntries1(ArrayList<string> entries){ … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } …}
PrintEntries1(ArrayList<string> entries){ … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } …}
Code Sample 1
PrintEntries2(ArrayList<string> entries)
{ … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } …}
PrintEntries2(ArrayList<string> entries)
{ … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } …}
Code Example 2
Code Sample 2
Java.util.Iterator.next() throws NoSuchElementException when invoked on a list without any elements
42
Example: java.util.Iterator.next()
PrintEntries1(ArrayList<string> entries){ … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } …}
PrintEntries1(ArrayList<string> entries){ … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } …}
Code Sample 1
PrintEntries2(ArrayList<string> entries){ … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } …}
PrintEntries2(ArrayList<string> entries){ … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } …}
Code Sample 2
1243 code examples
Sample 1 (1218 / 1243)
Sample 2 (6/1243)
Mined Pattern from existing approaches:“boolean check on return of Iterator.hasNext before Iterator.next”
43
Example: java.util.Iterator.next()
Require more general patterns (alternative patterns): P1 or P2
P1 : boolean check on return of Iterator.hasNext before Iterator.nextP2 : boolean check on return of ArrayList.size before Iterator.next
Cannot be mined by existing approaches, since alternative P2 is infrequent
PrintEntries1(ArrayList<string> entries){ … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } …}
PrintEntries1(ArrayList<string> entries){ … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } …}
Code Sample 1
PrintEntries2(ArrayList<string> entries){ … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } …}
PrintEntries2(ArrayList<string> entries){ … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } …}
Code Sample 2
44
Our Solution: ImMiner Algorithm Mines alternative patterns of the form P1 or P2
Based on the observation that infrequent alternatives such as P2 are frequent among code examples that do not support P1
1243 code examples
Sample 1 (1218 / 1243)
Sample 2 (6/1243)
P2 is frequent among code examples not supporting P1
P2 is infrequent among entire 1243 code examples
45
Alternative Patterns ImMiner mines three kinds of alternative patterns of the general form “P1 or P2”
Balanced: all alternatives (both P1 and P2) are frequent
Imbalanced: some alternatives (P1) are frequent and others are infrequent (P2). Represented as “P1 or P^
2”
Single: only one alternative
46
ImMiner Algorithm Uses frequent-itemset mining [Burdick et al. ICDE 01] iteratively
An input database with the following APIs for Iterator.next()
Input database Mapping of IDs to APIs
47
ImMiner Algorithm: Frequent AlternativesInput database
Frequent itemset mining
(min_sup 0.5)
Frequent item: 1P1: boolean-check on the return of
Iterator.hasNext() before Iterator.next()
48
ImMiner: Infrequent Alternatives of P1
Positive database (PSD)
Negative database (NSD)
Split input database into two databases: Positive and Negative
Mine patterns that are frequent in NSD and are infrequent in PSD Reason: Only such patterns serve as alternatives for P1
Alternative Pattern : P2 “const check on the return of ArrayList.size() before Iterator.next()”
Alattin applies ImMiner algorithm to detect neglected conditions
49
Neglected Conditions Neglected conditions refer to
Missing conditions that check the arguments or receiver of the API call before the API call
Missing conditions that check the return or receiver of the API call after the API call
One of the primary reasons for many fatal issues security or buffer-overflow vulnerabilities [Chang et
al. ISSTA 07]
50
Alattin Approach
ApplicationUnder Analysis
Detect neglected conditions
Classes and methods
Open Source Projects on web Open Source Projects on web
1 2 N…
…Pattern
Candidates
Alternative Patterns
Violations
Extract classes and methods
reused
Phase 1: Issue queries and collect relevant code samples. Eg: “lang:java
java.util.Iterator next”Phase 2: Generate pattern candidates
Phase 3: Mine alternative patterns
Phase 4: Detect neglected conditions statically
51
Evaluation Research Questions:
1. Do alternative patterns exist in real applications?
2. How high percentage of false positives are reduced (with low or no increase of false negatives) in detected violations?
52
Subjects
Two categories of subjects: 3 Java default API libraries 3 popular open source libraries
#Samples: #code examples collected from Google code search
53
RQ1: Balanced and Imbalanced Patterns How high percentage of balanced and imbalanced patterns exist in real
apps?
Balanced patterns: 0% to 30% (average: 9.69%) Imbalanced patterns:
30% to 100% (average: 65%) for Java default API libraries 0% to 9.5% (average: 5%) for open source libraries
Explanation: Java default API libraries provide more different ways of writing code compared to open source libraries
54
RQ2: False Positives and False Negatives How high % of false positives are reduced (with low or no increase of
false negatives)? Applied mined patterns (“P1 or P2 or ... or Pi or A^
1 or A^2 or ... or A^
j ”) in three modes:
Existing mode:
“P1 or P2 or ... or Pi or A^1 or A^
2 or ... or A^j ”
P1 ,P2, ... , Pi
Balanced mode:
“P1 or P2 or ... or Pi or A^1 or A^
2 or ... or A^j ”
“P1 or P2 or ... or Pi” Imbalanced mode:
“P1 or P2 or ... or Pi or A^1 or A^
2 or ... or A^j ”
“P1 or P2 or ... or Pi or A^1 or A^
2 or ... or A^j ”
55
RQ2: False Positives and False Negatives
Application Existing Mode Balanced Mode
Defects False Positives
Defects False Positives
% of reduction
False Negatives
Java Util 37 104 37 104 0 0
Java Transaction
51 105 51 105 0 0
Java SQL 56 143 56 90 37.06 0
BCEL 2 14 2 8 42.86 0
HSqlDB 1 0 1 0 0 0
Hibernate 10 9 10 8 11.11 0
AVERAGE/TOTAL
15.17 0
Existing Mode vs Balanced Mode
Balanced mode reduced false positives by 15.17% without any increase in false negatives
RQ2: False Positives and False Negatives
Application Existing Mode Imbalanced Mode
Defects False Positives
Defects False Positives
% of reduction
False Negatives
Java Util 37 104 36 74 28.85 1
Java Transaction
51 105 47 76 27.62 4
Java SQL 56 143 53 81 43.36 3
BCEL 2 14 2 6 57.04 0
HSqlDB 1 0 1 0 0 0
Hibernate 10 9 10 8 11.11 0
AVERAGE/TOTAL
28.01 8
Existing Mode vs Imbalanced Mode
Imbalanced mode reduced false positives by 28% with quite small increase in false negatives
56
57
Conclusion Problem-driven methodology by identifying
new problems, patterns mining algorithms, defects
CAR-Miner [ICSE 09]: mining sequence association rules of the form
(FCc1 ... FCcn) Λ FCa => (FCe1 ... Fcen) Context Trigger Recovery
reduce false negatives Alattin [ASE 09]: mining alternative patterns classified
into three categories: balanced, imbalanced, and single P1 or P2 or ... or Pi or A^
1 or A^2 or ... or A^
j reduce false positives
58
Thank You
Questions?
Tao XieUniversity of Illinois at Urbana-Champaign
http://web.engr.illinois.edu/~taoxie/[email protected]