msr2010 ibrahim
TRANSCRIPT
1
Should I contribute to this discussion?
Walid M. Ibrahim, Nicolas Bettenburg, Emad Shihab, Bram Adams and Ahmed E. Hassan
Software Analysis and Intelligence Lab (SAIL)Queen’s UniversityKingston, Canada
Which factors drive developers to contribute to a discussion?
2
Mailing Lists Co-ordinate Open-Source Development
Future directions
Code review
Design directions
User problemsFeature requests
Bug fixes
3
26 February• Re: Hot Standby query cancellation and Streaming Replication integration Greg Smith (23:49)• Performance Patches Was: Lock Wait Statistics (next commitfest) Mark Kirkwood (23:23)• Re: Hot Standby query cancellation and Streaming Replication integration Josh Berkus (22:29)• Re: Hot Standby query cancellation and Streaming Replication integration Josh Berkus (22:26)• Re: Lock Wait Statistics (next commitfest) Greg Smith (22:23)• Re: Hot Standby query cancellation and Streaming Replication integration Greg Smith (22:10)• Re: NaN/Inf fix for ECPG Michael Meskes (21:56)• Re: Lock Wait Statistics (next commitfest) Josh Berkus (21:16)• Re: ECPG, two varchars with same name on same line Michael Meskes (21:10)• Re: Lock Wait Statistics (next commitfest) Mark Kirkwood (21:04)• Re: NaN/Inf fix for ECPG Tom Lane (20:51)• Re: Hot Standby query cancellation and Streaming Replication integration Josh Berkus (20:31)• Re: Anyone know if Alvaro is OK? Bruce Momjian (20:30)• Anyone know if Alvaro is OK? Josh Berkus (20:28)• Re: Lock Wait Statistics (next commitfest) Greg Smith (19:17)• Re: Lock Wait Statistics (next commitfest) Greg Smith (18:18)• Re: NaN/Inf fix for ECPG Rémi Zara (17:18)• Re: NaN/Inf fix for ECPG Tom Lane (16:58)• Re: Lock Wait Statistics (next commitfest) Tom Lane (16:44)• Re: Re: Hot Standby query cancellation and Streaming Replication integration Tatsuo Ishii (14:56)• Re: Path separator Magnus Hagander (14:48)• Re: Small change of the HS document Bruce Momjian (14:45)• Re: visibility maps and heap_prune Bruce Momjian (14:40)• Re: Hot Standby query cancellation and Streaming Replication integration Bruce Momjian (14:34)• Re: caracara failing to bind to localhost? Dickson S. Guedes (14:28)• Re: NaN/Inf fix for ECPG Rémi Zara (13:40)• Re: Lock Wait Statistics (next commitfest) Mark Kirkwood (10:40)• Re: Hot Standby query cancellation and Streaming Replication integration Heikki Linnakangas (09:34)• Re: Hot Standby query cancellation and Streaming Replication integration Heikki Linnakangas (09:07)• Re: visibility maps and heap_prune Heikki Linnakangas (09:03)• Re: Re: Hot Standby query cancellation and Streaming Replication integration Heikki Linnakangas (07:00)• Re: Re: Hot Standby query cancellation and Streaming Replication integration Heikki Linnakangas (06:55)• Re: Testing of parallel restore with current snapshot Gokulakannan Somasundaram (06:52)• Re: Lock Wait Statistics (next commitfest) Gokulakannan Somasundaram (06:44) • Re: Correcting Error message Tom Lane (06:16) • Re: Correcting Error message Jaime Casanova (06:09) • Re: A thought on Index Organized Tables Gokulakannan Somasundaram (05:42)• …….• …….• …….• Re: Avoiding bad prepared-statement plans. Tom Lane (00:03)
Over 100 Messages
4
Timely contributions move discussions forward without delays
Can we help developers identify which threadsneed their contributions the most?
5
Research Questions
Q1. Can we build a high-accuracy model for each developer to indicate the threads needing their contribution?
Q2. What are the most important factors that influence a developer to contribute to a thread?
6
Studied Mailing Lists
Web Server DBMS Interpreter
121,288 Messages 162,741 Messages 93,919 Messages
18,838 Threads 19,945 Threads 10,671 Threads
3,137 Contributors 4,996 Contributors 2,848 Contributors
7
Can we build a high-accuracy model for each developer to indicate the threads needing their contribution?
Q1
8
Contribution Factors
Social Network
Contribution ActivityKnown PosterKnown starter
Time
Message MonthMessage DayMessage Time
ThreadThread Length Word count
Thread ContentThread Subject Thread Body
9
Model Building
Thread Social Network Time Thread Content
Decision Tree
10
Composite Model
Thread Social Network Time Thread Content
10
Spam Filter(Naïve Bayesian)
Decision Tree Spam Score
Evaluation of the model
True ClassClassified as
Contribute Not ContributeContribute a bNot Contribute c d
Overall Accuracy:
Not Contribute Accuracy:
Contribute Accuracy : baa
dcd
dcbada
11
12
Most Active Med. Active Least Active0%
10%20%30%40%50%60%70%80%90%
100%
84%94% 98%
87%
98% 100%
82%
65%71%
Accuracy rates for Post-greSQL
Overall Not contributeContribute
13
Most Active Med. Active Least Active0%
10%20%30%40%50%60%70%80%90%
100%
84%94% 98%
87%
98% 100%
82%
65%71%
Accuracy rates for Post-greSQL
Overall Not contributeContribute
14
Most Active Med. Active Least Active0%
10%20%30%40%50%60%70%80%90%
100%
84%94% 98%
87%
98% 100%
82%
65%71%
Accuracy rates for Post-greSQL
Overall Not contributeContribute
Contributed to 10,148
Did NOT contribute to 8,654
Contributed to 2,180
Did NOT contribute to 16,765
Contributed to 860
Did NOT contribute to 18,085
15
Re-balanced training data
9,4722,180
16,765
Resample
9,472
Contribute Not ContributeContributeNot Contribute
Test on un-balanced data
16
Most Active Med. Active Least Active0%
10%20%30%40%50%60%70%80%90%
100%
83%89% 93%
85% 89% 93%
81% 85% 81%
Accuracy rates for Post-greSQL
Re-sampled
Overall Not contributeContribute
-1% +20% +10%
17
Series10%
10%20%30%40%50%60%70%80%90%
100%87% 86% 89%88% 86% 90%87%
81% 81%
Accuracy for top 10 developers
Overall Not contributeContribute
18
Can we build a high-accuracy model for each developer to indicate the threads needing their contribution?
Back to our Q1
We can build a model with high average accuracy between 85% and 89%
19
What are the most important factors that influence a developer to contribute to a thread?
Q2
20
Top Node Analysis
21
Top Node Analysis
Lvl # Factor0 3 Activity in Last Month
1 2 Message Content1 Thread Length1 Activity in Last Month
22
Top Node Analysis
Lvl # Factor0 3 Activity in Last Month
1 Thread Length1 2 Message Content
1 Thread Length1 Activity in Last Month
23
Top Node Analysis
Lvl # Factor0 3 Activity in Last Month
1 Thread Length1 2 Message Content
1 Thread Length1 Activity in Last Month
24
Top Node Analysis for PostgreSQL
Developers follow their own behaviour
Lvl Most Active Dev. Med. Active Dev. Least Active Dev.# Factor # Factor # Factor
0 10 Thread Length 8 Activity in Last Month 10 Message Content2 Thread Length
1 10 Message Content 6 Message Content 8 Thread Length3 Thread Length 2 Activity in Last Month1 Activity in Last Month
25
Average Top Node AnalysisFor the Three Projects
Apache PostgreSQL Python# Factor # Factor # Factor5 Message Content 4 Message Content 7 Message Content4 Thread Length 3 Thread Length 3 Thread Length1 Activity in Last
Month3 Activity in Last
Month
Developers contribute to threads based on:• Their knowledge• The lack of other contributions • Their availability
26
Words used by Naïve Bayesian
• Most active developer:Linux – baseline – version – package – deadlock timeout – trace – structure – debug
• Least active developer:Format – bug – directory – libpq – reporting – compile – log_directory – parser – testtable
27
Threats to Validity
• Need more case studies on commercial projects
• Should explore other factors
Marking a thread as not to contribute doesn’t mean not to read the thread
28
Conclusion