msr2010 ibrahim

28
Should I contribute to this discussion? Walid M. Ibrahim, Nicolas Bettenburg, Emad Shihab, Bram Adams and Ahmed E. Hassan Software Analysis and Intelligence Lab (SAIL) Queen’s University Kingston, Canada 1 Which factors drive developers to contribute to a discussion?

Upload: sailqu

Post on 12-Apr-2017

97 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Msr2010 ibrahim

1

Should I contribute to this discussion?

Walid M. Ibrahim, Nicolas Bettenburg, Emad Shihab, Bram Adams and Ahmed E. Hassan

Software Analysis and Intelligence Lab (SAIL)Queen’s UniversityKingston, Canada

Which factors drive developers to contribute to a discussion?

Page 2: Msr2010 ibrahim

2

Mailing Lists Co-ordinate Open-Source Development

Future directions

Code review

Design directions

User problemsFeature requests

Bug fixes

Page 3: Msr2010 ibrahim

3

26 February• Re: Hot Standby query cancellation and Streaming Replication integration Greg Smith (23:49)• Performance Patches Was: Lock Wait Statistics (next commitfest) Mark Kirkwood (23:23)• Re: Hot Standby query cancellation and Streaming Replication integration Josh Berkus (22:29)• Re: Hot Standby query cancellation and Streaming Replication integration Josh Berkus (22:26)• Re: Lock Wait Statistics (next commitfest) Greg Smith (22:23)• Re: Hot Standby query cancellation and Streaming Replication integration Greg Smith (22:10)• Re: NaN/Inf fix for ECPG Michael Meskes (21:56)• Re: Lock Wait Statistics (next commitfest) Josh Berkus (21:16)• Re: ECPG, two varchars with same name on same line Michael Meskes (21:10)• Re: Lock Wait Statistics (next commitfest) Mark Kirkwood (21:04)• Re: NaN/Inf fix for ECPG Tom Lane (20:51)• Re: Hot Standby query cancellation and Streaming Replication integration Josh Berkus (20:31)• Re: Anyone know if Alvaro is OK? Bruce Momjian (20:30)• Anyone know if Alvaro is OK? Josh Berkus (20:28)• Re: Lock Wait Statistics (next commitfest) Greg Smith (19:17)• Re: Lock Wait Statistics (next commitfest) Greg Smith (18:18)• Re: NaN/Inf fix for ECPG Rémi Zara (17:18)• Re: NaN/Inf fix for ECPG Tom Lane (16:58)• Re: Lock Wait Statistics (next commitfest) Tom Lane (16:44)• Re: Re: Hot Standby query cancellation and Streaming Replication integration Tatsuo Ishii (14:56)• Re: Path separator Magnus Hagander (14:48)• Re: Small change of the HS document Bruce Momjian (14:45)• Re: visibility maps and heap_prune Bruce Momjian (14:40)• Re: Hot Standby query cancellation and Streaming Replication integration Bruce Momjian (14:34)• Re: caracara failing to bind to localhost? Dickson S. Guedes (14:28)• Re: NaN/Inf fix for ECPG Rémi Zara (13:40)• Re: Lock Wait Statistics (next commitfest) Mark Kirkwood (10:40)• Re: Hot Standby query cancellation and Streaming Replication integration Heikki Linnakangas (09:34)• Re: Hot Standby query cancellation and Streaming Replication integration Heikki Linnakangas (09:07)• Re: visibility maps and heap_prune Heikki Linnakangas (09:03)• Re: Re: Hot Standby query cancellation and Streaming Replication integration Heikki Linnakangas (07:00)• Re: Re: Hot Standby query cancellation and Streaming Replication integration Heikki Linnakangas (06:55)• Re: Testing of parallel restore with current snapshot Gokulakannan Somasundaram (06:52)• Re: Lock Wait Statistics (next commitfest) Gokulakannan Somasundaram (06:44) • Re: Correcting Error message Tom Lane (06:16) • Re: Correcting Error message Jaime Casanova (06:09) • Re: A thought on Index Organized Tables Gokulakannan Somasundaram (05:42)• …….• …….• …….• Re: Avoiding bad prepared-statement plans. Tom Lane (00:03)

Over 100 Messages

Page 4: Msr2010 ibrahim

4

Timely contributions move discussions forward without delays

Can we help developers identify which threadsneed their contributions the most?

Page 5: Msr2010 ibrahim

5

Research Questions

Q1. Can we build a high-accuracy model for each developer to indicate the threads needing their contribution?

Q2. What are the most important factors that influence a developer to contribute to a thread?

Page 6: Msr2010 ibrahim

6

Studied Mailing Lists

Web Server DBMS Interpreter

121,288 Messages 162,741 Messages 93,919 Messages

18,838 Threads 19,945 Threads 10,671 Threads

3,137 Contributors 4,996 Contributors 2,848 Contributors

Page 7: Msr2010 ibrahim

7

Can we build a high-accuracy model for each developer to indicate the threads needing their contribution?

Q1

Page 8: Msr2010 ibrahim

8

Contribution Factors

Social Network

Contribution ActivityKnown PosterKnown starter

Time

Message MonthMessage DayMessage Time

ThreadThread Length Word count

Thread ContentThread Subject Thread Body

Page 9: Msr2010 ibrahim

9

Model Building

Thread Social Network Time Thread Content

Decision Tree

Page 10: Msr2010 ibrahim

10

Composite Model

Thread Social Network Time Thread Content

10

Spam Filter(Naïve Bayesian)

Decision Tree Spam Score

Page 11: Msr2010 ibrahim

Evaluation of the model

True ClassClassified as

Contribute Not ContributeContribute a bNot Contribute c d

Overall Accuracy:

Not Contribute Accuracy:

Contribute Accuracy : baa

dcd

dcbada

11

Page 12: Msr2010 ibrahim

12

Most Active Med. Active Least Active0%

10%20%30%40%50%60%70%80%90%

100%

84%94% 98%

87%

98% 100%

82%

65%71%

Accuracy rates for Post-greSQL

Overall Not contributeContribute

Page 13: Msr2010 ibrahim

13

Most Active Med. Active Least Active0%

10%20%30%40%50%60%70%80%90%

100%

84%94% 98%

87%

98% 100%

82%

65%71%

Accuracy rates for Post-greSQL

Overall Not contributeContribute

Page 14: Msr2010 ibrahim

14

Most Active Med. Active Least Active0%

10%20%30%40%50%60%70%80%90%

100%

84%94% 98%

87%

98% 100%

82%

65%71%

Accuracy rates for Post-greSQL

Overall Not contributeContribute

Contributed to 10,148

Did NOT contribute to 8,654

Contributed to 2,180

Did NOT contribute to 16,765

Contributed to 860

Did NOT contribute to 18,085

Page 15: Msr2010 ibrahim

15

Re-balanced training data

9,4722,180

16,765

Resample

9,472

Contribute Not ContributeContributeNot Contribute

Test on un-balanced data

Page 16: Msr2010 ibrahim

16

Most Active Med. Active Least Active0%

10%20%30%40%50%60%70%80%90%

100%

83%89% 93%

85% 89% 93%

81% 85% 81%

Accuracy rates for Post-greSQL

Re-sampled

Overall Not contributeContribute

-1% +20% +10%

Page 17: Msr2010 ibrahim

17

Series10%

10%20%30%40%50%60%70%80%90%

100%87% 86% 89%88% 86% 90%87%

81% 81%

Accuracy for top 10 developers

Overall Not contributeContribute

Page 18: Msr2010 ibrahim

18

Can we build a high-accuracy model for each developer to indicate the threads needing their contribution?

Back to our Q1

We can build a model with high average accuracy between 85% and 89%

Page 19: Msr2010 ibrahim

19

What are the most important factors that influence a developer to contribute to a thread?

Q2

Page 20: Msr2010 ibrahim

20

Top Node Analysis

Page 21: Msr2010 ibrahim

21

Top Node Analysis

Lvl # Factor0 3 Activity in Last Month

1 2 Message Content1 Thread Length1 Activity in Last Month

Page 22: Msr2010 ibrahim

22

Top Node Analysis

Lvl # Factor0 3 Activity in Last Month

1 Thread Length1 2 Message Content

1 Thread Length1 Activity in Last Month

Page 23: Msr2010 ibrahim

23

Top Node Analysis

Lvl # Factor0 3 Activity in Last Month

1 Thread Length1 2 Message Content

1 Thread Length1 Activity in Last Month

Page 24: Msr2010 ibrahim

24

Top Node Analysis for PostgreSQL

Developers follow their own behaviour

Lvl Most Active Dev. Med. Active Dev. Least Active Dev.# Factor # Factor # Factor

0 10 Thread Length 8 Activity in Last Month 10 Message Content2 Thread Length

1 10 Message Content 6 Message Content 8 Thread Length3 Thread Length 2 Activity in Last Month1 Activity in Last Month

Page 25: Msr2010 ibrahim

25

Average Top Node AnalysisFor the Three Projects

Apache PostgreSQL Python# Factor # Factor # Factor5 Message Content 4 Message Content 7 Message Content4 Thread Length 3 Thread Length 3 Thread Length1 Activity in Last

Month3 Activity in Last

Month

Developers contribute to threads based on:• Their knowledge• The lack of other contributions • Their availability

Page 26: Msr2010 ibrahim

26

Words used by Naïve Bayesian

• Most active developer:Linux – baseline – version – package – deadlock timeout – trace – structure – debug

• Least active developer:Format – bug – directory – libpq – reporting – compile – log_directory – parser – testtable

Page 27: Msr2010 ibrahim

27

Threats to Validity

• Need more case studies on commercial projects

• Should explore other factors

Marking a thread as not to contribute doesn’t mean not to read the thread

Page 28: Msr2010 ibrahim

28

Conclusion