topology and evolution of the open source software community advisors: dr. vincent w. freeh dr....
TRANSCRIPT
Topology and Evolution of the Open Source Software Community
Advisors:
Dr. Vincent W. FreehDr. Kevin Bowyer
Supported in part by the National Science Foundation – Digital Science & Technology
Yongqin Gao
2
Outline
Overview• Data collection
• Network modeling
• Topological statistical analysis (real data)
• Simulations
• Publications
• Conclusions
3
Overview (about OSS)
• What is OSS– Free to use, free to distribute – Unlimited user and usage – Source code available and modifiable
• Potential advantages over commercial software– Higher quality– Faster development– Lower cost– Transparent
4
Overview (about our research)
• Our goal– Understanding the OSS phenomenon
• Approach– SourceForge is the source of our empirical data– Modeling as a social network– Analysis of topological statistics– Use simulation to verify and validate the model
5
Outline
• OverviewData collection
• Network modeling
• Topological statistical analysis
• Simulations
• Publications
• Conclusions
6
Data Collection — Monthly
• Web crawler (scripts)– Python– Shell– AWK– Sed
• Monthly• Since Jan 2001 • ProjectID• DeveloperID• Almost 2 million records• Relational database
PROJ|DEVELOPER8001|dev3488001|dev89728001|dev99228002|dev276508005|dev313518006|dev124098007|dev199358007|dev42628007|dev367118008|dev8972
7
Outline
• Overview
• Data collectionNetwork modeling
• Topological statistical analysis (real data)
• Simulations
• Publications
• Conclusions
8
Modeling as Collaboration Network
• What is a collaboration network?– A social network representing the collaborating
relationships.– Movie actor network and scientist collaboration
network
• Difference of SourceForge collaboration network– Link detachment– Virtual collaboration– Voluntary– Global
• Bipartite property of collaboration networks
9
Collaboration network - bipartite
Adapted from Newman, Strogatz and Watts, 2001
10
SourceForge Developer Network
15850 dev[46]dev[83] 15850 dev[46]
dev[48]
15850 dev[46]dev[56]
15850 dev[46]dev[58]
6882 dev[58]dev[47]
6882 dev[47]dev[79]
6882 dev[47]dev[52]
6882 dev[47]dev[55]
7028 dev[46]dev[99]
7028 dev[46]dev[51]
7028 dev[46]dev[57]
7597 dev[46]dev[45]
7597 dev[46]dev[72]
7597 dev[46]dev[55]
7597 dev[46]dev[58]
7597 dev[46]dev[61]
7597 dev[46]dev[64]7597 dev[46]
dev[67]
7597 dev[46]dev[70]
9859 dev[46]dev[49]9859 dev[46]
dev[53]
9859 dev[46]dev[54]
9859 dev[46]dev[59]
dev[46]
dev[83] dev[56]
dev[48]
dev[52]
dev[79]
dev[72]
dev[51]
dev[57]
dev[55]
dev[99]
dev[47]
Dev[80]
dev[53]
dev[58]
dev[65]
dev[45]
dev[70]
dev[67]
dev[59]
dev[54]
dev[49]
dev[64]
dev[61]
Project 6882
Project 9859
Project 7597
Project 7028
Project 15850
OSS Developer Network (Part)Developers are nodes / Projects are links
24 Developers5 Projects
2 hub Developers1 Cluster
11
Outline
• Overview
• Data collection
• Network modelingTopological statistical analysis (real data)
• Simulations
• Publications
• Conclusion
12
Topological Analysis
• Statistics inspected– Diameter– Average degree– Clustering coefficient– Degree distribution– Cluster size distribution– Relative size of major cluster– Fitness and life cycle
• Evolution of these statistics• Dual networks
– developer network and project network
13
Terminology
• Diameter– Average length of shortest paths between all pairs of vertices
• Degree– The count of edges connected to given vertex
• Average degree– Average of the degrees of all vertices in the network
• Cluster– The connected components of the network
• Clustering coefficient (CC)– CCi: Fraction representing the number of links actually present relative t
o the total possible number of links among the vertices in its neighborhood.
– CC: average of all CCi in a network• Degree distribution
– The distribution of degrees throughout a network• Major cluster
– The largest cluster in the network
14
Diameter of Developer Network vs. Time
• Network size increased from 30,000 to 70,000
15
Diameter of Project Network vs. Time
• Network size increased from 20,000 to 50,000.
• Diameter decreasing with time both for developer network and project network
16
Clustering Coefficient of Developer Network vs. Time
17
Clustering Coefficient of Project Network vs. Time
18
Degree Distribution (developers)
19
Degree Distribution (projects)
20
Cluster Size Distribution
• R2 with major cluster is 0.7426
• R2 without major cluster is 0.9799
21
Relative Size of Major Cluster vs. Time
• Increase of the relative size of the major cluster
• Increasing rate is decreasing
• May be an indication of the network evolution
22
Existence of Fitness
• Investigation of development of single project can verify the existence of “newcomer” phenomenon
• We tracked the development of every new project in July 2001 until now (total 1660 projects)
• Maximal monthly growth per project is 13 while average monthly growth per project is just 0.3639
23
Life Cycle of Project
24
Summary
25
Summary of Results
• Power law rules– Degree distributions, cluster distribution
• Average degree increasing with time
• Diameter decreasing with time
• Clustering coefficient decreasing with time
• Fitness existed in SourceForge
• Projects have life cycle behaviors
26
Outline
• Overview
• Data collection
• Network modeling
• Topological statistical analysis (real data)Simulations
• Publications
• Conclusion
27
Conceptual Framework
Empirical data
Adjustment
Generation
Verification
Validation
Cha
ract
eriz
atio
nD
escr
iptio
n
Model
Simulation
28
Agent-based Modeling
• EBM vs. ABM– Heterogeneous individuals– Complex network
• Experience environment– Hardware: computer cluster– Software:
• Simulation toolkits: Swarm• Database: Oracle• Language: Java, PL/SQL
29
Model for SourceForge
• ABM based on bipartite graph• Model description
– Agent: developer– Behaviors: Create, join, abandon and idle– Preference: developer’s and project’s– Fitness
• Four models in iterations– ER, BA, BA with constant fitness and BA with dynamic
fitness
• Comparison of empirical and simulated data
30
ER Model - Diameter
• Average degree is decreasing while it is increasing in empirical data
• Diameter is increasing while it is decreasing in empirical data
31
ER Model – Clustering Coefficient
• Clustering coefficient is relatively low under 0.3 while it is around 0.7 in empirical data.
32
ER Model – Degree Distribution
• Degree distribution is normal distribution while it is power law in empirical data
33
ER Model – Cluster Size Distribution
• power law distribution with R2 as 0.6667 (0.9653 without the major cluster) while R2 in empirical data is 0.7426 (0.9799 without the major cluster)
• The actual distribution is different from empirical data
34
BA Model – Diameter and Clustering Coefficient
• Small diameter and high clustering coefficient like empirical data
• Diameter and clustering coefficient are both decreasing like empirical data
35
BA Model – Degree Distribution
• Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data).
• For developer distribution: simulated data has R2 as 0.9798 and empirical data has R2 as 0.9714.
• For project distribution: simulated data has R2 as 0.6650 and empirical data has R2 as 0.9838.
36
BA Model with Constant Fitness
• Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data).
• For developer distribution: simulated data has R2 as 0.9742 and empirical data has R2 as 0.9714.
• For project distribution: simulated data has R2 as 0.7253 and empirical data has R2 as 0.9838.
37
BA Model with Dynamic Fitness
• Power laws in degree distribution, similar to empirical data (o for simulated data and x for empirical data).
• For developer distribution: simulated data has R2 as 0.9695 and empirical data has R2 as 0.9714.
• For project distribution: simulated data has R2 as 0.8051 and empirical data has R2 as 0.9838.
38
Advantage of Dynamic Fitness
• Intuition: Fitness should decreasing with time.
• Statistics: project has life cycle behavior which can not be replicated by BA model with constant fitness but can be replicated by BA model with dynamic fitness
39
Summary
40
Summary of Results
• We use ABM to model and simulate the SourceForge collaboration network.
• Conceptual framework is proposed for agent-based modeling and simulation.
• Case study of this framework: SourceForge study through ER, BA, BA with constant fitness and BA with dynamic fitness.
41
Outline
• Overview
• Data collection
• Network modeling
• Topological statistical analysis (real data)
• SimulationsPublications
• Conclusion
42
Publications To-date
• Yongqin Gao, "Modeling and Simulation of the OSS Community", Seventh Annual Swarm Researchers Meeting (Swarm2003), Notre Dame, IN, 2003.
• Yongqin Gao, Vince Freeh, and Greg Madey, "Analysis and Modeling of the Open Source Software Community", NAACSOS Conference 2003, Pittsburgh.
• Yongqin Gao, Vince Freeh, and Greg Madey, "Conceptual Framework for Agent-based Modeling and Simulation", NAACSOS Conference 2003, Pittsburgh.
• Greg Madey, Vincent Freeh, Renee Tynan, Yongqin Gao, Chris Hoffman, "Agent-based Modeling and Simulation of Collaborative Social Networks", AMCIS 2003, Tampa, FL.
43
Possible Journals
• Chapter 3– Physica A: statistical mechanics and its applicatio
ns– Journal of Social Structure (JSS)
• Chapter 4– Journal of Artificial Societies and Social Simulatio
n (JASSS)– Journal of Statistical Computation and Simulation
(JSCS)
44
Outline
• Overview
• Data collection
• Network modeling
• Topological statistical analysis (real data)
• Simulations
• PublicationsConclusion
45
Conclusion
• Study of SourceForge collaboration network can help us understanding the OSS community
• We investigate not only the topological statistics but also the evolution of these statistics.
• Simulation is used to investigate of SourceForge collaboration network.
46
Contribution
• Statistical study of the SourceForge community (snapshot and evolution)
• Verification of the approximate method to calculate the diameter and CC
• Proposal of a model for the SourceForge community
• Improvement of dynamic fitness to BA model
47
Future Work
• Data collection– Database dump from SourceForge (PostgreSQL 8GB)– All the possible attributes– Database schema in UML
• More topology analysis (with more attributes)– Discussion forum– Task assignment– Project management– Active testing
• Behavior-based analysis– Interaction between agents– H. Beyton Young’s model
• Information entropy analysis
48
Acknowledgements
• Committee
• Advisors
• Colleagues
• SourceForge
• NSF
• Others
49
Thank you