dynamic social network analysis (and more!) with eresearch tools
DESCRIPTION
A presentation for the OSS Watch Expert Workshop on Profiling Communities, demonstrating eResearch methodology applied to replicating research on open source software development.TRANSCRIPT
Dynamic Social Network Analysis (and more!) with eResearch Tools
Andrea Wiggins
iSchool @ Syracuse University
21 July, 2008
eResearch for FLOSS
• An approach to research using cyberinfrastructure
• Collaborative and transparent, like FLOSS
• Large-scale shared data sets– FLOSSmole– Notre Dame SourceForge dumps– CVSanalY– Etc…
• Uses tools and analyses that allow sharing among researchers to support open science ideals– Taverna Workbench– MyExperiment.org
Using Taverna
• Scientific analysis workflow tool– Open source development lead by myGrid team
– Target users are UK life sciences community
• Create analysis workflows by connecting modular components through input/output ports– Produces (rigorous) analyses that are replicable, self-
documenting, and easy to share
– Components include WSDL SOAP web services, Beanshell, RShell, and local Java shims
• Collaboratively developing our workflows
Replicating FLOSS Research as eResearch
• Replicating a selection of FLOSS papers and presentations, currently in progress
• Demonstrating utility and viability of eResearch approaches for FLOSS and social science
• Building reusable, customizable analysis components specific to FLOSS research, e.g. for data selection, sociomatrix generation for SNA, etc.
• Extending the original research analysis by parameterization (inputs, thresholds) and implementing “future work” suggestions of authors (plus our own ideas, of course)
Social dynamics of FLOSS communications
• Replication of Howison, Inoue & Crowston, 2006– Compute dynamic network centrality of projects from
trackers for 120 projects
• Extension– Added exponentially-decayed edge weighting function
(needs sensitivity testing)
– Made sliding window adjustable
– Can apply to any threaded communication venue for which data is available
– Completed: all venues for 2 projects; queued: 216 projects with 635 venues!
Workflow for Dynamic SNA
Dynamic SNA Across FLOSS Communication Channels
• Clearly a lot of variation across channels (user, developer & trackers), no easily observed patterns except overall trend toward decentralization
• Implications: carefully match theoretical constructs to data sampling, as different venues are very likely to yield different results, which significantly impacts interpretations
“Do the Rich Get Richer?”
• Replication of OSCon 2004 presentation by Conklin– Demonstrate scale-free distribution of developers
among projects
• Almost there– A little more analysis to replicate
• Hoping to extend to dynamic analysis of preferential attachment– Showing change to project sizes over time
– Comparing evolution and growth across repositories
Workflow for Rich Get Richer
• Using a single FLOSSmole summary statistic• Very simple workflow, can expand analysis considerably• Analysis of over 65K projects completes in under 3 minutes!
Scale-free Developer Distribution in FLOSS
“Identifying success and tragedy of FLOSS Commons”
• Replication of English & Schweik, 2007– Classification of project success by stage of growth for
110K projects as of August 2006
– Requires data from 2 repositories, FLOSSmole & ND
• Extension– Parameterized all thresholds, makes sensitivity
analysis possible
– Added 2 additional options for a criterion test, one suggested by authors in article
– Limitation: slightly less available data in FLOSSmole, 94K projects as of April 2005
Workflow for Success-Abandonment Classification
Classifying FLOSS Projects
• Very complex data requirements; meshing across repositories
• Difficult to scale and resource intensive
• Already using this workflow for project sampling
• For small (non-random) sample of 54 projects:– 64% growth, 17% initiation,
19% null (i.e. missing data)• Indeterminate Growth:
18.9%
• Success Growth: 39.6%
• Tragedy Growth: 7.5%
• Other: 34%
amsn,downloaded,growth,enough.releases,active,ok.release.rate,true,SGanjuta,downloaded,growth,enough.releases,active,ok.release.rate,false,SGanon,downloaded,growth,enough.releases,inactive,fast.release.rate,false,TGetc…
Future Directions
• Replication of “Evolution & Growth in Large Libre Software Projects” by Robles et al., 2005
• Prototyping OWL ontology of FLOSS communication data, already in use with RDF & SPARQL
• Cross-linking data, analyses, and papers• Increasing scale of analyses to thousands of projects• Extending analyses, sensitivity testing to strengthen
findings• Building reusable analysis components to share,
enabling cumulative research
Thanks!
• More at floss.syr.edu/publications/