analysis of the open source software development community...

30
Supported in part by the National Science Foundation – ISS/Digital Science & Technology Analysis of the Analysis of the Open Open Source Source Software development Software development community using ST mining: community using ST mining: A Research Plan A Research Plan Yongqin Gao Yongqin Gao , Greg , Greg Madey Madey Computer Science & Engineering Computer Science & Engineering University of Notre Dame University of Notre Dame NAACSOS Conference NAACSOS Conference Notre Dame, IN Notre Dame, IN June 26-28, 2005 June 26-28, 2005

Upload: others

Post on 16-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Supported in part by the National Science

Foundation – ISS/Digital Science & Technology

Analysis of theAnalysis of the OpenOpen SourceSource

Software developmentSoftware development

community using ST mining:community using ST mining:

A Research PlanA Research PlanYongqin GaoYongqin Gao, Greg , Greg MadeyMadey

Computer Science & EngineeringComputer Science & Engineering

University of Notre DameUniversity of Notre Dame

NAACSOS ConferenceNAACSOS Conference

Notre Dame, INNotre Dame, IN

June 26-28, 2005June 26-28, 2005

Page 2: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

OutlineOutline

!! BackgroundBackground

!! MotivationMotivation

!! Problem definitionProblem definition

!! Research dataResearch data

!! MethodologyMethodology

!! ConclusionConclusion

Page 3: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Background (OSS)Background (OSS)!! What is OSS?What is OSS?

!! Free to use, modify and distributeFree to use, modify and distribute

!! Source code available and modifiableSource code available and modifiable

!! Potential advantages over commercial softwarePotential advantages over commercial software!! Transparent and easy adoptionTransparent and easy adoption

!! Fast developmentFast development

!! Low costLow cost

!! Potential high qualityPotential high quality

!! Why study OSS?Why study OSS?!! Software engineering Software engineering —— new development and coordination methods new development and coordination methods

!! Open content Open content —— model for other forms of open, shared collaboration model for other forms of open, shared collaboration

!! Complexity Complexity —— successful example of self-organization/emergence successful example of self-organization/emergence

!! Growing popularityGrowing popularity

!! Non-traditional governance and project management practicesNon-traditional governance and project management practices

!! Virtual --> Data!Virtual --> Data!

Page 4: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Open Source Software (OSS)Open Source Software (OSS)!! Free Free ……

!! to view sourceto view source

!! to modifyto modify

!! to shareto share

!! of costof cost

!! ExamplesExamples!! ApacheApache

!! PerlPerl

!! GNUGNU

!! LinuxLinux

!! SendmailSendmail

!! PythonPython

!! KDEKDE

!! GNOMEGNOME

!! MozillaMozilla

!! Thousands moreThousands more

Linux

GNU

Savannah

Page 5: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

LeadersLeaders

Linus Tolvalds

Linux

Larry Wall

Perl

Richard Stallman

GNU Manifesto

Eric Raymond

Cathedral and Bazaar

Page 6: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Success of ApacheSuccess of Apache

!! Almost 70% Market Share Almost 70% Market Share ((NetcraftNetcraft.com).com)

Page 7: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Research ApproachResearch Approach

Parameter Values

Structural Features

Parameter Values

Cross Validation

Structural Features

Combined Data Mining

Parameter Values

Understanding the

Social and Task

Dynamics that Predict

Developer Behaviors

Social Network

Analysis: Longitudinal

Study of Preferential

Attachment and Dynamic

Attachment

Conceptual

Explanatory Model of

OSS: Agent-Based

Modeling and Simulation

Opportunity: Huge amounts

of relatively good data

Page 8: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

SourceForgeSourceForge..netnet

• VA Software

• Part of OSDN

• Started 12/1999

• Collaboration tools

• 100 K Projects

• 100 K Developers

• 1 M Registered Users

Page 9: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

150 150 GBytes GBytes of Data & Growingof Data & Growing

Page 10: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

15850 dev[46]dev[83] 15850 dev[46]

dev[48]

15850 dev[46]dev[56]

15850 dev[46]dev[58]

6882 dev[58]dev[47]

6882 dev[47]dev[79]

6882 dev[47]dev[52]

6882 dev[47]dev[55]

7028 dev[46]dev[99]

7028 dev[46]dev[51]

7028 dev[46]dev[57]

7597 dev[46]dev[45]

7597 dev[46]dev[72]

7597 dev[46]dev[55]

7597 dev[46]dev[58]

7597 dev[46]dev[61]

7597 dev[46]dev[64]7597 dev[46]

dev[67]

7597 dev[46]dev[70]

9859 dev[46]dev[49]9859 dev[46]

dev[53]

9859 dev[46]dev[54]

9859 dev[46]dev[59]

dev[46]

dev[83] dev[56]

dev[48]

dev[52]

dev[79]

dev[72]

dev[51]

dev[57]

dev[55]

dev[99]

dev[47]

dev[58]

dev[53]

dev[58]

dev[65]

dev[45]

dev[70]

dev[67]

dev[59]

dev[54]

dev[49]

dev[64]

dev[61]

Project 6882

Project 9859

Project 7597

Project 7028

Project 15850

OSS Developer - Social NetworkDevelopers are nodes / Projects are links

24 Developers5 Projects

2 Linchpin Developers1 Cluster

Page 11: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Scale free distribution: developerScale free distribution: developer

participationparticipation

# projects # of developers on

that many projects

1 21488

2 3688

3 1086

4 413

5 177

6 76

7 35

8 21

9 9

10 6

11 5

12 6

15 1

16 1

17 1

y =10.6905 - 3.70892 x

R2 = 0.979906

0.5 1 1.5 2 2.5

2

4

6

8

10

Log( # of Projects)

Log(#

of

Dev

eloper

s)

Scale Free – Power Law (developers)

Page 12: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Scale free distribution: project sizesScale free distribution: project sizes

Scale Free – Power Law (projects)

Page 13: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Background (DM)Background (DM)

!! Characteristics of data setCharacteristics of data set!! Incomplete, noisy, redundantIncomplete, noisy, redundant

!! Complex structures, unstructuredComplex structures, unstructured

!! HeterogeneousHeterogeneous

!! Database not designed for research, but to support projectDatabase not designed for research, but to support projectmanagement services of management services of SourceForgeSourceForge.net.net

!! Temporal data is available, but not everything a researcherTemporal data is available, but not everything a researcherwould wantwould want

!! Inferencing/discovery Inferencing/discovery of temporal data potentially valuableof temporal data potentially valuableopportunityopportunity

!! What is DM (Data mining)What is DM (Data mining)!! Nontrivial extraction of implicit, previously unknown andNontrivial extraction of implicit, previously unknown and

potentially useful information from data.potentially useful information from data.

Page 14: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Data Mining ProcedureData Mining Procedure

Raw data

Relevant data

Feature selection

Algorithm application

Result Evaluation

Data Integration

Data Pre-processing

Database

Page 15: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Spatial-temporal DM (1)Spatial-temporal DM (1)

!! Temporal data miningTemporal data mining

!! Discover the behavior-based knowledge instead ofDiscover the behavior-based knowledge instead of

state-based knowledge.state-based knowledge.

!! Example: many wolves -> fewer rabbitsExample: many wolves -> fewer rabbits

!! Relationship between timely feedback and quality ofRelationship between timely feedback and quality of

software/success of the OSS projectsoftware/success of the OSS project

Page 16: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Spatio-temporal Spatio-temporal DMDM

!! New research domain: New research domain: Spatio-temporal Spatio-temporal data miningdata mining!! Growing interest in Growing interest in spatio-temporal spatio-temporal data miningdata mining

!! Recommender systemsRecommender systems

!! Location based servicesLocation based services

!! Time based servicesTime based services

!! GIS applicationsGIS applications

!! Extension of classic data mining techniques into data setExtension of classic data mining techniques into data setwith spatial and temporal properties.with spatial and temporal properties.

!! Challenges: complexity of spatial information and difficultyChallenges: complexity of spatial information and difficultyin reasoning temporal information, e.g.,in reasoning temporal information, e.g.,!! IntervalsIntervals

!! PointsPoints

!! HybridsHybrids

Page 17: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

MotivationsMotivations

!! LimitationsLimitations of OSS research to dateof OSS research to date

!! Mostly feature based data miningMostly feature based data mining to dateto date

!! Neglecting of the inherent spatial and temporalNeglecting of the inherent spatial and temporal

information in the OSS communityinformation in the OSS community

!! SourceForgeSourceForge.net properties.net properties

!! Spatial informationSpatial information

!! Collaboration networkCollaboration network

!! Temporal informationTemporal information

!! History data and log tablesHistory data and log tables

Page 18: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Spatial information in OSS?Spatial information in OSS?

!! The collaboration network in SFThe collaboration network in SF!! Study of the topology of the collaboration network.Study of the topology of the collaboration network.

!! The network can be mapped as a graphThe network can be mapped as a graph

!! This graph is a non-Metric spaceThis graph is a non-Metric space

!! Spread of ideas (software engineering tools and practices,Spread of ideas (software engineering tools and practices,new project opportunities)new project opportunities)

Page 19: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Temporal information inTemporal information in OSSOSS

!! The network is evolving and the histories of theThe network is evolving and the histories of the

site and individual entitiessite and individual entities comprise thecomprise the

temporal information in the network.temporal information in the network.

!! Discrete time pointsDiscrete time points

!! All the statistics are collected periodically.All the statistics are collected periodically.

!! Partially ordered eventsPartially ordered events

!! Multiple timelines existed in the systemMultiple timelines existed in the system

?a

bc

d

Page 20: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

ST MiningST Mining

!! Different from classic data miningDifferent from classic data mining

!! Spatial and temporal relationships are complicatedSpatial and temporal relationships are complicated

!! Metric and non-metric spatial relationsMetric and non-metric spatial relations

!! Temporal relationsTemporal relations

!! Intrinsic dependency and heterogeneityIntrinsic dependency and heterogeneity

!! Scale effect in space and timeScale effect in space and time

!! Significant modification of many data miningSignificant modification of many data mining

techniques are needed.techniques are needed.

Page 21: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Problem definition IProblem definition I

!! Dependency analysisDependency analysis

!! Extension of associations to ST miningExtension of associations to ST mining

!! Complicated associationsComplicated associations

!! Vertical (temporal) and horizontal (spatial) associationsVertical (temporal) and horizontal (spatial) associations

!! Combination of vertical and horizontal associationsCombination of vertical and horizontal associations

!! Examples: lag effects between projectsExamples: lag effects between projects

!! Flexible associationsFlexible associations

!! Huge volume and scale effect of spatial-temporal data setHuge volume and scale effect of spatial-temporal data set

introduce noise and errorintroduce noise and error

!! Strict association is difficult to defineStrict association is difficult to define

Page 22: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Problem definition IIProblem definition II

!! Topic of this study: prediction supportTopic of this study: prediction support

!! Clustering: group the projects with similar evolution.Clustering: group the projects with similar evolution.

!! Summarization: summarize the representativeSummarization: summarize the representative

characteristics of different project evolution patternscharacteristics of different project evolution patterns

!! Prediction: predict the project evolution (based onPrediction: predict the project evolution (based on

the pattern discovered)the pattern discovered)

Page 23: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Research DataResearch Data

!! SourceForgeSourceForge.net database dump June 2005.net database dump June 2005

!! 117 tables117 tables

!! Records up to 30 million per tableRecords up to 30 million per table

!! 23 Gigabytes23 Gigabytes

!! PostgreSQLPostgreSQL

!! Three types of tablesThree types of tables

!! Data tablesData tables

!! History tablesHistory tables

!! Statistics tablesStatistics tables

Page 24: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

MethodologyMethodology

!! Project development statisticsProject development statistics

!! Numerical statistics.Numerical statistics.

!! Expertise and survey statistics.Expertise and survey statistics.

!! Time series analysisTime series analysis

!! Generate the time series for these statisticsGenerate the time series for these statistics

!! Classification generationClassification generation

!! ABN algorithm usedABN algorithm used

!! Classifier evaluationClassifier evaluation

!! Evaluation by comparing the predicted class withEvaluation by comparing the predicted class withthe actual classthe actual class

Page 25: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Numerical statisticsNumerical statistics

!! Statistics tables have the information about projectStatistics tables have the information about project

historyhistory

!! Stats_project_monthsStats_project_months

!! Every record stands for a monthly history of a single projectEvery record stands for a monthly history of a single project

!! Records from November 1999 to June 2005Records from November 1999 to June 2005

!! There are 24 attributes in every recordThere are 24 attributes in every record

!! Descriptive attributes (3)Descriptive attributes (3)

!! Statistics (numeric) attributes (21)Statistics (numeric) attributes (21)

!! We use the statistics attributesWe use the statistics attributes

Page 26: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Statistics AttributesStatistics Attributes

CVS_addsCVS_addsSite_viewsSite_views

Support_closedSupport_closed

CVS_commitsCVS_commitsSupport_openedSupport_opened

CVS_checkoutsCVS_checkoutsBug_closedBug_closed

Help_requestsHelp_requestsBug_openedBug_opened

Tasks_closedTasks_closedMsg_postedMsg_posted

Tasks_openedTasks_openedFile_releasesFile_releases

Artifacts_closedArtifacts_closedPage_viewsPage_views

Artifacts_openedArtifacts_openedSubdomain_ViewsSubdomain_Views

Patches_closedPatches_closedDownloadsDownloads

Patches_openedPatches_openedDevelopersDevelopers

AttributesAttributes

Page 27: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Expertise statisticsExpertise statistics

!! Rating scoresRating scores

!! Expertise ratingExpertise rating

!! User ratingUser rating

!! Importance parameterImportance parameter

!! Domain importanceDomain importance

!! Contribution parameterContribution parameter

Page 28: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

Time SeriesTime Series

!! Time series used to describe the history of eachTime series used to describe the history of each

attribute.attribute.

!! Time series: an ordered sequence of values of aTime series: an ordered sequence of values of a

variable at equally spaced time intervals.variable at equally spaced time intervals.

!! The available monthly values of each statistic isThe available monthly values of each statistic is

used to generate the time series.used to generate the time series.

!! Goal is to study the project history patterns.Goal is to study the project history patterns.

!! DescriptionDescription

!! PredictionPrediction

Page 29: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

ConclusionConclusion

!! Project prediction using ST miningProject prediction using ST mining

!! We used statistics to predict the projectWe used statistics to predict the projectdevelopmentdevelopment

!! Calibration using new data is important to keep theCalibration using new data is important to keep theprediction valid.prediction valid.

Page 30: Analysis of the Open Source Software development community ...oss/Papers/NAACSOS2005Gao_slides.pdf · Virtual --> Data! Open Source Software (OSS) ... Savannah. Leaders Linus Tolvalds

QuestionsQuestions