when github meets cran: an analysis of inter-repository package dependency problems

28
When GitHub meets CRAN : An Analysis of Inter-Repository Package Dependency Problems Alexandre Decan Tom Mens Maelick Claes Philippe Grosjean So5ware Engineering Lab University of Mons, Belgium h<p://[email protected]/genlog/projects/ecos

Upload: tom-mens

Post on 12-Apr-2017

406 views

Category:

Science


2 download

TRANSCRIPT

Page 1: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

WhenGitHubmeetsCRAN:AnAnalysisofInter-RepositoryPackageDependencyProblems

AlexandreDecan TomMensMaelickClaes PhilippeGrosjean

So5wareEngineeringLabUniversityofMons,Belgium

h<p://[email protected]/genlog/projects/ecos

Page 2: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

• Aso5wareecosystemis:•  “acollec@onofso5wareproductsthathavesomegivendegreeofsymbio@crela@onships.”[Messerschmi<&Szyperski2003]

•  “acollec@onofso5wareprojectsthataredevelopedandevolvetogetherinthesameenvironment.”[Lungu2008]

• Apackage-basedecosystem•  iscomposedofinterdependentso5warepackages…•  …thatcanbeinstalledanddistributedusingapackagemanager

• Examples

Package-basedSo5wareEcosystems

Page 3: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

TheRecosystem

•  R=popularopensourcelanguageandso5wareenvironmentforsta@s@[email protected]

•  Languagepopularity(IEEESpectrum)

2015 2014

Page 4: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

TheRecosystemWherecanwefindRpackages?

cran.r-project.org

www.bioconductor.org

r-forge.r-project.org

www.github.org

Numberofpackages(March2015)

6,411

997

1,883

5,150

Page 5: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

CRANdistribu@onplakorm

•  But…–  “ItusedtobethatpuDngyourpackageonCRANalsogaveitsomeexposure,butwith>6000packages,that’snolongerquitetrue.”[Broman2015]

–  “Thenon-transparentnatureoftheCRANsubmission/rejecUonprocessisparUcularlyatissue.h<ps://github.com/stnava/ANTsR/issues/8

ThemainadvantagetogeDngyourpackageonCRANisthatitwillbeeasierforuserstoinstall.

YourpackagewillalsobetesteddailyonmulUplesystems.“Rpackageprimer:aminimaltutorial”,2015.KarlBroman

Page 6: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

CRANdistribu@onplakorm

Credits:h<p://[email protected]/cran-gephi/

PackagedependencygraphofCRAN

6,706unarchivedpackagesonJune1,2015

Page 7: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

CRANdistribu@onplakorm

•  Con@nuousintegra@onprocessbasedonRCMDChecktoensureoverallconsistency– Dailycheck,onmul@pleOS,ofallCRANpackages– Looksforcommonproblemsinpackagestructure,metadata,code,documenta@on,data,…

– Packagesnotpassingthecheckneedtobeupdatedtoavoidgepngarchived

– Packagescanonlydependonthelatestversionofanotherpackageèiftheydependonafailedpackagetheymayalsofailthecheck

Page 8: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

CRANdistribu@onplakorm

AutomateddailyCRANpackageCheck(2016-02-23):Summaryofresults

Page 9: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

CRANdistribu@onplakorm

AutomateddailyCRANpackageCheck(2016-02-23):Resultsbypackage

Page 10: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

RonGitHub

githut.info Q4,2014

Page 11: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

HowimportantisGitHubforRpackages?

•  CRANusedtobethemaindistribu@onplakormforRpackages

•  WiththeappearanceofGitHub,thisisnolongerthecase

IsGitHubstar@ngtoreplaceorcomplementCRANaspackagedistribu@onmedium?

Page 12: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

HowimportantisGitHubforRpackages?

•  GitHubhostsmanyRpackages•  IncreasingnumberofnewRpackagesonGitHub

June1,2015

monthlynumberofnewpackages

Page 13: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

HowimportantisGitHubforRpackages?

IsGitHubbeingusedasadistribu(onpla/ormforRpackages?•  GitHubhoststhousandsofRpackages

•  >40%ofGitHubRpackagesareintendedtobedistributedastheycontaininstruc@onstoinstallthemfromGitHub.

•  AnincreasingnumberofRpackagesonGitHubisnotdistributedelsewhere

•  Packageslikedevtoolsfacilitatedistribu@onfromGitHub

Page 14: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

ManagingRpackagedependencies

JeffLeek

JeroenOoms

ThenumberofpackagesonCRANandotherrepositorieshasincreasedbeyondwhatmighthavebeenforeseen,andisrevealingsomelimitaUonsofthecurrentdesign.Onesuchproblemisthegenerallackofdependencyversioningintheinfrastructure.“Possibledirec@onsforimprovingdependencyversioninginR,”

RJournal5(1):197–206,June2013

OneofthebestthingsabouttheRecosystemisbeingabletorelyonotherpackagessothatyoudon’thavetowrite

everythingfromscratch.Butthereisahardbalancetostrikewithkeepingthedependencylistsmall.“HowIdecidewhentotrustanRpackage,”

simplysta@[email protected]/?p=4409,November2015.

Page 15: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

ManagingRpackagedependencies

InwhichrepositoryarepackagedependenciessaUsfied?

depends on cmake, as well as some packages that dependon commercial packages not available in CRAN.

From a quantitative point of view, we found many Rpackages that are distributed on GitHub, providing specificinstallation instructions. We also found that packages thatare only available on GitHub are younger than the GitHubpackages that are also distributed on CRAN.

Conclusion. R package developers are using GitHubas a distribution platform for R packages.

V. RQ2: TO WHICH EXTENT DO R PACKAGES SUFFERFROM INTER-REPOSITORY DEPENDENCY PROBLEMS?

The R package devtools provides a wrapper around R’sbuilt-in installation manager to facilitate the installation of Rpackages from different sources. Unfortunately, repositoriessuch as GitHub have no central listing of available Rpackages. GitHub contains millions of Git projects filledwith content from various programming languages. Evenif we limit GitHub projects to those tagged with the Rlanguage, the vast majority does not contain an R package.

The lack of a central listing of packages prevents devtoolsto automatically find and install dependencies. Additionally,the same package can be hosted on multiple repositoriesand in different versions, making the problem of dependencyresolution even more difficult.

To analyze the extent to which R packages hosted onGitHub are subject to inter-repository dependency problemswe show that, while CRAN is the main source of packagesto depend upon, the CRAN packages required by GitHubpackages are the ones that are updated the most frequently.It shows that GitHub packages suffer from inter-repositoryproblems because there are many packages with error statuson CRAN caused by backward incompatible changes in thedependencies. In order to achieve this goal we study thefollowing subquestions:

• RQ2a: In which repository are package dependenciessatisfied? We provide evidence that, while CRAN is themain source of packages to depend upon, many GitHubpackages have dependencies to packages not providedby CRAN.

• RQ2b: Are CRAN packages more frequently updatedthan GitHub packages? We provide evidence thatCRAN packages required by GitHub packages are moreprone to be updated than CRAN packages not requiredby GitHub packages.

• RQ2c: How often do package updates cause backwardincompatible changes? We provide evidence that manyCRAN package become broken due to CRAN packagesupdates. We show how this problem is addressed onCRAN and maintainers of GitHub packages do notbenefit from CRAN’s solution.

RQ2a: In which repository are package dependencies sat-isfied?

We determined in which repository the dependencies ofR packages hosted in CRAN or GitHub are satisfied. Theresults are summarised in Fig. 5. For each edge fromrepository A to repository B, the blue label on top representsthe percentage of R packages from A that have a dependencysatisfied by B, and the red label below the edge representsthe percentage of dependencies of R packages in A thatare satisfied by B. For example, 70.6% of all R packageson GitHub depend on at least one package in CRAN, while85.3% of all declared dependencies in R packages on GitHubare satisfied by CRAN packages. Note that, if a package ishosted by both CRAN and GitHub, we count it as a CRANpackage, because for dependency satisfaction we privilegethe officially distributed package over its development ver-sion.

Other repositories

CRAN GitHub

0.5

1.5

98.0

85.3

10.7

4.00.8

1.8

61.5

70.6

11.8

8.6

Figure 5. Inter-repository package dependencies for CRAN and GitHub.In blue: percentage of packages. In red: percentage of dependencies.

Because of the constraints imposed by CRAN’s dailyR CMD check, CRAN is self-contained. 98.0% of itsdependencies are satisfied within CRAN. Another 1.5% issatisfied by BioConductor and Omegahat, which are packagedistributions that are taken into account by the R CMDcheck.

The presence of 0.8% CRAN packages that depend onGitHub packages (totalling 0.5% of CRAN’s package de-pendencies) may seem to contradict this assertion, as the RCMD check does not allow the installation of packages fromGitHub. However, these 0.8% dependencies target packagesthat are distributed on BioConductor or Omegahat, and forwhich a development version is available on GitHub.

This implies that nearly all CRAN packages satisfy theirdependencies within CRAN. In addition, 70.6% of the Rpackages on GitHub depend on a package belonging toCRAN (totalling 85.3% of GitHub’s package dependencies).This strongly suggests that CRAN is at the center of theecosystem, and that it is nearly impossible to install Rpackages hosted on GitHub without relying on CRAN.

Considering R packages on GitHub, 8.6% of them havea dependency in GitHub and 11.8% have a dependencythat is not in CRAN or GitHub. The union of these R

Blue=%ofpackagesinrepositoryAthathaveadependencysa@sfiedbyapackageinrepositoryB

Red=%ofdependenciesofpackagesinrepositoryAthataresa@sfiedbyapackageinrepositoryB

(BioConductor,OmegaHat,…)

CRANisself-contained,soitdoesnotsufferfrominter-repository

dependencyproblems.ButwhataboutGitHub?

Page 16: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

ManagingRpackagedependencies

InwhichrepositoryarepackagedependenciessaUsfied?

depends on cmake, as well as some packages that dependon commercial packages not available in CRAN.

From a quantitative point of view, we found many Rpackages that are distributed on GitHub, providing specificinstallation instructions. We also found that packages thatare only available on GitHub are younger than the GitHubpackages that are also distributed on CRAN.

Conclusion. R package developers are using GitHubas a distribution platform for R packages.

V. RQ2: TO WHICH EXTENT DO R PACKAGES SUFFERFROM INTER-REPOSITORY DEPENDENCY PROBLEMS?

The R package devtools provides a wrapper around R’sbuilt-in installation manager to facilitate the installation of Rpackages from different sources. Unfortunately, repositoriessuch as GitHub have no central listing of available Rpackages. GitHub contains millions of Git projects filledwith content from various programming languages. Evenif we limit GitHub projects to those tagged with the Rlanguage, the vast majority does not contain an R package.

The lack of a central listing of packages prevents devtoolsto automatically find and install dependencies. Additionally,the same package can be hosted on multiple repositoriesand in different versions, making the problem of dependencyresolution even more difficult.

To analyze the extent to which R packages hosted onGitHub are subject to inter-repository dependency problemswe show that, while CRAN is the main source of packagesto depend upon, the CRAN packages required by GitHubpackages are the ones that are updated the most frequently.It shows that GitHub packages suffer from inter-repositoryproblems because there are many packages with error statuson CRAN caused by backward incompatible changes in thedependencies. In order to achieve this goal we study thefollowing subquestions:

• RQ2a: In which repository are package dependenciessatisfied? We provide evidence that, while CRAN is themain source of packages to depend upon, many GitHubpackages have dependencies to packages not providedby CRAN.

• RQ2b: Are CRAN packages more frequently updatedthan GitHub packages? We provide evidence thatCRAN packages required by GitHub packages are moreprone to be updated than CRAN packages not requiredby GitHub packages.

• RQ2c: How often do package updates cause backwardincompatible changes? We provide evidence that manyCRAN package become broken due to CRAN packagesupdates. We show how this problem is addressed onCRAN and maintainers of GitHub packages do notbenefit from CRAN’s solution.

RQ2a: In which repository are package dependencies sat-isfied?

We determined in which repository the dependencies ofR packages hosted in CRAN or GitHub are satisfied. Theresults are summarised in Fig. 5. For each edge fromrepository A to repository B, the blue label on top representsthe percentage of R packages from A that have a dependencysatisfied by B, and the red label below the edge representsthe percentage of dependencies of R packages in A thatare satisfied by B. For example, 70.6% of all R packageson GitHub depend on at least one package in CRAN, while85.3% of all declared dependencies in R packages on GitHubare satisfied by CRAN packages. Note that, if a package ishosted by both CRAN and GitHub, we count it as a CRANpackage, because for dependency satisfaction we privilegethe officially distributed package over its development ver-sion.

Other repositories

CRAN GitHub

0.5

1.5

98.0

85.3

10.7

4.00.8

1.8

61.5

70.6

11.8

8.6

Figure 5. Inter-repository package dependencies for CRAN and GitHub.In blue: percentage of packages. In red: percentage of dependencies.

Because of the constraints imposed by CRAN’s dailyR CMD check, CRAN is self-contained. 98.0% of itsdependencies are satisfied within CRAN. Another 1.5% issatisfied by BioConductor and Omegahat, which are packagedistributions that are taken into account by the R CMDcheck.

The presence of 0.8% CRAN packages that depend onGitHub packages (totalling 0.5% of CRAN’s package de-pendencies) may seem to contradict this assertion, as the RCMD check does not allow the installation of packages fromGitHub. However, these 0.8% dependencies target packagesthat are distributed on BioConductor or Omegahat, and forwhich a development version is available on GitHub.

This implies that nearly all CRAN packages satisfy theirdependencies within CRAN. In addition, 70.6% of the Rpackages on GitHub depend on a package belonging toCRAN (totalling 85.3% of GitHub’s package dependencies).This strongly suggests that CRAN is at the center of theecosystem, and that it is nearly impossible to install Rpackages hosted on GitHub without relying on CRAN.

Considering R packages on GitHub, 8.6% of them havea dependency in GitHub and 11.8% have a dependencythat is not in CRAN or GitHub. The union of these R

Blue=%ofpackagesinrepositoryAthathaveadependencysa@sfiedbyapackageinrepositoryB

Red=%ofdependenciesofpackagesinrepositoryAthataresa@sfiedbyapackageinrepositoryB

(BioConductor,OmegaHat,…)

Page 17: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

ManagingRpackagedependencies

Conclusions:•  MostpackagesonGitHubrelyonpackagesinCRAN…•  …butthenumberofpackagesonGitHubthatrelyon

packagesoutsideofCRANisincreasing

DoRpackagesonGitHubsufferfrominter-repositorydependencyproblems?

Page 18: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

Inter-repositorydependencyproblems

Observa@ons•  CRAN’spolicyrequirestodependonthemostrecent

packageversion•  GitHubpackagesfrequentlydependonCRANpackages•  TheycannotbenefitfromCRAN’sdailycheck•  Butcanuseothercon@nuousintegra@onprocesses

suchasTravisCI•  Althoughlessthan25%actuallydothis…

Ipersonallythinkit’sREALLYrelevanttoatleastbeABLEtobeveryspecificandrigidwithregardtoyourdependencies.

AndIthinktheRuniversecouldprovidebeaertoolstofittheneedsofdevelopersandprofessionalsoutthereinabeaerway.

Page 19: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

Inter-repositorydependencyproblems

Observa@ons•  CRAN’spolicyrequirestodependonthemostrecent

packageversion•  GitHubpackagesfrequentlydependonCRANpackages•  TheycannotbenefitfromCRAN’sdailycheck•  Butcanuseothercon@nuousintegra@onprocesses

suchasTravisCI•  Althoughlessthan25%actuallydothis…

Ipersonallythinkit’sREALLYrelevanttoatleastbeABLEtobeveryspecificandrigidwithregardtoyourdependencies.

AndIthinktheRuniversecouldprovidebeaertoolstofittheneedsofdevelopersandprofessionalsoutthereinabeaerway.

èCRANpackageupdatesmorelikelytoleadtoproblemsforGitHubthanforCRAN

packages

Page 20: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

Inter-repositorydependencyproblems

CRANpackageupdatesmorelikelytoleadtoproblemsforGitHubthanforCRANpackages

SurvivalAnalysis:ProbabilitythataCRANpackageisNOTupdatedperiod12/2014–6/20153,740consideredpackages

IfaCRANpackageisrequiredbyaGitHubpackage,itismore

likelytobeupdated.

Page 21: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

Inter-repositorydependencyproblems

HowocendopackageupdatescausebackwardincompaUblechangesonCRAN?

Page 22: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

Inter-repositorydependencyproblems

HowocendopackageupdatescausebackwardincompaUblechangesonCRAN?

•  41%oftheerrorsinCRANpackagesarecausedbybackwardincompa@blechanges(analysisbasedondailyRCMDCheckresultsover2-yearperiod)

•  Weobservedbackwardincompa@blechangesonaverageonceevery20updates(perioddecember2014àJune2015)

Page 23: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

Inter-repositorydependencyproblems

HowocendopackageupdatescausebackwardincompaUblechanges?

Rpackagemaintainerssharethisconcern:

ItismoreandmoreofapainifthepackageI’mdependingonbreaks

IfIhavetoloadaDependspackage,itaddsasignificantburden:IhavetocheckforconflictseveryUmeItakeadependencyonanewpackage.

Page 24: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

Inter-repositorydependencyproblems

HowocendopackageupdatescausebackwardincompaUblechanges?

Rpackagemaintainerssharethisconcern:

Theriskofthingsbreakingatsomepointduetothefactthataversionofadependencyhaschangedwithoutyouknowingaboutitisimmense.

IhadonecasewheremypackageheavilydependedonanotherpackageandacerawhilethatpackagewasremovedfromCRANandstoppedbeingmaintained.

Page 25: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

Wrap-up

•  GitHubisbecomingincreasinglyusedtodevelopanddistributeRpackages

•  Packageupdateso5enleadtoincompa@bili@esindependingpackages

•  GitHubpackagescannotbenefitfromCRAN’scon@nuousintegra@onprocess

Onerecentexamplewastheforcedroll-backoftheggplot2updatetoversion0.9.0,becausetheintroducedchangescausedseveralotherpackagestobreak.”

JeroenOoms

Onerecentexamplewastheforcedroll-backoftheggplot2updatetoversion0.9.0,becausetheintroducedchangescausedseveralotherpackagestobreak.

“Possibledirec@onsforimprovingdependencyversioninginR,”RJournal5(1):197–206,June2013

Page 26: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

Wrap-up

•  Rcommunityneedsbe<ertoolsupportformanagingdependenciesanddealingwithpackageversions.

•  Suchtoolsarestar@ngtoemerge:–  devtools,drat,packrat,miniCRAN,checkpoint,…–  R-Hub:ongoingprojectacceptedbyRconsor@um

•  Infrastructurefordeveloping,building,tes@ng,valida@on,distribu@on,con@nuousintegra@onofRpackages.

•  AservicethatiscomplementarytoCRANandR-Forge

Page 27: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

•  OnthemaintainabilityofCRANpackages[CSMR-WCRE2014]

•  maintaineR,aweb-baseddashboardformaintainersofCRANpackages[ICSME2015]

•  AnempiricalstudyofidenUcalfuncUonclonesinCRAN[IWSC2015@SANER]

•  Anonymizede-mailinterviewswithRpackagemaintainersacUveonCRANandGitHub[TechnicalReport]

Website:[email protected]/genlog/projects/ecos/

27

References

Page 28: When GitHub meets CRAN: An Analysis of Inter-Repository Package Dependency Problems

SANER–Osaka,Japan,March2016

References

/var/folders/lm/qbg6frqd6435jxm9w70y7h480000gn/T/com.apple.Preview/com.apple.Preview.PasteboardItems/csmr13-140314140935-phpapp02(dragged).pdf

/var/folders/lm/qbg6frqd6435jxm9w70y7h480000gn/T/com.apple.Preview/com.apple.Preview.PasteboardItems/csmr13-140314140935-phpapp02(dragged).pdf

The Evolution of the R Software Ecosystem

Daniel M. GermanUniversity of Victoria

Bram AdamsÉcole Polytechnique

de Montréal

Ahmed E. HassanQueen's University

CSMR2013