exploring language communities on github

22
Exploring Language Communities on GitHub Antigoni M. Founta

Upload: antigoni-maria-founta

Post on 21-Feb-2017

32 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Exploring Language Communities on Github

Exploring Language Communities on GitHub

Antigoni M. Founta

Page 2: Exploring Language Communities on Github

IntroductionThis study focuses on the exploration of underlying patterns and the detection of

communities on programming languages used by GitHub users, via network analysis.

There are two graphs derived from the whole dataset and two location-specific graphs,

in order to study both the general audience of GitHub as well as the trends regarding

some sample locations.

Goal: Understand how languages are practically grouped in terms of the way

developers use them, as well as discover trends either worldwide or on specific

locations.

Nodes → Languages

Edges → Language co-occurrence in User Profiles (based on the user repositories)

Page 3: Exploring Language Communities on Github

GitHub● GitHub is a web-based Git repository hosting service

● It offers distributed revision control and source code

management (SCM)

● It is the largest host of source code in the world! [1]

Why Github?

“The introduction of social features in a code hosting site has drawn

particular attention from researchers while the integrated social

features, and the availability of metadata through an accessible api

have made GitHub very attractive for software engineering

researchers” [3]

Top Image Source: https://goo.gl/CWBMqbBottom Image Source: https://github.com/logos

Page 4: Exploring Language Communities on Github

● Programming Language

categorization ambiguity

● GitHub bias on Web Development

● Locations and users have

power-law distribution: there are

numerous developers from few

locations (such as California,

London etc) and there is a

significant amount of locations

with few users

Pros Challenges● Developers will get a hint of

which languages are used jointly,

and thus perhaps serve the same

purpose.

● Language creators will get a hint

of what their audience prefer and

trust.

● Language communities might

actually be another way to

explore developer communities.

Page 5: Exploring Language Communities on Github

FundamentalsDataset Features

➔ ID, Username, Location, Followers, Public Repos, Languages & Bytes of code

Network Structure

➔ Nodes: Languages

◆ Attribute: Total Bytes of Code

➔ Edges: Pairs of Languages that co-occurred in at least one user profile

◆ Weight: Amount of users that use both languages

Challenges upon Data

➔ Only public repositories accessible (users mainly work on private!)

➔ Languages are added by the user (empty, not real, not written in the same way)

PyGithub[2]

Page 6: Exploring Language Communities on Github

Final Datasets

❏ 4000 users since GitHub foundation + 150.000 from 2012

❏ Filter: Get only users with locations!

❏ Final: 2300 users since GitHub foundation + 37.000 from 2012

Page 7: Exploring Language Communities on Github

Descriptive Statistics

Page 8: Exploring Language Communities on Github

Data Distribution

Page 9: Exploring Language Communities on Github

MethodologyCreate graph (as described):

● Filters: Degree Range

● Layout: Force Atlas 2

● Node size: “Bytes of Code” Range

● Label size: Degree Range

Compute Modularity & get communities:

● Sometimes using edge weights, sometimes not

Visualize pairs of languages and amount of developers that use both

Page 10: Exploring Language Communities on Github

Results: All Data - All Languages

User-based Language Graph

Page 11: Exploring Language Communities on Github

Language Co-occurrences on User Profiles &

Top Languages based on Bytes of Code written

Page 12: Exploring Language Communities on Github

Results: All Data - Top Languages

User-based Language Graph

Page 13: Exploring Language Communities on Github

Language Co-occurrences

on User Profiles

#Top languages had minor differences, and thus are not reported

Page 14: Exploring Language Communities on Github

Results: California - Top 3 Languages

User-based Language Graph

Page 15: Exploring Language Communities on Github

Language Co-occurrences on User Profiles &

Top Languages based on Bytes of Code written

Page 16: Exploring Language Communities on Github

Results: Greece - Top 3 Languages

User-based Language Graph

Page 17: Exploring Language Communities on Github

Language Co-occurrences on User Profiles &

Top Languages based on Bytes of Code written

Page 18: Exploring Language Communities on Github

Repo-based Language Graph

Communities(modularity: 0.23)

Blue: Web-oriented

Pink: Desktop-oriented

Yellow: Other

Page 19: Exploring Language Communities on Github

ConclusionsLanguage-Oriented

➔ “Web-oriented” is the most robust category of languages used in Github

➔ “JavaScript - CSS” is the leading pair of languages, always outnumbering all other pairs

➔ Even though JavaScript is almost always dominating Pairs of Languages, C is always the

most used one in matters of Bytes of Code [perhaps C users are not language-extroverts…]

Scheme-Oriented

➔ With a user-based scheme we can understand the general preferences of developers and the

patterns between languages. [difficult when dataset is big!]

➔ With a repo-based scheme we can understand hidden (or at least not widely known)

patterns of languages that are used for same purposes.

➔ General purpose: repo-based scheme

Location purpose: user-based scheme

Page 20: Exploring Language Communities on Github

Future Work● More Data !

● More Locations and Comparisons

● Language Graphs based on Top/Most influential Users [using followers or stars]

● Association Rules on Languages for community detection

● User Graph to detect user communities per Location (e.g. web developers, game

developers) and compare with Language Graph of Location

Page 21: Exploring Language Communities on Github

References1. Github on Wikipedia: https://en.wikipedia.org/wiki/GitHub

2. PyGithub Library: https://github.com/PyGithub/PyGithub

3. Kalliamvakou, Eirini, et al. "The promises and perils of mining GitHub." Proceedings of the

11th working conference on mining software repositories. ACM, 2014.

4. Thung, Ferdian, et al. "Network structure of social coding in github." Software maintenance

and reengineering (csmr), 2013 17th european conference on. IEEE, 2013.

5. Takhteyev, Yuri, and Andrew Hilts. "Investigating the geography of open source software

through GitHub." (2010).

6. Figueira Filho, Fernando, et al. "A study on the geographical distribution of Brazil’s

prestigious software developers." Journal of Internet Services and Applications 6.1 (2015): 1.

Image Source: http://wifflegif.com/tags/58347-octocat-gifs

Page 22: Exploring Language Communities on Github

Thank you for your attention! Any questions?Image Source: https://octodex.github.com/images/heisencat.png