the github repository recommendation...

The GitHub Repository Recommendation System

An effort towards enhanced collaboration in Open Source Community

Project Team - 5(Collaborator)

Ankur Kataia (akatari2)

Guanxu Yu (gyu9) Himangshu Ranjan Borah (hborah) Mukundram Muraliram (mmurali5) Sudhanshu Shekhar Singh (ssingh25)

Project Guide

Michael Kowolenko

Computer Science Department North Carolina State University, Raleigh NC

1

Table of Contents

Introduction 3 A Quick Note on the GitHub Data 4

Methodology 5 The Data Collection 6 The Modeling 12 Preprocessing of Text Data 18 The Recommendation Engine 19 The Application Flow and Development 21

Results 23 Collaborative Filtering Model 23 Content Based Filtering Model 24

Discussion 25

Conclusion and Future Work 26

Appendix 27 Code 27 Dictionaries 29 Tools 29

References 29

Acknowledgement 29

Application Snapshots 30

CSC 591- Project Report Team 5 2

Introduction In this era of fast advancements in technology, the research and development in the different fields of Computer Science has become a pivotal aspect of this revolution. There are many different attributes of computer science which makes it what it is today. Starting from the core mathematical research to the commercial developments, it’s a long pipeline of seamless changes that’s happening all around us. Needless to say, computer programming or coding is one of those major aspects of computer science which has evolved to an extremely sophisticated shape from which it started from it late ‘80s. In the initial days of software programming, everything used to be proprietary and the concept of Open Source as it exists today was almost absent. Slowly, the software architectures started to emerge from basic type to quite extensive ones and the owners of the systems started to release their codes for open development by the society. The major change in open source development started from the foundation of the GitHub, which was started in February 8, 2008 to support the online management of the version control systems of the codes. GitHub was initially started to support Git , a local code version control system, which was developed in 2005. Slowly, the GitHub became more than a version control system and started to form communities around it. The basic reason behind having tremendous networking capabilities within the GitHub is its inherent nature of collaboration among the developers. There are many different aspects of the usage dynamics of the GitHub which makes it a great candidate to study the collaboration behaviors among the users and modeling different socio technical systems which can help the users to put efforts into the contributions in a more effective manner. In this project, we have considered a very interesting problem of of recommending repositories to users which they might be interested in contributing to. Currently, we don’t see much research happening on that problem and it looked to us like a viable problem to be considered for this academic project. Our basic assumption is that the users of the GitHub are always interested in contributing to the Open Source community and it would be very helpful to them if we could design an architecture for that using the Data that we have from the GitHub server’s daily logs and other APIs. Our system’s core flow tries to collect user and their repository data and build models to make recommendations to users for repositories to which he is not connected to. Our results showed some really interesting results and we have dedicated this report on the approaches that we followed, the results that we saw and some left work which we would like to explore later if we have an opportunity. As a part of our prior work, we didn’t find any work which addresses exactly this problem. However, there are indeed few work on the sentiment analysis of the GitHub commit logs and other related stuff. In [1], the authors did some lexical sentiment analysis to study emotions


expressed in commit comments of different open source projects and analyze their relationship with different factors such as used programming language, time and day of the week etc. They claimed to have found some really interesting results from their studies. In [2], they studied a little different but related task of recommending reviewers for pull requests which is one of the most important aspects of GitHub dynamics. The report talks very elaborately about all the approaches taken for building the full application that we have, starting from the Data Collection till the application development and is organized as follows. First we have the methodologies which will describe the core mathematics and the corresponding implementations for mainly the Machine Learning part of the project. Then we also describe the Data çollection and the application development part of it. Then we explain some of the results that we obtained using our models and discuss how successful was it and how much not. Finally we describe some core implementation level details of our code base and conclude with the snapshots of our application and the presentation.

A Quick Note on the GitHub Data GitHub APIs provide Data in two ways, first, we can download the archived data in per hour, per day and per month granularities where the files contain information on the GitHub logs of different events happening during that time in the public network. The second way of data collection is to use the REST APIs provided by GitHub ask for specific data such as one person’s user details like username, company, other demographics, etc. For our problem, we have basically two ways of collecting the data,

● Start from one user and explore the network from him by recursively crawling through the follower and the collaborator relationships. This gives us a network which is centered around the initial user and we believed that this approach will not give us the required variability in data that are hoping to find.

● Another way is to start with a bunch of random users which we call as \textit{seeds}, and then crawl simultaneously from those users to make up a network.

For our analysis, it seemed to us like the second approach would be more appropriate than the first one due to the nature of our problem statement. The initial users that we used were picked from an archive file of one hour public GitHub event logs. We had a few users in the initial user list and after crawling up to depth 1 for every user, i.e. just collecting the followers of the users, the people they are following and the Collaborators of all their repositories, we had a total of 2964 users in our user database. The data of the users were appended to the same list of initial users. The Data Collection Section will describe the process of forming the Database in detail.


Methodology There are three major blocks of our system which are namely the Data Collection Layer, The Recommendation Engine Layer and the Application Layer. The rough architecture is given in figure 1. The Data Collection is responsible for collecting different Data Related to the Users, their repositories and the corresponding commit events on those repositories. The Data is majorly inserted into a Relational Database system which resides in out Postgres server. This Data is pulled by the recommendation Engine layer which in turn works with the Data cleaning, preprocessing and attribute synthesis. The major work of this layer is to make the data usable by the recommender system algorithms and make the models for prediction. When these models are built, they are serialized to the persistent memory and the next layer uses these models for on spot prediction. This layer is generally queried by the GUI Web App where the user asks for a recommendation. Then finally the results are shown to the user and the flow is complete. We will describe every step in detail.

Figure 1 : The Flow of the System, abstract view.


The Data Collection Open-source developers all over the world are working on millions of projects, writing code & documentation, fixing & submitting bugs, and so forth. GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. GitHub provides 20+ event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. These events are aggregated into hourly archives, which can be accessed with any HTTP client.

Figure 2: Sample Query Command. Now, from the GitHub Archive, we got massive different events. Below is a basic structure of one json object one event in the archive.

Figure 3: Basic json object structure of every event. The initial users come from a list of users extracted from a GitHub public log of one hour. As you can see, from “actor” attribute, we can access their “id”, “login”(which represent their user name) and “url”. Now, we import data to our “initial_user_list ” table. We use set “user_id” as primary key and set unique constraint with the combination of “user_login” and “user_api_url” to keep data integrity. Below is the basic design of this table.


http://developer.github.com/v3/activity/events/types/

Figure 4: initial_user_list table. Now, we have enough structured initial data for further process. In fact, we could get all information of one user just from his/her “user_api_url”. In other words, when we copy and paste one “user_api_uer” in a browser, we could get the basic json object of one user. Below is the basic structure of user json object.

Figure 5: User json object example. From the above json object, we can easily grab all useful attributes of one user. And import them in our second table--”users”. In this table, we use user_id as primary key and set unique constraint of “user_login”. Because it will be regard as the only way to retrieve all user data in our application. Below is the basic design of “users” table.


Figure 6 : Users table. Now, with all users’ data, the next step is to grab all their repositories attributes. As it shows in figure 5, it is very easy to get all users repositories from “repos_ure” attribute with the way same as we get all users attributes. Below is a basic chunk of json repository json object.

Figure 7: Repository json object 1.


Figure 8: Repository json object 2. Below is the owner json object within repository json object.

Figure 9: Owner json object. From this two json object, we can grab all useful attributes of one repository and import them into our third table -- “repositories”. At this table, we set “repo_id” as the primary key, and set owner_id as foreign key references “user_id” in “users” table. Below is our basic design of “repositories” table.


Figure 10 : Repositories table. That is not enough. We need more! Commits messages are critical to reflect users’ habit, skills and so forth. So there is an issue, should we collect all commits in our database? As a matter of fact, one repository could have many branches, and every branch could have many different commits. So, should we include all that? Our answer is NO. We just collect commits of the master branch. And here is why we do that. As it said in GitHub guide, “ There's only one rule: anything in the master branch

is always deployable”. That means commits on master branch make sense. Hence, we only grab those commits. According to GitHub API documentation, “https://api.github.com/repos/user_name/repo_name/commits” , which could be get from repositories’ json object in Figure 7, will only get the first page of commits with default branch. In general, the default branch is the master branch. That’s not enough. We want more data. So we use “branches_url” to get the master branch and add parameters “?per_page=100” at the end of master branch’s url. This will guarantee enough data for the later analysis. Below is a basic commit json object.


Figure 11: Commits json object. Below figures are what we interested in .

Figure 12: Commit message object.

Figure 13: Part of committer json object. Now, we could use commits attributes on the above two json objects to construct our “commits_events” table. Here is one thing need to be cleared, every commit within the same repository have one unique “sha” . Therefore, we use “sha” and “repo_id” as composite key. And we set “committer_id” and “repo_id” as foreign key respectively references from “user_id” at users table


and “repo_id” at repositories table. Below is our basic commit_events table design.

Figure 14: commit_events table.

The Modeling The modeling part of the architecture is the heart of the recommendation system. The major challenge here is to deal with both structured and unstructured information simultaneously that we have available in our Database which corresponds to the Users, Repositories and the Commit Logs . We have two tasks to perform, first we have to convert the structured and unstructured data to a format which can be utilized by the Machine Learning models. Second, we have to use the information about each user’s collaboration and association on his repositories to find out his or her seriousness or the confidence on the repositories. In terms of a recommendation system literature, we basically need some kind of a “rating” for every repository from their owners. We call this rating the “degree

of seriousness (S)” in our modelling. Traditionally, we have two different kind of ratings or feedbacks available in case of recommendation systems. The first one is called an Explicit Feedback , which refers to the scenario when a user typically gives a rating for a particular item explicitly, for eg. rating a movie. The other kind of feedback is called Implicit Feedback. This one is little tricky in the sense that the user will not give any feedback on the items themselves. We have to synthesize the ratings in these case ourselves using the other usage information that we have from the collaboration and the interaction of the users with the items. For example, getting a rating from a user about a movie based on his or her usage history of that item. There are recommendation system algorithms available which are inherently capable of handling implicit feedbacks, but it’s always a better option to synthesize the ratings ourselves using the modeling techniques and the domain expertise that we have on the possible interactions and then using an explicit feedback type of recommendation system algorithm. In our scenario, this is precisely what we will do and the details of the synthesization are explained below.


Before moving to the details, we want to talk about an aspect of the Data that we have. As we have explained in the Data Collection part above, we are basically finding users from a logfile archive of a period. That’s kind of the seed of user search for us. It’s not possible to have all the users in the GitHub server for us, so what we have is basically a set of users., their followers and followees(The

people whom the user follows) and all the collaborators of their repositories collectively. This leads to kind of an ego network for every user which are connected to some extent, but maybe disconnected at times as well. So the recommendations that we will be producing are also having the same spatial and temporal behaviors as the training datasets. This is, however, an issue which we can not deal with unless we have the whole GitHub network at our disposal. Given we will have the full network with us if we have to deploy this architecture in the GitHub network, we will be able to produce recommendations which are more accurate and well located both spatially and temporally in each user’s ego network. Now we start off with the synthesis of the user’s data. We collected the following information for every user in our Database namely user_id, user_login, html_url, name, company, location, email, bio, repo_count, followers_count, followees_count, created_at, updated_at, blog_url where everything is self explanatory by their names. All of these attributes are not necessary for modeling as there are lot of nominal attributes like name or blog url which never helps in modeling purposes. For Machine Learning, we will be needing attributes which define a unique signature for one user which we can use for learning models for that user uniquely. Also, additionally we extracted some information from the parsing of the Bio field which gives us some information unique to the users. So finally, we have some of the information that was collected from the Database and some which we synthesized from the Bio fields. In the code, the BiographyAnalyzer class in the main manager of synthesizing Bio related attributes. Basically we parse the Bio text to find out the following five information about the user which we believe are unique identifier of the user.

1. Interest: If there exist anything related to his interests in general. 2. Technologies: The technologies he might be working on and has listed in the bio text. 3. Programming Languages: The languages he is working on. 4. Positions: The positions that he holds. 5. Student Status: Anything related to student statuses.

All of the above 5 matrices are searched using 5 different dictionaries which were collected to list possible list of words the every category might belong to. These dictionaries are rather long to include in here and are provided along with the codebase for reference, if needed. While searching for the possible matches, we have used fuzzy string matching which tries to find the distances


between two strings using string similarity measures and give us the approximate ratio of match between two strings. This is done because most of the times it’s not easy to find the exact word we are looking for. In that case, even if we have some amount of match, then we approximate that into the nearest match. The string matching distance that we are using here is called Levenshtein Distance and is defined as follows[3]. “Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the

source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions

required to transform s into t”. For example,

If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already

identical.

If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to

transform s into t.

The greater the Levenshtein distance, the more different the strings are. Now after getting the string matchings, we have to decide a cutoff based on which we decide a hit or miss. We have kept that as a configuration parameter and can be tweaked using the ConfigurationManager class in our implementation. Finally, one performance enhancement that we did is to use a cartesian product of the lexicons in the bio and the lexicons in the dictionaries to avoid nested loops which decreases the performance by introducing a O(n^2) complexity term. Having explained the details of the Biography Analyser, we now give an explanation of how this actually works. Suppose for example we have a bio text “I am interested in software development. Currently working as a test automation

engineer and working mainly with Java and Scala. In my free time, I like to go for hiking and adventure

sports. ” The biography analyser will then synthesize 5 attributes namely "interest_q", "tech_q", "languages_q", "positions_q" and "status_q" which will have values probably like [“hiking”,

“automation”, “java”, “testing”, null] after we analyse it. This step, however is a very crucial one and may fluctuate heavily depending on the dynamics and legibility of the Bio Text. Finally. Along with these 5 attributes, we do retain some of the original attributes like "location", "repo_count", "followers_count", "folowee_count", "days_from_creation" and "days_from_update" which will be added to the above 5 to have a full user vector which is ready to be feed to any machine learning algorithm for modeling. All the Date attributes are converted to a date difference form which gives us a continuous variable which is easy for the learning algorithms to cope up with. Otherwise, the dates become simple nominal attributes and are of very little use. In our implementation, “user_orig_data” corresponds to the actual information collected from the Database, the raw information and the “user_data” corresponds to the synthesized data which is explained in this paragraph.


After getting the useful attributes for the user, next task is to get the useful attributes for the repositories. In case of repositories, very little needs to be processed and we pretty much directly took the data from the original data. The fields that we finalize of the repository signature are “repo_id", "owner_id", "is_private", "is_forked", "cont_count", "language", "days_from_creation",

"days_from_updation", "days_from_push", "size", "watcher_count", "stargazer_count", "has_wiki",

"fork_count", "open_issues", "sub_count", "readme", "description" where every field is self explanatory and are very good candidates for distinguishing two different repositories. We will be using another subset of the repositories data for our content based modeling part, which we will explain in due time. Like the user, “repo_orig_data” corresponds to the raw data of the repositories and “repo_data corresponds to the modified subset in our implementation. The final and the most important part of our data synthesization part is computing the user’s seriousness factor in their repositories which will be used as our prime signal for the ratings. The basic idea is to find out one user’s repositories and their different activities on those repositories. Those will be then used to form a linear model which will give us a rating for the repo per user. There are a lot of factors on which we will be calculating the user’s seriousness factor, some of them are structured and some are unstructured in nature. For the structured data , the following are the attributes on which the seriousness factor(S) will depend on. We show whether it’s a directly proportional or inversely proportional relation. We use the actual values for the numeric attributes and 0 and 1 encoding for the boolean variables.

1. S is directly proportional to Contributor Count 2. S is inversely proportional to days from last push event 3. S is directly proportional to size of the repository 4. S is directly proportional to watcher’s count 5. S is directly proportional to stargazer’s count 6. S is directly proportional t o has wiki 7. S is directly proportional to fork count 8. S is directly proportional to open_issues 9. S is directly proportional to subscriber’s count 10. S is inversely proportional to is forked 11. S is directly proportional to no of commits

Apart from the above factors, we have another very strong indicator which we will be using for our modeling purposes. That is the Commit Logs of the user on that repository under consideration. From all the commit logs that we have of that user for that repository, we basically synthesize 5 difference confidence ratings which are numerical values that gives different confidence factors. The following are the details of those 5 factors and how they were synthesized.


1. Total Length: The average normalized length of all the commits after preprocessing. If there are “n” commits and L(i) is the normalized length of one commit, then the average

length is defined as L(i)1/n)( ∑n

i=1

2. Structural Integrity Score: This checks the grammatical correctness of the commit log message and finds out the average errors in the sentence formations. This is a great signal of checking a user’s seriousness on a log message. We use a library called “grammar_check” for this purpose. We also use fuzzy string matching in this one like we did before for the biosynthesis.

3. Topic Relevance Score: We have used some concepts of topic modeling in this synthesization. According to a basic definition from wikipedia, “In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently.” Using topic modeling, we try to find out the topic clusters from the commit log and then try to match them with a predefined dictionary like ['fix', 'issue', 'implement', 'modify', 'changed', 'bug', 'error'] etc. and the repository description text and find out the average match with those topics. We use this as another signal for the modeling. We use “graphlab” for the topic modeling part.

4. Positivity Score: This is one the most important attributes of all which some previous research have also explored in their work. In this, we try to do a basic sentiment analysis on the commit logs to find out the degree of positivity in that log message. Sentiment analysis is a sophisticated tool of Machine Learning which uses predefined training corpuses of positive and negative sentiment words and can be used on a normalized and preprocessed text corpus to find out it’s degree of positivity about a topic. This also serves as one signature signal for te data. We use “graphlab” for this part too.

5. Spelling Integrity Score: This is a little bit less important factor which we have still used in our modeling purposes. This checks the correctness of the spellings and find out the average no. of errors. We suspect that when the spellings are not correct, the user is not very concerned about the commit, which may not be true at all times. We use “enchant.checker” for our spelling checking purpose.

Having synthesized the above 5 very important attributes, we saw that all of them are directly proportional to the degree of seriousness, S. Now for the final step, the actual calculation, we use a linear combination of all the different factors described above, 11 from the Data itself and 5 newly synthesized from the unstructured text. We also assign one weight to each of those factors and keep it as a tweakable configuration parameter maintained and managed by the Configuration Manager class. Mathematically, if an attribute A has value V(i) and weight W(i) and we have “n” factors to


combine linearly(where n = 16 in our case here), then the combined score is(A variant of weighted mean, the normalization is done beforehand so not doing again here),

core or Degree of Seriousness, S (i) (i)S = ∑n

i = V ×W

This is a very simple yet powerful aggregation which is used in many of the Machine Learning Models for the preprocessing stages. So in a nutshell, we now have an S value for every user and repository pair, which gives us the most wanted “user_item association_matrix” for our main recommendation engine process. A sample matrix looks like,

Figure 15: Sample Association Matrix

After synthesizing the rankings, the next important issue to address was the normalization of the values. We had a huge range of [-0.4, 2030113.3] in the ratings columns, which followed a power

law distribution.

Figure 16: The Power law distribution of the Ratings before Normalization.


We had to remove some outliers from the data to get the following normally distributed ratings column,

Figure 17: Normalized distribution. Also, we mapped the whole range of ratings to the range [1,10] using the following logic, OldRange = (OldMax - OldMin)

if OldRange == 0: NewValue = NewMin

else: NewRange = (NewMax - NewMin) NewValue = (((OldValue - OldMin) * NewRange) / OldRange) + NewMin

Preprocessing of Text Data

While working with the unstructured Data, almost all of the processings that we used needed some kind of preprocessing of the text. In our implementation, the class “Preprocess Manager” takes care of that. It has APIs for frequency counts and unique bag-of-words. We also do normalization, stop word removal and lemmatization of the text Data, details of which are omitted from here due to


brevity. We used the famous NLTK and SCIKIT LEARN library in python for all our text data preprocessing part.

The Recommendation Engine The recommendation system algorithms are the core of the architecture that’s supporting our models here. Till now, whatever we have assembled is more of a data preprocessing and synthesization part which makes the data usable by the machines to start mining on them and extract knowledge. The algorithm is what will help us extracting the knowledge. There are basically two kinds of recommender algorithms which are described below.

1. Content Based Algorithms: The basic idea behind is to represent the items in terms of their attribute vectors and and then do a modeling per user on the items. In layman terms, the intuition behind is to say that if you like a few items, then you are very likely to like a similar item which you have not seen. It works well when we are able to represent the items in terms of some of their properties. In our problem statement, the repositories are the items under consideration here and the repository vector that we have synthesized in the above layer will be used for modelling. Mathematically, suppose we have an user “u” and he has some repositories and corresponding ratings then we form a r , ...rr1, 2 r3 n , , ...pp1 p2 p3 n

model which basically tries to predict the confidence of an unseen repository using pnew rnew

the former data points. Suppose every repository is expressed in terms of attributes which is basically a d dimensional vector, then the new prediction is a function of, ...aa1 a2 d

those attributes, or Then using this model, we can find the new (a , ...a ).p = f 1 a2 d

confidences on all the unseen items, i.e. the repositories and then giving out the top k as the recommendation from the model.

2. Collaborative Filtering Algorithms: These algorithms work on the assumption that there are latent factors present behind the decisions made by the users on their feedbacks. Using either implicit or explicit feedback signals gathered for every user on their repositories which we have synthesized in the above layers, we basically form an user-item association matrix and try to fill the missing values in that matrix using algorithms like Alternating Least Squares etc. the details of which are beyond the scope of this report here. These algorithms are entirely dependent on the past behavior and the ratings and not on the content of the items or the users. However, for our modeling we have used all the user and item contexts while synthesizing the implicit feedback signals and so for us it won’t be a pure collaborative filtering as such.


Having described the algorithms in details, now it’s the time to connect all the dots created so far and explain the full flow of the application. The following flowchart shows the training and the testing/recommendation phase in details. The corresponding classes in our code handles every phase separately and we have named the code accordingly, so that it’s very easy to follow the code along with the architecture. We also describe the steps briefly below.

Figure 18 : High Level Block Diagram for Training Flow.


Figure 19 : High Level Overview of the Testing Phase

In a nutshell, we first pull the data from the database. Then we process the Users and the Repositories Data properly and then we synthesize the Ratings for every user on every repository using the process described in detail above. Then we pass the data to the recommendation systems to build the models. Once the models are built, we can query the models using the new data from the incoming user. The recommendation systems have the inherent constraint of being able to recommend items only from the items it has seen in the training phase, so as the Data grows, we wille expect to see more relevant results going forward.

The Application Flow and Development The presentation part of the Project is based on Python Flask, HTML and Twitter’s Bootstrap. The user interface provides two different way to search for recommendation.

1. The first way is to authenticate yourself using Github OAuth. 2. Another way is to search with a specified username.

Once the user is authenticated or a valid user name is entered, the web application sends a HTTP GET request to the model for recommendation. The response data is returned to the web-page in form of JSON array.


The application flow is shown in the below image:

Figure 20 : Application Flow Different layers are:

Figure 21 : Application Flow Layers


Results There’s an inherent shortcoming of the lack of good evaluation measures for recommendation systems as there are no ways to know whether a predicted item is really wanted by the user or not. Most of the times, it happens that even the user didn’t know that he needed that item at the first place. Sometimes, the newly recommended items start a new trend entirely. Nonetheless, people try to evaluate these systems in terms of the basic measures like Accuracy, Precision recall etc. We will analyze few other aspects of some interesting results that we saw during our analysis, apart from the traditional measures. Primarily we have 2 different models, one is collaborative filtering model and another is item content model. The following table shows the minimum and maximum accuracies obtained from both of the models, using a random train-test split using the splitters of graphlab.

Collaborative Filtering Model


Figure 22 : Results of Collaborative

Content Based Filtering Model

Figure 23 : Results of Content based


Though there were some inherent issues with the whole thing, we did see some really interesting results while the optimization phase was going on. The convex optimization process that is being used from the point of view of the MSE(Mean Squared Error) was actually able to converge to a minimum point where all the algorithms are using regularization inherently. We we are pretty confident that the modeling that we are trying to do here is definitely giving us meaningful results. We checked the recommendations for both the algorithms that we implemented, and for both known users and unknown users. The known users are the ones who actually have some recommendations already given by them in the training set set and the unknown are the complete new users. There are some very interesting results that we saw in our analysis. As we know that the collaborative filtering suffers from a problem called “The Cold Start Problem” , and is defined as the issue of same recommendations for every new user when they are new to the filtering system, and we were actually seeing this in practice. When we try with any user that was not in our training data, we were getting the same recommendations when we used the Collaborative Filtering model. The same didn’t happen with the Content Based Models, and this was totally expected. This is one of the most amazing results that we found and seeing it in practice what we knew from theory, was a totally new learning experience for us. Apart from this, we did numerous other experiments which’s details are skipped from here due to brevity constraints. Readers are encouraged to play with our app, the links and the details of the code are provided in due sections.

Discussion From the results that we have obtained, we have indeed learned many internal dynamics of the GitHub social network and how does it work. Even if we didn’t get tremendous success in the results, which we would like to attribute mainly to the Data Scarcity issues, this is something that we have developed from scratch and has given us new lights. We believe this pipeline that we have established will help the fraternity to build upon it and generate more concrete results with more robust feature synthesization. There are many issues to be addressed like the scalability and the performance of the recommendation system which we didn’t have time to look into. Initially, we started off with the suspicion on the GitHub data that the information that we have available, may not be very strong and in turn it turned out that for some features, it really was the case. Suppose, for example, the features like contributor count, stargazer count etc. individually followed power law distributions which led us to drop those features in certain modelings because those feature were not information rich to start with. Even after using the attribute importance measures like Gini Index and Gain Ratio, we had to use our domain expertise on the features and the individual analysis of the distributions to fine tune the results. There were some other issues like the trade off of the mean - median bias problem while aggregating were also very promimemnt in almost all the pases that we encountered those. However amidst all these, we are excited about the whole architecture that we have developed and would like to dig more into it in future to explore more, if we get an opportunity. Finally, before wrapping up, we want to revisit the basic questions that drove us


throughout the project and how we can finally point out the answers to these questions in a formal way. They are described below,

● What do you want to know? ○ We want to know how the social dynamics of the GitHub network can help us generating

Repository Recommendations for the users. ● What problem or opportunity do you want to explore?

○ We want to explore the opportunities of improved collaboration between the GitHub users using repository recommendations.

● What customer needs do you want to serve? ○ The need of getting suggestions for better social activity within the network.

● What capabilities do you want to test with the market? ○ We want to explore the possibilities of more collaborative software development in the

open source community. ● What new markets do you want to explore?

○ Apart from the User base of GitHub itself, going forward we can employ the similar models for other open source communities like Apache etc.

Conclusion and Future Work Using the features of the GitHub users and the repositories for analyzing the social dynamics and for making repository recommendations is a very fresh field of research and we believe that there are numerous possibilities to this application domain. In this project, we have done lot of feature aggregations and developed a pipeline to generate recommendations for users about the repositories they might want to work on. To our knowledge, this is a novel idea for this problem statement and we hope that this will establish a baseline for the future work to come in this field. There are lot of interesting immediate future works that can be started from the work given by us in here. For example, we can take very similar approach to generate recommendations for people you want to work on. This would be a very cool feature to have the GitHub page recommending people to follow, to work with based on the repositories you won. We can even further dig down and model the coding style of people from their code and recommend similar people to them. One idea for this that we were initially planning was to use Attributed Community Graphs for creating user clusters. Also, in our repository recommendation system, we didn’t use the readme field of the repository extensively. We believe that the mining of that field more will give us interesting insights as well. From the sample size standpoint, the Data we used is a very small chunk of the whole GitHub network and we definitely look forward to finding out how our system works with more data. Generating recommendation in GitHub is a relatively tough problem due to the issue that we already discussed before. The main problem is that the network itself was not designed for collaboration purposes, so it helps us very less for that. Currently, even GitHub official website itself doesn’t have any feature like


that. We think the results that we have got and the pipeline that we have established, will help the upcoming work in this area and we eagerly look forward to finding out more on the prospects of our brainchild in future.

Appendix

Code

The codebase for the Project Resides in the GitHub Repository below, https://github.com/himangshunits/GitHub-Recommender Or Scan the QR Code Below.

The Code has mainly 4 modules,

1. The Core Recommendation Engine 2. WebApp Simulating the UI 3. Data Collection Scripts 4. SQL Scripts for RDBMS Creation

The Code uses the following Libraries for operation, so please go ahead and install them in your system before starting to use it.

● fuzzywuzzy ● itertools ● unicodedata ● graphlab(This is a Licensed library, you will need to create one student license for that.) ● grammar_check ● pyenchant ● psycopg2 ● pandas ● flask ● sklearn ● nltk


https://github.com/himangshunits/GitHub-Recommender

● nltk.corpus(Not sure if this comes with the NLTK library itself.) ● nltk.stem.snowball(Not sure if this comes with the NLTK library itself.) ● nltk.stem.wordnet(Not sure if this comes with the NLTK library itself.) ● requests ● datetime ● Dateparser ● Flask-OAuthlib

Also the code uses the connectivity with our Private Postgres DB for the Data Extraction, so please change it in code if needed to connect to a different server. The core engine has the following classes,

1. Driver.py : This is the Main Class which will be owned by the Web App server to run all the flows.

2. TrainFlowManager.py : Manages the training phase, owned by Driver class. 3. TestFlowManager.py : Manages the Test Flow, the real time recommendations. Owned by the

Driver class. 4. DatabaseConnector.py : Manages all the DB Connectivity. Separately owned by everyone. 5. CommitLogAnalyzer.py : Synthesizes the matrices for the Commit Logs, owned by Train Flow

Manager. 6. BiographyAnalyzer.py : Analyses the Bio fields of Users. 7. LoggingManager.py : Manages the Logging of during the Data Synthesization part. 8. PreprocessManager.py : Class with static methods for preprocessing text data, uses many of the

NLP Techniques for this. 9. ConfigurationManager.py : Config file for all custom parameters. 10. MainDriver.py : Used for testing the core flows without the server presence. 11. CustomExceptions.py: Houses many of the custom excpetions used for handling. 12. NewUserDataSynthesizer.py : One of the most importantclasses, used for the generation of

vector data for the unseen test users. 13. Run.py : Entry point for the server serving the user requests. 14. index.html : The html content and the bootstrap logic for application flow, which contains the

home screen, the Github OAuth and login details. 15. Repos.html : This page has html content to display the repositories of a particular user. 16. Error.html : Handles the error occurring in the application whenever the information of a

particular user is not found by the model. 17. 404.html : Handles the 404, file not found exception in the application and displays the error

message. Instructions To Run:

1. Clone the repository to your machine. 2. Run the file, run.py as “python run.py”. , it will take up some time for training the models. 3. Voila! Navigate to address : http://127.0.0.1:5000/


http://127.0.0.1:5000/

4. If entering new id for recommendation, it will take up some time to synthesize the new user data, the process can be tracked using the terminal logs.

Dictionaries

We used Dictionaries only for the User Modeling part, all the dictionaries are in the Directory called “bio_corpuses” in the source code tree. Rest of the dictionaries for the Sentiment Analysis and the Topic MOdelling were downloaded from SkLearn Repository and the Graphlab Official Repositories.

Tools

Described above in the code section.

References

1. Guzman, Emitza, David Azócar, and Yang Li. "Sentiment analysis of commit comments in GitHub: an empirical study." Proceedings of the 11th Working Conference on Mining Software Repositories . ACM, 2014.

2. Yu, Yue, et al. "Reviewer recommendation for pull-requests in GitHub: What can we learn from code review and bug assignment?." Information and Software Technology 74 (2016): 204-218.

3. http://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Fall2006/Assignments/editdistance/Levenshtein%20Distance.htm

4. Wikipedia.

Acknowledgement An endeavour such as this one, in order to reach fruition, requires the contribution of a variety of sources, whose selflessness and painstaking efforts have enabled this undertaking to reach its best possible level. Dr. Michael Kowolenko has been of great support as the course instructor as well as the project guide for the CSC 591 Data Driven Decision Making Course. His inputs and helping hands were essential to create the best possible version of the project. We want to take this opportunity to formally extend our gratitude to him and thank for everything.


http://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Fall2006/Assignments/editdistance/Levenshtein%20Distance.htm

Application Snapshots

Home Screen

Login With Github


Search With Username

Recommendation Results


the github repository recommendation...

Documents