analyzing stack overflow - problem

Assignment 6

Analyzing response time in Q&A websites

Question and Answer (Q&A) sites like StackOverflow, Yahoo! Answers, Naver, Quora, LiveQnA, WikiAnswers etc. are becoming increasingly popular with the growth of the Web. These are large collaborative production and social computing platforms of the Web, aimed at crowdsourcing knowledge by allowing users to post and answer questions. They not only provide a platform for experts to share their knowledge and get identified but also help novice users solve their problems effectively. StackOverflow is one such communitydriven Q&A website used by more than a million software developers who post and answer questions related to computer programming. It is governed by a reputation system which rewards the users by giving reputation points, badges, extra privileges on the website, etc. by the usefulness of their posts. The usefulness of a question or an answer is largely determined by the number of votes it receives.

In such a crowdsourced system driven by a reputation mechanism, response time of questions to receive the first answer plays an important role and would largely determine the popularity of the website. People who post questions would want to know the time by which they can expect a response to their question. In this assignment, we want to investigate whether besides several other factors, tags of a question have strong correlation with response time. Tagging questions involves askers selecting appropriate keywords (e.g., android, jquery, c#) to broadly identify the domains to which their questions are related. There also exist mechanisms by which other users can subscribe to tags, search via tags, mark tags as favorites, etc. As a result, tags should play a crucial role in how the questions are answered and hence determining their response time. Input Dataset:

http://gaming.stackexchange.com/ (Dataset https://archive.org/download/stackexchange/gaming.stackexchange.com.7z) is

a sister site of StackOverflow where questions related to Gaming are discussed. We have attached the datadump of the website till 26th September, 2014. Download and Unzip the dataset and you will find the following files

Badges.xml Comments.xml PostHistory.xml PostLinks.xml Posts.xml Tags.xml Users.xml Votes.xml

http://www.google.com/url?q=http%3A%2F%2Fdatascience.stackexchange.com%2F&sa=D&sntz=1&usg=AFQjCNEPhSlVULW18zCeCgvQBKLZvVJQlw

https://www.google.com/url?q=https%3A%2F%2Farchive.org%2Fdownload%2Fstackexchange%2Fgaming.stackexchange.com.7z&sa=D&sntz=1&usg=AFQjCNGXtIotJxRdp5iUmPJkXu70QC4Gbw

Information about all the posts (questions and answers) and tags can be found in “Posts.xml” and “Tags.xml” files respectively. Examples from each of the files are given below. Typical Question <row Id="7" PostTypeId="1" AcceptedAnswerId="10" CreationDate="20140514T00:11:06.457" Score="1" ViewCount="185" Body="As a researcher and instructor, I'm looking for opensource books (or similar materials) that provide a relatively thorough overview of data science from an applied perspective. To be clear, I'm especially interested in a thorough overview that provides material suitable for a collegelevel course, not particular pieces or papers.
" OwnerUserId="36" LastEditorUserId="97" LastEditDate="20140516T13:45:00.237" LastActivityDate="20140516T13:45:00.237" Title="What opensource books (or other materials) provide a relatively thorough overview of data science?" Tags="<education><opensource>" AnswerCount="3" CommentCount="4" FavoriteCount="1" ClosedDate="20140514T08:40:54.950" ></row> Typical Answer <row Id="10" PostTypeId="2" ParentId="7" CreationDate="20140514T00:53:43.273" Score="8" Body="One book that's freely available is "The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman (published by Springer): <a href="http://statweb.stanford.edu/~tibs/ElemStatLearn/">see Tibshirani's website</a>.

Another fantastic source, although it isn't a book, is Andrew Ng's Machine Learning course on Coursera. This has a much more appliedfocus than the above book, and Prof. Ng does a great job of explaining the thinking behind several different machine learning algorithms/situations.
" OwnerUserId="22" LastActivityDate="20140514T00:53:43.273" CommentCount="1" /> Typical Tag <row Id="3" TagName="bigdata" Count="46" ExcerptPostId="66" WikiPostId="65" /> Output Deliverables: A. Feature Calculation

You should use Java to parse these xml files and for each question, calculate the response time and the following tag based features:

1. tag_popularity: We define popularity of a tag t as its frequency, i.e., the number of

questions that contains t as one of its tags. For each question, you should compute the average popularity of all its tags.

2. num_pop_tags: We consider a tag to be popular if its frequency is more than 20. Here you should count the number of popular tags each question contains. There will be atmost 6 boxes in plot as each question can contain at max 5 tags.

3. num_subs_ans: We define an “active subscriber” of a tag t to be a user who has posted “sufficient” answers in the “recent past” to questions containing t. We say that a user has posted “sufficient” answers when the number of their answers is greater than 5 and by “recent past” we mean answers posted after 7th Jan 2014. After computing the number of active subscribers for every tag, you should compute the average number of active subscribers for individual tags in each question.

4. percent_subs_ans: For each tag, you should also compute the ratio of the number of “active subscribers” to the total number of subscribers, where the total number of subscribers indicates the number of users who have posted at least one answer to a question containing a particular tag. After computing the ratio for every tag, you should compute the average ratio for individual tags in each question.

B. Feature Analysis

To analyze the question features and their correlation with response time, you should construct plots of the response time against the values of different features. You should distribute the feature values into ten equal bins and then use gnuplot to produce the following two plots:

1. Box plots that capture the median, 25% and 75% of the response time distributions, as well as the minimum and maximum values, and

2. Cumulative distribution function (CDF) plots of the response time.

analyzing stack overflow - problem

Education