introduction to r package recommendation system competition

Post on 20-Jun-2015

3.088 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

John Myles White's Introduction to R Package Recommendation System Competition

TRANSCRIPT

R Recommendation System Contest

John Myles White

March 10, 2011

John Myles White R Recommendation System Contest

Kaggle

Kaggle is a platform for data prediction competitionsthat allows organizations to post their data and have itscrutinized by the world’s best data scientists.

John Myles White R Recommendation System Contest

Kaggle Features

Kaggle provides every contest with:

I Centralized data downloads

I Public and private leaderboards using RMSE, AUC and othermetrics

I Public discussion forums for participants to use

John Myles White R Recommendation System Contest

Kaggle Features

John Myles White R Recommendation System Contest

Recent Kaggle Contests

I Tourism Forecasting

I Chess Ratings: Elo versus the Rest of the World

I INFORMS 2010: Short Term Stock Price Movements

John Myles White R Recommendation System Contest

Current and Upcoming Kaggle Contests

I Arabic Writer Identification

I Don’t Overfit: Dealing with Many Variables and FewObservations

I Heritage Health Prize

John Myles White R Recommendation System Contest

Advice on Running Kaggle Contests

I Stay involved: respond to forum posts quickly and make thecontest seem alive

I Don’t use a prediction task where near perfect accuracy canbe achieved

John Myles White R Recommendation System Contest

Mistakes We Made

I Netflix Prize: 0.8616 RMSE

I R Recommendation Contest: 0.9882 AUC

John Myles White R Recommendation System Contest

The R Recommendation System Contest

I Contestants must be able to predict whether a user U willhave a package P installed on their system

John Myles White R Recommendation System Contest

Full Data Set

I Outcomes: List of all packages installed on 52 R users’systems

I Predictors: Metadata about 2485 CRAN packages

John Myles White R Recommendation System Contest

Metadata

I Dependencies

I Suggests

I Imports

I Views

I Core

I Recommended

I Maintainer

I Maintainer’s Package Count

John Myles White R Recommendation System Contest

Training Data / Test Data Split

I Uniform random split over rows in full data set

I Training Set: 99373 rows

I Test Set: 33125 rows

John Myles White R Recommendation System Contest

Additional Metadata

I LDA topic assignments for CRAN packages

I Used 25 topics

I Used all documentation: manuals, vignettes, etc.

John Myles White R Recommendation System Contest

Example Models

1. Package Metadata

2. Package Metadata + Per User Intercepts

3. Package Metadata + Per User Intercepts + Package TopicAssignments

John Myles White R Recommendation System Contest

Example Model 1

library(‘ProjectTemplate’)try(load.project())

logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage,

data = training.data,family = binomial(link = ‘logit’))

John Myles White R Recommendation System Contest

Example Model 2

logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage +factor(User),

data = training.data,family = binomial(link = ‘logit’))

John Myles White R Recommendation System Contest

Example Model 3

logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage +factor(User) +Topic,

data = training.data,family = binomial(link = ‘logit’))

John Myles White R Recommendation System Contest

Model Performance

I Model 1: ∼ 0.80 AUC

I Model 2: ∼ 0.95 AUC

I Model 3: > 0.95 AUC

John Myles White R Recommendation System Contest

Unexploited Structure in Data

John Myles White R Recommendation System Contest

Future Work

What makes a package useful?

I Need subjective ratings

I Some packages are only installed because they’redependencies for other popular packages

John Myles White R Recommendation System Contest

Future Work

Get a better data sample:

I Contest only used data from 52 users

I But we do have complete data for those users

I But data was not a random sample of R users

John Myles White R Recommendation System Contest

Future Work

I Do more with LDA to categorize R packages

I Prediction task allows us to evaluate “quality” of topics countand topic assignments

John Myles White R Recommendation System Contest

Future Work

I Build up various package-package similarity matrices forconditional recommendations

John Myles White R Recommendation System Contest

Future Work

I Can we understand the clustering in the network structuregraph?

John Myles White R Recommendation System Contest

Resources

For more information, see

I The original Dataists’ contest announcement

I GitHub project page

John Myles White R Recommendation System Contest

top related