introduction to r package recommendation system competition

R Recommendation System Contest

John Myles White

March 10, 2011

John Myles White R Recommendation System Contest

Kaggle

Kaggle is a platform for data prediction competitionsthat allows organizations to post their data and have itscrutinized by the world’s best data scientists.

Kaggle Features

Kaggle provides every contest with:

I Centralized data downloads

I Public and private leaderboards using RMSE, AUC and othermetrics

I Public discussion forums for participants to use

Kaggle Features

Recent Kaggle Contests

I Tourism Forecasting

I Chess Ratings: Elo versus the Rest of the World

I INFORMS 2010: Short Term Stock Price Movements

Current and Upcoming Kaggle Contests

I Arabic Writer Identification

I Don’t Overfit: Dealing with Many Variables and FewObservations

I Heritage Health Prize

Advice on Running Kaggle Contests

I Stay involved: respond to forum posts quickly and make thecontest seem alive

I Don’t use a prediction task where near perfect accuracy canbe achieved

Mistakes We Made

I Netflix Prize: 0.8616 RMSE

I R Recommendation Contest: 0.9882 AUC

The R Recommendation System Contest

I Contestants must be able to predict whether a user U willhave a package P installed on their system

Full Data Set

I Outcomes: List of all packages installed on 52 R users’systems

I Predictors: Metadata about 2485 CRAN packages

Metadata

I Dependencies

I Suggests

I Imports

I Views

I Core

I Recommended

I Maintainer

I Maintainer’s Package Count

Training Data / Test Data Split

I Uniform random split over rows in full data set

I Training Set: 99373 rows

I Test Set: 33125 rows

Additional Metadata

I LDA topic assignments for CRAN packages

I Used 25 topics

I Used all documentation: manuals, vignettes, etc.

Example Models

1. Package Metadata

2. Package Metadata + Per User Intercepts

3. Package Metadata + Per User Intercepts + Package TopicAssignments

Example Model 1

library(‘ProjectTemplate’)try(load.project())

logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage,

data = training.data,family = binomial(link = ‘logit’))

Example Model 2

logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage +factor(User),

Example Model 3

logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage +factor(User) +Topic,

Model Performance

I Model 1: ∼ 0.80 AUC

I Model 2: ∼ 0.95 AUC

I Model 3: > 0.95 AUC

Unexploited Structure in Data

Future Work

What makes a package useful?

I Need subjective ratings

I Some packages are only installed because they’redependencies for other popular packages

Future Work

Get a better data sample:

I Contest only used data from 52 users

I But we do have complete data for those users

I But data was not a random sample of R users

Future Work

I Do more with LDA to categorize R packages

I Prediction task allows us to evaluate “quality” of topics countand topic assignments

Future Work

I Build up various package-package similarity matrices forconditional recommendations

Future Work

I Can we understand the clustering in the network structuregraph?

Resources

For more information, see

I The original Dataists’ contest announcement

I GitHub project page

introduction to r package recommendation system competition

rmse r recommendation

r userssystemspredictors

kaggle kaggle

kaggle features kaggle

data setoutcomes

data settraining set

data prediction competitions

package p

Documents

national moot court competition package 2021

tour package recommendation:. transportation details:

panache - the fashion competition - info package

portugal: the competition authority issues recommendation...

recommendation concerning international co-operation … ·...

zone b base-package recommendation framework based...

state aid modernisation package ramona ianu country...

gif activity package potato competition

fresno unified school district package … unified school...

2019 independent package design competition€¦ · 2019...

experiences with competition assessment - oecd.org ·...

competition development coach [pre-task package] ·...

gvrd board meeting agenda package - october 12, 2012 -...

standard industry types · 3 competition type competition...

rotman international trading competition...

email package recommendation

enterprise resource planning (erp) software package ......

orientation package 2016 2017 good copy - access program...

review of the 1998 oecd recommendation concerning ... ·...

drug package recommendation via interaction-aware graph