introduction to r package recommendation system competition
DESCRIPTION
John Myles White's Introduction to R Package Recommendation System CompetitionTRANSCRIPT
![Page 1: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/1.jpg)
R Recommendation System Contest
John Myles White
March 10, 2011
John Myles White R Recommendation System Contest
![Page 2: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/2.jpg)
Kaggle
Kaggle is a platform for data prediction competitionsthat allows organizations to post their data and have itscrutinized by the world’s best data scientists.
John Myles White R Recommendation System Contest
![Page 3: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/3.jpg)
Kaggle Features
Kaggle provides every contest with:
I Centralized data downloads
I Public and private leaderboards using RMSE, AUC and othermetrics
I Public discussion forums for participants to use
John Myles White R Recommendation System Contest
![Page 4: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/4.jpg)
Kaggle Features
John Myles White R Recommendation System Contest
![Page 5: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/5.jpg)
Recent Kaggle Contests
I Tourism Forecasting
I Chess Ratings: Elo versus the Rest of the World
I INFORMS 2010: Short Term Stock Price Movements
John Myles White R Recommendation System Contest
![Page 6: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/6.jpg)
Current and Upcoming Kaggle Contests
I Arabic Writer Identification
I Don’t Overfit: Dealing with Many Variables and FewObservations
I Heritage Health Prize
John Myles White R Recommendation System Contest
![Page 7: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/7.jpg)
Advice on Running Kaggle Contests
I Stay involved: respond to forum posts quickly and make thecontest seem alive
I Don’t use a prediction task where near perfect accuracy canbe achieved
John Myles White R Recommendation System Contest
![Page 8: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/8.jpg)
Mistakes We Made
I Netflix Prize: 0.8616 RMSE
I R Recommendation Contest: 0.9882 AUC
John Myles White R Recommendation System Contest
![Page 9: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/9.jpg)
The R Recommendation System Contest
I Contestants must be able to predict whether a user U willhave a package P installed on their system
John Myles White R Recommendation System Contest
![Page 10: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/10.jpg)
Full Data Set
I Outcomes: List of all packages installed on 52 R users’systems
I Predictors: Metadata about 2485 CRAN packages
John Myles White R Recommendation System Contest
![Page 11: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/11.jpg)
Metadata
I Dependencies
I Suggests
I Imports
I Views
I Core
I Recommended
I Maintainer
I Maintainer’s Package Count
John Myles White R Recommendation System Contest
![Page 12: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/12.jpg)
Training Data / Test Data Split
I Uniform random split over rows in full data set
I Training Set: 99373 rows
I Test Set: 33125 rows
John Myles White R Recommendation System Contest
![Page 13: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/13.jpg)
Additional Metadata
I LDA topic assignments for CRAN packages
I Used 25 topics
I Used all documentation: manuals, vignettes, etc.
John Myles White R Recommendation System Contest
![Page 14: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/14.jpg)
Example Models
1. Package Metadata
2. Package Metadata + Per User Intercepts
3. Package Metadata + Per User Intercepts + Package TopicAssignments
John Myles White R Recommendation System Contest
![Page 15: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/15.jpg)
Example Model 1
library(‘ProjectTemplate’)try(load.project())
logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage,
data = training.data,family = binomial(link = ‘logit’))
John Myles White R Recommendation System Contest
![Page 16: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/16.jpg)
Example Model 2
logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage +factor(User),
data = training.data,family = binomial(link = ‘logit’))
John Myles White R Recommendation System Contest
![Page 17: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/17.jpg)
Example Model 3
logit.fit <- glm(Installed ~ LogDependencyCount +LogSuggestionCount +LogImportCount +LogViewsIncluding +LogPackagesMaintaining +CorePackage +RecommendedPackage +factor(User) +Topic,
data = training.data,family = binomial(link = ‘logit’))
John Myles White R Recommendation System Contest
![Page 18: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/18.jpg)
Model Performance
I Model 1: ∼ 0.80 AUC
I Model 2: ∼ 0.95 AUC
I Model 3: > 0.95 AUC
John Myles White R Recommendation System Contest
![Page 19: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/19.jpg)
Unexploited Structure in Data
John Myles White R Recommendation System Contest
![Page 20: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/20.jpg)
Future Work
What makes a package useful?
I Need subjective ratings
I Some packages are only installed because they’redependencies for other popular packages
John Myles White R Recommendation System Contest
![Page 21: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/21.jpg)
Future Work
Get a better data sample:
I Contest only used data from 52 users
I But we do have complete data for those users
I But data was not a random sample of R users
John Myles White R Recommendation System Contest
![Page 22: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/22.jpg)
Future Work
I Do more with LDA to categorize R packages
I Prediction task allows us to evaluate “quality” of topics countand topic assignments
John Myles White R Recommendation System Contest
![Page 23: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/23.jpg)
Future Work
I Build up various package-package similarity matrices forconditional recommendations
John Myles White R Recommendation System Contest
![Page 24: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/24.jpg)
Future Work
I Can we understand the clustering in the network structuregraph?
John Myles White R Recommendation System Contest
![Page 25: Introduction to R Package Recommendation System Competition](https://reader034.vdocuments.us/reader034/viewer/2022052621/558571eed8b42a512c8b4ba6/html5/thumbnails/25.jpg)
Resources
For more information, see
I The original Dataists’ contest announcement
I GitHub project page
John Myles White R Recommendation System Contest