demographics and weblog hackathon – case study
DESCRIPTION
Demographics and Weblog Hackathon – Case Study. 5.3% of Motley Fool visitors are subscribers. Design a classificaiton model for insight into which variables are important for strategies to increase the subscription rate Learn by Doing. http:// www.meetup.com / HandsOnProgrammingEvents /. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/1.jpg)
Demographics and Weblog Hackathon – Case Study
5.3% of Motley Fool visitors are subscribers. Design a classificaiton model for insight into which variables are
important for strategies to increase the subscription rateLearn by Doing
![Page 2: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/2.jpg)
http://www.meetup.com/HandsOnProgrammingEvents/
![Page 3: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/3.jpg)
Data Mining Hackathon
![Page 4: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/4.jpg)
Funded by Rapleaf
• With Motley Fool’s data• App note for Rapleaf/Motley Fool • Template for other hackathons• Did not use AWS. R on individual PCs• Logisics: Rapleaf funded prizes and food for 2
weekends for ~20-50. Venue was free
![Page 5: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/5.jpg)
Getting more subscribers
![Page 6: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/6.jpg)
Headline Data, Weblog
![Page 7: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/7.jpg)
Demographics
![Page 8: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/8.jpg)
Cleaning Data
• training.csv(201,000), headlines.tsv(811MB), entry.tsv(100k), demographics.tsv
• Feature Engineering• Github:
![Page 9: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/9.jpg)
Ensemble Methods
• Bagging, Boosting, randomForests• Overfitting• Stability (small changes make large prediction
changes)• Previously none of these work at scale• Small scale results using R, large scale exist in
proprietary implementations(google, amazon, etc..)
![Page 10: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/10.jpg)
ROC Curves
Binary Classifier Only!
![Page 11: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/11.jpg)
Paid Subscriber ROC curve, ~61%
![Page 12: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/12.jpg)
Boosted Regression Trees Performance
• training data ROC score = 0.745 • cv ROC score = 0.737 ; se = 0.002• 5.5% less performance than the winning score
without doing any data processing• Random is 50% or .50. We are .737-.50 better
than random by 23.7%
![Page 13: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/13.jpg)
Contribution of predictor variables
![Page 14: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/14.jpg)
Predictive Importance• Friedman, number of times a variable is selected for splitting weighted by squared
error or improvement to model. Measure of sparsity in data• Fit plots remove averages of model variables• 1 pageV 74.0567852• 2 loc 11.0801383• 3 income 4.1565597• 4 age 3.1426519• 5 residlen 3.0813927• 6 home 2.3308287• 7 marital 0.6560258• 8 sex 0.6476549• 9 prop 0.3817017• 10 child 0.2632598• 11 own 0.2030012
![Page 15: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/15.jpg)
Behavioral vs. Demographics
• Demographics are sparse• Behavioral weblogs are the best source. Most
sites aren’t using this information correctly. There is no single correct answer. Trial and Error on features. The features are more important than the algorithm
• Linear vs. Nonlinear
![Page 16: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/16.jpg)
Fitted Values (Crappy)
![Page 17: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/17.jpg)
Fitted Values Better
![Page 18: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/18.jpg)
Predictor Variable Interaction
• Adjusting variable interactions
![Page 19: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/19.jpg)
Variable Interactions
![Page 20: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/20.jpg)
Plot Interactions age, loc
![Page 21: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/21.jpg)
Trees vs. other methods
• Can see multiple levels good for trees. Do other variables match this? Simplify model or add more features. Iterate to a better model
• No Math. Analyst
![Page 22: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/22.jpg)
Number of Trees
![Page 23: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/23.jpg)
Data Set Number of Trees
![Page 24: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/24.jpg)
Hackathon Results
![Page 25: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/25.jpg)
Weblogs only 68.15%, 18% better than random
![Page 26: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/26.jpg)
Demographics add 1%
![Page 27: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/27.jpg)
AWS Advantages
• Running multiple instances with different algorithms and parameters using R
• Add tutorial, install Screen, R GUI bugs• http://amazonlabs.pbworks.com/w/page/280
36646/FrontPage
![Page 28: Demographics and Weblog Hackathon – Case Study](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816688550346895dda3f6b/html5/thumbnails/28.jpg)
Conclusion
• Data Mining at scale requires more development in visualization, MR algorithms, MR data preprocessing.
• Tuning using visualization. Tune 3 parameters, tc, lr, #trees. Didn’t cover 2/3.
• This isn’t reproducable in Hadoop/Mahout or any open source code I know of
• Other use cases, i.e. predicting which item will sell(eBay), search engine ranking.
• Careful with MR paradigms, Hadoop MR != Couchbase MR