winning kaggle 101: mark landry's experience
TRANSCRIPT
H2O.aiMachine Intelligence
Competitive Data Science
Kaggle from a competitor’s viewMark Landry, H2O
Competitive Data Scientist & Product Manager
H2O.aiMachine Intelligence
Overview• Personal background• Iterative workflow• Framing the problem• Learning from other competitors• Q&A
2
H2O.aiMachine Intelligence
Background
3
Competitive data scientist & product manager, H2O
BS, computer science
Additional roles: data warehousing, BI, analytics
Preferred algorithm: GBM
H2O.aiMachine Intelligence
Iterative Workflow• Agile workflows generally outperform waterfall
methodologies• One of the most commonly cited insights from
Kaggle employees regarding success
4
H2O.aiMachine Intelligence
Iterative Workflow: Basics• Work quickly to develop a reasonable model early
o Model should be complete enough to gauge score, per competition setup
o Simple models: understand how the mean and mode scoreo Confirms understanding of the problemo Confirms validity of your internal loss calculation
• Enhance model iterativelyo Explore and add features: additional data sets and/or
transformationso Experiment with additional model classeso Experiment with hyperparameters within algorithm classo Ensembleo Validate enhancements via improvement from prior leading
model5
H2O.aiMachine Intelligence
Iterative Workflow: Benefits• Allows the data guide what modeling approach fits
besto Availability and quality of data may not support complex
modeling ideas• Catch mistakes or incorrect assumptions early and
clearlyo If you observe no improvement after adding what you
considered to be a vital feature, you know to immediately check the accuracy of the calculations and/or question how the model already captured that information
6
H2O.aiMachine Intelligence
Framing the Problem• Have to make the data machine learning ready
o 1 training fileo 1 row per targeto Features do not require additional methodology (e.g. text,
images)
• Many Kaggle competitions arrive “ML-ready”
7
H2O.aiMachine Intelligence
Framing the Problem, 2• My favorite competitions are those that are non
ML-readyo Focuses more heavily on solving the data problemo More like solving a puzzle instead of tuning hyperparameters
8
H2O.aiMachine Intelligence
Framing the Problem, 2
9
• Time permitting: brief intro to Avitoo https://www.kaggle.com/c/avito-context-ad-clicks/
H2O.aiMachine Intelligence
Learning from Kaggle• Sharing during competition
o Kaggle Scriptso Discussions on the forums
• Shared after the competitiono Most often several of the top ranking competitors will share
their methodologyo Often a summary post, occasionally Github codeo I find this the most valuable component of learning data
science
10
H2O.aiMachine Intelligence
Q & A
11