winning kaggle 101: mark landry's experience

11
H 2 O.ai Machine Intelligence Competitive Data Science Kaggle from a competitor’s view Mark Landry, H2O Competitive Data Scientist & Product Manager

Upload: ted-xiao

Post on 13-Feb-2017

412 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Winning Kaggle 101: Mark Landry's Experience

H2O.aiMachine Intelligence

Competitive Data Science

Kaggle from a competitor’s viewMark Landry, H2O

Competitive Data Scientist & Product Manager

Page 2: Winning Kaggle 101: Mark Landry's Experience

H2O.aiMachine Intelligence

Overview• Personal background• Iterative workflow• Framing the problem• Learning from other competitors• Q&A

2

Page 3: Winning Kaggle 101: Mark Landry's Experience

H2O.aiMachine Intelligence

Background

3

Competitive data scientist & product manager, H2O

BS, computer science

Additional roles: data warehousing, BI, analytics

Preferred algorithm: GBM

Page 4: Winning Kaggle 101: Mark Landry's Experience

H2O.aiMachine Intelligence

Iterative Workflow• Agile workflows generally outperform waterfall

methodologies• One of the most commonly cited insights from

Kaggle employees regarding success

4

Page 5: Winning Kaggle 101: Mark Landry's Experience

H2O.aiMachine Intelligence

Iterative Workflow: Basics• Work quickly to develop a reasonable model early

o Model should be complete enough to gauge score, per competition setup

o Simple models: understand how the mean and mode scoreo Confirms understanding of the problemo Confirms validity of your internal loss calculation

• Enhance model iterativelyo Explore and add features: additional data sets and/or

transformationso Experiment with additional model classeso Experiment with hyperparameters within algorithm classo Ensembleo Validate enhancements via improvement from prior leading

model5

Page 6: Winning Kaggle 101: Mark Landry's Experience

H2O.aiMachine Intelligence

Iterative Workflow: Benefits• Allows the data guide what modeling approach fits

besto Availability and quality of data may not support complex

modeling ideas• Catch mistakes or incorrect assumptions early and

clearlyo If you observe no improvement after adding what you

considered to be a vital feature, you know to immediately check the accuracy of the calculations and/or question how the model already captured that information

6

Page 7: Winning Kaggle 101: Mark Landry's Experience

H2O.aiMachine Intelligence

Framing the Problem• Have to make the data machine learning ready

o 1 training fileo 1 row per targeto Features do not require additional methodology (e.g. text,

images)

• Many Kaggle competitions arrive “ML-ready”

7

Page 8: Winning Kaggle 101: Mark Landry's Experience

H2O.aiMachine Intelligence

Framing the Problem, 2• My favorite competitions are those that are non

ML-readyo Focuses more heavily on solving the data problemo More like solving a puzzle instead of tuning hyperparameters

8

Page 10: Winning Kaggle 101: Mark Landry's Experience

H2O.aiMachine Intelligence

Learning from Kaggle• Sharing during competition

o Kaggle Scriptso Discussions on the forums

• Shared after the competitiono Most often several of the top ranking competitors will share

their methodologyo Often a summary post, occasionally Github codeo I find this the most valuable component of learning data

science

10

Page 11: Winning Kaggle 101: Mark Landry's Experience

H2O.aiMachine Intelligence

Q & A

11