10 things i wish i knew… - eth...

51
10 things I wish I knew… …about Machine Learning Competitions

Upload: lyhanh

Post on 08-Apr-2018

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

10 things I wish I knew……about Machine Learning Competitions

Page 2: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Introduction• Theoretical competition run-down• The list of things I wish I knew• Code samples for a running competition

Page 3: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Kaggle – the platform

Page 4: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Reasons to compete• Money• Fame• Learning experience• Tough challenge• Fun

Page 5: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Competition run-down• Head over to kaggle.com• Read the competition description• Download the train/test set

Page 6: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Preparations• Plot the data• Look at the distributions• Start simple (all-zeroes benchmark)• Make sure to optimize the correct metric• Read up on the specific propertiesà e.g. Logarithmic Loss, extremepredictionshttps://www.kaggle.com/wiki/Metrics

Page 7: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Preprocessing• Replace missing values• Remove duplicates from the training set• One-Hot encode categorical features• Decide what to do with outliers• Scaling/Standardizing

Page 8: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Building the model• Start with a baseline or simple modelà Random predictionsà LogisticRegressionà Decision treesà KNearestNeighbours

• Establish a cross-validation scheme

Page 9: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Submit• Leaderboard score vs. local score

• Mismatch?à Check your scoring functionà Check the sample size of the public LBà Ignore the LB

Page 10: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Kaggle isn’t real world ML• Trade-off:

Accuracy vs. Interpretability vs. Speed• Interpretability/speed is often more important

than accuracy• "Arrow splitting“• "Netflix Problem"http://fastml.com/kaggle-vs-industry-as-seen-through-lens-of-the-avito-competition/http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.htmlhttp://machinelearningmastery.com/building-a-production-machine-learning-infrastructure/

Page 11: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

1) Timing• Don’t start too early

«Beat the benchmark», sharing, motivation

• Don’t start too lateYou’ll certainly run out of time

• ~ 30 Days before the deadline

Page 12: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

2) Learn a tool, stick with it• Python• R• Matlab/Octave

“The grass is always greener on the other side”

Page 13: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

3) Make sure your result are reproducible

• Fix the seeds for algorithms that involverandomization

• Automate your pipeline• Preferably one script from input to output

Page 14: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

4) Make sure your result are reproducible

Examples:• Weight initialization (Neural Networks)• Data subsampling (e.g. Random Forest)

# scikit-learntrain_test_split(X, y, random_state=42)

# numpynp.random.seed(42)

Page 15: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

5) Don’t trust the Leaderboard• Danger of overfitting when tuning your

models according to feedback of the publicleaderboard

• Use cross-validation to estimate theperformance of your model

• Don’t, if computationally to expensiveà Train/Test split might cut it too

Page 16: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

6) Avoid LeakageCommon Sources• PCA• TfIdf• Imputation (Mean/Median)• Duplicate rows in the training set• Inappropriate Cross-validation Scheme

Row, Person, Time, Location

Page 17: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

7) Bias/Variance Trade-offHigh Variance (Overfitting)High Bias (Underfitting)

https://www.coursera.org/course/ml

Page 18: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

8) Think outside the box

• «Don’t get stuck in local minima»• Stop doing what you’re doing if you’re not

making significant progress• Read-up relevant papers on the problem• Explore a different model• Try more feature engineering

Page 19: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

9) Spend your time wisely• Feature Engineering vs. Hyper-parameter

tuning• Read up on Error Analysis• Read up on Learning Curves

Page 20: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

9) Improving a learning algorithm• Get more training examples (V)• Try smaller sets of features (V)• Try getting additional features (B)• Try adding polynomial features (B)• Increase regularization (V)• Decrease regularization (B)

Page 21: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

10) Make use of ensembling• Six bad models are usually better than one

really good model [1, 2]à KNN, SVM, NeuralNet, RF,LogisticRegression, Ridgeà Neural Nets (structurally, seed)

• Make yourself familiar with:Bagging, Boosting, Blending, Stacking[1] http://www.tandfonline.com/doi/abs/10.1080/095400996116839#.VEebN_nkcyN[2] http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml04.icdm06long.pdf

Page 22: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

An example would be handy…

…right about now.

Page 23: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Make use of ensembling (cont)

http://www.overkillanalytics.net/more-is-always-better-the-power-of-simple-ensembles/

True signal

Linearmodel

Non-Linearm

odel

Training data Averaged

Page 24: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Working with features

• Feature selection• Feature engineeringà categoricalà numericalà textual

Page 25: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Examples of feature selection/engineering

• Remove correlated features• Remove features using statistical tests

• Try pair-wise feature interactionsa*b, a-b, a+b, a/b

• Try feature transformationssqrt(a), log(a), abs(a)

Page 26: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Feature engineering (categorical)

• CabinID into deck and room number‘A25’à (‘A’, 25)‘B16’à (‘B’, 16)

• Recode number of siblings to binary (family)• Decompose Dates

Year, month, dayDay of the weekDay of the month

Page 27: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Feature engineering (Textual)

• Lowercase• Stemming (‘rainy’à ‘rain’)• Spelling correction

«I wsa hungray»à «I was hungry»«It’s hotttt outside»à «It’s hot outside»

• Remove stopwords• N-Grams• TfIdf, Count, Hashing

Page 28: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

What usually doesn’t work (for me)• Dimensionality reduction (information loss)• Feature elimination (information loss)• Tree-based methods on High-

dimensional/Sparse data (by design)

Page 29: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

There is always a twist• Feature engineeringà a.k.a. “Golden Features”

• How exciting is this project?à linear decay towards the end

• Removing useless/noisy features

Page 30: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Dataset Trends• Datasets become larger (millions of

samples, thousands of features)• Datasets are anonymizedà Black-Box Machine Learning

Page 31: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Interesting stuff to keep an eye on• Caffe, cuDNN• Vowpal Wabbit (Wee-Dub)• h2o from 0xdata• Regularized Greedy Forests• Factorization models

Page 32: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition
Page 33: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition
Page 34: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition
Page 35: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

55 features , 15k training samples, ~500k Test samples

Page 36: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Random predictions

Page 37: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Start simple: Decision tree

Page 38: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

A little more complex

Page 39: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Let’s see what the model thinks

Page 40: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Next: SVM!

What?!

Page 41: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Feature scaling!

Page 42: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Enough playing, let’s get real.

Page 43: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

73.5% accuracy?

Page 44: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Class distribution

Page 45: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Scale it up!

Page 46: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

75.489% accuracy

Page 47: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Even more?

Nope, no more progress! Time to switch tactics.

Page 48: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Feature Engineering

Page 49: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

78.212% accuracy

Page 50: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

One more round

Page 51: 10 things I wish I knew… - ETH Zpeople.inf.ethz.ch/jaggim/meetup/slides/ML-meetup-9-vonRohr-kaggle.… · Competition run-down • Head over to kaggle.com • Read the competition

Mail [email protected]: @mattvonrohrLinkedIn: ch.linkedin.com/in/mattvonrohr/Kaggle: kaggle.com/users/8376/matt