evaluating machine learning models -- a beginner's guide

Evaluating Machine Learning Models – A Beginner’s Guide

Alice Zheng, DatoSeptember 15, 2015

My machine learning trajectory

Applied machine learning(Data science)

Build ML tools

Shortage of expertsand good tools.

Why machine learning?

Model data.Make predictions.Build intelligent

applications.

Machine learning pipeline

I fell in love the instant I laid my eyes on that puppy. His big eyes and playful tail, his soft furry paws, …

Raw data

FeaturesModels

Predictions

Deploy inproduction

GraphLab Create

Dato Distributed

Dato Predictive Services

The ML Jargon Challenge

Typical machine learning paper

… semi-supervised model for with large-scale learning from sparse data … sub-modular optimization for distributed computation…evaluated on real and synthetic datasets…performance exceeds start-of-the-art methods

What it looks like to ML researchers

Such regularize!

Much optimal

So sparsity

Very scale

What it looks like to normal people

What it’s like in practice

Doesn’t scaleBrittle ?

Hard to tune

Doesn’t solve my problem on my data

Achieve Machine Learning Zen

Why is evaluation important?• So you know when you’ve succeeded• So you know how much you’ve succeeded• So you can decide when to stop• So you can decide when to update the model

Basic questions for evaluation• When to evaluate?• What metric to use?• On what data?

When to evaluate

Onlineevaluation

Historical data

Onlineevaluation

results

Offlineevaluation

Livedata

Offlineevaluationon live data

Prototypemodel

Trainingresults

Validationresults

Deployed model

Evaluation Metrics

Types of evaluation metric• Training metric• Validation metric• Tracking metric• Business metric

“But they may not match!”

Uh-oh Penguin

Example: recommender system• Given data on which users liked which items,

recommend other items to users• Training metric

- How well is it predicting the preference score? - Residual mean squared error: (actual – predicted)2

• Validation metric- Does it rank known preferences correctly?- Ranking loss

Example: recommender system• Tracking metric

- Does it rank items correctly, especially for top items?- Normalized Discounted Cumulative Gain (NDCG)

• Business metric- Does it increase the amount of time the user spends on

the site/service?

Dealing with metrics• Many possible metrics at different

stages• Defining the right metric is an art

- What’s useful? What’s feasible?• Aligning the metrics will make

everyone happier- Not always possible: cannot directly

train model to optimize for user engagement

“Do the best you

can!”

Okedokey Donkey

Model Selection and Tuning

Historical data

Hyperparametertuning

Trainingdata

Validationdata

Model training

Trainingresults

Validationresults

Key questions for model selection• What’s validation?• What’s a hyperparameter and how do you tune it?

Model validation• Measure generalization error

- How well the model works on new data- “New” data = data not used during training

• Train on one dataset, validate on another• Where to find “new” data for validation?

- Clever re-use of old data

Methods for simulating new dataHold-out validation

Training Validation

K-fold cross validation

1 2 3 K…

Validation set

Bootstrap resampling

Resampled dataset

Hyperparameter tuning vs. model training

Best model parameters

Best hyperparameters

Hyperparameter tuning

Model training

Hyperparameters != model parameters

Feature 2

Feature 1

Classification between two classes

Model parameter

Hyperparameter:How many features to use

Why is hyperparameter tuning hard?• Involves model training as a sub-process

- Can’t optimize directly• Methods:

- Grid search- Random search- Smart search

• Gaussian processes/Bayesian optimization• Random forests• Derivative-free optimization• Genetic algorithms

Online Evaluations

ML in production - 101

ModelHistorical

Predictions

LiveData

Feedback

Batch training

Real-time predictions

ML in production - 101

ModelHistorical

Real-time predictions

Batch training

PredictionsModel 2

LiveData

Why evaluate models online?• Track real performance of model over time• Decide which model to use when

Choosing between ML models

Model 2

Model 12000 visits10% CTR

Group A

Everybody gets Model 2

2000 visits30% CTR

Group B

Strategy 1: A/B testing—select the best model and use it all the time

Choosing between ML modelsA statistician walks into a casino…

Pay-off $1:$1000 Pay-off $1:$200 Pay-off $1:$500

Play this 85% of the time

Multi-armed bandits

Choosing between ML modelsA statistician walks into an ML production environment

Pay-off $1:$1000 Pay-off $1:$200 Pay-off $1:$500

Use this 85% of the time

(Exploitation)

(Exploration)

Model 1 Model 2 Model 3

MAB vs. A/B testingWhy MAB?• Continuous optimization, “set and forget”• Maximize overall reward

Why A/B test?• Simple to understand• Single winner• Tricky to do right

That’s not all, folks!Read the details• Blog posts: http://blog.dato.com

/topic/machine-learning-primer • Report: http://oreil.ly/1L7dS4a

• Dato is hiring! jobs@dato.com

alicez@dato.com @RainyData

evaluating machine learning models -- a beginner's guide

Engineering

c# 3.0: a beginner's guide (beginner's guide (osborne mcgraw...

l2. evaluating machine learning algorithms i

evaluating machine learning approaches to classify

evaluating the output of machine translation...

evaluating machine-learning techniques for...

01.beginner's lithuanian

evaluating slurm simulator with real-machine slurm and

evaluating the of machine translation systems

3dreshaper beginner's guide beginner's guide

beginner's quiz

using machine learning techniques for evaluating the ... ·...

evaluating machine learning algorithms for...

evaluating machine translation quality: a case study …

evaluating differentially private machine learning in...

evaluating machine learning model

java: a beginner's guide, sixth edition: a beginner's guide

l12. evaluating machine learning algorithms ii

beginner's greek script

automatically evaluating balance: a machine learning...

evaluating the performance of different machine types to