optimal learning for fun and profit with moe

50
Optimal Learning for Fun and Profit with MOE Scott Clark SF Machine Learning 10/15/14 Joint work with: Eric Liu, Peter Frazier, Norases Vesdapunt, Deniz Oktay, JaiLei Wang [email protected] @DrScottClark

Upload: yelp-engineering

Post on 01-Dec-2014

463 views

Category:

Engineering


4 download

DESCRIPTION

Abstract: In this talk we will introduce MOE, the Metric Optimization Engine. MOE is an efficient way to optimize a system's parameters, when evaluating parameters is time-consuming or expensive. It can be used to help tackle a myriad of problems including optimizing a system's click-through or conversion rate via A/B testing, tuning parameters of a machine learning prediction method or expensive batch job, designing an engineering system or finding the optimal parameters of a real-world experiment. MOE is ideal for problems in which the optimization problem's objective function is a black box, not necessarily convex or concave, derivatives are unavailable, and we seek a global optimum, rather than just a local one. This ability to handle black-box objective functions allows us to use MOE to optimize nearly any system, without requiring any internal knowledge or access. To use MOE, we simply need to specify some objective function, some set of parameters, and any historical data we may have from previous evaluations of the objective function. MOE then finds the set of parameters that maximize (or minimize) the objective function, while evaluating the objective function as few times as possible. This is done internally using Bayesian Global Optimization on a Gaussian Process model of the underlying system and finding the points of highest Expected Improvement to sample next. MOE provides easy to use Python, C++, CUDA and REST interfaces to accomplish these goals and is fully open source. We will present the motivation and background, discuss the implementation and give real-world examples. Scott Clark Bio: After finishing my PhD in Applied Mathematics at Cornell University in 2012 I have been working on the Ad Targeting team at Yelp Inc. I've been employing a variety of machine learning and optimization techniques from multi-armed bandits to Bayesian Global Optimization and beyond to their vast dataset and problems. I have also been trying to lead the charge on academic research and outreach within Yelp by leading projects like the Yelp Dataset Challenge and open sourcing MOE.

TRANSCRIPT

Page 1: Optimal Learning for Fun and Profit with MOE

Optimal Learningfor Fun and Profit with MOE

Scott ClarkSF Machine Learning

10/15/14Joint work with: Eric Liu, Peter Frazier, Norases Vesdapunt, Deniz Oktay, JaiLei Wang

[email protected] @DrScottClark

Page 2: Optimal Learning for Fun and Profit with MOE

● Optimal Learning○ What is it?○ Why do we care?

● Multi-armed bandits○ Definition and motivation○ Examples

● Bayesian global optimization○ Optimal experiment design○ Uses to extend traditional A/B testing○ Examples

● MOE: Metric Optimization Engine○ Examples and Features

Outline of Talk

Page 3: Optimal Learning for Fun and Profit with MOE

What is optimal learning?Optimal learning addresses the challenge of how to collect information as efficiently as possible, primarily for settings where collecting information is time consuming and expensive.

Prof. Warren Powell - optimallearning.princeton.edu

What is the most efficient way to collect information?

Prof. Peter Frazier - people.orie.cornell.edu/pfrazier

How do we make the most money, as fast as possible?

Me - @DrScottClark

Page 4: Optimal Learning for Fun and Profit with MOE

Part I:Multi-Armed Bandits

Page 5: Optimal Learning for Fun and Profit with MOE

● Imagine you are in front of K slot machines.● Each one is set to "free play" (but you can still win $$$)● Each has a possibly different, unknown payout rate● You have a fixed amount of time to maximize payout

What are multi-armed bandits?

THE SETUP

GO!

Page 6: Optimal Learning for Fun and Profit with MOE

What are multi-armed bandits?

THE SETUP(math version)

Page 7: Optimal Learning for Fun and Profit with MOE

Real World Bandits

Why do we care?● Maps well onto Click Through Rate (CTR)

○ Each arm is an ad or search result○ Each click is a success○ Want to maximize clicks

● Can be used in experiments (A/B testing)○ Want to find the best solutions, fast○ Want to limit how often bad solutions are used

Page 8: Optimal Learning for Fun and Profit with MOE

Tradeoffs

Exploration vs. ExploitationGaining more knowledge about the system

vs.Getting largest payout with current knowledge

Page 9: Optimal Learning for Fun and Profit with MOE

Naive Example

Epsilon First Policy

● Sample sequentially εT < T times○ only explore

● Pick the best and sample for t = εT+1, ..., T○ only exploit

Page 10: Optimal Learning for Fun and Profit with MOE

Example (K = 3, t = 0)

p = 0.5 p = 0.8 p = 0.2Unknownpayout rate

PULLS:

WINS:

RATIO:

0

0

-

0

0

-

0

0

-

Observed Information

Page 11: Optimal Learning for Fun and Profit with MOE

Example (K = 3, t = 1)

p = 0.5 p = 0.8 p = 0.2Unknownpayout rate

PULLS:

WINS:

RATIO:

1

1

1

0

0

-

0

0

-

Observed Information

Page 12: Optimal Learning for Fun and Profit with MOE

Example (K = 3, t = 2)

p = 0.5 p = 0.8 p = 0.2Unknownpayout rate

PULLS:

WINS:

RATIO:

1

1

1

1

1

1

0

0

-

Observed Information

Page 13: Optimal Learning for Fun and Profit with MOE

Example (K = 3, t = 3)

p = 0.5 p = 0.8 p = 0.2Unknownpayout rate

PULLS:

WINS:

RATIO:

1

1

1

1

1

1

1

0

0

Observed Information

Page 14: Optimal Learning for Fun and Profit with MOE

Example (K = 3, t = 4)

p = 0.5 p = 0.8 p = 0.2Unknownpayout rate

PULLS:

WINS:

RATIO:

2

1

0.5

1

1

1

1

0

0

Observed Information

Page 15: Optimal Learning for Fun and Profit with MOE

Example (K = 3, t = 5)

p = 0.5 p = 0.8 p = 0.2Unknownpayout rate

PULLS:

WINS:

RATIO:

2

1

0.5

2

2

1

1

0

0

Observed Information

Page 16: Optimal Learning for Fun and Profit with MOE

Example (K = 3, t = 6)

p = 0.5 p = 0.8 p = 0.2Unknownpayout rate

PULLS:

WINS:

RATIO:

2

1

0.5

2

2

1

2

0

0

Observed Information

Page 17: Optimal Learning for Fun and Profit with MOE

Example (K = 3, t = 7)

p = 0.5 p = 0.8 p = 0.2Unknownpayout rate

PULLS:

WINS:

RATIO:

3

2

0.66

2

2

1

2

0

0

Observed Information

Page 18: Optimal Learning for Fun and Profit with MOE

Example (K = 3, t = 8)

p = 0.5 p = 0.8 p = 0.2Unknownpayout rate

PULLS:

WINS:

RATIO:

3

2

0.66

3

3

1

2

0

0

Observed Information

Page 19: Optimal Learning for Fun and Profit with MOE

Example (K = 3, t = 9)

p = 0.5 p = 0.8 p = 0.2Unknownpayout rate

PULLS:

WINS:

RATIO:

3

2

0.66

3

3

1

3

1

0.33

Observed Information

Page 20: Optimal Learning for Fun and Profit with MOE

Example (K = 3, t > 9)

Exploit!

Profit!

Right?

Page 21: Optimal Learning for Fun and Profit with MOE

What if our observed ratio is a poor approx?

p = 0.5 p = 0.8 p = 0.2Unknownpayout rate

PULLS:

WINS:

RATIO:

3

2

0.66

3

3

1

3

1

0.33

Observed Information

Page 22: Optimal Learning for Fun and Profit with MOE

What if our observed ratio is a poor approx?

p = 0.9 p = 0.5 p = 0.5Unknownpayout rate

PULLS:

WINS:

RATIO:

3

2

0.66

3

3

1

3

1

0.33

Observed Information

Page 23: Optimal Learning for Fun and Profit with MOE

Fixed exploration fails

Regret is unbounded!

Amount of explorationneeds to depend on data

We need better policies!

Page 24: Optimal Learning for Fun and Profit with MOE

What should we do?

Many different policies● Weighted random choice (another naive approach)

● Epsilon-greedy○ Best arm so far with P=1-ε, random otherwise

● Epsilon-decreasing*○ Best arm so far with P=1-(ε * exp(-rt)), random otherwise

● UCB-exp*● UCB-tuned*● BLA*● SoftMax*● etc, etc, etc (60+ years of research)

*Regret bounded as t->infinity

Page 25: Optimal Learning for Fun and Profit with MOE

Bandits in the Wild

● Hardware constraints limit real-time knowledge? (batching)

● Payoff noisy? Non-binary? Changes in time? (dynamic content)

● Parallel sampling? (many concurrent users)

● Arms expire? (events, news stories, etc)

● You have knowledge of the user? (logged in, contextual history)

● The number of arms increases? Continuous? (parameter search)

What if...

Every problem is different.This is an active area of research.

Page 26: Optimal Learning for Fun and Profit with MOE

Part I:Global Optimization

Page 27: Optimal Learning for Fun and Profit with MOE

● Optimize some objective function ○ CTR, revenue, delivery time, or some combination thereof

● given some parameters ○ config values, cuttoffs, ML parameters

● CTR = f(parameters)○ Find best parameters

● We want to sample the underlying function as few times as possible

THE GOAL

(more mathy version)

Page 28: Optimal Learning for Fun and Profit with MOE

Metric Optimization EngineA global, black box method for parameter optimization

MOE

History of how past parameters have performed

New, optimal parameters

Page 29: Optimal Learning for Fun and Profit with MOE

What does MOE do?

● MOE optimizes a metric (like CTR) given some parameters as inputs (like scoring weights)

● Given the past performance of different parametersMOE suggests new, optimal parameters to test

Results of A/B tests run so far

MOE

New, optimal values to A/B test

Page 30: Optimal Learning for Fun and Profit with MOE

Example Experiment

Parameters + Obj Func

distance_cutoffs = {‘shopping’: 20.0,‘food’: 14.0,‘auto’: 15.0,…}

objective_function = {‘value’: 0.012,‘std’: 0.00013}

MOENew Parameters

distance_cutoffs = {‘shopping’: 22.1,‘food’: 7.3,‘auto’: 12.6,…}

Biz details distance in ad● Setting a different distance cutoff for each category

to show “X miles away” text in biz_details ad● For each category we define a maximum distance

Run A/B Test

Page 31: Optimal Learning for Fun and Profit with MOE

Why do we need MOE?

● Parameter optimization is hard○ Finding the perfect set of parameters takes a long time○ Hope it is well behaved and try to move in the right direction○ Not possible as number of parameters increases

● Intractable to find best set of parameters in all situations○ Thousands of combinations of program type, flow, category ○ Finding the best parameters manually is impossible

● Heuristics quickly break down in the real world○ Dependent parameters (changes to one change all others)○ Many parameters at once (location, category, map, place, ...)○ Non-linear (complexity and chaos break assumptions)

MOE solves all of these problems in an optimal way

Page 32: Optimal Learning for Fun and Profit with MOE

How does it work?

MOE1. Build Gaussian Process (GP)

with points sampled so far2. Optimize covariance

hyperparameters of GP3. Find point(s) of highest

Expected Improvementwithin parameter domain

4. Return optimal next best point(s) to sample

Page 33: Optimal Learning for Fun and Profit with MOE

Rasmussen and Williams GPMLgaussianprocess.org

Gaussian Processes

Page 34: Optimal Learning for Fun and Profit with MOE

Prior:

Posterior:

Gaussian Processes

Page 35: Optimal Learning for Fun and Profit with MOE

Optimizing Covariance Hyperparameters

Rasmussen and Williams Gaussian Processes for Machine Learning

Finding the GP model that fits best

● All of these GPs are created with the same initial data○ with different hyperparameters (length scales)

● Need to find the model that is most likely given the data○ Maximum likelihood, cross validation, priors, etc

Page 36: Optimal Learning for Fun and Profit with MOE

Optimizing Covariance Hyperparameters

Rasmussen and Williams Gaussian Processes for Machine Learning

Page 37: Optimal Learning for Fun and Profit with MOE

[Jones, Schonlau, Welsch 1998][Clark, Frazier 2012]

Find point(s) of highest expected improvement

We want to find the point(s) that are expected to beat the best point seen so far, by the most.

Page 38: Optimal Learning for Fun and Profit with MOE

Tying it all Together #1: A/B Testing

Experiment Framework

(users -> cohorts)(cohorts -> % traffic,

params)

Metric System (batch)Logs, Metrics, Results

MOEMulti-Armed BanditsBayesian Global Opt

App

Users

cohorts -> paramsparams -> objective function

optimal cohort % trafficoptimal new params

daily/hourly batch

● Optimally assign traffic fractions for experiments (Multi-Armed Bandits)

● Optimally suggest new cohorts to be run (Bayesian Global Optimization)

time consuming and expensive

Page 39: Optimal Learning for Fun and Profit with MOE

Tying it all Together #2Expensive Batch Systems

Machine Learning Framework

complex regression, deep learning system, etc

MetricsError, Loss, Likelihood, etc

MOEBayesian Global Opt

Big Data

framework output

● Optimally suggest new hyperparameters for the framework to minimize loss (Bayesian Global Optimization)

time

cons

umin

g an

d ex

pens

ive

Hyperparametersoptimal hyperparameters

Page 40: Optimal Learning for Fun and Profit with MOE

Tying it all Together #3Physical Experiments

Drug Trialdrug creation,FDA approval,expert admin

Evaluation of ResultsRequires Expert

MOEMulti-Armed BanditsBayesian Global Opt

asynchronous results

● Optimally allocate trial sizes (MAB)● Optimally suggest parameters for new

patients (Bayesian Global Optimization)

Parametersdosage, frequency,

composition

optimal parameters

conditions on outstanding experiments

optimal experiments for new patients

time consuming and expensivetime consuming and expensive

Page 41: Optimal Learning for Fun and Profit with MOE

What is MOE doing right now?

MOE is now live in production ● MOE is informing active experiments

● MOE is successfully optimizing towards all given metrics

● MOE treats the underlying system it is optimizing as a black box, allowing it to be easily extended to any system

Page 42: Optimal Learning for Fun and Profit with MOE

MOE is Open Source!

github.com/Yelp/MOE

Page 43: Optimal Learning for Fun and Profit with MOE

MOE is Fully Documented

yelp.github.io/MOE

Page 44: Optimal Learning for Fun and Profit with MOE

MOE has Examplesyelp.github.io/MOE/examples.html

Page 45: Optimal Learning for Fun and Profit with MOE

● Multi-Armed Bandits○ Many policies implemented and more on the way

● Global Optimization○ Bayesian Global Optimization via Expected Improvement on GPs

Page 46: Optimal Learning for Fun and Profit with MOE

MOE is Easy to Install

A MOE server is now running at http://localhost:6543

● yelp.github.io/MOE/install.html#install-in-docker● registry.hub.docker.com/u/yelpmoe/latest

Page 48: Optimal Learning for Fun and Profit with MOE

ReferencesGaussian Processes for Machine Learning Carl edward Rasmussen and Christopher K. I. Williams. 2006. Massachusetts Institute of Technology. 55 Hayward St., Cambridge, MA 02142. http://www.gaussianprocess.org/gpml/ (free electronic copy)

Parallel Machine Learning Algorithms In Bioinformatics and Global Optimization (PhD Dissertation) Part II, EPI: Expected Parallel Improvement Scott Clark. 2012. Cornell University, Center for Applied Mathematics. Ithaca, NY. https://github.com/sc932/Thesis

Differentiation of the Cholesky Algorithm S. P. Smith. 1995. Journal of Computational and Graphical Statistics. Volume 4. Number 2. p134-147

A Multi-points Criterion for Deterministic Parallel Global Optimization based on Gaussian Processes. David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. 2008. D´epartement 3MI. Ecole Nationale Sup´erieure des Mines. 158 cours Fauriel, Saint-Etienne, France. {ginsbourger, leriche, carraro}@emse.fr

Efficient Global Optimization of Expensive Black-Box FunctionsJones, D.R., Schonlau, M., Welch,W.J. 1998.Journal of Global Optimization, 13, 455-492.

Page 49: Optimal Learning for Fun and Profit with MOE

Use Cases● Optimizing a system's click-through or conversion rate (CTR).

○ MOE is useful when evaluating CTR requires running an A/B test on real user traffic, and

getting statistically significant results requires running this test for a substantial amount of time

(hours, days, or even weeks). Examples include setting distance thresholds, ad unit properties,

or internal configuration values.

○ http://engineeringblog.yelp.com/2014/10/using-moe-the-metric-optimization-engine-to-optimize-

an-ab-testing-experiment-framework.html

● Optimizing tunable parameters of a machine-learning prediction method. ○ MOE can be used when calculating the prediction error for one choice of the parameters takes a

long time, which might happen because the prediction method is complex and takes a long time to train, or because the data used to evaluate the error is huge. Examples include deep

learning methods or hyperparameters of features in logistic regression.

Page 50: Optimal Learning for Fun and Profit with MOE

More Use Cases ● Optimizing the design of an engineering system.

○ MOE helps when evaluating a design requires running a complex physics-based numerical simulation on a supercomputer. Examples include designing and modeling airplanes, the

traffic network of a city, a combustion engine, or a hospital.

● Optimizing the parameters of a real-world experiment.○ MOE can help guide design when every experiment needs to be physically created in a lab or

very few experiments can be run in parallel. Examples include chemistry, biology, or physics

experiments or a drug trial.

● Any time sampling a tunable, unknown function is time consuming or

expensive.