click through rate predictionmarsy/resources/bids/bids_minicours… · click through rate...

51
Click Through Rate Prediction Ameet Talwalkar January 14, 2015

Upload: others

Post on 21-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Click Through Rate Prediction

Ameet Talwalkar January 14, 2015

Page 2: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Schedule9:00am-12:30pm Distributed ML • 9:00-10:00am Millionsong Exercise (continued) • 10:00-11:00am Click-through Rate (CTR) Prediction • 11:00-12:30pm Exercise on CTR Prediction 12:30pm-1:30pm Lunch / Feedback Session • Intro Spark • Data Science • Distributed ML • Databricks Cloud

1:30pm-3:00pm Exercise on CTR Prediction

Page 3: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Online Advertising

Logistic Regression

One-hot-encoding and Feature hashing

CTR Exercise

Page 4: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am
Page 5: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am
Page 6: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am
Page 7: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Big Business • Multiple billion dollar industry • $42B in 2013, 17% increase over

previous year! • How Google and other search

engines make $

Online Advertising

Page 8: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Online Advertising

Big Data • Lots of people use the internet • Easy to gathered labeled data

(privacy concerns!)

Big ML • A great success story for ML • Major motivation for scalable ML • Problem is hard, so we need all

the data we can get!

Page 9: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

The PlayersPublishers: NYTimes, Google, ESPN • Make money displaying ads

Advertisers: Marc Jacobs, Fossil, Macy’s, Dr. Pepper • Pay for ads to attract business

Matchmakers: Google, Microsoft, Yahoo • Match publishers with advertisers • In real-time (i.e., as a specific user visits a website)

Page 10: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Why Advertisers Pay?

Impressions • Get message to target audience • e.g., brand awareness campaign

Performance • Get users to do something • e.g., click on ad and visit site • e.g., buy something or join a mailing list

Most common

Page 11: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Efficient MatchmakingIdea: Predict and Optimize • Predict probability that user will click each ad • Choose ads to maximize probability

Predictive features • Ad’s historical performance • Advertiser and ad content info • Publisher info • User info (search and click

history, social network)

Page 12: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Companies get billions of impressions per day…

…but data is high-dim, sparse, skewed • Hundreds of millions of users (or userIDs) • Millions of unique landing pages • Millions of unique ads • Very few ads get clicked (label skew)

Using more data is crucial to tease out signal

Goal: Estimate P(click | user, ad, publisher info) Given: Massive amounts of labeled data

Page 13: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

One Approach

Goal: Estimate P(click | user, ad, publisher info) Given: Massive amounts of labeled data

Feature extraction • Quadratic features • One-Hot-Encoding (categorical features) • Feature Hashing (control dimensionality)

Logistic Regression: models probabilities directly • Also evaluate classifier via probabilities

Page 14: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Recap Question

Why is CTR modeling crucial for online advertising? • CTR modeling allows for efficient spot market • Amount advertisers pay is based on the chance that

their ad will have the desired effect (e.g., a click) • Publishers want to maximize the money they make

hosting ads (hence want to host ads with high CTR) • 3rd party matchmakers need to make good

matches to stay in business

Page 15: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Online Advertising

Logistic Regression

One-hot-encoding and Feature hashing

CTR Exercise

Page 16: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Binary Classification

Goal: Learn a mapping from entities to discrete labels given a set of training examples (supervised learning)

Examples: • Spam Detection • Fraud detection • Face detection • Clickthrough rate prediction

1.2 Definitions and terminology 3

Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.

Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?

1.2 Definitions and terminology

We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.

Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.

Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.

Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.

Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.

Validation sample: Examples used to tune the parameters of a learning algorithm

Page 17: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Goal: find linear decision boundary: • Model Parameters: slope (w) and intercept (b)

As with linear regression, we can add a “one” feature and drop the intercept term, i.e.,

Linear Classifiers

Features: x coordinate Labels: y coordinate

x

y

y = sign(wx + b)

y = sign(w�x)

Page 18: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Goal: find linear decision boundary: What to do when data is not separable? • Minimize the number of errors!

Linear Classifiers

Features: x coordinate Labels: y coordinate

x

y

y = sign(w�x)

Page 19: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

0/1 Loss Minimization

�0/1(z) =

�1 if z < 00 otherwise

Issue: hard optimization problem, not convex!

minw

n!

i=1

ℓ0/1(yi · w⊤xi)where

y · y

Page 20: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Solution: Approximate 0/1 loss with convex loss • Several options (hinge, logistic, exponential…)

Approximate 0/1 Loss

Credit: Elements of Statistical Learning, Hastie,Tibshirani, Friedman

y · y

Page 21: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Logistic Regression uses logistic loss (logloss)

Approximate 0/1 Loss

Credit: Elements of Statistical Learning, Hastie,Tibshirani, Friedman

ℓlog(z) = log(1+ e−z)

Page 22: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Gradient Descent

w

w*

f(w)

Start at a random point

- Pick descent direction (negative gradient) - Choose a step size - Update

w0

w1

x

Page 23: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Objective:

1D Derivative: (Chain rule 2x) Gradient:

Gradient Descent Updatef(w) =

n�

i=1

log(1+ exp(�yiw�xi))

�wf =n�

i=1

�1� 1

1+ exp(�yiw�xi)

�(�yixi)

dfdw

=n�

i=1

11+ exp(�yiwxi)

· d[1+ exp(�yiwxi)]dw

=n�

i=1

exp(�yiwxi)1+ exp(�yiwxi)

· d(�yiwxi)dw

=n�

i=1

exp(�yiwxi)1+ exp(�yiwxi)

(�yixi)

=n�

i=1

�1� 1

1+ exp(�yiwxi)

�(�yixi)

Page 24: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Gradient Descent UpdateObjective:

1D Derivative: (Chain rule 2x) Gradient:

f(w) =n�

i=1

log(1+ exp(�yiw�xi))

�wf =n�

i=1

�1� 1

1+ exp(�yiw�xi)

�(�yixi)

dfdw

=n�

i=1

11+ exp(�yiwxi)

· d[1+ exp(�yiwxi)]dw

=n�

i=1

exp(�yiwxi)1+ exp(�yiwxi)

· d(�yiwxi)dw

=n�

i=1

exp(�yiwxi)1+ exp(�yiwxi)

(�yixi)

=n�

i=1

�1� 1

1+ exp(�yiwxi)

�(�yixi)

Page 25: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Gradient Descent UpdateObjective:

1D Derivative: (Chain rule 2x) Gradient:

f(w) =n�

i=1

log(1+ exp(�yiw�xi))

�wf =n�

i=1

�1� 1

1+ exp(�yiw�xi)

�(�yixi)

dfdw

=n�

i=1

11+ exp(�yiwxi)

· d[1+ exp(�yiwxi)]dw

=n�

i=1

exp(�yiwxi)1+ exp(�yiwxi)

· d(�yiwxi)dw

=n�

i=1

exp(�yiwxi)1+ exp(�yiwxi)

(�yixi)

=n�

i=1

�1� 1

1+ exp(�yiwxi)

�(�yixi)

Page 26: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Gradient Descent UpdateObjective:

1D Derivative: (Chain rule 2x)

f(w) =n�

i=1

log(1+ exp(�yiw�xi))

dfdw

=n�

i=1

11+ exp(�yiwxi)

· d[1+ exp(�yiwxi)]dw

=n�

i=1

exp(�yiwxi)1+ exp(�yiwxi)

· d(�yiwxi)dw

=n�

i=1

exp(�yiwxi)1+ exp(�yiwxi)

(�yixi)

=n�

i=1

�1� 1

1+ exp(�yiwxi)

�(�yixi)

Page 27: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Gradient Descent UpdateObjective:

1D Derivative: (Chain rule 2x)

Gradient:

f(w) =n�

i=1

log(1+ exp(�yiw�xi))

�wf =n�

i=1

�1� 1

1+ exp(�yiw�xi)

�(�yixi)

dfdw

=n�

i=1

11+ exp(�yiwxi)

· d[1+ exp(�yiwxi)]dw

=n�

i=1

exp(�yiwxi)1+ exp(�yiwxi)

· d(�yiwxi)dw

=n�

i=1

exp(�yiwxi)1+ exp(�yiwxi)

(�yixi)

=n�

i=1

�1� 1

1+ exp(�yiwxi)

�(�yixi)

Page 28: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Probabilistic Interpretation

Goal: model a probability via a linear model • But wTx is a real!

logistic function maps reals to [0,1]

σ(z) =1

1+ exp(�z)

Page 29: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Probabilistic Interpretation

Model probability of label (y) given features (x)

Equivalent to previous setup • Maximizing likelihood gives same objective as before • •

logistic function maps reals to [0,1]

P(Y = y|x) =1

1+ exp(�yiw�xi)= σ(yiw�xi)

P(Y = 1|x) = P(Y = �1|x) �� w�x = 0

P(Y = 1|x) > P(Y = �1|x) �� sign(w�x) > 0

Page 30: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Logloss and Cross Entropy

Cross entropy: information-theoretic way to compare the ‘closeness’ of two probability distributions • E.g., two discrete distributions over

Consider predicted and observed labels for a training point as discrete distributions over • Cross entropy between distributions equals logloss!

Logloss makes sense for evaluating probabilities • Particularly good for CTR prediction b/c of label skew

t � T

T = {0, 1}

Page 31: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Recap Question

What is the purpose of a loss function? • It’s a way to penalize a model for incorrect

predictions on training data • It precisely defines the optimization problem to be

solved for a particular learning model

Page 32: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Recap Question

What are some nice properties about logloss? • It’s a convex surrogate for 0-1 loss • It has a nice probabilistic interpretation as the cross

entropy between the true and approximate distributions in the case of logistic regression.

Page 33: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Recap Question

(Bonus) How would you augment the logistic regression model to control for model complexity? Why might you want to do this?

• Add a regularization term, just as in linear regression.

• Adding such a term can prevent overfitting / encourage generalization.

Page 34: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Online Advertising

Logistic Regression

One-hot-encoding and Feature hashing

CTR Exercise

Page 35: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Categorical Data

Raw entity data often not numerical • Earlier we spoke about text processing

Categorical is common, often with high cardinality • User features: Gender, Nationality, Occupation, … • Advertiser / Ad features: Industry, Language, …

How do we convert categorical into numeric? • One approach: One-hot-encoding

Page 36: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

One-hot-encoding

Create dummy variables • E.g., given a variable categories [‘UK’, ‘USA’, ‘CA’], we

create 3 dummy variables

Can drastically increase dimensionality • Number of dummy variables equals cardinality of

variable

‘UK’ ⇒ [1 0 0], ‘USA’ ⇒ [0 1 0], ‘CA’ ⇒ [0 0 1]

Page 37: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

How to Reduce Dimension?

One option: Discard infrequent values • Requires additional pass to compute feature counts • Doesn’t work well with new data • Might be throwing out something useful

An alternative: Feature Hashing• Hash into d << D

buckets • No additional

processing / storage • Easy to implement

Page 38: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

How to Reduce Dimension?

One option: Discard infrequent values • Requires additional pass to compute feature counts • Doesn’t work well with new data • Might be throwing out something useful

An alternative: Feature Hashing

Collisions have minimal impact given sparsity • Under certain conditions preserves inner products • Good empirical performance

Page 39: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Recap Question

Why is feature hashing attractive in the CTR prediction setting?

• CTR prediction uses many categorical features with large cardinality (e.g., User location, User occupation, Advertiser industry, etc.).

• Using a one-hot-encoding representation of these categorical features can blow up the feature space.

• Feature hashing can reduce this feature dimension.

Page 40: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Online Advertising

Logistic Regression

One-hot-encoding and Feature hashing

CTR Exercise

Page 41: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

CTR ExerciseGoal: Predict click-through-rate Data: Criteo Dataset from recent Kaggle competition • Subsample of 10GB of CTR data • 39 masked features • Full data has 33M distinct categories

CTR pipeline exercise • Featurize data using one-hot-encoding • Reduce feature dimension via feature hashing • Explore various properties of data • Train, evaluate and tune hyperparameters for logistic

regression model using logloss

Page 42: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

MLlib: Spark’s Machine Learning Library

Ameet Talwalkar January 14, 2015

Page 43: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

MLbase and MLlib

MLlib: Spark’s core ML libraryMLI, Pipelines: APIs to simplify ML development• Tables, Matrices, Optimization, ML Pipelines

MLOpt: Declarative layer to automate hyperparameter tuning

MLlib

MLIMLOpt

Apache Spark

MLbase aims to simplify development and deployment of

scalable ML pipelines

Experimental Testbeds

ProductionCode

Pipelines

Page 44: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

History of MLlibInitial Release• Developed by MLbase team in AMPLab (11 contributors)

• Scala, Java

• Shipped with Spark v0.8 (Sep 2013)

17 months later…• 80+ contributors from various organization

• Scala, Java, Python

• Latest release part of Spark v1.2 (Dec 2014)

Page 45: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

What’s in MLlib?

• Alternating Least Squares• Lasso• Ridge Regression• Logistic Regression• Decision Trees• Naïve Bayes• Support Vector Machines• K-Means• Gradient descent• L-BFGS• Random data generation• Linear algebra• Feature transformations• Statistics: testing, correlation• Evaluation metrics

Collaborative Filteringfor Recommendation

Prediction

Clustering

Optimization

Many Utilities

Page 46: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Benefits of MLlib

• Part of Spark• Integrated data analysis workflow• Free performance gains

Apache Spark

SparkSQL Spark Streaming MLlib GraphX

Page 47: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Benefits of MLlib

• Part of Spark• Integrated data analysis workflow• Free performance gains

• Scalable• Python, Scala, Java APIs• Broad coverage of applications & algorithms• Rapid improvements in speed & robustness

Page 48: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

Performance Spark: 10-100X faster than Hadoop & Mahout

On a dataset with 660M users, 2.4M items, and 3.5B ratings MLlib runs in 40 minutes with 50 nodes

0

12.5

25

37.5

50

MLlibMahout

Number of Ratings 0 200M 400M 600M 800M

Run

time

(min

utes

)

ALS on Amazon Reviews on 16 nodes

Page 49: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

PerformanceSteady performance gains

ALS

Decision Trees

K-Means

Logistic Regression

Ridge Regression

Speedup(Spark 1.0 vs. 1.1)

~3X speedups on average

Page 50: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

ML Pipelines

Pipelines in 1.2 (alpha release)• Easy workflow construction

• Standardized interface for model tuning

• Testing & failing early

Typical ML workflow is complex

Inspired by MLbase / Pipelines ProjectCollaboration between Databricks / AMPLabMLbase / MLOpt aims to autotune these pipelines

Page 51: Click Through Rate Predictionmarsy/resources/bids/bids_minicours… · Click Through Rate Prediction Ameet Talwalkar January 14, 2015. Schedule 9:00am-12:30pm Distributed ML • 9:00-10:00am

MLlib Programming Guide spark.apache.org/docs/latest/mllib-guide.html

Databricks training info databricks.com/spark-training

Spark user lists & communityspark.apache.org/community.html

Resources