click through rate predictionmarsy/resources/bids/bids_minicours… · click through rate...

Click Through Rate Prediction

Ameet Talwalkar January 14, 2015

Schedule9:00am-12:30pm Distributed ML • 9:00-10:00am Millionsong Exercise (continued) • 10:00-11:00am Click-through Rate (CTR) Prediction • 11:00-12:30pm Exercise on CTR Prediction 12:30pm-1:30pm Lunch / Feedback Session • Intro Spark • Data Science • Distributed ML • Databricks Cloud

1:30pm-3:00pm Exercise on CTR Prediction

Online Advertising

Logistic Regression

One-hot-encoding and Feature hashing

CTR Exercise

Big Business • Multiple billion dollar industry • $42B in 2013, 17% increase over

previous year! • How Google and other search

engines make $

Online Advertising

Online Advertising

Big Data • Lots of people use the internet • Easy to gathered labeled data

(privacy concerns!)

Big ML • A great success story for ML • Major motivation for scalable ML • Problem is hard, so we need all

the data we can get!

The PlayersPublishers: NYTimes, Google, ESPN • Make money displaying ads

Advertisers: Marc Jacobs, Fossil, Macy’s, Dr. Pepper • Pay for ads to attract business

Matchmakers: Google, Microsoft, Yahoo • Match publishers with advertisers • In real-time (i.e., as a specific user visits a website)

Why Advertisers Pay?

Impressions • Get message to target audience • e.g., brand awareness campaign

Performance • Get users to do something • e.g., click on ad and visit site • e.g., buy something or join a mailing list

Most common

Efficient MatchmakingIdea: Predict and Optimize • Predict probability that user will click each ad • Choose ads to maximize probability

Predictive features • Ad’s historical performance • Advertiser and ad content info • Publisher info • User info (search and click

history, social network)

Companies get billions of impressions per day…

…but data is high-dim, sparse, skewed • Hundreds of millions of users (or userIDs) • Millions of unique landing pages • Millions of unique ads • Very few ads get clicked (label skew)

Using more data is crucial to tease out signal

Goal: Estimate P(click | user, ad, publisher info) Given: Massive amounts of labeled data

One Approach

Goal: Estimate P(click | user, ad, publisher info) Given: Massive amounts of labeled data

Feature extraction • Quadratic features • One-Hot-Encoding (categorical features) • Feature Hashing (control dimensionality)

Logistic Regression: models probabilities directly • Also evaluate classifier via probabilities

Recap Question

Why is CTR modeling crucial for online advertising? • CTR modeling allows for efficient spot market • Amount advertisers pay is based on the chance that

their ad will have the desired effect (e.g., a click) • Publishers want to maximize the money they make

hosting ads (hence want to host ads with high CTR) • 3rd party matchmakers need to make good

matches to stay in business

Online Advertising

Logistic Regression


CTR Exercise

Binary Classification

Goal: Learn a mapping from entities to discrete labels given a set of training examples (supervised learning)

Examples: • Spam Detection • Fraud detection • Face detection • Clickthrough rate prediction

1.2 Definitions and terminology 3

Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.

Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?

1.2 Definitions and terminology

We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.

Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.

Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.

Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.

Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.

Validation sample: Examples used to tune the parameters of a learning algorithm

Goal: find linear decision boundary: • Model Parameters: slope (w) and intercept (b)

As with linear regression, we can add a “one” feature and drop the intercept term, i.e.,

Linear Classifiers

Features: x coordinate Labels: y coordinate

x

y

y = sign(wx + b)

y = sign(w�x)

Goal: find linear decision boundary: What to do when data is not separable? • Minimize the number of errors!

Linear Classifiers

Features: x coordinate Labels: y coordinate

x

y

y = sign(w�x)

0/1 Loss Minimization

�0/1(z) =

�1 if z < 00 otherwise

Issue: hard optimization problem, not convex!

minw

n!

i=1

ℓ0/1(yi · w⊤xi)where

y · y

Solution: Approximate 0/1 loss with convex loss • Several options (hinge, logistic, exponential…)

Approximate 0/1 Loss

Credit: Elements of Statistical Learning, Hastie,Tibshirani, Friedman

y · y

Logistic Regression uses logistic loss (logloss)

Approximate 0/1 Loss

Credit: Elements of Statistical Learning, Hastie,Tibshirani, Friedman

ℓlog(z) = log(1+ e−z)

Gradient Descent

w

w*

f(w)

Start at a random point

- Pick descent direction (negative gradient) - Choose a step size - Update

w0

w1

x

Objective:

1D Derivative: (Chain rule 2x) Gradient:

Gradient Descent Updatef(w) =

n�

i=1

log(1+ exp(�yiw�xi))

�wf =n�

i=1

�1� 1

1+ exp(�yiw�xi)

�(�yixi)

dfdw

=n�

i=1

11+ exp(�yiwxi)

· d[1+ exp(�yiwxi)]dw

=n�

i=1

exp(�yiwxi)1+ exp(�yiwxi)

· d(�yiwxi)dw

=n�

i=1


(�yixi)

=n�

i=1

�1� 1

1+ exp(�yiwxi)

�(�yixi)

Gradient Descent UpdateObjective:

1D Derivative: (Chain rule 2x) Gradient:

f(w) =n�

i=1


�wf =n�

i=1

�1� 1

1+ exp(�yiw�xi)

�(�yixi)

dfdw

=n�

i=1

11+ exp(�yiwxi)


=n�

i=1


· d(�yiwxi)dw

=n�

i=1


(�yixi)

=n�

i=1

�1� 1

1+ exp(�yiwxi)

�(�yixi)


1D Derivative: (Chain rule 2x)

f(w) =n�

i=1


dfdw

=n�

i=1

11+ exp(�yiwxi)


=n�

i=1


· d(�yiwxi)dw

=n�

i=1


(�yixi)

=n�

i=1

�1� 1

1+ exp(�yiwxi)

�(�yixi)


1D Derivative: (Chain rule 2x)

Gradient:

f(w) =n�

i=1


�wf =n�

i=1

�1� 1

1+ exp(�yiw�xi)

�(�yixi)

dfdw

=n�

i=1

11+ exp(�yiwxi)


=n�

i=1


· d(�yiwxi)dw

=n�

i=1


(�yixi)

=n�

i=1

�1� 1

1+ exp(�yiwxi)

�(�yixi)

Probabilistic Interpretation

Goal: model a probability via a linear model • But wTx is a real!

logistic function maps reals to [0,1]

σ(z) =1

1+ exp(�z)

Probabilistic Interpretation

Model probability of label (y) given features (x)

Equivalent to previous setup • Maximizing likelihood gives same objective as before • •

logistic function maps reals to [0,1]

P(Y = y|x) =1

1+ exp(�yiw�xi)= σ(yiw�xi)

P(Y = 1|x) = P(Y = �1|x) �� w�x = 0

P(Y = 1|x) > P(Y = �1|x) �� sign(w�x) > 0

Logloss and Cross Entropy

Cross entropy: information-theoretic way to compare the ‘closeness’ of two probability distributions • E.g., two discrete distributions over

Consider predicted and observed labels for a training point as discrete distributions over • Cross entropy between distributions equals logloss!

Logloss makes sense for evaluating probabilities • Particularly good for CTR prediction b/c of label skew

t � T

T = {0, 1}

Recap Question

What is the purpose of a loss function? • It’s a way to penalize a model for incorrect

predictions on training data • It precisely defines the optimization problem to be

solved for a particular learning model

Recap Question

What are some nice properties about logloss? • It’s a convex surrogate for 0-1 loss • It has a nice probabilistic interpretation as the cross

entropy between the true and approximate distributions in the case of logistic regression.

Recap Question

(Bonus) How would you augment the logistic regression model to control for model complexity? Why might you want to do this?

• Add a regularization term, just as in linear regression.

• Adding such a term can prevent overfitting / encourage generalization.

Online Advertising

Logistic Regression


CTR Exercise

Categorical Data

Raw entity data often not numerical • Earlier we spoke about text processing

Categorical is common, often with high cardinality • User features: Gender, Nationality, Occupation, … • Advertiser / Ad features: Industry, Language, …

How do we convert categorical into numeric? • One approach: One-hot-encoding

One-hot-encoding

Create dummy variables • E.g., given a variable categories [‘UK’, ‘USA’, ‘CA’], we

create 3 dummy variables

Can drastically increase dimensionality • Number of dummy variables equals cardinality of

variable

‘UK’ ⇒ [1 0 0], ‘USA’ ⇒ [0 1 0], ‘CA’ ⇒ [0 0 1]

How to Reduce Dimension?

One option: Discard infrequent values • Requires additional pass to compute feature counts • Doesn’t work well with new data • Might be throwing out something useful

An alternative: Feature Hashing• Hash into d << D

buckets • No additional

processing / storage • Easy to implement

How to Reduce Dimension?

One option: Discard infrequent values • Requires additional pass to compute feature counts • Doesn’t work well with new data • Might be throwing out something useful

An alternative: Feature Hashing

Collisions have minimal impact given sparsity • Under certain conditions preserves inner products • Good empirical performance

Recap Question

Why is feature hashing attractive in the CTR prediction setting?

• CTR prediction uses many categorical features with large cardinality (e.g., User location, User occupation, Advertiser industry, etc.).

• Using a one-hot-encoding representation of these categorical features can blow up the feature space.

• Feature hashing can reduce this feature dimension.

Online Advertising

Logistic Regression


CTR Exercise

CTR ExerciseGoal: Predict click-through-rate Data: Criteo Dataset from recent Kaggle competition • Subsample of 10GB of CTR data • 39 masked features • Full data has 33M distinct categories

CTR pipeline exercise • Featurize data using one-hot-encoding • Reduce feature dimension via feature hashing • Explore various properties of data • Train, evaluate and tune hyperparameters for logistic

regression model using logloss

MLlib: Spark’s Machine Learning Library

Ameet Talwalkar January 14, 2015

MLbase and MLlib

MLlib: Spark’s core ML libraryMLI, Pipelines: APIs to simplify ML development• Tables, Matrices, Optimization, ML Pipelines

MLOpt: Declarative layer to automate hyperparameter tuning

MLlib

MLIMLOpt

Apache Spark

MLbase aims to simplify development and deployment of

scalable ML pipelines

Experimental Testbeds

ProductionCode

Pipelines

History of MLlibInitial Release• Developed by MLbase team in AMPLab (11 contributors)

• Scala, Java

• Shipped with Spark v0.8 (Sep 2013)

17 months later…• 80+ contributors from various organization

• Scala, Java, Python

• Latest release part of Spark v1.2 (Dec 2014)

What’s in MLlib?

• Alternating Least Squares• Lasso• Ridge Regression• Logistic Regression• Decision Trees• Naïve Bayes• Support Vector Machines• K-Means• Gradient descent• L-BFGS• Random data generation• Linear algebra• Feature transformations• Statistics: testing, correlation• Evaluation metrics

Collaborative Filteringfor Recommendation

Prediction

Clustering

Optimization

Many Utilities

Benefits of MLlib

• Part of Spark• Integrated data analysis workflow• Free performance gains

Apache Spark

SparkSQL Spark Streaming MLlib GraphX

Benefits of MLlib

• Part of Spark• Integrated data analysis workflow• Free performance gains

• Scalable• Python, Scala, Java APIs• Broad coverage of applications & algorithms• Rapid improvements in speed & robustness

Performance Spark: 10-100X faster than Hadoop & Mahout

On a dataset with 660M users, 2.4M items, and 3.5B ratings MLlib runs in 40 minutes with 50 nodes

0

12.5

25

37.5

50

MLlibMahout

Number of Ratings 0 200M 400M 600M 800M

Run

time

(min

utes

)

ALS on Amazon Reviews on 16 nodes

PerformanceSteady performance gains

ALS

Decision Trees

K-Means

Logistic Regression

Ridge Regression

Speedup(Spark 1.0 vs. 1.1)

~3X speedups on average

ML Pipelines

Pipelines in 1.2 (alpha release)• Easy workflow construction

• Standardized interface for model tuning

• Testing & failing early

Typical ML workflow is complex

Inspired by MLbase / Pipelines ProjectCollaboration between Databricks / AMPLabMLbase / MLOpt aims to autotune these pipelines

MLlib Programming Guide spark.apache.org/docs/latest/mllib-guide.html

Databricks training info databricks.com/spark-training

Spark user lists & communityspark.apache.org/community.html

Resources

http://spark.apache.org/docs/latest/mllib-guide.html



http://databricks.com/spark-training

https://spark.apache.org/community.html



click through rate predictionmarsy/resources/bids/bids_minicours… · click through rate...

Documents