snu seminar sharpflow ty · compare two logistic regression models result model structure ne...

20180523 Seminar

1. Factorization Machines - 2010

2. Field-aware Factorization Machines - 2016

3. Practical lessons from predicting clicks on ads at facebook - 2014

Presented by Hochul Kim

[email protected]

20180523

Factorization Machines

Factorization Machines (FM)

Other factorization models : (ex : MF, SVD++, PITF, FPMC, …)

Drawback

not applicable for general prediction taskswork only with special input dataequations and optimization algorithms are derived individually

Feature of FM

model all interactions between variables (using factorized parameters)

Advantages of FM

general predictor working with any real valued feature vectorthey are able to estimate interactions even in problems with huge sparsitycalculated in linear time

Prediction Under Sparsity

terms in this paper

mailto:[email protected]

prediction function: let

feature vector: for Regression: for Classification(for binary classification may be):

training dataset:

describe sparsity of :

num of none zero in : mean of : condition of Huge Sparsity := where

indeies

user index: where item index: where time index where raitng independent value (e.g. )example of data (Netfilx)

Input Data for example

Fig.1 Example for sparse real valued featuce vectors the transactions of example 1

FM Model Equation - 1 : Model

where

and dot product is

where

FM Model Equation - 2

2-way FMs(degree , )

capture all single and pairwise interactions between variable

: the global bias

: models the strength of the -th variable

: models the interaction between the -th variable and -th variable

Instead of using an own model parameter the FM models the interaction by factorizing it

for any positive definite matrix there exists a matrix such that if is largeenough(sufficiently large)

typically a small should be chosen

there is not enough data to estimate complex interactions

i.e. : Restricting better generalization

Example

Parameter Estimation Under Sparsity : want to estimate the interaction between (Alice) and (Star Trek)

no case in the training data where both variables and are non-zero

But with the factorized interacton parameter can estimate

from and ,

from and ,

from and ,

FM Model Equation - 3

Complexity

Complexity of model :

all pairwise interactions have to be computed.

deduction

Learning FM

can be learned by SGD with logistic loss (or hinge loss)gradient of FM

can be precomputedthus case can be done in (under sparsity)

Field-aware Factorization Machines for CTR Prediction

Introduction

Exisiting Models - 1

Poly2

Paper: Training and testing low-degree polynomial data mappings via linear SVM

use degree-2 polynomial mapping for capture the information of feature conjunctions

applying a linear model on the explicit form of degree-2 mappings

training time and testtime are much faster than kernel methods.

model

where is a hashing and , function as below(model size is user secified parameter):

Complexity:

where is the average number of non-zero elements per instance.

Exisiting Models - 2

FMs

Paper: Factorization Machinesmodel

size of = with deduction

than complexity is reduced to

why FMs can be better than Poly2 when the data set is sparse?

Example

for (ESPN, Adidas) pair in example, only one negative traning data for pair (0,-1)

For Poly2

a very negative weight might be learned for this pair.

FMs

prediction of is determined by and , ans these are also learned from (ESPN-Nike), (NBC-Adidas), so prediction may be more accurate

FFMs-1

Idea

Pairwise Interaction Tensor Factorization

use FMs for each set of field(e.g. : (User, Item), (User, Tag), (Item, Tag))

Example

Clicked Publisher(P) Advertiser (A) Gender (G)

Y ESPN Nike Male

data for (Yes, ESPN, Nike, Male)

i.e. factorize each feature(e.g. Nike, ESPN, ...) only

i.e. factorize each feature(e.g. Nike, ESPN, ...) with field(e.g. Advertiser, Publisher, Gender)

FFMs-2

Model equation

where and are respectively the fields of and

Complexity

let num of field is

size of is

= num of feature(=length of feature vector ) = num of field that features are categorized = length of latent vector

complexity

but usually

model variables complexity

LM

Poly2

FM

FFM

compare algorithms

FFMs-3

Solving Optimization Problem - Algorithm

Let be a tensor of all ones

Run the following loop for t epoches

for do

Sample a data point

calculate

for non-zero terms in do

for non-zero terms in do

calculate sub-gradient, see (A)

for do

update sum (variable), see (B)update model, see (C)

initial value

is a user-specified learning rate, set as in experiments is a learning rate, set as in experiments are randomly sampled from a uniform distribution between

are set to 1 in odrer to prevent a large value of

FFMs-4

Solving Optimization Problem - formula

(A) sub gradients

where

(B) for each coodinate the sum of squared gradient is accumulated

(C) update and

where , (size of and is )

empirical experience in paper

we find that normalizing each instance to have the unit length makes the test accuracy slightly better and insensitive to parameters.

Parallelization on Shared-memory Systems

apply HOGWILD for parallelization

parallellized the first forsee Experiment for more detail

Adding Information field

data format for packages

데이터 형태

LIBSVM data format;

for FFM, consider;

terms

for label : : true, clicked / : false, not clicked = Publisher, = Advertiser, : Gender

Categorical Feature

from

convert to boolean for each feature

Numerical Features

Accepted AR Hidx Cite

45.73 2 3

1.04 100 50000

from

convert data with strategy

naive way : merely duplicates of features

discretize each numerical feature to a categorical one with strategy (discretization with rountstrategy)

i.e apply strategy to original value and results of discretization are fits into 'feature' andvalue is converted as boolean

drawback

not east to determine best setting(discretization strategy)some information will be lost after discretization

Experiment-1

evaluation

use logistic loss

where is num of test instance

impact of paremeters -

Early Stopping

various methods was tried such as lazy update 5 and ALS-based optimization.but Early Stopping was best

1. Split the data set into a training set and a validation set.2. At the end of each epoch, use the validation set to calculate the loss.3. If the loss goes up, record the number of epochs. Stop or go to step 4.4. If needed, use the full data set to re-train a model with the number of epochs obtained in step

Speedup

definition speedup:

Epoch-loss

loss is converges after #13

thread-speedup

speedup is converges more than 8 threadmemory lock

Comparison on More Dataset

Practical Lessons from Predicting Clicks on Ads at Facebook

Abstract

Volume : 750 million daily Active User(AU), over 1 million active advertisers

Combined decision trees with logistic regression

outperforming either of these methods on its own by over 3%

Explore how a number of fundamental parameters impact the final prediction performance ofsystem

Most important thing is to have the right features

historical information about the userad dominate other types of features -> contextual feature

Right Model & Features -> other factors play small roles

Introduction

Billing : bid and pay per click auctions

The efficiency of an ads auction depends on the accuracy and calibration of click prediction.

Needs..

Robust and Adaptive Capable of learning from massive volumes of data

Feature of Facebook Ads

Ads are not associated with a query

but specify demographic and interest targeting

Traditional sponsored search advertising

user query is used to retrieve candidate ads

Experimental Setup

Evaluation Metrics

Normalized Entropy: the predictive log loss normalized by the entropy of the background CTR

= empirical CTR of the training data set

Calibration: ratio of the average estimated CTR and empirical CTR

Prediction Model Structure-1

system Architecure

Figure 1: Boosted decision trees + Probabilistic sparse linear classifier (for online learning)


Model

online learning schemes : based on the Stochastic Gradient Descent (SGD) algorithm

After feature Transform, , where is num of feature = num of leaves in boosteddscision tree

: -th unit vector : the value of the categorical input features

for labeled impression

Bayesian online learning scheme for probit regression (BOPR)

GLM (with probit link function)

where

Prior

Posterior

The resulting model consists of the mean and the variance of the approximate posteriordistribution of weight vector


update

,

more discussion

This inference can be viewed as an SGD scheme

where

(A) can be seen as a per-coordinate gradient descent like (B)


Decision tree feature transformers

Boosted decision tree

use follow the Gradient Boosting Machine (GBM) with L2-TreeBoost algorithm

experiment

compare two logistic regression models

result

Model Structure NE (relative to Trees only)

LR + Trees 96.58%

LR only (non-transformed) 99.43%

Trees only 100% (reference)

Table 1

Logistic Regression (LR) and boosted decision trees (Trees) make a powerful combination. We evaluate them by their Normalized Entropy (NE) relative to that of the Trees only model.

Prediction Model Structure -2

Data freshness

Click prediction systems are often deployed in dynamic environments where the data distributionchanges over time.

Experiment

model on one particular day and test it on consecutive days

train on one day of dataevaluate on the six consecutive days

result

worth to retraining daily

But : too Expensive

The boosted decision trees can be trained daily - (with some restrictions)linear classifier can be trained in near real-time (online learning)

Prediction Model Structure -3

Online Linear Classifier - Compare BOPR and SGD -1

name Learning rate schema Parameters

Per-coordinate α = 0.1, β = 1.0

Per-weight square root α = 0.01

Per-weight α = 0.01

Global α = 0.01

Constant α = 0.0005

SGD Learning rate Schema and params

Result

Model Type NE (relative to LR)

LR 100% (reference)

BOPR 99.82%


Online Linear Classifier - Compare BOPR and SGD-2

Per-coordinate online LR vs BOPR

advantages of LR over BOPR

model size is half ( only)

Fast (Depending on the implementation)Low computational cost

advantage of BOPR over LR

being a Bayesian formulation -> provides a full predictive distribution

can compute percentilescan be used for explore/exploit

Online Data Joiner

forms a tight closed loop

train classifier layer online

positive and negative

Positive = ClickedNegative = User did not click the ad after a fixed and sufficiently long period of time after seeing the ad

architecture for serving ad and join data

1. The initial data stream is generated when a user visits Facebook

2. request is made to the ranker for candidate

3. The ads are passed back to the user’s device and in parallel

each ad and the associated features used in ranking that impression are added to the impression stream

4. after the full join window has expired will the labelled impression be emitted to the trainingstream.

Ohter

protection mechanisms against anomalies that could corrupt the online learning system

Containing Memory and Latency-1

Number of boosting trees

Boosting Feature importance

Containing Memory and Latency-2

Historical feature

contextual feature

depends on current information regarding the contextex : the device used by the users, the current page that the user is on, ..

historical feature

depend on previous interaction for the ad or user

ex : CTR of ad in last week, average CTR of user, ..

Percentage of historical feature

Type of features NE (relative to Contextual)

All 95.65%

Historical only 96.32%

Contextual only 100%

but contextual features are very important to handle the cold start problem

data freshness

historical features describe long-time accumulated user behavior

much more stable than contextual features

Massive Training Data

Massive Data -> Need Sampling

Uniform subsampling

experiment with sampling rate 0.01, 0.1, 0.5, 1

Negative Down sampling

Discussion

Practical Lessons

Data freshness matterssignificantly increases the prediction accuracy Best online learning method - BOPR is better than LR-SGD

Results

tradeoff between the number of boosted decision trees and accuracy feature selection effect by boosted dicision trees

snu seminar sharpflow ty · compare two logistic regression models result model structure ne...

Documents