a two-stage ensemble of classification, regression, and ranking models for advertisement ranking...

A Two-Stage Ensemble of classification, regression, and ranking Models for Advertisement Ranking

Presenter: Prof. Shou-de Lin

Team NTU members: Kuan-Wei Wu, Chun-Sung Ferng, Chia-Hua Ho, An-Chun Liang, Chun-Heng Huang, Wei-Yuan Shen, Jyun-Yu Jiang, Ming-Hao Yang, Ting-Wei Lin, Ching-Pei Lee, Perng-Hwa Kung, Chin-En Wang, Ting-Wei Ku, Chun-Yen Ho, Yi-Shu Tai, I-Kuei Chen, Wei-Lun Huang, Che-Ping Chou, Tse-Ju Lin, Han-Jay Yang,Yen-Kai Wang, Cheng-Te Li, Prof. Hsuan-tien Lin

About Team NTU (the catch up version)

• A team from the EECS college of National Taiwan University • This year’s team is leading by Prof. Hsuan-tien Lin and Prof. Shou-de Lin• We have a course aiming at training students to analyze real-world,

large-scale datasets.– every year we recruit new students to participate in this course as well as the

KDD Cup. – The majority of our students are undergraduate students, they are

inexperienced but they are smart and quick learners.• Starting 2008, the NTU team has won 4 KDD Cup champions (and a 3rd

place) in the past 5 years.

Facts about Track 2

• Predict click-through rate (#click/#impression) of ads on search engine

• 155,750,158 instances in training and 20,297,594 instances in testing

• Each training instance can be viewed as a vector (#click, #impression, DisplayURL, AdID, AdvertiserID, Depth, Position, QueryID, KeywordID, TitleID, DescriptionID, UserID)

• Testing instance shares the same format except for the lack of #click and #impression

• Gender, age of users and tokens information are also provided

• Goal: Maximize AUC on testing

Framework for Track 2• Individual models in five different categories• Validation set blending to combine portion of models, boosting

performance, enhance the diversity• Test set Ensemble to aggregate the high performance blending

models into our final solution• This 3-stage framework has also been exploited successfully for

our solutions on KDD Cup 2011

Classification Models

Regression Models

Ranking Models

Combined Regression and Ranking Models

Matrix Factorization Models

Validation Set Blending Models Test Set Ensemble Result

Validation Set

• We tried several strategies to create the validation set, but none of them can represent testing performance faithfully comparing to a very naïve one as below.

• We divide the training data into sub-train and validation (sub-train : validation= 10:1)– Models’ performance on the validation set and the test set

is slightly inconsistent, we think it is because different ratio of cold-start users in each set (6.9% in the validation, but 57.7% in the test set)

• Our conclusion: It is non-trivial to create a validation set on which the model’s performance is consistent with that of the testing dataset

General Features

• We create 6 categories of features, and each individual model may use different subsets of them– Categorical features– Sparse features– Click-through rate features– ID raw value features– Other numerical features– Token similarity features

• In track 2, we find no ‘killer features’ such as the sequential features in track 1.

Categorical & Sparse Features

• Categorical features– Only for Naïve Bayes– We treat IDs such as UserID, AdID as categorical features

directly

• Sparse binary features– Expand categorical features into binary indicator features– Most of the features=0

Click-through Rate Features

• For each category, we generate the average click-through rate as a one-dimensional feature

• For example, for each AdID, we compute the average click-through rate for all instances of the same AdID as one feature.

• To handle biased CRT due to insufficient statistics, we apply additive smoothing:– Smoothing significantly boosts the performance

experimentour in 75 and 0.05 use we, #

#

impression

click

ID Raw Value

• We observed numerical value of ID contain some information• For example, the figure below plots the total #impression for

each KeywrodID, and shows that #impressions decrease when value of KeywordID increase

• We guess the ID values may contain time information in it

Other Numerical Features

• Features for position & depth– ad’s position– depth– relative position, (depth-position)/depth

• Number of tokens for QueryID, KeywordID, TitleID and DescriptionID

• Weighted number of tokens for QueryID, KeywordID, TitleID and DescriptionID, each token is weighted by its IDF value

• Number of impression of categorical features

Token’s Similarity Features

• Tokens similarity between QueryID, KeywordID, TitleID and DescriptionID as features. – C(4,2)=6 pairs of similarity as 6 features– cosine similarity between tf-idf vector of tokens– alternatively, we use LDA model to extract topics for QueryID,

KeywordID, TitleID and DescriptionID, and then generate cosine similarity between latent topics

Individual Models

• The click-through rate prediction problem is modeled as classification, regression and ranking problems

• For each strategy, we exploit several models and most of them reach competitive performance


Regression Models

Ranking Models




Individual Models: Classification Models (1)

• We split each training instance into #click positive samples and (#impression-#click ) negative samples

• We apply two classification methods– Naïve Bayes– Logistic Regression


Regression Models

Ranking Models




Individual Models: Classification Models (2)

• Naïve Bayes– Additive smoothing and Good-Turing are applied with

promising results– The best AUC is 0.7760 on the public test set

• Logistic Regression– Train on sampled subset to reduce the training time– Separate users into two group (userID=0 or not), train two

models on for these groups and then combine the results– This model achieve 0.7888 on the public Test set

Individual Models: Regression Models (1)

• For the regression models, we use as target to predict

• Two methods in this category– Linear Regression– Support Vector Regression


Regression Models

Ranking Models




impression

clickCTR

#

#

Individual Models: Regression Models (2)

• Linear Regression– degree-2 polynomial expansion on numerical value

features– 0.7352 AUC on the public test set

• Support vector Regression– Use degree-2 polynomial expansion– The best AUC of this model is 0.7705 on the public test set

Individual Models: Ranking Models (1)

• We split each training instance into #click positive samples and (#impression-#click ) negative samples

• Optimize pairwise ranking • Two methods in this category

– Rank Logistic Regression– RankNet


Regression Models

Ranking Models




Individual Models: Ranking Models (2)

• Rank Logistic Regression– Optimize by

– The best AUC is 0.722 on the public test set

• RankNet– Optimizes cross entropy loss function

with neuron network, where – Using SGD to update parameters– The best result is 0.7577 on the public test set

Stochastic Gradient Descent (SGD) is used

)1log(ˆ ijrijij erC

H

j kik

T

jkjijiij xwwrrrr1

)1()2( )tanh(ˆ and )ˆˆ(ˆ

Surprisingly, ranking-based model does not outperform the other models, maybe be due to the fact it is more complicated to train and tune the parameters.

Individual Models: Combined Regression and Ranking Models (1)

• We also explore another model that combines the ranking loss and the regression loss

• In this model we try to optimize

where H is ranking loss, L is regression loss• Solve by SGD SVM, the best AUC is 0.7819


Regression Models

Ranking Models




Individual Models: Matrix Factorization Models (1)

• We also have feature-based factorization models, which exploit latent information from data

• Two different matrix factorization are provided. One optimizes regression loss, and the other optimizes ranking loss


Regression Models

Ranking Models





• Regression-Based– Divide features into two groups, α as user’s features and β

as items features– The prediction for a instance is

– Minimize RMSE– The best AUC is 0.7776 on the public test set

)()()()(ˆ )()( j

jjT

jjj

jj

ij

jj

uji qpwwxr

bias bias


• Ranking-Based– The prediction for a instance is

– Features can belong to α, β or both– Optimize pairwise ranking as

– The best AUC is 0.7968 on the public test set

kj

B

j

B

jkkj

jjji ppwxr

1 1

,)(ˆ

))exp(1ln()( where, 2

))(ˆ)(ˆ(min2

1 1

LxrxrL ji

N

i

N

j

Validation Set blending (1)• Blend models and additional features non-linearly• Re-blending to exploit additional enhancement• Four models for blending– Support Vector Regression (SVR)– RankNet (RN)– Combined Regression and Ranking Models (CRR)– LambdaMart (LM)


Regression Models

Ranking Models




Validation Set blending (2)

• Stratgies for Model Selection1. By difference between validation AUC and test AUC2. Or select diverse model set by human

• Different Score– Raw score– Normalize score– Ranked score Model Public Test Set AUC

SVR 0.8038

RN (with re-blending) 0.8062

CRR (with re-blending)

0.8051

LM (with re-blending) 0.8060Performance of blending models

Test Set Ensemble (1)

• Ensemble the selected models from validation set blending

• Combine each models linearly• Weights of the linear combination depends on AUC

on the public test set• It achieves 0.8064 on the public test setClassification

Models

Regression Models

Ranking Models




Final Result

• We apply uniform average on the top five models on board to aggregate our final solution. It achieves 0.8089 on the private test set (0.8070 on the public test set), which outperforms all the other competitors in this competition.


Regression Models

Ranking Models




Take Home Points

• The main reasons for our success:– Tried diverse models (ranking, classification,

regression, factorization)– Novel ways for feature engineering (e.g.

smoothing, latent features using LDA, ids, etc)– Complex two-stage blending models– Perseverance (we probably have tried more failure

models than effective ones)

Acknowledgement

• We truly thank– organizers for designing a successful competition– NTU EECS college, CSIE department, INTEL-NTU

center for the supports

a two-stage ensemble of classification, regression, and ranking models for advertisement ranking...

Documents

models performance

categories of features

individual models

general features

killer features

sequential features

portion of models

testing slide