a two-stage ensemble of classification, regression, and ranking models for advertisement ranking...
TRANSCRIPT
A Two-Stage Ensemble of classification, regression, and ranking Models for Advertisement Ranking
Presenter: Prof. Shou-de Lin
Team NTU members: Kuan-Wei Wu, Chun-Sung Ferng, Chia-Hua Ho, An-Chun Liang, Chun-Heng Huang, Wei-Yuan Shen, Jyun-Yu Jiang, Ming-Hao Yang, Ting-Wei Lin, Ching-Pei Lee, Perng-Hwa Kung, Chin-En Wang, Ting-Wei Ku, Chun-Yen Ho, Yi-Shu Tai, I-Kuei Chen, Wei-Lun Huang, Che-Ping Chou, Tse-Ju Lin, Han-Jay Yang,Yen-Kai Wang, Cheng-Te Li, Prof. Hsuan-tien Lin
About Team NTU (the catch up version)
• A team from the EECS college of National Taiwan University • This year’s team is leading by Prof. Hsuan-tien Lin and Prof. Shou-de Lin• We have a course aiming at training students to analyze real-world,
large-scale datasets.– every year we recruit new students to participate in this course as well as the
KDD Cup. – The majority of our students are undergraduate students, they are
inexperienced but they are smart and quick learners.• Starting 2008, the NTU team has won 4 KDD Cup champions (and a 3rd
place) in the past 5 years.
Facts about Track 2
• Predict click-through rate (#click/#impression) of ads on search engine
• 155,750,158 instances in training and 20,297,594 instances in testing
• Each training instance can be viewed as a vector (#click, #impression, DisplayURL, AdID, AdvertiserID, Depth, Position, QueryID, KeywordID, TitleID, DescriptionID, UserID)
• Testing instance shares the same format except for the lack of #click and #impression
• Gender, age of users and tokens information are also provided
• Goal: Maximize AUC on testing
Framework for Track 2• Individual models in five different categories• Validation set blending to combine portion of models, boosting
performance, enhance the diversity• Test set Ensemble to aggregate the high performance blending
models into our final solution• This 3-stage framework has also been exploited successfully for
our solutions on KDD Cup 2011
Classification Models
Regression Models
Ranking Models
Combined Regression and Ranking Models
Matrix Factorization Models
Validation Set Blending Models Test Set Ensemble Result
Validation Set
• We tried several strategies to create the validation set, but none of them can represent testing performance faithfully comparing to a very naïve one as below.
• We divide the training data into sub-train and validation (sub-train : validation= 10:1)– Models’ performance on the validation set and the test set
is slightly inconsistent, we think it is because different ratio of cold-start users in each set (6.9% in the validation, but 57.7% in the test set)
• Our conclusion: It is non-trivial to create a validation set on which the model’s performance is consistent with that of the testing dataset
General Features
• We create 6 categories of features, and each individual model may use different subsets of them– Categorical features– Sparse features– Click-through rate features– ID raw value features– Other numerical features– Token similarity features
• In track 2, we find no ‘killer features’ such as the sequential features in track 1.
Categorical & Sparse Features
• Categorical features– Only for Naïve Bayes– We treat IDs such as UserID, AdID as categorical features
directly
• Sparse binary features– Expand categorical features into binary indicator features– Most of the features=0
Click-through Rate Features
• For each category, we generate the average click-through rate as a one-dimensional feature
• For example, for each AdID, we compute the average click-through rate for all instances of the same AdID as one feature.
• To handle biased CRT due to insufficient statistics, we apply additive smoothing:– Smoothing significantly boosts the performance
experimentour in 75 and 0.05 use we, #
#
impression
click
ID Raw Value
• We observed numerical value of ID contain some information• For example, the figure below plots the total #impression for
each KeywrodID, and shows that #impressions decrease when value of KeywordID increase
• We guess the ID values may contain time information in it
Other Numerical Features
• Features for position & depth– ad’s position– depth– relative position, (depth-position)/depth
• Number of tokens for QueryID, KeywordID, TitleID and DescriptionID
• Weighted number of tokens for QueryID, KeywordID, TitleID and DescriptionID, each token is weighted by its IDF value
• Number of impression of categorical features
Token’s Similarity Features
• Tokens similarity between QueryID, KeywordID, TitleID and DescriptionID as features. – C(4,2)=6 pairs of similarity as 6 features– cosine similarity between tf-idf vector of tokens– alternatively, we use LDA model to extract topics for QueryID,
KeywordID, TitleID and DescriptionID, and then generate cosine similarity between latent topics
Individual Models
• The click-through rate prediction problem is modeled as classification, regression and ranking problems
• For each strategy, we exploit several models and most of them reach competitive performance
Classification Models
Regression Models
Ranking Models
Combined Regression and Ranking Models
Matrix Factorization Models
Validation Set Blending Models Test Set Ensemble Result
Individual Models: Classification Models (1)
• We split each training instance into #click positive samples and (#impression-#click ) negative samples
• We apply two classification methods– Naïve Bayes– Logistic Regression
Classification Models
Regression Models
Ranking Models
Combined Regression and Ranking Models
Matrix Factorization Models
Validation Set Blending Models Test Set Ensemble Result
Individual Models: Classification Models (2)
• Naïve Bayes– Additive smoothing and Good-Turing are applied with
promising results– The best AUC is 0.7760 on the public test set
• Logistic Regression– Train on sampled subset to reduce the training time– Separate users into two group (userID=0 or not), train two
models on for these groups and then combine the results– This model achieve 0.7888 on the public Test set
Individual Models: Regression Models (1)
• For the regression models, we use as target to predict
• Two methods in this category– Linear Regression– Support Vector Regression
Classification Models
Regression Models
Ranking Models
Combined Regression and Ranking Models
Matrix Factorization Models
Validation Set Blending Models Test Set Ensemble Result
impression
clickCTR
#
#
Individual Models: Regression Models (2)
• Linear Regression– degree-2 polynomial expansion on numerical value
features– 0.7352 AUC on the public test set
• Support vector Regression– Use degree-2 polynomial expansion– The best AUC of this model is 0.7705 on the public test set
Individual Models: Ranking Models (1)
• We split each training instance into #click positive samples and (#impression-#click ) negative samples
• Optimize pairwise ranking • Two methods in this category
– Rank Logistic Regression– RankNet
Classification Models
Regression Models
Ranking Models
Combined Regression and Ranking Models
Matrix Factorization Models
Validation Set Blending Models Test Set Ensemble Result
Individual Models: Ranking Models (2)
• Rank Logistic Regression– Optimize by
– The best AUC is 0.722 on the public test set
• RankNet– Optimizes cross entropy loss function
with neuron network, where – Using SGD to update parameters– The best result is 0.7577 on the public test set
Stochastic Gradient Descent (SGD) is used
)1log(ˆ ijrijij erC
H
j kik
T
jkjijiij xwwrrrr1
)1()2( )tanh(ˆ and )ˆˆ(ˆ
Surprisingly, ranking-based model does not outperform the other models, maybe be due to the fact it is more complicated to train and tune the parameters.
Individual Models: Combined Regression and Ranking Models (1)
• We also explore another model that combines the ranking loss and the regression loss
• In this model we try to optimize
where H is ranking loss, L is regression loss• Solve by SGD SVM, the best AUC is 0.7819
Classification Models
Regression Models
Ranking Models
Combined Regression and Ranking Models
Matrix Factorization Models
Validation Set Blending Models Test Set Ensemble Result
Individual Models: Matrix Factorization Models (1)
• We also have feature-based factorization models, which exploit latent information from data
• Two different matrix factorization are provided. One optimizes regression loss, and the other optimizes ranking loss
Classification Models
Regression Models
Ranking Models
Combined Regression and Ranking Models
Matrix Factorization Models
Validation Set Blending Models Test Set Ensemble Result
Individual Models: Matrix Factorization Models (2)
• Regression-Based– Divide features into two groups, α as user’s features and β
as items features– The prediction for a instance is
– Minimize RMSE– The best AUC is 0.7776 on the public test set
)()()()(ˆ )()( j
jjT
jjj
jj
ij
jj
uji qpwwxr
bias bias
Individual Models: Matrix Factorization Models (3)
• Ranking-Based– The prediction for a instance is
– Features can belong to α, β or both– Optimize pairwise ranking as
– The best AUC is 0.7968 on the public test set
kj
B
j
B
jkkj
jjji ppwxr
1 1
,)(ˆ
))exp(1ln()( where, 2
))(ˆ)(ˆ(min2
1 1
LxrxrL ji
N
i
N
j
Validation Set blending (1)• Blend models and additional features non-linearly• Re-blending to exploit additional enhancement• Four models for blending– Support Vector Regression (SVR)– RankNet (RN)– Combined Regression and Ranking Models (CRR)– LambdaMart (LM)
Classification Models
Regression Models
Ranking Models
Combined Regression and Ranking Models
Matrix Factorization Models
Validation Set Blending Models Test Set Ensemble Result
Validation Set blending (2)
• Stratgies for Model Selection1. By difference between validation AUC and test AUC2. Or select diverse model set by human
• Different Score– Raw score– Normalize score– Ranked score Model Public Test Set AUC
SVR 0.8038
RN (with re-blending) 0.8062
CRR (with re-blending)
0.8051
LM (with re-blending) 0.8060Performance of blending models
Test Set Ensemble (1)
• Ensemble the selected models from validation set blending
• Combine each models linearly• Weights of the linear combination depends on AUC
on the public test set• It achieves 0.8064 on the public test setClassification
Models
Regression Models
Ranking Models
Combined Regression and Ranking Models
Matrix Factorization Models
Validation Set Blending Models Test Set Ensemble Result
Final Result
• We apply uniform average on the top five models on board to aggregate our final solution. It achieves 0.8089 on the private test set (0.8070 on the public test set), which outperforms all the other competitors in this competition.
Classification Models
Regression Models
Ranking Models
Combined Regression and Ranking Models
Matrix Factorization Models
Validation Set Blending Models Test Set Ensemble Result
Take Home Points
• The main reasons for our success:– Tried diverse models (ranking, classification,
regression, factorization)– Novel ways for feature engineering (e.g.
smoothing, latent features using LDA, ids, etc)– Complex two-stage blending models– Perseverance (we probably have tried more failure
models than effective ones)
Acknowledgement
• We truly thank– organizers for designing a successful competition– NTU EECS college, CSIE department, INTEL-NTU
center for the supports