recsys challenge 2014, semwexmff group
TRANSCRIPT
Hybrid Biased k-NN to Predict Movie Tweets PopularityLadislav Peška
Department of Software EngineeringCharles University in Prague
Malostranske namesti 25, Prague, Czech Republic
Peter VojtášDepartment of Software Engineering
Charles University in PragueMalostranske namesti 25, Prague, Czech Republic
ABSTRACTIn this paper we describe approach of our SemWexMFF group to the RecSys Challenge 2014. Target of the challenge was to predict level of user engagement on tweets generated automatically from IMDB. During experiments we have tested several state-of-the-art prediction techniques and proposed a variant of item based k-NN algorithm, which better reflects user engagement and nature of the movie domain content-based attributes. Our final solution (placed in the midfield of the challenge leader board) is an aggregation of several runs of this algorithm.
OUR APPROACHOur approach follows two main hypothesis:1. Engagement of similar objects should be similar2. Engagement depends on neighborhood (friends and
followers) of current user Similarity of objects is based on their content-based attributes: Numeric attributes: normalized distance
String attributes: relative Levenshtein distance
Nominal/Set attributes: Jaccard similarity
Effect of user’s friends and followers was approximated by user bias.Movie attributes were queried from OMDB API:- average rating, number of awards, IMDB metascore- number of ratings- movie name, release date, genre, country, language, director, actors
𝑠𝑖𝑚𝑥,𝑦,𝑚𝑎𝑥𝐷𝑖𝑠𝑡 =maxቆ0,𝑚𝑎𝑥𝐷𝑖𝑠𝑡−a�𝑥−𝑦a�𝑚𝑎𝑥𝐷𝑖𝑠𝑡 ቇ
𝑠𝑖𝑚𝑥,𝑦= 1−൬ 𝑙𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛(𝑥,𝑦)max(𝑙𝑒𝑛𝑔ℎ𝑡ሺ𝑥ሻ, 𝑙𝑒𝑛𝑔ℎ𝑡(𝑦))൰
𝑠𝑖𝑚𝐱,𝐲= |ሺ𝐱∩𝐲ሻ| |ሺ𝐱∪𝐲ሻΤ |
RESULTS
HYBRID BIASED k-NNFor tweet tID, its movie mID and fixed k, the algorithm first compute similarities to other movies and selects k most similar movies. Then for each tweet about the movie the predicted ranking 𝑟Ƹ is increased according to similarity𝑠Ƹ, user engagement r and bias of the tweeting user. The bias of the current movie is added in the final 𝑟Ƹ prediction too.
function HybridBiasedKNN(tID, mID1, k){
𝑟Ƹ = 0; /*compute similarity for all movies */
foreach(mID2 ϵ TrainSet){
S[mID2] = similarity(mID1, mID2);
}
Sത = getKMostSimilar(S,k); /*get all tweets about movies in Sത */ foreach({uID, mID, r, 𝑠Ƹ}: {uID, mID, r} ϵ TrainSet && Sത[mID]=𝑠Ƹ){ 𝑟Ƹ += 𝑠Ƹ * r / (bias(uID) + ε); }
𝑟Ƹ = bias(mID1) + (𝑟Ƹ / sum(𝑠Ƹ)) return 𝑟Ƹ; }
BraveheartTID: 421065455743541248
UID: 25813709
The Patriot
8.4
7.1
1995
2000
68
63
Action, Biography, Drama
Action, Drama, War
Mel Gibson; James Robinson; Sean Lawlor; Sandy Nelson; James Cosmo
Mel Gibson; Heath Ledger; Joely Richardson; Jason Isaacs
Rating YearIMDB
metascoreGenre Actors
TID: 410808483345465344UID: 307867510Engagement: 0
TID: 421040870931320833UID: 296041028Engagement: 3 …
AVG Eng.: 0.0AVG Eng.: 0.3250
AVG Eng.: 0.024
AVG Eng.: 0.001
Results of state-of-the-art methods
Method nDCG
Random predictions 0.7482
Bi-Polar Slope One 0.7652
Factor Wise Matrix Factorization 0.7556
Item-Item k-NN 0.7604
Decision Tree 0.7494
Support Vector Machines (SVM) 0.8057
Results of Hybrid k-NN using only one attribute
Method nDCG Method nDCG
AVG rating 0.7918 Genres 0.7919
Awards 0.7652 Countries 0.7984
IMDB Metascore 0.8057 Languages 0.8005
Number of ratings 0.7964 Director 0.8029
Movie name 0.7947 Actors 0.7930
Release year 0.7962
Results of Hybrid k-NN combining more attributes
Method nDCGHybrid k-nn (Metascore, Language, Director, Country, Date, # of ratings)
0.7927
Hybrid k-nn(Metascore, Language, Director, Country, Date, # of ratings), no bias
0.7792
Linear Regression (Metascore, Language, Director, Country, Date, # of ratings)
0.7913
AVG (Metascore, Language, Director, Country, Date, # of ratings), omit best and worst prediction
0.8134
LESSONS LEARNED and POSSIBLE EXTENSIONSBoth hypothesis on which we based our solution seems to be confirmed. Omitting user bias lead to severe decrease of algorithm success metrics and almost all content-based attributes proved to be quite good measure of movie similarity.
Hybrid biased k-NN outperformed all considered state-of-the-art machine learning methods, however our results were placed in lower midfield of the challenge. Several extensions to the current approach is possible, namely:
- Considering temporal dependance in the dataset- Using some of the tweet characteristics, additional content-based attributes e.g. from DBPedia or some other meta-learning methods