recsys challenge 2014, semwexmff group

Hybrid Biased k-NN to Predict Movie Tweets PopularityLadislav Peška

Department of Software EngineeringCharles University in Prague

Malostranske namesti 25, Prague, Czech Republic

Peter VojtášDepartment of Software Engineering

Charles University in PragueMalostranske namesti 25, Prague, Czech Republic

ABSTRACTIn this paper we describe approach of our SemWexMFF group to the RecSys Challenge 2014. Target of the challenge was to predict level of user engagement on tweets generated automatically from IMDB. During experiments we have tested several state-of-the-art prediction techniques and proposed a variant of item based k-NN algorithm, which better reflects user engagement and nature of the movie domain content-based attributes. Our final solution (placed in the midfield of the challenge leader board) is an aggregation of several runs of this algorithm.

OUR APPROACHOur approach follows two main hypothesis:1. Engagement of similar objects should be similar2. Engagement depends on neighborhood (friends and

followers) of current user Similarity of objects is based on their content-based attributes: Numeric attributes: normalized distance

String attributes: relative Levenshtein distance

Nominal/Set attributes: Jaccard similarity

Effect of user’s friends and followers was approximated by user bias.Movie attributes were queried from OMDB API:- average rating, number of awards, IMDB metascore- number of ratings- movie name, release date, genre, country, language, director, actors

𝑠𝑖𝑚𝑥,𝑦,𝑚𝑎𝑥𝐷𝑖𝑠𝑡 =maxቆ0,𝑚𝑎𝑥𝐷𝑖𝑠𝑡−a�𝑥−𝑦a�𝑚𝑎𝑥𝐷𝑖𝑠𝑡 ቇ

𝑠𝑖𝑚𝑥,𝑦= 1−൬ 𝑙𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛(𝑥,𝑦)max(𝑙𝑒𝑛𝑔ℎ𝑡ሺ𝑥ሻ, 𝑙𝑒𝑛𝑔ℎ𝑡(𝑦))൰

𝑠𝑖𝑚𝐱,𝐲= |ሺ𝐱∩𝐲ሻ| |ሺ𝐱∪𝐲ሻΤ |

RESULTS

HYBRID BIASED k-NNFor tweet tID, its movie mID and fixed k, the algorithm first compute similarities to other movies and selects k most similar movies. Then for each tweet about the movie the predicted ranking 𝑟Ƹ is increased according to similarity𝑠Ƹ, user engagement r and bias of the tweeting user. The bias of the current movie is added in the final 𝑟Ƹ prediction too.

function HybridBiasedKNN(tID, mID1, k){

𝑟Ƹ = 0; /*compute similarity for all movies */

foreach(mID2 ϵ TrainSet){

S[mID2] = similarity(mID1, mID2);

}

Sത = getKMostSimilar(S,k); /*get all tweets about movies in Sത */ foreach({uID, mID, r, 𝑠Ƹ}: {uID, mID, r} ϵ TrainSet && Sത[mID]=𝑠Ƹ){ 𝑟Ƹ += 𝑠Ƹ * r / (bias(uID) + ε); }

𝑟Ƹ = bias(mID1) + (𝑟Ƹ / sum(𝑠Ƹ)) return 𝑟Ƹ; }

BraveheartTID: 421065455743541248

UID: 25813709

The Patriot

8.4

7.1

1995

2000

68

63

Action, Biography, Drama

Action, Drama, War

Mel Gibson; James Robinson; Sean Lawlor; Sandy Nelson; James Cosmo

Mel Gibson; Heath Ledger; Joely Richardson; Jason Isaacs

Rating YearIMDB

metascoreGenre Actors

TID: 410808483345465344UID: 307867510Engagement: 0

TID: 421040870931320833UID: 296041028Engagement: 3 …

AVG Eng.: 0.0AVG Eng.: 0.3250

AVG Eng.: 0.024

AVG Eng.: 0.001

Results of state-of-the-art methods

Method nDCG

Random predictions 0.7482

Bi-Polar Slope One 0.7652

Factor Wise Matrix Factorization 0.7556

Item-Item k-NN 0.7604

Decision Tree 0.7494

Support Vector Machines (SVM) 0.8057

Results of Hybrid k-NN using only one attribute

Method nDCG Method nDCG

AVG rating 0.7918 Genres 0.7919

Awards 0.7652 Countries 0.7984

IMDB Metascore 0.8057 Languages 0.8005

Number of ratings 0.7964 Director 0.8029

Movie name 0.7947 Actors 0.7930

Release year 0.7962

Results of Hybrid k-NN combining more attributes

Method nDCGHybrid k-nn (Metascore, Language, Director, Country, Date, # of ratings)

0.7927

Hybrid k-nn(Metascore, Language, Director, Country, Date, # of ratings), no bias

0.7792

Linear Regression (Metascore, Language, Director, Country, Date, # of ratings)

0.7913

AVG (Metascore, Language, Director, Country, Date, # of ratings), omit best and worst prediction

0.8134

LESSONS LEARNED and POSSIBLE EXTENSIONSBoth hypothesis on which we based our solution seems to be confirmed. Omitting user bias lead to severe decrease of algorithm success metrics and almost all content-based attributes proved to be quite good measure of movie similarity.

Hybrid biased k-NN outperformed all considered state-of-the-art machine learning methods, however our results were placed in lower midfield of the challenge. Several extensions to the current approach is possible, namely:

- Considering temporal dependance in the dataset- Using some of the tweet characteristics, additional content-based attributes e.g. from DBPedia or some other meta-learning methods

recsys challenge 2014, semwexmff group

Science