recsys challenge 2014, semwexmff group

1
Hybrid Biased k-NN to Predict Movie Tweets Popularity Ladislav Peška Department of Software Engineering Charles University in Prague Malostranske namesti 25, Prague, Czech Republic Peter Vojtáš Department of Software Engineering Charles University in Prague Malostranske namesti 25, Prague, Czech Republic ABSTRACT In this paper we describe approach of our SemWexMFF group to the RecSys Challenge 2014. Target of the challenge was to predict level of user engagement on tweets generated automatically from IMDB. During experiments we have tested several state-of-the-art prediction techniques and proposed a variant of item based k-NN algorithm, which better reflects user engagement and nature of the movie domain content-based attributes. Our final solution (placed in the midfield of the challenge leader board) is an aggregation of several runs of this algorithm. OUR APPROACH Our approach follows two main hypothesis: 1. Engagement of similar objects should be similar 2. Engagement depends on neighborhood (friends and followers) of current user Similarity of objects is based on their content-based attributes: Numeric attributes: normalized distance String attributes: relative Levenshtein distance Nominal/Set attributes: Jaccard similarity Effect of user’s friends and followers was approximated by user bias. Movie attributes were queried from OMDB API: - average rating, number of awards, IMDB metascore - number of ratings - movie name, release date, genre, country, language, , , = m ax ቆ0, ȁ ȁ , = 1 − ቆ ( , ) m ax ( , ( )) , =| | | Τ | RESULTS HYBRID BIASED k-NN Fortw eet tID ,its m ovie mID and fixed k,the algorithm first com pute sim ilarities to otherm ovies and selects k m ost sim ilarm ovies.Then foreach tw eetaboutthe m ovie the predicted ranking Ƹ is increased according to sim ilarity Ƹ, userengagem ent r and bias ofthe tw eeting user.The bias ofthe currentm ovie is added in the final Ƹ prediction too. function HybridBiasedKNN( t I D, mI D 1 , k ){ Ƹ= 0; /*com pute sim ilarity forall m ovies */ foreach(mID 2 ϵ Trai nSet ){ S[mID 2 ] = similarity(mID 1 , mI D 2 ); } S = getKMostSimilar(S,k); /*getall tw eets aboutm ovies in S */ foreach({uI D , mID , r , Ƹ}: {uI D , mID , r} ϵ Trai nSet && S [mID ]= Ƹ){ Ƹ+= Ƹ* r / (bias(uI D ) + ε); } Ƹ= bias(mID 1 ) + ( Ƹ/ sum( Ƹ)) return Ƹ; } Bravehea rt TID: 421065455743541248 UID: 25813709 The Patriot 8. 4 7. 1 199 5 200 0 68 63 Action, Biography, Drama Action, Drama, War Mel Gibson; James Robinson; Sean Lawlor; Sandy Nelson; James Cosmo Mel Gibson; Heath Ledger; Joely Richardson; Jason Isaacs Rating Year IMDB metascore Genre Actors TID: 410808483345465344 UID: 307867510 Engagement: 0 TID: 421040870931320833 UID: 296041028 Engagement: 3 AVG Eng.: 0.0 AVG Eng.: 0.3250 AVG Eng.: 0.024 AVG Eng.: 0.001 Results of state-of-the-art methods Method nDCG Random predictions 0.7482 Bi-Polar Slope One 0.7652 Factor Wise Matrix Factorization 0.7556 Item-Item k-NN 0.7604 Decision Tree 0.7494 Support Vector Machines (SVM) 0.8057 Results of Hybrid k-NN using only one attribute Method nDCG Method nDCG AVG rating 0.7918 Genres 0.791 9 Awards 0.7652 Countries 0.7984 IMDB Metascore 0.8057 Languages 0.8005 Number of ratings 0.7964 Director 0.8029 Movie name 0.7947 Actors 0.7930 Release year 0.7962 Results of Hybrid k-NN combining more attributes Method nDCG Hybrid k-nn (Metascore, Language, Director, Country, Date, # of ratings) 0.7927 Hybrid k-nn(Metascore, Language, Director, Country, Date, # of ratings), no bias 0.7792 Linear Regression (Metascore, Language, Director, Country, Date, # of ratings) 0.7913 AVG (Metascore, Language, Director, Country, Date, # of ratings), omit best and worst prediction 0.8134 LESSONS LEARNED and POSSIBLE EXTENSIONS Both hypothesis on which we based our solution seems to be confirmed. Omitting user bias lead to severe decrease of algorithm success metrics and almost all content-based attributes proved to be quite good measure of movie similarity. Hybrid biased k-NN outperformed all considered state-of-the-art machine learning methods, however our results were placed in lower midfield of the challenge. Several extensions to the current approach is possible, namely: - Considering temporal dependance in the dataset - Using some of the tweet characteristics, additional content-based attributes e.g. from DBPedia or some

Upload: ladislav-peska

Post on 28-Jul-2015

61 views

Category:

Science


0 download

TRANSCRIPT

Page 1: RecSys Challenge 2014, SemWexMFF group

Hybrid Biased k-NN to Predict Movie Tweets PopularityLadislav Peška

Department of Software EngineeringCharles University in Prague

Malostranske namesti 25, Prague, Czech Republic

Peter VojtášDepartment of Software Engineering

Charles University in PragueMalostranske namesti 25, Prague, Czech Republic

ABSTRACTIn this paper we describe approach of our SemWexMFF group to the RecSys Challenge 2014. Target of the challenge was to predict level of user engagement on tweets generated automatically from IMDB. During experiments we have tested several state-of-the-art prediction techniques and proposed a variant of item based k-NN algorithm, which better reflects user engagement and nature of the movie domain content-based attributes. Our final solution (placed in the midfield of the challenge leader board) is an aggregation of several runs of this algorithm.

OUR APPROACHOur approach follows two main hypothesis:1. Engagement of similar objects should be similar2. Engagement depends on neighborhood (friends and

followers) of current user Similarity of objects is based on their content-based attributes: Numeric attributes: normalized distance

String attributes: relative Levenshtein distance

Nominal/Set attributes: Jaccard similarity

Effect of user’s friends and followers was approximated by user bias.Movie attributes were queried from OMDB API:- average rating, number of awards, IMDB metascore- number of ratings- movie name, release date, genre, country, language, director, actors

𝑠𝑖𝑚𝑥,𝑦,𝑚𝑎𝑥𝐷𝑖𝑠𝑡 =maxቆ0,𝑚𝑎𝑥𝐷𝑖𝑠𝑡−a�𝑥−𝑦a�𝑚𝑎𝑥𝐷𝑖𝑠𝑡 ቇ

𝑠𝑖𝑚𝑥,𝑦= 1−൬ 𝑙𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛(𝑥,𝑦)max(𝑙𝑒𝑛𝑔ℎ𝑡ሺ𝑥ሻ, 𝑙𝑒𝑛𝑔ℎ𝑡(𝑦))൰

𝑠𝑖𝑚𝐱,𝐲= |ሺ𝐱∩𝐲ሻ| |ሺ𝐱∪𝐲ሻΤ |

RESULTS

HYBRID BIASED k-NNFor tweet tID, its movie mID and fixed k, the algorithm first compute similarities to other movies and selects k most similar movies. Then for each tweet about the movie the predicted ranking 𝑟Ƹ is increased according to similarity𝑠Ƹ, user engagement r and bias of the tweeting user. The bias of the current movie is added in the final 𝑟Ƹ prediction too.

function HybridBiasedKNN(tID, mID1, k){

𝑟Ƹ = 0; /*compute similarity for all movies */

foreach(mID2 ϵ TrainSet){

S[mID2] = similarity(mID1, mID2);

}

Sത = getKMostSimilar(S,k); /*get all tweets about movies in Sത */ foreach({uID, mID, r, 𝑠Ƹ}: {uID, mID, r} ϵ TrainSet && Sത[mID]=𝑠Ƹ){ 𝑟Ƹ += 𝑠Ƹ * r / (bias(uID) + ε); }

𝑟Ƹ = bias(mID1) + (𝑟Ƹ / sum(𝑠Ƹ)) return 𝑟Ƹ; }

BraveheartTID: 421065455743541248

UID: 25813709

The Patriot

8.4

7.1

1995

2000

68

63

Action, Biography, Drama

Action, Drama, War

Mel Gibson; James Robinson; Sean Lawlor; Sandy Nelson; James Cosmo

Mel Gibson; Heath Ledger; Joely Richardson; Jason Isaacs

Rating YearIMDB

metascoreGenre Actors

TID: 410808483345465344UID: 307867510Engagement: 0

TID: 421040870931320833UID: 296041028Engagement: 3 …

AVG Eng.: 0.0AVG Eng.: 0.3250

AVG Eng.: 0.024

AVG Eng.: 0.001

Results of state-of-the-art methods

Method nDCG

Random predictions 0.7482

Bi-Polar Slope One 0.7652

Factor Wise Matrix Factorization 0.7556

Item-Item k-NN 0.7604

Decision Tree 0.7494

Support Vector Machines (SVM) 0.8057

Results of Hybrid k-NN using only one attribute

Method nDCG Method nDCG

AVG rating 0.7918 Genres 0.7919

Awards 0.7652 Countries 0.7984

IMDB Metascore 0.8057 Languages 0.8005

Number of ratings 0.7964 Director 0.8029

Movie name 0.7947 Actors 0.7930

Release year 0.7962

Results of Hybrid k-NN combining more attributes

Method nDCGHybrid k-nn (Metascore, Language, Director, Country, Date, # of ratings)

0.7927

Hybrid k-nn(Metascore, Language, Director, Country, Date, # of ratings), no bias

0.7792

Linear Regression (Metascore, Language, Director, Country, Date, # of ratings)

0.7913

AVG (Metascore, Language, Director, Country, Date, # of ratings), omit best and worst prediction

0.8134

LESSONS LEARNED and POSSIBLE EXTENSIONSBoth hypothesis on which we based our solution seems to be confirmed. Omitting user bias lead to severe decrease of algorithm success metrics and almost all content-based attributes proved to be quite good measure of movie similarity.

Hybrid biased k-NN outperformed all considered state-of-the-art machine learning methods, however our results were placed in lower midfield of the challenge. Several extensions to the current approach is possible, namely:

- Considering temporal dependance in the dataset- Using some of the tweet characteristics, additional content-based attributes e.g. from DBPedia or some other meta-learning methods