performance of recommender algorithms on top-n recommendation tasks recsys 2010 intelligent database...
TRANSCRIPT
Performance of Recommender Algorithms on Top-N Recommendation Tasks
RecSys 2010
Intelligent Database Systems Lab.
School of Computer Science & Engineering
Seoul National University
Center for E-Business TechnologySeoul National UniversitySeoul, Korea
Presented by Sangkeun Lee1/14/2011
Paolo Cremonesi, Yehuda Koren, Roberto TurrinPolitecnico di Milano, Yahoo! Research Haifa, Israel, Neptuny Milan, Italy
Copyright 2010 by CEBT
Introduction
Competition of recommender systems
By evaluating their error metrics such as RMSE (Root mean squared error)
Average error between estimated ratings and actual ratings
Why the majority of the literature is focused on error metrics?
Logical & convenient
However, many commercial systems perform top-N recom-mendation tasks
The systems suggest a few specific items to the user that are likely to be very appealing to him
Copyright 2010 by CEBT
Introduction: Top-N Performance
Classical error measures (e.g. RMSE, MAE) do not really mea-sure top-N performance
Measure for Top-N Performance
Accuracy metrics
– Recall and Precision
In this paper,
The authors present an extensive evaluation of several state-of-art recommender systems & naïve non-personalized algorithms
And they give us some insight from the experimental results
On Netflix & Movielens datasets
Copyright 2010 by CEBT
Testing Methodology: Dataset
For each dataset, known ratings are split into two sub-sets :
Training set M and test set T
Test set T contains only 5-starts ratings
– So, we can reasonably state that T contains items relevant to the respective users
For the Neflix dataset,
Training set = training dataset 100M ratings for Netflix prize
Test set = 5-star ratings from probe dataset for Netflix prize (|T|=384,573)
For the Movielens dataset,
Randomly sub-sampled 1.4% of the ratings from the dataset to create testset
Copyright 2010 by CEBT
Testing Methodology: measuring precision and recall
1) Train the model over the ratings in M
2) For each item I rated 5-starts by user u in T
Randomly select 1000 additional items unrated by user u
Predict the ratings for the test item I and for the additional 1000 items
Form a ranked list by ordering 1001 items according to the predicted ratings. Let p denote the rank of the item I within this list. (The best result: p=1)
Form a top-N recommendation list by picking the N top ranked items from the list. If p<=N we have a hit. Other-wise we have a miss.
Copyright 2010 by CEBT
Testing Methodology: measuring precision and recall
For any single test case,
recall for a single test can assume either 0 (miss) or 1(hit)
Precision for a single test can assume either the value 0(miss) or 1/N (hit)
The overall recall and precision are defined by averaging over all test cases
Copyright 2010 by CEBT
Rating distribution : Popular items vs.Long-tail
About 33% of ratings collected by Netflix involve only the 1.7% of most popular items
To evaluate the accuracy of recommender algorithms in suggesting non-trivial items, T has been partitioned into Thead and Tlong
Copyright 2010 by CEBT
Algorithms
Non-personalized models
Movie Rating Average (MovieAvg) – average of ratings
Top Popular (TopPop) – number of ratings – non applicable to mea-sure error metrics
Collaborative Filtering models
Neighborhood models
– The most common approaches
– Based on similarity among either users or items
Latent factor models
– Finding hidden factors
– Model users and items in the same latent factor spaces
– Predict ratings usib proximity (e.g., inner-product)
Copyright 2010 by CEBT
Neighborhood Models
Correlation Neighborhood (CorNgbr)
denotes rating bias of user u to item I (e.g. average ratings)
denotes the set of most similar items
represents shrink similarity
the number of common raters
similarity between items (cosine similarity)
Non-normalized Cosine Neighborhood (NNCosNgbr)
Higher ranking for items with many similar neighbors
It’s no longer es-timated rating, but still we can use this for top-N rec-ommendation tasks
Copyright 2010 by CEBT
Latent Factor Models
The key idea is to factorize user-item matrix into two lower rank matri-ces
One matrix containing user-factors
One matrix containing item-factors
Rating estimation is computed as
SVD is undefined in the presence of unknown values
Replace unknown ratings with baseline estimations
Learn factor vectors through a suitable objective function which minimizes the prediction error
And so on. (out of scope)
Two state of the art algorithms
Asymmetric-SVD (AsySVD)
SVD++ (high quality in RMSE)
Copyright 2010 by CEBT
Latent Factor Models: PureSVD
Now,
We are interested only in a correct item ranking
We don’t need exact rating prediction
PureSVD
Considering all missing values in the user rating matrix as zeros
Lets define
– u-th row of represents the user factors vector
Q
– i-th row of Q represents the item factors vector
It’s no longer es-timated rating, but still we can use this for top-N rec-ommendation tasks
the u-th row of the user rating matrix
Copyright 2010 by CEBT
RMSE Ranking
SVD++ 0.8911
AsySVD 0.9000
CorNgbr 0.9406
MovieAvg 1.053
Note that TopPop, NNCorNgbr, PureSVD are not applica-ble for measuring error metrics
Copyright 2010 by CEBT
Result: Movielens dataset
All item/Recall at N=10 AsySVD is about 0.28
TopPop 0.29
SVD++, NNCosNgbr 0.44
PureSVD 0.52 (50)
All item/Precision PureSVD outperforms
TopPopAysSVD
(widely used) CorNgbr un-deperforms!
Long-tail Accuracy of TopPop dramat-
ically falls down
PureSVD is still the best (150)
SVD++ is the best among RMSE-oriented algorithms
Similar!?
Best!?
Copyright 2010 by CEBT
Result: Netflix dataset
All items TopPop outperforms
CorNgbr
AsySVD and SVD++ slightly performs better than TopPop (Note that these algorithms are possibly better tuned for Neflix data)
NNCosNgbr works good
PureSVD is still the best
Long-tail CorNgbr significantly un-
derperforms for the head
But it performs well on long-tail data (Probably, it ex-plains why CorNgbr has been widely used)
Copyright 2010 by CEBT
PureSVD??
Poor design in terms of rating estimation
The authors did not expect the result
PureSVD
Easy to code &Good computational performance in both off-line and online
When moving to longer tail items, accuracy improves with raising the dimensionality of the PureSVD model. (50 -> 150)
– This could mean that first latent factors capture properties of popular items, while additional features capture properties of long-tail items
Copyright 2010 by CEBT
Conclusions
Error metrics have been more popular
Mathematical convenience
Formal optimization
However, it is well recognized that accuracy measures may be more natural
In summary,
(1) There is no monotonic(trivial) relation between error metrics and accuracy metrics
(2) Test-cases should be carefully selected as we can see the exper-imental results (long-tail vs. head) Watch out the possible pitfalls!
(3) New variants of existing algorithms improves the top-N perfor-mances
Q&A
Thank you
17