ht2014 tutorial: evaluating recommender systems - ensuring replicability of evaluation
DESCRIPTION
Tutorial on "Evaluating Recommender Systems: Ensuring Reproducibility of Evaluation". With Alan Said.TRANSCRIPT
![Page 1: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/1.jpg)
Evaluating Recommender Systems: Ensuring Replicability of Evaluation
Alejandro Bellogín (UAM, Spain)
Alan Said (TU-Delft, The Netherlands)
Tutorial at Hypertext 2014
![Page 2: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/2.jpg)
![Page 3: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/3.jpg)
About me
• 2014: Lecturer at UAM (Spain)
• 2013: Post-doctoral Marie Curie fellow at CWI (The Netherlands)
• 2007-2012: PhD student at UAM
• Topics (recommender systems): algorithms (probabilistic, hybrid, trust-based, social-based, graph-based), evaluation (methodologies, biases), user analysis (user clarity, coherence, performance)
![Page 4: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/4.jpg)
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [30 minutes]
• Reproducible Experimental Design [60 minutes]
• Summary [10 minutes]
![Page 5: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/5.jpg)
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [30 minutes]
• Reproducible Experimental Design [60 minutes]
• Summary [10 minutes]
![Page 6: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/6.jpg)
Background
• A recommender system aims to find and suggest items of likely interest based on the users’ preferences
![Page 7: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/7.jpg)
Background
• A recommender system aims to find and suggest items of likely interest based on the users’ preferences
![Page 8: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/8.jpg)
Background
• A recommender system aims to find and suggest items of likely interest based on the users’ preferences
• Examples:
– Netflix: tv shows and movies
– Amazon: products
– LinkedIn: jobs and colleagues
– Last.fm: music artists and tracks
![Page 9: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/9.jpg)
Background
• Typically, the interactions between user and system are recorded in the form of ratings
– But also: clicks (implicit feedback)
• This is represented as a user-item matrix: i1 … ik … im
u1
…
uj ?
…
un
![Page 10: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/10.jpg)
Motivation
• Evaluation is an integral part of any experimental research area
• It allows us to compare methods…
![Page 11: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/11.jpg)
Motivation
• Evaluation is an integral part of any experimental research area
• It allows us to compare methods…
• … and decide a winner (in competitions)
![Page 12: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/12.jpg)
Motivation
• A proper evaluation culture allows advance the field
• … or at least, identify when there is a problem!
![Page 13: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/13.jpg)
Motivation
• In recommendation, we find inconsistent evaluation results, for the “same”
– Dataset
– Algorithm
– Evaluation metric
Movielens 1M [Cremonesi et al, 2010]
Movielens 100k [Gorla et al, 2013]
Movielens 1M [Yin et al, 2012]
Movielens 100k, SVD [Jambor & Wang, 2010]
![Page 14: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/14.jpg)
Motivation
• In recommendation, we find inconsistent evaluation results, for the “same”
– Dataset
– Algorithm
– Evaluation metric
0
0.05
0.30
0.35
0.40
TR 3 TR 4 TeI TrI AI OPR
P@50 SVD50IBUB50
[Bellogín et al, 2011]
![Page 15: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/15.jpg)
Motivation
• In recommendation, we find inconsistent evaluation results, for the “same”
– Dataset
– Algorithm
– Evaluation metric
0
0.05
0.30
0.35
0.40
TR 3 TR 4 TeI TrI AI OPR
P@50 SVD50IBUB50
We need to understand why this happens
![Page 16: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/16.jpg)
In this tutorial
• We will present the basics of evaluation
– Accuracy metrics: error-based, ranking-based
– Also coverage, diversity, and novelty
• We will focus on reproducibility
– Define the context
– Present typical problems
– Propose some guidelines
![Page 17: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/17.jpg)
NOT in this tutorial
• In-depth analysis of evaluation metrics
– See chapter 9 on handbook [Shani & Gunawardana, 2011]
• Novel evaluation dimensions
– See tutorial at WSDM ’14 and SIGIR ‘13 on diversity and novelty
• User evaluation
– See tutorial at RecSys 2012 by B. Knijnenburg
• Comparison of evaluation results in research
– See RepSys workshop at RecSys 2013
![Page 18: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/18.jpg)
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [30 minutes]
• Reproducible Experimental Design [60 minutes]
• Summary [10 minutes]
![Page 19: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/19.jpg)
Recommender Systems Evaluation
• Typically: as a black box
Train Test Validation
Dataset
Recommender generates
a ranking (for a user)
a prediction for a given item (and user)
precision error
coverage …
![Page 20: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/20.jpg)
Evaluation metrics
• Accuracy metrics: typically reported in the literature (and usually, only these)
• Non accuracy metrics: related to other evaluation dimensions
– Coverage
– Diversity
– Novelty
– …
![Page 21: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/21.jpg)
Accuracy metrics
• Error-based
– RMSE, MAE
• Ranking-based
– Precision, recall, MAP, nDCG
• Other accuracy metrics
– AUC, NDPM, correlation
![Page 22: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/22.jpg)
Error-based metrics
• Assumption: more accurate predictions, better
• Pre-assumption: we are predicting ratings
• Conclusion: not useful for implicit feedback
MAE = Mean Absolute Error
RMSE = Root Mean Squared Error
![Page 23: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/23.jpg)
Error-based metrics
• Variations: – Normalize RMSE or MAE by the range of the
ratings (divide by rmax – rmin)
– Average RMSE or MAE to compensate for unbalance distributions of items or users
uMAE = user-averaged Mean Absolute Error
![Page 24: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/24.jpg)
Error-based metrics
• Limitations:
– Depend on the ratings range (unless normalized)
– Depend on the recommender output’s range
– Not valid for recommenders that produce a score (not a rating): probability, similarity, etc.
– Do not distinguish errors on top items and the rest User-item pairs Real Rec1 Rec2 Rec3
(u1, i1) 5 4 8 5
(u1, i2) 3 2 4 1
(u1, i3) 1 1 2 1
(u2, i1) 3 2 4 2
MAE/RMSE 0.75/0.87 1.5/1.73 0.75/1.12
![Page 25: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/25.jpg)
Ranking-based metrics
• Assumption: users only care about errors in the item rank order provided by the system
• They are usually computed up to a ranking position or cutoff k
P = Precision (Precision at k)
R = Recall (Recall at k)
![Page 26: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/26.jpg)
Ranking-based metrics
• Assumption: users only care about errors in the item rank order provided by the system
• They are usually computed up to a ranking position or cutoff k
MAP = Mean Average Precision
![Page 27: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/27.jpg)
Ranking-based metrics
• Assumption: users only care about errors in the item rank order provided by the system
• They are usually computed up to a ranking position or cutoff k
nDCG = normalized Discounted Cumulative Gain
![Page 28: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/28.jpg)
Ranking-based metrics
• There are many others:
– Rank score (half-life utility): like nDCG but with a different discount function
– Mean percentage ranking
– Mean reciprocal rank: only takes into account where the first relevant result occurs
– Average rank of correct recommendation
– Average reciprocal hit-rank
![Page 29: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/29.jpg)
Ranking-based metrics
• Limitations:
– Performance is, probably, underestimated (since real preferences are scarce and unknown preferences are assumed to be not relevant)
– Implementation-dependent when there are ties in the scores that generate the ranking
– Different results depending on the cutoff…
– … And no agreement about which cutoff is best: 1, 3, 5, 10, 50, …?
![Page 30: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/30.jpg)
Other accuracy metrics
• AUC: area under the (ROC) curve At each rank position:
– If item relevant: curve up
– Otherwise: curve right
• Random recommender – straight diagonal line
– AUC = 0.5
• Variations – Global ROC
– Customer ROC: same number of items to each user [Herlocker et al, 2004]
![Page 31: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/31.jpg)
Other accuracy metrics
• NDPM: normalized distance-based performance measure
• It compares two weakly ordered rankings
– con: number of discordant item pairs
– tie: number of compatible item pairs
– normalized by the number of pairs not tied in the real ranking
![Page 32: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/32.jpg)
Other accuracy metrics
• Rank correlation coefficients between predicted and ideal ranking:
– Spearman
– Kendall
• NDPM is similar but provides a more accurate interpretation of the effect of tied user ranks
• Limitation: interchange weakness
– Interchanges at the top of the ranking have the same weight as in the bottom
![Page 33: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/33.jpg)
Non accuracy metrics: Coverage
• User coverage
• Catalog/item coverage
– Simple ratio [Ge et al, 2010]
– Based on Gini’s index [Shani & Gunawardana, 2011]
– Based on Shannon’s entropy [Shani & Gunawardana, 2011]
• “Practical accuracy of a system”: combination of coverage and accuracy
– A system with low coverage is less useful
![Page 34: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/34.jpg)
Non accuracy metrics: Diversity
Diversity Relevance?
![Page 35: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/35.jpg)
Non accuracy metrics: Diversity
• How to measure diversity? Several proposals:
– Using a distance/dissimilarity function [Zhang & Hurley, 2008]
– Measuring the intra-list similarity [Ziegler et al, 2005]
– Using statistics to analyze the item distribution (concentration curve) [Zhang & Hurley, 2009]
– Based on entropy [Bellogín et al, 2010]
– Based on Gini’s index [Fleder & Hosanagar, 2009]
• Formal framework in [Vargas & Castells, 2011]
![Page 36: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/36.jpg)
Non accuracy metrics: Novelty
• Novel recommendations: items the user did not know prior to the recommendation
• Directly measured in online experiments
• Not clear how to do it in offline experiments:
– Using a taxonomy (items about novel topics) [Weng et al,
2007]
– New items over time [Lathia et al, 2010]
– Based on entropy, self-information and Kullback-Leibler divergence [Bellogín et al, 2010; Zhou et al, 2010; Filippone
& Sanguinetti, 2010]
![Page 37: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/37.jpg)
Recommender Systems Evaluation: Summary
• Usually, evaluation seen as a black box
• Mostly focused on metrics
– Especially, on accuracy metrics
• But there are other dimensions worth of interest
• No metric is perfect
• We should agree on standard implementations, parameters, instantiations, …
– Example: trec_eval in IR
![Page 38: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/38.jpg)
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [30 minutes]
• Reproducible Experimental Design [60 minutes]
• Summary [10 minutes]
![Page 39: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/39.jpg)
Reproducible Experimental Design
• We need to distinguish
– Replicability
– Reproducibility
• Different aspects:
– Algorithmic
– Published results
– Experimental design
![Page 40: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/40.jpg)
Definition: Replicability
To copy something
• The results
• The data
• The approach
Being able to evaluate in the same setting and obtain the same results
![Page 41: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/41.jpg)
Definition: Reproducibility
To recreate something
• The (complete) set of experiments
• The (complete) set of results
• The (complete) experimental setup
To (re) launch it in production with the same results
![Page 42: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/42.jpg)
Comparing against the state-of-the-art
Your settings are not exactly like those in paper X, but it is
a relevant paper
Reproduce results of paper X
They are (too) different
Replicate results of paper X
Do they agree with the original paper?
Let’s start from scratch
Congrats! You have shown that paper X behaves different in
the new context
Sorry, there is something wrong/incomplete in the
experimental design
![Page 43: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/43.jpg)
What do they mean?
Can we recreate them?
![Page 44: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/44.jpg)
Replicability
• Why do we need to replicate?
![Page 45: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/45.jpg)
Replicability
• Making sure your results were not a fluke
• Can others repeat/validate your experiments, results, conclusions?
http://validation.scienceexchange.com
![Page 46: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/46.jpg)
Reproducibility
Why do we need to reproduce?
Because these two are not the same
![Page 47: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/47.jpg)
Reproducibility
• In order to ensure that our experiments, settings, and results are:
– Valid
– Generalizable
– Of use for others
– etc.
we must make sure that others can reproduce our experiments in their setting
![Page 48: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/48.jpg)
Making reproducibility easier
• Description, description, description
• No magic numbers
• Specify the value for all the parameters
• Motivate!
• Keep a detailed protocol
• Describe process clearly
• Use standards
• Publish code (nobody expects you to be an awesome developer, you’re a researcher)
![Page 49: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/49.jpg)
Replicability, reproducibility, and progress
• Can there be actual progress if no valid comparison can be done?
• What is the point of comparing two approaches if the comparison is flawed?
• How do replicability and reproducibility facilitate actual progress in the field?
![Page 50: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/50.jpg)
Evaluation as a black box
Train Test Validation
Dataset
Recommender generates
a ranking (for a user)
a prediction for a given item (and user)
precision error
coverage …
![Page 51: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/51.jpg)
Evaluation as black boxes
Train Test Validation
Dataset
Recommender generates
a ranking (for a user)
a prediction for a given item (and user)
precision error
coverage …
![Page 52: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/52.jpg)
An experiment
• We used internal evaluation methods in Mahout (AM), LensKit (LK), and MyMediaLite (MML)
[Said & Bellogín, 2014]
![Page 53: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/53.jpg)
Evaluation as black boxes
PROS
• Easy
• Don’t reinvent the wheel
CONS
• Cherry-picking – Good results
– Wrong (not optimal) setting
• Not comparable – Add/remove bias from data
• Difficult to disclose all the details – Is step N important?
– What did I do after step M?
![Page 54: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/54.jpg)
Some problems with “black boxes”
• What do you do when a recommender cannot predict a score?
– This has an impact on coverage
![Page 55: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/55.jpg)
Some problems with “black boxes”
• What do you do when a recommender cannot predict a score?
– This has an impact on coverage
– It can also affect error-based metrics
User-item pairs Real Rec1 Rec2 Rec3
(u1, i1) 5 4 NaN 4
(u1, i2) 3 2 4 NaN
(u1, i3) 1 1 NaN 1
(u2, i1) 3 2 4 NaN
MAE/RMSE, ignoring NaNs 0.75/0.87 2.00/2.00 0.50/0.70
MAE/RMSE, NaNs as 0 0.75/0.87 2.00/2.65 1.75/2.18
MAE/RMSE, NaNs as 3 0.75/0.87 1.50/1.58 0.25/0.50
![Page 56: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/56.jpg)
Some problems with “black boxes”
• NDCG has at least two discounting functions
– Which one are you using: linear or exponential decay?
![Page 57: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/57.jpg)
Some problems with “black boxes”
• How do you select the candidate items to be ranked?
Solid triangle represents the target user. Boxed ratings denote test set.
0.85
0.90
0.95
1.00
1.05
1.10
SVD IB UB
RMSE
0
0.05
0.30
0.35
0.40
TR 3 TR 4 TeI TrI AI OPR
P@50 SVD50IBUB50
0
0.20
0.90
0.10
0.30
1.00
TR 3 TR 4 TeI TrI AI OPR
Recall@50
0
0.90
0.10
0.05
0.80
1.00
TR 3 TR 4 TeI TrI AI OPR
nDCG@50
![Page 58: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/58.jpg)
Some problems with “black boxes”
• How do you select the candidate items to be ranked?
[Said & Bellogín, 2014]
![Page 59: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/59.jpg)
Summary
• Important issues in recommendation
– Validity of results (replicability)
– Comparability of results (reproducibility)
– Validity of experimental setup
• We need to incorporate reproducibility and replication to facilitate the progress in the field
![Page 60: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/60.jpg)
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [30 minutes]
• Reproducible Experimental Design [60 minutes]
• Summary [10 minutes]
![Page 61: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/61.jpg)
Key Takeaways
• Every decision has an impact
– We should log every step taken in the experimental part and report that log
• There are more things besides papers
– Source code, web appendix, etc. are very useful to provide additional details not present in the paper
• You should not fool yourself
– You have to be critical about what you measure and not trust intermediate “black boxes”
![Page 62: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/62.jpg)
We must avoid this
From http://dilbert.com/strips/comic/2010-11-07/
![Page 63: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/63.jpg)
Pointers
• Email and Twitter
– Alejandro Bellogín
• @abellogin
– Alan Said
• @alansaid
• Slides: • In Slideshare... soon!
![Page 64: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/64.jpg)
Thank you!
![Page 65: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/65.jpg)
References and Additional reading • [Armstrong et al, 2009] Improvements That Don’t Add Up: Ad-Hoc Retrieval Results Since
1998. CIKM
• [Bellogín et al, 2010] A Study of Heterogeneity in Recommendations for a Social Music Service. HetRec
• [Bellogín et al, 2011] Precision-Oriented Evaluation of Recommender Systems: an Algorithm Comparison. RecSys
• [Cremonesi et al, 2010] Performance of Recommender Algorithms on Top-N Recommendation Tasks. RecSys
• [Filippone & Sanguinetti, 2010] Information Theoretic Novelty Detection. Pattern Recognition
• [Fleder & Hosanagar, 2009] Blockbuster Culture’s Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity. Management Science
• [Ge et al, 2010] Beyond accuracy: evaluating recommender systems by coverage and serendipity. RecSys
• [Gorla et al, 2013] Probabilistic Group Recommendation via Information Matching. WWW
• [Herlocker et al, 2004] Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems
![Page 66: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/66.jpg)
References and Additional reading • [Jambor & Wang, 2010] Goal-Driven Collaborative Filtering. ECIR
• [Knijnenburg et al, 2011] A Pragmatic Procedure to Support the User-Centric Evaluation of Recommender Systems. RecSys
• [Koren, 2008] Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. KDD
• [Lathia et al, 2010] Temporal Diversity in Recommender Systems. SIGIR
• [Li et al, 2010] Improving One-Class Collaborative Filtering by Incorporating Rich User Information. CIKM
• [Pu et al, 2011] A User-Centric Evaluation Framework for Recommender Systems. RecSys
• [Said & Bellogín, 2014] Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks. RecSys
• [Schein et al, 2002] Methods and Metrics for Cold-Start Recommendations. SIGIR
• [Shani & Gunawardana, 2011] Evaluating Recommendation Systems. Recommender Systems Handbook
• [Vargas & Castells, 2011] Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems. RecSys
![Page 67: HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of Evaluation](https://reader034.vdocuments.us/reader034/viewer/2022051515/553856a24a79598f768b4713/html5/thumbnails/67.jpg)
References and Additional reading • [Weng et al, 2007] Improving Recommendation Novelty Based on Topic Taxonomy. WI-IAT
• [Yin et al, 2012] Challenging the Long Tail Recommendation. VLDB
• [Zhang & Hurley, 2008] Avoiding Monotony: Improving the Diversity of Recommendation Lists. RecSys
• [Zhang & Hurley, 2009] Statistical Modeling of Diversity in Top-N Recommender Systems. WI-IAT
• [Zhou et al, 2010] Solving the Apparent Diversity-Accuracy Dilemma of Recommender Systems. PNAS
• [Ziegler et al, 2005] Improving Recommendation Lists Through Topic Diversification. WWW