Download - Replicable Evaluation of Recommender Systems
![Page 1: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/1.jpg)
Replicable Evaluation of Recommender Systems
Alejandro Bellogín (Universidad Autónoma de Madrid, Spain)
Alan Said (Recorded Future, Sweden)
Tutorial at ACM RecSys 2015
![Page 2: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/2.jpg)
Stephansdom
2
![Page 3: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/3.jpg)
Stephansdom
3
![Page 4: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/4.jpg)
Stephansdom
4
![Page 5: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/5.jpg)
Stephansdom
5
![Page 6: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/6.jpg)
Stephansdom
6
![Page 7: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/7.jpg)
Stephansdom
7
![Page 8: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/8.jpg)
#EVALTUT
8
![Page 9: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/9.jpg)
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [20 minutes]
• Replicating Evaluation Results [20 minutes]
• Replication by Example [20 minutes]
• Conclusions and Wrap-up [10 minutes]
• Questions [10 minutes]
9
![Page 10: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/10.jpg)
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [20 minutes]
• Replicating Evaluation Results [20 minutes]
• Replication by Example [20 minutes]
• Conclusions and Wrap-up [10 minutes]
• Questions [10 minutes]
10
![Page 11: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/11.jpg)
Background
• A recommender system aims to find and suggest items of likely interest based on the users’ preferences
11
![Page 12: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/12.jpg)
Background
• A recommender system aims to find and suggest items of likely interest based on the users’ preferences
12
![Page 13: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/13.jpg)
Background
• A recommender system aims to find and suggest items of likely interest based on the users’ preferences
• Examples:
– Netflix: TV shows and movies
– Amazon: products
– LinkedIn: jobs and colleagues
– Last.fm: music artists and tracks
– Facebook: friends 13
![Page 14: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/14.jpg)
Background
• Typically, the interactions between user and system are recorded in the form of ratings
– But also: clicks (implicit feedback)
• This is represented as a user-item matrix:
i1 … ik … im
u1
…
uj ?
…
un
14
![Page 15: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/15.jpg)
Motivation
• Evaluation is an integral part of any experimental research area
• It allows us to compare methods…
15
![Page 16: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/16.jpg)
Motivation
• Evaluation is an integral part of any experimental research area
• It allows us to compare methods…
• … and identify winners (in competitions)
16
![Page 17: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/17.jpg)
Motivation
A proper evaluation culture allows advance the field
… or at least, identify when there is a problem!
17
![Page 18: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/18.jpg)
Motivation
In RecSys, we find inconsistent evaluation results, for the “same”
– Dataset
– Algorithm
– Evaluation metric
Movielens 1M [Cremonesi et al, 2010]
Movielens 100k [Gorla et al, 2013]
Movielens 1M [Yin et al, 2012]
Movielens 100k, SVD [Jambor & Wang, 2010]
18
![Page 19: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/19.jpg)
Motivation
In RecSys, we find inconsistent evaluation results, for the “same”
– Dataset
– Algorithm
– Evaluation metric
0
0.05
0.30
0.35
0.40
TR 3 TR 4 TeI TrI AI OPR
P@50 SVD50IBUB50
[Bellogín et al, 2011]
19
![Page 20: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/20.jpg)
Motivation
In RecSys, we find inconsistent evaluation results, for the “same”
– Dataset
– Algorithm
– Evaluation metric
0
0.05
0.30
0.35
0.40
TR 3 TR 4 TeI TrI AI OPR
P@50 SVD50IBUB50
We need to understand why this happens
20
![Page 21: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/21.jpg)
In this tutorial
• We will present the basics of evaluation
– Accuracy metrics: error-based, ranking-based
– Also coverage, diversity, and novelty
• We will focus on replication and reproducibility
– Define the context
– Present typical problems
– Propose some guidelines
21
![Page 22: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/22.jpg)
Replicability
• Why do we need to replicate?
22
![Page 23: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/23.jpg)
Reproducibility
Why do we need to reproduce?
Because these two are not the same
23
![Page 24: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/24.jpg)
NOT in this tutorial
• In-depth analysis of evaluation metrics – See chapter 9 on handbook [Shani & Gunawardana, 2011]
• Novel evaluation dimensions – See tutorials at WSDM ’14 and SIGIR ‘13 on
diversity and novelty
• User evaluation – See tutorial at RecSys 2012
• Comparison of evaluation results in research – See RepSys workshop at RecSys 2013
– See [Said & Bellogín 2014] 24
![Page 25: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/25.jpg)
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [20 minutes]
• Replicating Evaluation Results [20 minutes]
• Replication by Example [20 minutes]
• Conclusions and Wrap-up [10 minutes]
• Questions [10 minutes]
25
![Page 26: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/26.jpg)
Recommender Systems Evaluation
Typically: as a black box
Train Test Validation
Dataset
Recommender generates
a ranking (for a user)
a prediction for a given item (and user)
precision error
coverage … 26
![Page 27: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/27.jpg)
Recommender Systems Evaluation
Train Test Validation
Dataset
Recommender generates
a ranking (for a user)
a prediction for a given item (and user)
precision error
coverage … 27
The reproducible way: as black boxes
![Page 28: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/28.jpg)
Recommender as a black box
What do you do when a recommender cannot predict a score?
This has an impact on coverage
28 [Said & Bellogín, 2014]
![Page 29: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/29.jpg)
Candidate item generation as a black box How do you select the candidate items to be ranked?
Solid triangle represents the target user. Boxed ratings denote test set.
0.85
0.90
0.95
1.00
1.05
1.10
SVD IB UB
RMSE
0
0.05
0.30
0.35
0.40
TR 3 TR 4 TeI TrI AI OPR
P@50 SVD50IBUB50
0
0.20
0.90
0.10
0.30
1.00
TR 3 TR 4 TeI TrI AI OPR
Recall@50
0
0.90
0.10
0.05
0.80
1.00
TR 3 TR 4 TeI TrI AI OPR
nDCG@50
29
![Page 30: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/30.jpg)
How do you select the candidate items to be ranked?
[Said & Bellogín, 2014] 30
Candidate item generation as a black box
![Page 31: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/31.jpg)
Evaluation metric computation as a black box What do you do when a recommender cannot predict a score?
– This has an impact on coverage
– It can also affect error-based metrics
MAE = Mean Absolute Error
RMSE = Root Mean Squared Error 31
![Page 32: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/32.jpg)
Evaluation metric computation as a black box What do you do when a recommender cannot predict a score?
– This has an impact on coverage
– It can also affect error-based metrics
User-item pairs Real Rec1 Rec2 Rec3
(u1, i1) 5 4 NaN 4
(u1, i2) 3 2 4 NaN
(u1, i3) 1 1 NaN 1
(u2, i1) 3 2 4 NaN
MAE/RMSE, ignoring NaNs 0.75/0.87 2.00/2.00 0.50/0.70
MAE/RMSE, NaNs as 0 0.75/0.87 2.00/2.65 1.75/2.18
MAE/RMSE, NaNs as 3 0.75/0.87 1.50/1.58 0.25/0.50 32
![Page 33: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/33.jpg)
Using internal evaluation methods in Mahout (AM), LensKit (LK), and MyMediaLite (MML)
[Said & Bellogín, 2014] 33
Evaluation metric computation as a black box
![Page 34: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/34.jpg)
Variations on metrics:
Error-based metrics can be normalized or averaged per user:
– Normalize RMSE or MAE by the range of the ratings (divide by rmax – rmin)
– Average RMSE or MAE to compensate for unbalanced distributions of items or users
34
Evaluation metric computation as a black box
![Page 35: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/35.jpg)
Variations on metrics:
nDCG has at least two discounting functions (linear and exponential decay)
35
Evaluation metric computation as a black box
![Page 36: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/36.jpg)
Variations on metrics:
Ranking-based metrics are usually computed up to a ranking position or cutoff k
P = Precision (Precision at k)
R = Recall (Recall at k)
MAP = Mean Average Precision 36
Evaluation metric computation as a black box
![Page 37: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/37.jpg)
If ties are present in the ranking scores, results may depend on the implementation
37
Evaluation metric computation as a black box
[Bellogín et al, 2013]
![Page 38: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/38.jpg)
Not clear how to measure diversity/novelty in offline experiments (directly measured in online experiments):
– Using a taxonomy (items about novel topics) [Weng
et al, 2007]
– New items over time [Lathia et al, 2010]
– Based on entropy, self-information and Kullback-Leibler divergence [Bellogín et al, 2010; Zhou et al, 2010;
Filippone & Sanguinetti, 2010]
38
Evaluation metric computation as a black box
![Page 39: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/39.jpg)
Recommender Systems Evaluation: Summary
• Usually, evaluation seen as a black box
• The evaluation process involves everything: splitting, recommendation, candidate item generation, and metric computation
• We should agree on standard implementations, parameters, instantiations, …
– Example: trec_eval in IR
39
![Page 40: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/40.jpg)
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [20 minutes]
• Replicating Evaluation Results [20 minutes]
• Replication by Example [20 minutes]
• Conclusions and Wrap-up [10 minutes]
• Questions [10 minutes]
40
![Page 41: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/41.jpg)
Reproducible Experimental Design
• We need to distinguish
– Replicability
– Reproducibility
• Different aspects:
– Algorithmic
– Published results
– Experimental design
• Goal: have a reproducible experimental environment
41
![Page 42: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/42.jpg)
Definition: Replicability
To copy something
• The results
• The data
• The approach
Being able to evaluate in the same setting and obtain the same results
42
![Page 43: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/43.jpg)
Definition: Reproducibility
To recreate something
• The (complete) set of experiments
• The (complete) set of results
• The (complete) experimental setup
To (re) launch it in production with the same results
43
![Page 44: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/44.jpg)
Comparing against the state-of-the-art Your settings are not exactly like those in paper X, but it is
a relevant paper
Reproduce results of paper X
Congrats, you’re done!
Replicate results of paper X
Congrats! You have shown that paper X behaves different in
the new setting
Sorry, there is something wrong/incomplete in the
experimental design
They agree
They do not agree
Do results match the
original paper?
Yes!
No!
Do results agree with
original paper?
44
![Page 45: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/45.jpg)
What about Reviewer 3?
• “It would be interesting to see this done on a different dataset…”
– Repeatability
– The same person doing the whole pipeline over again
• “How does your approach compare to *Reviewer 3 et al. 2003+?”
– Reproducibility or replicability (depending on how similar the two papers are)
45
![Page 46: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/46.jpg)
Repeat vs. replicate vs. reproduce vs. reuse
46
![Page 47: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/47.jpg)
Motivation for reproducibility
In order to ensure that our experiments, settings, and results are:
– Valid
– Generalizable
– Of use for others
– etc.
we must make sure that others can reproduce our experiments in their setting
47
![Page 48: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/48.jpg)
Making reproducibility easier
• Description, description, description
• No magic numbers
• Specify values for all parameters
• Motivate!
• Keep a detailed protocol
• Describe process clearly
• Use standards
• Publish code (nobody expects you to be an awesome developer, you’re a researcher)
48
![Page 49: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/49.jpg)
Replicability, reproducibility, and progress
• Can there be actual progress if no valid comparison can be done?
• What is the point of comparing two approaches if the comparison is flawed?
• How do replicability and reproducibility facilitate actual progress in the field?
49
![Page 50: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/50.jpg)
Summary
• Important issues in recommendation
– Validity of results (replicability)
– Comparability of results (reproducibility)
– Validity of experimental setup (repeatability)
• We need to incorporate reproducibility and replication to facilitate the progress in the field
50
![Page 51: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/51.jpg)
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [20 minutes]
• Replicating Evaluation Results [20 minutes]
• Replication by Example [20 minutes]
• Conclusions and Wrap-up [10 minutes]
• Questions [10 minutes]
51
![Page 52: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/52.jpg)
Replication by Example
• Demo time!
• Check
– http://www.recommenders.net/tutorial
• Checkout
– https://github.com/recommenders/tutorial.git
52
![Page 53: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/53.jpg)
The things we write
mvn exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation”
53
![Page 54: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/54.jpg)
The things we forget to write
mvn -o exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation"
-Dexec.args=”-u false"
54
mvn exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation”
![Page 55: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/55.jpg)
The things we forget to write
mvn -o exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation"
-Dexec.args="-t 4.0"
55
mvn -o exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation"
-Dexec.args=”-u false"
mvn exec:java -Dexec.mainClass="net.recommenders.tutorial.CrossValidation”
![Page 56: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/56.jpg)
Outline
• Background and Motivation [10 minutes]
• Evaluating Recommender Systems [20 minutes]
• Replicating Evaluation Results [20 minutes]
• Replication by Example [20 minutes]
• Conclusions and Wrap-up [10 minutes]
• Questions [10 minutes]
56
![Page 57: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/57.jpg)
Key Takeaways
• Every decision has an impact
– We should log every step taken in the experimental part and report that log
• There are more things besides papers
– Source code, web appendix, etc. are very useful to provide additional details not present in the paper
• You should not fool yourself
– You have to be critical about what you measure and not trust intermediate “black boxes”
57
![Page 58: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/58.jpg)
We must avoid this
From http://dilbert.com/strips/comic/2010-11-07/ 58
![Page 59: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/59.jpg)
Next steps?
• Agree on standard implementations
• Replicable badges for journals / conferences
59
![Page 60: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/60.jpg)
Next steps?
• Agree on standard implementations
• Replicable badges for journals / conferences
http://validation.scienceexchange.com
The aim of the Reproducibility Initiative is to identify and reward high quality reproducible research via independent validation of key experimental results
60
![Page 61: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/61.jpg)
Next steps?
• Agree on standard implementations
• Replicable badges for journals / conferences
• Investigate how to improve reproducibility
61
![Page 62: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/62.jpg)
Next steps?
• Agree on standard implementations
• Replicable badges for journals / conferences
• Investigate how to improve reproducibility
• Benchmark, report, and store results
62
![Page 63: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/63.jpg)
Pointers
• Email and Twitter
– Alejandro Bellogín
• @abellogin
– Alan Said
• @alansaid
• Slides: • In Slideshare... soon!
63
![Page 64: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/64.jpg)
RiVal
Recommender System Evaluation Toolkit
http://rival.recommenders.net
http://github.com/recommenders/rival
64
![Page 65: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/65.jpg)
Thank you!
65
![Page 66: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/66.jpg)
References and Additional reading • [Armstrong et al, 2009] Improvements That Don’t Add Up: Ad-Hoc Retrieval Results Since
1998. CIKM
• [Bellogín et al, 2010] A Study of Heterogeneity in Recommendations for a Social Music Service. HetRec
• [Bellogín et al, 2011] Precision-Oriented Evaluation of Recommender Systems: an Algorithm Comparison. RecSys
• [Bellogín et al, 2013] An Empirical Comparison of Social, Collaborative Filtering, and Hybrid Recommenders. ACM TIST
• [Ben-Shimon et al, 2015] RecSys Challenge 2015 and the YOOCHOOSE Dataset. RecSys
• [Cremonesi et al, 2010] Performance of Recommender Algorithms on Top-N Recommendation Tasks. RecSys
• [Filippone & Sanguinetti, 2010] Information Theoretic Novelty Detection. Pattern Recognition
• [Fleder & Hosanagar, 2009] Blockbuster Culture’s Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity. Management Science
• [Ge et al, 2010] Beyond accuracy: evaluating recommender systems by coverage and serendipity. RecSys
• [Gorla et al, 2013] Probabilistic Group Recommendation via Information Matching. WWW
66
![Page 67: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/67.jpg)
References and Additional reading • [Herlocker et al, 2004] Evaluating Collaborative Filtering Recommender Systems. ACM
Transactions on Information Systems
• [Jambor & Wang, 2010] Goal-Driven Collaborative Filtering. ECIR
• [Knijnenburg et al, 2011] A Pragmatic Procedure to Support the User-Centric Evaluation of Recommender Systems. RecSys
• [Koren, 2008] Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. KDD
• [Lathia et al, 2010] Temporal Diversity in Recommender Systems. SIGIR
• [Li et al, 2010] Improving One-Class Collaborative Filtering by Incorporating Rich User Information. CIKM
• [Pu et al, 2011] A User-Centric Evaluation Framework for Recommender Systems. RecSys
• [Said & Bellogín, 2014] Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks. RecSys
• [Schein et al, 2002] Methods and Metrics for Cold-Start Recommendations. SIGIR
• [Shani & Gunawardana, 2011] Evaluating Recommendation Systems. Recommender Systems Handbook
• [Steck & Xin, 2010] A Generalized Probabilistic Framework and its Variants for Training Top-k Recommender Systems. PRSAT
67
![Page 68: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/68.jpg)
References and Additional reading • [Tikk et al, 2014] Comparative Evaluation of Recommender Systems for Digital Media. IBC
• [Vargas & Castells, 2011] Rank and Relevance in Novelty and Diversity Metrics for Recommender Systems. RecSys
• [Weng et al, 2007] Improving Recommendation Novelty Based on Topic Taxonomy. WI-IAT
• [Yin et al, 2012] Challenging the Long Tail Recommendation. VLDB
• [Zhang & Hurley, 2008] Avoiding Monotony: Improving the Diversity of Recommendation Lists. RecSys
• [Zhang & Hurley, 2009] Statistical Modeling of Diversity in Top-N Recommender Systems. WI-IAT
• [Zhou et al, 2010] Solving the Apparent Diversity-Accuracy Dilemma of Recommender Systems. PNAS
• [Ziegler et al, 2005] Improving Recommendation Lists Through Topic Diversification. WWW
68
![Page 69: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/69.jpg)
Rank-score (Half-Life Utility)
69
![Page 70: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/70.jpg)
Mean Reciprocal Rank
70
![Page 71: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/71.jpg)
Mean Percentage Ranking
[Li et al, 2010]
71
![Page 72: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/72.jpg)
Global ROC
[Schein et al, 2002]
72
![Page 73: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/73.jpg)
Customer ROC
[Schein et al, 2002]
73
![Page 74: Replicable Evaluation of Recommender Systems](https://reader034.vdocuments.us/reader034/viewer/2022042619/5871a2131a28ab044e8b70fb/html5/thumbnails/74.jpg)
Popularity-stratified recall
[Steck & Xin, 2010]
74