user friendly recommender systems - university of …judy/homec/theses/2006_hingsto… · ·...
Post on 26-Apr-2018
216 Views
Preview:
TRANSCRIPT
User Friendly Recommender Systems
MARK HINGSTON
SID: 0220763
SI D
ERE·M EN
S·EADEM
·M UT ATO
Supervisor: Judy Kay
This thesis is submitted in partial fulfillment ofthe requirements for the degree of
Bachelor of Information Technology (Honours)
School of Information TechnologiesThe University of Sydney
Australia
3 November 2006
Abstract
Recommender systems are a recent but increasingly widely used resource. Yet most, if not all of
them suffer from serious deficiencies.
Recommender systems often require first time users to enter ratings for a large number of items —
a tedious process that often deters users. Thus, this thesis investigated whether useful recommendations
could be made without requiring users to explicitly rate items. It was shown thatratings automatically
generated from implicit information about a user can be used to make usefulrecommendations.
Most recommender systems also provide no explanations for the recommendations that they make,
and give users little control over the recommendation process. Thus, when these systems make a poor
recommendation, users can not understand why it was made, and are notable to easily improve their
recommendations. Hence, this thesis investigated ways in which scrutability andcontrol could be imple-
mented in such systems. A comprehensive questionnaire was completed by 18participants as a basis for
a broader understanding of the issues mentioned above and to inform the design of a prototype; a pro-
totype was then created and two separate evaluations performed, each withat least 9 participants. This
investigation highlighted a number of key scrutability and control features that could be useful additions
to existing recommender systems.
The findings of this thesis can be used to improve the effectiveness, usefulness and user friendliness
of existing recommender systems. These findings include:
• Explanations, controls and a map based presentation are all useful additions to a recommender
system.
• Specific explanation types can be more useful than others for explaining particular recommen-
dation techniques.
• Specific recommendation techniques can be useful even when a user hasnot entered many
ratings.
• Ratings generated from purely implicit information about a user can be usedto made useful
recommendations.
ii
Acknowledgements
Firstly, I would like to thank my supervisor, Judy Kay, for all of the time and effort she has put into
guiding me through the production of this thesis.
I would like to thank Mark van Setten and the creators of the Duine Toolkit forproducing a high
quality piece of software and making it available to the public.
I want to also thank Joseph Konstan, for taking the time to talk with me and give me encouragement
at the formative, early stages of my thesis.
I would also like to thank my lovely girlfriend Sarah Kulczycki, for her unwavering support and
fun-loving spirit.
iii
CONTENTS
Abstract ii
Acknowledgements iii
List of Figures vii
Chapter 1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 2 Literature Review 4
2.0.1 Social Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . 4
2.0.2 Content-Based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Hybrid Recommenders (The Duine Toolkit) . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 7
2.2 Unobtrusive Recommendation. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Scrutability and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 3 Exploratory Study 14
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Recommendation Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 16
3.4 Questionnaire - Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 Part A - Presentation Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 Part B - Understanding & Usefulness . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . 20
3.4.3 Final Questions - Integrative . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Questionnaire - Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 Usefulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.2 Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.3 Understanding And Usefulness . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 34
iv
CONTENTS v
3.5.4 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5.5 Presentation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.6 Final Questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 4 Prototype Design 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 User’s View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 iSuggest-Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 iSuggest-Unobtrusive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Design & Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.1 iSuggest-Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.2 iSuggest-Unobtrusive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Chapter 5 Evaluations 62
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1 iSuggest-Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.2 iSuggest-Unobtrusive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 iSuggest-Usability Evaluations — Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . 68
5.3.1 Recommender Usefulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 68
5.3.2 Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.3 Controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.4 Presentation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 iSuggest-Unobtrusive - Results . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4.1 Statistical Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 78
5.4.2 Ratings Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 81
5.4.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 82
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Chapter 6 Conclusion 87
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
CONTENTS vi
References 90
Appendix A Appendix A — Questionnaire Form 93
Appendix B Appendix B — Questionnaire Results 94
Appendix C Appendix C — iSuggest-Usability Evaluation Instructions 95
Appendix D Appendix D — iSuggest-Usability Evaluation Results 96
Appendix E Appendix E — iSuggest-Unobtrusive Evaluation Instructions 97
Appendix F Appendix F — iSuggest-Unobtrusive Evaluation Results 98
List of Figures
2.1 MAE For The Duine Toolkit’s System Lifecycle Test. Lower MAE Values Indicate Better
Performance. The Numbers Below Each Group Indicate The Sample Size (In Number Of
Predictions) 6
2.2 Examples Of Features That Can Be Computed For Various Item Types 7
2.3 Mean Response Of Users To Each Explanation Interface, Based OnA Scale Of One To Seven.
Explanations 11 And 12 Represent The Base Case Of No Additional Information. Shaded
Rows Indicate Explanations With A Mean Response Significantly Different From The Base
Cases. 12
3.1 Summary Of Possible Explanations And Control Features For The Major Algorithms In The
Duine Toolkit. 18
3.2 Demographic Information For Each Of The Respondents. 20
3.3 List Based Presentation That Was Shown To Participants In The Questionnaire 21
3.4 Map Based Presentation That Was Shown To Participants In The Questionnaire 21
3.5 One Of The Explanation Screens Shown To Participants In The Questionnaire. This Screen
Explains Recommendations From The Learn By Example Technique 22
3.6 One Of The Explanation Screens Shown To Participants In The Questionnaire. This Screen
Explains Recommendations From The Social Filtering Technique 22
3.7 The Genre Based Control Shown To Participants In The Questionnaire 23
3.8 The Screens With The Maximum Average Usefulness For Each Recommendation Method.
Error Bars Show One Standard Deviation Above And Below The Mean. N =18. 25
3.9 Average Ranking Given To Each Presentation Method. N = 18. Top Ranking = 1. Bottom
Ranking = 6. 25
vii
L IST OF FIGURES viii
3.10Average Response For Contribution That Each Method Should MakeTo A Combination Of
Recommendation Methods. Error Bars Show One Standard Deviation AboveAnd Below The
Mean. N = 18. 26
3.11The Screens With The Maximum Average Understanding For Each Recommendation Method.
Error Bars Show One Standard Deviation Above And Below The Mean. N =18 29
3.12Respondents’ Average Understanding Of Recommendation Methods Before And After
Explanations. Error Bars Show One Standard Deviation Above And BelowThe Mean. N = 18 30
3.13Average Ratings For Questions Regarding Respondents’ Understanding, Likelihood Of Using
And Perceived Usefulness Of Each Control Feature. Error Bars Show One Standard Deviation
Above And Below The Mean. N = 18 35
3.14User’s Responses For Questions Regarding Recommendation Presentation Methods. Error Bars
Show One Standard Deviation Above And Below The Mean. N = 18 37
3.15Average Rating For The Usefulness Of Possible Features Of A Recommender. Error Bars Show
One Standard Deviation Above And Below The Mean. 39
4.1 List Based Presentation Of Recommendations 47
4.2 The Star Bar That Users Used To Rate Items 47
4.3 Recommendation Technique Selection Screen. Note: The ‘Word Of Mouth’ Technique Shown
Here Is Social Filtering And The ‘Let iSuggest Choose’ Technique Is The Duine Toolkit Taste
Strategy 49
4.4 Explanation Screen For Genre Based Recommendations 49
4.5 Social Filtering (Simple Graph) Explanation Screen For Social Filtering Recommendations 49
4.6 Explanation Screen For Learn By Example Recommendations 50
4.7 Explanation Screen For Most Popular Recommendations 50
4.8 The Genre Based Control (Genre Slider) 51
4.9 The Social Filtering Control. Note: The actual control is the ‘Ignore This User’ Link 52
4.10Full Map Presentation — Zoomed Out View 53
4.11Full Map Presentation — Zoomed In View 54
4.12Similar Items Map Presentation 54
4.13The Explanation Screen Displayed After Ratings Generation 55
L IST OF FIGURES ix
4.14Architecture Of The Basic Prototype, With Components Constructed During This Thesis
Marked In Blue 56
4.15Architecture Of iSuggest-Usability, With Components Constructed DuringThis Thesis Marked
In Blue 57
4.16Architecture Of iSuggest-Unobtrusive, With Components ConstructedDuring This Thesis
Marked In Blue 58
5.1 Demographical Informations About The Users Who Conducted The Evaluations Of
iSuggest-Usability 66
5.2 Demographical Informations About The Users Who Conducted The Evaluations Of
iSuggest-Unobtrusive 67
5.3 Average Usefulness Ratings For Each Recommendation Method. ErrorBars Show Standard
Deviation. 69
5.4 Average Usefulness Ratings For Each Explanation. Error Bars Show Standard Deviation. 71
5.5 Users’ Ratings For The Overall Use Of The iSuggest Explanations. 72
5.6 Users’ Ratings For The Effectiveness Of Control Features. 74
5.7 Users’ Ratings For The Overall Effectiveness Of The iSuggest Control Features. 75
5.8 Average Usefulness Of The Map Based Presentations. Error BarsShow Standard Deviation. 76
5.9 Sum Of Votes For The Preferred Presentation Type. 77
5.10Comparison Of Distribution Of Ratings Values. 79
5.11Comparison Of MAE And SDAE For Movielens Recommendations And Recommendations
Using Generated Ratings. Lower Scores Are Better. Techniques Are Sorted By MAE. 80
5.12Average Usefulness Ratings For Each Recommendation Method. Error Bars Show Standard
Deviation. 82
CHAPTER 1
Introduction
Recommender systems are a recent, but increasingly widely used resource. Yet most, if not all of them
suffer from serious deficiencies.
With so much information available over the Internet, people often turn to recommendation services
to highlight the items that will be of most interest to them. All of the significant systems in the area
of recommendation build up a profile of a user (usually through asking users to rate items they have
seen) and then use content-based or collaborative filtering, or a combination (hybrid) of these methods,
to make recommendations about what other pieces of information a user might beinterested in. How-
ever many recommender systems require first time users to enter ratings for alarge number of items.
Further, these systems do not always make useful recommendations. Recommendations can be poor
for a number of reasons, but what happens when a recommenderdoesmake a poor recommendation?
Most recommender systems offer no information about the reason that theymade particular recommen-
dations. Further, most also offer users little opportunity to affect the system in a way that can improve
recommendations. The fact that recommenders require users to rate items can also be a failing, as the
tedious process of entering ratings can often deter users. When we takeaccount of all of these factors,
it is obvious that many existing recommender systems are not meeting their potential for usefulness and
usability.
1.1 Background
Since about 1995, recommender systems have been deployed across many domains. Two of the most im-
portant early recommender systems were Ringo (publicly available in 1994) and GroupLens1 (available
in 1996). The success of Ringo, one of the first large-scale music recommendation systems, is reported
in (Shardanand and Maes, 1995). GroupLens, an automated collaborative filtering system for Usenet
1www.grouplens.org/
1
1.2 RESEARCHQUESTIONS 2
news, also proved highly successful. (Konstanet al., 1997) reported trials of the GroupLens system, and
this classic paper showed that collaborative filtering could be effective on a large scale. The GroupLens
project was soon adapted to produce MovieLens2, a large-scale, publicly available movie recommenda-
tion system. Large interest in recommender systems was soon fostered by theincreasing public demand
for systems that helped deal with the problem of information overload. Sincethen, much academic and
commercial interest has been shown in recommender systems for many different domains. Although
much of their research is not published, Amazon.com is one of the most well known implementers of
this technology. Amazon.com makes use of collaborative filtering systems to recommend products that
a user might like to purchase. Other companies that use recommender systems, include netflix.com for
videos, TiVo for digital television and Barnes and Noble for books. Manymusic recommendation sys-
tems are also available today, such as Pandora.com (which maintains a staff of music analysts who tag
songs as they enter the system) and last.fm3. (Atkinson, 2006) rated these two systems as the best music
recommenders currently available to the public.
1.2 Research Questions
In order to make recommender systems more user friendly, the problems detailed above need to be
addressed. However, there is a lack of existing research into the way that recommender systems can:
make recommendations unobtrusively; explain recommendations and offer users useful control over the
recommendation process. This lack of research is especially prevalent inthe area of music recommen-
dation, where little research has been published. Thus, this project investigated the following research
questions:
Scrutability & Control: What is the impact of adding scrutability and control to a recommender
system?
Unobtrusive Recommendation:Can a recommender system provide useful recommendations
without asking users to explicitly rate items?
This thesis originally aimed to investigate these questions with reference to music recommender systems.
To further this goal, a dataset containing unobtrusively obtained information about users was located for
use in investigatingUnobtrusive Recommendation. However, it quickly became apparent that few music
2http://movielens.umn.edu/3http://www.last.fm
1.2 RESEARCHQUESTIONS 3
datasets containing users’ explicit ratings of music. Thus, in order to conduct a thorough and rigorous
study ofScrutability & Control, the MovieLens standard dataset was used. This contained information
on users and their ratings of movies.
The contributions of this thesis are: the identification of a lack of existing research into scrutability,
control and unobtrusiveness in recommender systems (Chapter 2); the identification of a number of
promising methods for adding scrutability and control to a recommender (Chapter 3); the creation of
a prototype that implements these scrutability and control methods, and can alsoprovide unobtrusive
recommendations (Chapter 4); and the evaluation of the methods implemented in thisprototype for
providing scrutability, control and unobtrusiveness within a recommendersystem (Chapter 5).
CHAPTER 2
Literature Review
The basic purpose of a music recommender is to recommend items that will be of interest to a specific
user. This task is required because of the fact that an abundance of information is now available to people
via the Internet and many don’t have the time sort through it all. Currently, all major recommendation
systems use social filtering, content-based filtering, or some combination of these two approaches to
predict how interested a user will be in a specific item. This information is then used to recommend
items that the system believes will be of the most interest to that user. Each of these approaches to rec-
ommendation is discussed below, with reference to Figure 2.0.1 (taken from (van Settenet al., 2002)).
This graph shows the results of testing a series of approaches to recommendation using the MovieLens
standard data set. These tests were evaluated using the Mean Absolute Error (MAE) metric, which
(Herlockeret al., 2004) lists as an appropriate metric for the evaluation of recommender systems. Fig-
ure 1 gives a good indication of the relative levels of performance that can be achieved by using each
approach.
2.0.1 Social Filtering
(Polcicovaet al., 2000), (Breeseet al., 1998) and (Shardanand and Maes, 1995) explain that social
filtering systems work by first asking users to rate items. Then by comparing those ratings, they locate
users who share common interests and make personalized recommendations based on like-minded user’s
opinions. Social filtering does not take formal content into account and makes judgments based purely
upon the ratings of users. The GroupLens project, documented in (Konstan et al., 1997), involved a
large-scale trial of a social filtering recommender system. This trial was confirmatory research - a large
amount of users were asked to test the system, and the results of this testing were collated to provide
a statistical confirmation that social filtering could be effective on a large scale. Many further research
projects into social filtering have confirmed its utility through simulation. Such projects include (Breese
4
2 LITERATURE REVIEW 5
et al., 1998) and (van Settenet al., 2002), which both contain simulations run on the MovieLens data set
and evaluated using mean error metrics.
In general, social filtering algorithms work in the following way:
"In the first step, they identify the k users in the database that are the most similar to the active user.
During the second step, they compute the [set of] of items [liked] by these users and associate a weight
with each item based on its importance in the set. In the third and final step, fromthis [set] they select
and recommend the items that have the highest weight and have not already been seen by the active
user" - (Deshpande and Karypis, 2004), p 4.
Figure 2.0.1 shows the social filtering recommender to have the equal lowestMAE in four of the five
tests, showing that it is a highly effective recommendation method. However,social filtering is not
without its problems. (Adomavicius and Tuzhilin, 2005) summarises the issues with social filtering as:
• An inability to make accurate predictions for new users. (Referred to in this thesis as thecold
start problem for new users).
• Poor recommendation accuracy during the initial stages of the system. (Referred to in this
thesis as thecold start problem for new systems).
• A lack of ability to recommend new items until they are rated by users.
Social filtering was one recommendation technique used in this project to make music and movie related
recommendations. As stated above, social filtering does not make use of thecontent of items, only the
ratings that users have given each item. This means that social filtering approaches were easily adapted
for use in both music and movie related recommendation.
2.0.2 Content-Based Filtering
In content-based filtering systems, users are again asked to rate items. Thesystem then analyses the
content of those items and creates a profile that represents a user’s interests in terms of item content
(features, key phrases, etc.). Then the content of items unknown to the user is analysed and these are
compared with the user’s profile in order to find the items that will be of interestto the user. The
information that a content-based filtering system can compute about a particular item falls into one of
two categories: content-derived and meta-content information. Content-derived information (used in
(Canoet al., 2005), (Logan, 2004) and (Mooney and Roy, 2000)) is computed bythe system through
2 LITERATURE REVIEW 6
FIGURE 2.1: MAE For The Duine Toolkit’s System Lifecycle Test. Lower MAE Val-ues Indicate Better Performance. The Numbers Below Each Group Indicate The SampleSize (In Number Of Predictions)
analysis of the actual content of an item (e.g. the beats per minute of a song or the key words found in
a document). Meta-content information (used in (Maket al., 2003), (van Settenet al., 2002) and (van
Settenet al., 2003)) is any information that the system can glean about an item that doesnot come from
analysing the content of that item (such information may come from an external database, or a header
attached to the item). Examples of the type of features that can be computed fortext, music and movie
data are given in Figure 2.2. Content-derived information about an item needs to make use of algorithms
that are specific to the type of item that is being analysed. In contrast, meta-content information does not
need to be computed from actual items and, in fact, meta-content information is often quite similar for
items from different domains. Figure 2.2 shows that meta-content informationfor each of the different
item types exhibits certain similarities, whereas the content-derived informationis quite specific to the
type of item. This fact means that meta-content based recommenders are ableto be easily adapted for
use in new domains, but that it is much more difficult to perform the same adaptation on recommenders
that use content-derived information. However, systems that make use ofcontent-derived information
gain a better picture of each of the items in the system and thus should be able to make more accurate
recommendations than systems that use only meta-content information.
2.1 HYBRID RECOMMENDERS(THE DUINE TOOLKIT) 7
Like social filtering, content-based filtering also has weaknesses. (Adomavicius and Tuzhilin, 2005)
states that they:
• Become over specialised and only recommend very specific types of items to each user.
• Are also subject to thecold start problem for new users.
• May rely on content-derived information, which is often expensive (or impossible) to compute
accurately.
Text Music MoviesMeta-
content: Author Composer Writer
Abstract N/A Synopsis
Publisher Producer Producer
Genre Genre Genre
N/A Performer Actors
Content-derived: Key phrases Beats / min
Color Histogrm
Term frequencies MFCC’s Story Tempo
FIGURE 2.2: Examples Of Features That Can Be Computed For Various Item Types
(van Settenet al., 2002) makes use of content-based filtering using meta-content to make movierecom-
mendations. This content-based filtering approach is one of a number of prediction techniques used in
the Duine Toolkit to make recommendations. This toolkit is discussed in detail in Section 2.1. The tests
summarized in (van Settenet al., 2002) show that the content-based algorithm included in the Duine
Toolkit performed well during simulations. This project extended the Duine Toolkit to also include
content-based prediction techniques for music recommendations.
2.1 Hybrid Recommenders (The Duine Toolkit)
Hybrid recommender systems combine content based and social filtering in thehope that this combina-
tion might contain all the strengths of the two approaches, while also alleviating their problems. The
Duine Toolkit is a hybrid recommender that was produced as a part of a PhD completed by Mark van
Setten. It is a piece of software that makes available a number of prediction techniques (including both
social filtering and content-based techniques) and allows them to be combined dynamically. This project
will involved using the using the Duine toolkit to make both music and movie related recommendations.
This toolkit makes use of prediction strategies, which were introduced in (van Settenet al., 2002). Such
2.2 UNOBTRUSIVE RECOMMENDATION 8
prediction strategies are a way of easily combining prediction techniques dynamically and intelligently
in an attempt to provide better and more reliable prediction results. (van Settenet al., 2002) introduces
these prediction strategies and demonstrates how they can be adapted depending upon the various states
that a system might be in. It introduces a software platform called Duine, which implements prediction
strategies and can be extended to include new prediction techniques and new strategies. Simulations run
in (van Settenet al., 2002) and (van Settenet al., 2004) showed that the combination of prediction tech-
niques into prediction strategies can improve the effectiveness of a recommendation system. The testing
done in these papers was of sound quality and was performed on the data set made available by the
MovieLens project, which is a well-known, standard data set for recommender systems. The results of
these tests are summarised in (van Settenet al., 2002). These results show that in every case, the Taste
Strategy (a particular prediction strategy used in testing) had the lowest MAEof all of the prediction
techniques used. This strategy is able to choose the most effective prediction technique for a particular
situation and thus is able to maximise prediction accuracy. The work done in (van Settenet al., 2002)
and (van Settenet al., 2004) focused on making predictions based on movie data. This project built upon
this work by extending the Duine Toolkit for use in music recommendation. As well as making use of
the Duine Toolkit in a new domain, this project also involved the addition ofScrutability & Control
features andUnobtrusive Recommendationto this toolkit. Each of these additions is discussed in the
following sections.
2.2 Unobtrusive Recommendation
Generally, recommender systems build a profile of a user’s likes and dislikesby asking a user to rate
specific items after they have listened to them. However, users often find this process to be tedious.
Further, the cold start problem for new users means that users may needto rate many items before
they receive useful recommendations. As a result, this thesis investigated ways in which a system
can elicit information about a user’s likes and dislikes in an unobtrusive manner. In order to investigate
Unobtrusive Recommendation, new features were added to the Duine Toolkit. This allowed would allow
the system to make recommendations without needing to ask a user to rate the items that they have seen
or heard. Accomplishing this task required an unobtrusive way to gauge auser’s level of interest in an
item. Some of the unobtrusive methods for judging how interested a user is in anitem are summarised
in (Oard and Kim, 1998). These methods include the length of time that a user spends viewing an item,
the number of times a user has viewed an item, the items that a user is willing to purchase, the items
2.2 UNOBTRUSIVE RECOMMENDATION 9
that a user deletes from their collection and the items that a user chooses to retain in their collection.
Unfortunately, (Oard and Kim, 1998) merely presents a summary of these methods and does not present
any testing of the methods it mentions. Of course, one of the problems with all ofthe methods mentioned
above for modelling users unobtrusively is the fact that preferences based upon such data are likely to be
less accurate than preferences based upon explicit user ratings. (Adomavicius and Tuzhilin, 2005) states
that "[unobtrusive] ratings (such as time spent reading an article) are often inaccurate and cannot fully
replace explicit ratings provided by the user. Therefore, the problem of minimizing intrusiveness while
maintaining certain levels of accuracy of recommendations needs to be addressed by the recommender
systems researchers" - (Adomavicius and Tuzhilin, 2005), p 12. This paper recognises the need for more
research into unobtrusive user modelling and notes a number of papers that have reported on work in
this area.
Unfortunately, there is a distinct lack of research published that deals witheliciting a user’s musical
preferences unobtrusively. The literature available on unobtrusive user modelling is often concerned
with determining user’s preferences in regard to websites and not their opinions on pieces of music.
(Kiss and Quinqueton, 2001) mentions the use of navigation histories to gauge a user’s level of interest
in particular websites. It also proposes some more creative methods for using implicit input, such as
matching the sort order of a search with the order that results were visited and using the time taken to
press the ’back’ button on a browser to judge a user’s interest in a page. Although (Kiss and Quinqueton,
2001) is obviously based upon some amount of research, and claims "the implementation has started and
is well advancing, and we begin to have some experimental results" - (Kiss and Quinqueton, 2001), p
15, disappointingly, results from the project are not easily available and,as user modelling forms only
one part of the paper, it is unlikely that it would be easy to identify the impact that particular user
modelling techniques had upon the results of this research. However, this paper does still present some
useful ideas on making use of implicit preference information that could be adapted for use in a music
recommender. (Middletonet al., 2001) describes similar techniques for user modelling and includes
results of a number of exploratory case studies that show that this form ofuser modelling can be quite
successful. This project built upon existing methods for user profiling and extended these to investigate
methods for inferring a user’s level of interest in an item from only implicit data.
2.3 SCRUTABILITY AND CONTROL 10
2.3 Scrutability and Control
The literature discussed in the sections above all deals with the desire to make high quality recommenda-
tions. Once these recommendations are made, scrutability is concerned with explaining to the user why
a particular recommendation was made. Further, control is concerned with allowing users to control a
recommender system in order to improve recommendations. Research published in (Sinha and Swearin-
gen, 2001) and (Sinha and Swearingen, 2002) shows that users aremore willing to trust or make use
of recommendations that are well explained (i.e. that are scrutable). Joseph Konstan, a leading figure
in recommender systems research noted that "adding scrutability to recommender systems is important,
but hard" - (Konstan, J., personal communication, June 3, 2006). Scrutability is a key component in a
recommender system for a number of reasons. First, users are not always willing to trust a system when
they are just beginning to use it. If users can be provided with some level ofassurance that the recom-
mendations made by a system are of a high quality, then they are more likely to trust that system. Such
assurances are given to the user by showing why a particular recommendation was made. Scrutability is
also useful in cases where a recommendation is made that a user believes is not appropriate. In this case,
if a user can access some explanation for the recommendation, they may be more likely to understand
why that recommendation might be of interest to them. Explanations may also help auser to identify
areas where a system is making errors and, ideally, control functions should then be able to help the
user alter the function of the system to make it less likely to make inappropriate recommendations. The
value of control functions is not limited to allowing alterations to the recommendationprocess when
errors occur. Rather, users can often make use of control functionsat any time during the operation of
a recommender system. This allows them to influence the process of recommendation in a way that
hopefully leads to improved recommendation accuracy.
Sinha and Swearingen have shown that scrutability improves the effectiveness of a recommendation
system. (Sinha and Swearingen, 2001) and (Sinha and Swearingen, 2002), published the results of re-
search that involved asking users to test a number publicly available recommendation systems and then
evaluate their experience with each one. The findings of these studies show that "in general users like
and feel more confident in recommendations perceived as transparent"- (Sinha and Swearingen, 2002),
p 2. Although their experiments were on only a small scale, they were well designed and the concept
of the importance of transparency is supported by other research suchas was conducted by "John-
son & Johnson (1993) [who] point out that explanations play a crucialrole in the interaction between
users and complex systems" - (Sinha and Swearingen, 2002), p 1. A similarexperimental study was
2.3 SCRUTABILITY AND CONTROL 11
conducted in (Herlocker, 2000), which describes scrutability experiments conducted on a much larger
sample group and confirms that "most users value explanations and would like to see them added to their
[recommendation] system. These sentiments were validated by qualitative textual comments given by
survey respondents" - (Herlockeret al., 2000), p 10. (Herlocker, 2000) describes in detail a series of
approaches to adding scrutability to social filtering recommender systems. Itreports on user trials that
were conducted involving a large number of users, who were each asked to use prototype recommender
systems and provide feedback on the value of the explanations given forrecommendations. The results
of these tests can be seen in Figure 2.3, which shows the most useful techniques for adding scrutabil-
ity to be explanations showing histograms of ratings from like-minded users (nearest neighbours) and
explanations showing the past performance of the recommender. (van Setten, 2005) also describes a
small scale investigation into explanations for recommender systems and (Mcsherry, 2005) and (Cun-
ninghamet al., 2003) present methods for explaining a particular method of recommendation, named
Learn By Example. Some commercial systems (such as liveplasma1) also offer innovative ways of pre-
senting recommendations, such as Map Based presentation of items. Such presentations may increase
the usefulness of recommendations and the ability of a user to understand these explanations.
The papers (and systems) mentioned above each demonstrate that scrutability can be beneficial in recom-
mender systems, and present some ways of creating it. However,Scrutability & Controlin recommender
systems is an area which has not received much research attention and thus; there are still many ques-
tions to be answered regarding the best way to achieve these goals. Specifically, there is a lack of existing
research into:
• Comparison of the multiple recommendation techniques in terms of their usefulnessand ability
to be explained.
• Providing explanations for recommendation techniques other than social filtering.
• The impact of adding of controls to a recommender system.
• The relationship between a user’s understanding of a recommendation technique and the use-
fulness of its recommendations, and the potential trade-off between the two.
• The effect of a Map Based presentation on the usefulness and understandability of recommen-
dations. As a result, this project addedScrutability & Controlfeatures to the Duine Toolkit in
order to build upon current research and investigate each of these areas.
1http://www.liveplasma.com
2.4 CONCLUSION 12
FIGURE 2.3: Mean Response Of Users To Each Explanation Interface, BasedOn AScale Of One To Seven. Explanations 11 And 12 Represent The Base Case Of NoAdditional Information. Shaded Rows Indicate Explanations With A Mean ResponseSignificantly Different From The Base Cases.
2.4 Conclusion
At this stage of the project, a number of key areas where more research was required were identified.
The first of these areas was the provision ofUnobtrusive Recommendationto users. Although there
is existing work into unobtrusive modeling of a user’s interests, most of this research has concentrated
upon the field of web browsing. Using implicit data to infer a user’s interests initems such as music
or movies is an area where little research has been conducted. Thus, this project aimed to build upon
existing work in the field of unobtrusive user modeling and investigate unobtrusive music recommenda-
tion. AddingScrutability & Controlto recommender systems is the second area where a lack of existing
2.4 CONCLUSION 13
research was identified. Current research into explaining and controlling recommender systems is quite
sparse, and although some research does exist, there are still many questions to be answered regarding
this goal. These questions include issues relating to the impact of adding controls to a recommender
system, as well as many issues related to providing scrutable recommendations. Ultimately, this project
aimed to advance research into bothScrutability & Control in recommender systems andUnobtrusive
Recommendation.
CHAPTER 3
Exploratory Study
3.1 Introduction
The review of literature from Chapter 2 highlighted that there is a lack of existing research in the areas
of scrutability, control and unobtrusiveness within recommender systems.This lack of research is espe-
cially prominent in the area of music recommendation, where little research at allhas been published.
Thus, this project aimed to investigate questions related toScrutability & Control and Unobtrusive
Recommendation. In order to investigate these areas, an exploratory study was first conducted, which
involved the following tasks:
• A qualitative analysis of existing recommender technologies.
• Conduct of a questionnaire to investigate aspects of recommender systems,as a foundation for
gaining the understanding needed to create a prototype recommender system.
• The creation of a dataset of implicit information about a large number of users, required for
performing evaluations on a prototype at a later stage of the thesis.
The first stage for this research project was a qualitative analysis of a number of existing recommender
systems and recommendation algorithms. This aimed to identify a suitable code basethat could be ex-
tended into a prototype recommender system. An analysis of the recommendationalgorithms contained
in the chosen code base was then performed. This analysis aimed to discover methods that could be used
to add controls and explanations to the prototype recommender system. To investigate users’ attitudes
toward these explanations and controls (as well as attitudes toward other aspects of recommender sys-
tems and usability), a questionnaire was conducted. The results of this questionnaire would be used later
in this thesis to guide the construction of the prototype. Finally, a source of test data was established for
use in evaluating the prototype. Each of these tasks is detailed in the sections below.
14
3.2 QUALITATIVE ANALYSIS 15
3.2 Qualitative Analysis
The system chosen as a code base needed to be open source and havegood code quality, resource con-
sumption (with particular reference to running time and memory usage) and recommendation quality.
It would also be highly useful if it provided support for the implementation offeatures such as ex-
planations, control features and unobtrusive recommendation. The recommendation toolkits that were
examined during the course of this qualitative analysis include:
Taste: open-source recommender, written in Java. Available from http://taste.sourceforge.net/
Cofi: open-source, written in Java. Available from http://www.nongnu.org/cofi/
RACOFI: open-source, written in Java. Available from http://www.daniel-lemire.com/fr/abstracts/COLA2003.html
SUGGEST: Free, written in C. Available from http://www-users.cs.umn.edu/ karypis/suggest/
Rating-Based Item-to-Item: public domain, written in PHP. Available from http://www.daniel-
lemire.com/fr/abstracts/TRD01.html
consensus:open-source, written in Python. Available from http://exogen.case.edu/projects/consensus/
The Duine Toolkit: open-source, written in Java. Available from http://sourceforge.net/projects/duine
The qualitative analysis of these systems began with an examination of the specifications of each toolkit.
Further analysis involved the examination of any available reference documentation. This analysis,
combined with learnings from the critical literature review described in 2 narrowed the candidates for
use down to just Taste, and the Duine Toolkit. At this stage, the code for each of these toolkits was
downloaded and examined. Ultimately, the Duine Toolkit was chosen for use for the following reasons:
Well documented code base:the Duine Toolkit has complete and high quality documentation,
as well as reference documents.
Good recommendation quality: (van Settenet al., 2004) showed that the Duine Toolkit is able
to choose the most effective recommendation technique for a particular situation and thus is
able to maximise the quality of recommendations.
Good resource usage:the Duine Toolkit has been built to conserve resources and ensures that
the most resource intensive operations (which involve calculating the similaritybetween a user
and all other users) occur only once for each user session, and notevery time that a user rates
an item.
Multiple recommendation methods: the Duine Toolkit has six built in recommendation tech-
niques and the facility to dynamically alter the recommendation technique that is being used.
3.3 RECOMMENDATION ALGORITHM ANALYSIS 16
This meant that a system could be built that allowed users to easily swap fromusing one rec-
ommendation technique to another. This also meant that we could test issues regarding users’
interactions with not just one, but several methods of recommendation.
Built in explanation facility: the Duine Toolkit was designed with explanations in mind — each
recommendation that is created using this toolkit can have an explanation object attached to it,
which describes how exactly that prediction was produced. This featurewas included in the
Duine Toolkit in in anticipation of further extensions to the toolkit that enabled recommenda-
tions to be displayed.
Easy to add user controls: In the Duine Toolkit, personal settings can be set and saved for each
user. Some of these settings affect the recommendations that are produced by the system.
The fact that the Duine Toolkit can set and save such personal settings means that it could be
extended to allow users to exert control over the recommendation process.
3.3 Recommendation Algorithm Analysis
Once the Duine Toolkit was chosen as the code base for this thesis, an analysis of the recommendation
techniques that it provided was necessary. The major recommendation techniques made available within
the Duine Toolkit are:
Most Popular: This technique recommends the most popular items, based on the average rating
each item was given, across all users of the system.
Genre Based:This is a content-based technique that uses a user’s ratings to decide what genres
that user likes and dislikes. It then recommends items based upon this decision.
Social Filtering: This is a social filtering technique that looks at the current user’s ratingsand
finds others who are similar to that user. These similar users are then used torecommend new
items. (Note: this method also makes use of ‘opposite users’).
Learn By Example: This is a content-based technique that predicts how interested a user will
be in a new artist by looking at how they have rated other similar items in the past. (Requires
some measure of similarity to be defined).
Information Filtering: This is a content-based technique that uses natural language processing
techniques to process a given piece of text for each item (e.g. A description). This information,
combined with the a user’s ratings is used to predict the user’s level of interest in new items.
3.3 RECOMMENDATION ALGORITHM ANALYSIS 17
Note that examination of this technique showed that it could be used to create recommen-
dations that were either Lyrics Based (using lyrics from songs) or Description Based (using
descriptions of particular artists).
Taste Strategy: As noted in Chapter 2, (van Settenet al., 2004) shows that this is the recommen-
dation technique that produces the highest quality recommendations within the Duine Toolkit.
This technique is, in fact, a ‘Prediction Strategy’ that is able to choose to makerecommen-
dations using any of the five techniques described above. This techniquechooses the best
available recommendation technique at any given point in time and makes recommendations
using that technique. This is the default recommendation technique for the Duine Toolkit.
Note that this technique was not considered as a candidate for the addition of scrutability
or control, as it is a ‘Prediction Strategy’ that merely makes use of other recommendation
techniques to make recommendations and does not actually create recommendations itself.
Thorough examination and testing was conducted upon these algorithms to ascertain ways in which
they could be explained and controlled. The results from this investigation are summarised in Figure
3.1. This table shows the possible explanations and control features that could be implemented for
each of the recommendation algorithms within the Duine Toolkit. It also lists any problems that may
be encountered when adding scrutability and control to this algorithm. For example, the entry for the
Genre Based technique notes that recommendations produced using this technique could be explained by
telling the user what genre an item belongs to and how interested the system thinks that user is in those
genres. It also notes that one of the ways that users could be given control over this technique would
be to allow them to specify their level of interest in particular genres. Finally,it shows that a possible
problem that may be encountered when offering users controls and explanations for this technique would
be if a user did not agree with the genres that an item was classified into.
3.3 RECOMMENDATION ALGORITHM ANALYSIS 18
Algorithm Possible Explanations Possible Control Features Problems
Most Popular Tell the user where this item ranks in terms of popularity.
Tell the user the average rating that has been given to this item.
Tell the user how many users have rated this item.
Genre Based Tell the user the recommendation was based on the genres that item belongs to.
Allow the user to specify their interest in a particular genre.
What if users don't agree with the genre classifications?
Show the user how interested the system thinks they are in each genre.
Social Filtering
Show the user how similar users have rated an item.
Allow the user to specify the impact that similar and opposite users should have on recommendations.
What if users do not think they are really similar to particular users?
Show the user the similar users that factored heavily in their recommendation.
Allow the user to choose users who they want to be considered as similar to them.
There is A LOT of information involved in this algorithm.
The 'opposite users' idea is a hard one to convey.
Learn By Example
Show the user the similar items that factored heavily in their recommendation and how they rated those similar items.
Allow the user to specify what factors should determine the similarity between items.
What if users do not think this item is actually similar to the items they have rated in the past.
Information Filtering
Show the user the key words that are present in the descriptions of items that they have liked in the past.
Allow user to control the features used in recommendation.
Users might disagree with the keywords used to categorise their interest - even if these key words are quite appropriate.
Users might not understand how this approach is working, especially if it works on something other than descriptions (e.g. it may work on the text from forum posts about an item).
FIGURE 3.1: Summary Of Possible Explanations And Control Features For The MajorAlgorithms In The Duine Toolkit.
The Taste Strategy, was also examined at this stage, but it was found that because it switches between
recommendation techniques, it is not a technique that can be explained in a consistent way to users. This
meant that it was not considered as a suitable technique to add scrutability and control to.
3.4 QUESTIONNAIRE - DESIGN 19
3.4 Questionnaire - Design
The recommendation algorithm analysis described in the previous section highlighted a number of us-
ability features that could be added to a recommender system. Further, the analysis of existing rec-
ommender systems described in Section 3.2 and the review of literature described in Chapter 2 also
brought to light some of the different usability features of existing recommender systems. In order to
investigate how understandable and effective users would find these usability features, a questionnaire
was designed. The results of this questionnaire should then be used to inform the construction of the
prototype. A questionnaire was chosen as it was the most efficient way to gather large amounts of de-
tailed information about users’ opinions on the set of potential usability features. The specific aims of
the questionnaire were to assess several potential usability features related to:
• Understanding of recommendations provided by various recommendation techniques.
• Usefulness of recommendations provided by various recommendation techniques.
• Attitudes toward control features for recommenders and understanding of how these would be
used.
• Preferences for recommendation presentation format.
To this end, an extensive questionnaire was designed. It asked usersto answer questions on a scale of 1
to 5, where 1 was the lowest score and 5 was the highest. Particular care was taken during the design of
the questionnaire to ensure that each question would elicit useful information from participants and that
all of the questions were clear and free of bias.
An initial group of five respondents filled out the questionnaire, each answering 60 questions. After
these respondents had completed the questionnaire, a number of revisionswere made. These revisions
included the removal of two questions, the addition of seven new questions and minor changes to the
wording of a small number of questions. The questionnaire was then conducted with a further 13 people,
who answered 65 questions (58 in common with the original questionnaire). Most respondents took
around 40 minutes to complete the questionnaire. Figure 3.4 shows demographic information for each
of the respondents. The sample group for this questionnaire was carefully selected to contain people
from a variety of backgrounds and both males and females. The majority (12/18) of the users who
completed the questionnaire were aged under 30. Since modern recommender systems are used most
often by people who fall in the 18-30 age range, a higher proportion of respondents in this age range
was deemed to be appropriate.
3.4 QUESTIONNAIRE - DESIGN 20
Participant: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18Age 22 21 20 30 22 51 52 19 21 22 22 21 19 47 48 18 47 19Gender F F M M M F M M F F F F M F M F F FHas An IT Background? N N N N Y N N N N N N N N N N N N NHas Used Any Type Of Recommender Before? Y N Y Y Y N Y Y Y Y Y N Y N N Y Y Y
FIGURE 3.2: Demographic Information For Each Of The Respondents.
Sections 3.4.1 to 3.4.3 now describe the final set of questions that were presented to respondents. Al-
though these were many questions, they were actually in three groups: Part A had one set of 5 questions,
Part B had six sets of questions, totalling 52 questions and the Final Questions comprised one set of
seven questions. The entire questionnaire is included as Appendix A.
3.4.1 Part A - Presentation Style
This section of the questionnaire aimed to investigate users’ preferences for recommendation presenta-
tion format.
At this stage, respondents were shown two forms of recommendation presentation. The first of these
was a standard List Based format (shown in Figure 3.3) and the second was a Map Based format (shown
in Figure 3.4), that was similar to the liveplasma1 interface mentioned in Chapter 2. After viewing an
example of each presentation format, respondents were then asked to ratehow well they understood
the information conveyed by that example and how useful they would find recommendations that were
presented in this format. Finally, after viewing both formats, respondents were asked to indicate whether
they would prefer the List Based format, the Map Based format or both.
3.4.2 Part B - Understanding & Usefulness
This section of the questionnaire aimed to investigate understanding of recommendations, usefulness of
recommendations and attitudes toward control features.
This section presented six recommendation techniques to respondents (Most Popular, Genre Based,
Social Filtering, Learn By Example, Description Based and Lyrics Based). For each of these techniques,
respondents followed this process:
1http://www.liveplasma.com/
3.4 QUESTIONNAIRE - DESIGN 21
FIGURE 3.3: List Based Presentation That Was Shown To Participants In The Questionnaire
FIGURE 3.4: Map Based Presentation That Was Shown To Participants In The Questionnaire
Respondents were first presented with a short textual description of how this technique works. At this
stage, they rated their initial understanding of the technique. Respondentswere then presented with
a number of explanation screens, each of which showed a recommended item and an explanation of
why it was recommended (example explanation screens are shown in Figures 3.5 and 3.6). For each
screen, respondents rated how well they understood why the recommendation had been made and how
3.4 QUESTIONNAIRE - DESIGN 22
useful they would find recommendations that were produced using this technique and explained in this
fashion. If this technique had control features, then respondents were also presented with a control
feature screen for each of the controls for this technique (an example control feature screen is shown in
Figure 3.7). After viewing each control feature screen, respondentsrated how well they understood how
they would use this control, how likely they would be to use it and how useful they expected it would
be. Finally, respondents rated the overall usefulness of this recommendation technique, and their overall
understanding of it.
FIGURE 3.5: One Of The Explanation Screens Shown To Participants In The Ques-tionnaire. This Screen Explains Recommendations From The Learn By Example Tech-nique
FIGURE 3.6: One Of The Explanation Screens Shown To Participants In The Ques-tionnaire. This Screen Explains Recommendations From The Social Filtering Tech-nique
3.4.3 Final Questions - Integrative
This section of the questionnaire aimed to investigate the usefulness of recommendation techniques and
attitudes toward explanations and control features.
3.4 QUESTIONNAIRE - DESIGN 23
FIGURE 3.7: The Genre Based Control Shown To Participants In The Questionnaire
At this stage of the questionnaire, respondents were asked to indicate theirgeneral opinion on the use-
fulness of all the six recommendation techniques. They first ranked the techniques from 1 to 6 in order
of usefulness. Then respondents were also asked to indicate the weightthey would want to place on each
technique if a combination of techniques was to be used in a recommender system. The weight that they
could place on each technique ranged from ‘Not At All’ (weight of 0) to ‘Very Much’ (weight of 100).
The final five questions in the questionnaire then asked respondents to rate how useful they would find
the following five potential features of a recommender system:
System Chooses Recommendation Method:The recommender system chooses the best rec-
ommendation technique to use at any point in time.
System Chooses Combination Of Recommendation Methods:The recommender system chooses
a combination of recommendation techniques to be used.
View Results From Other Recommendation Methods:The recommender system chooses the
best recommendation technique to use at any point in time. However, users are then able to
view what their recommendations would look like if other recommendation techniques were
used.
Explanations: Explanations are provided for how recommendations were made.
Controls: Users are given some amount of control over how recommendations are made.
These final questions would give an overall picture of users’ attitude toward a variety of potential features
of a recommender system. As well as providing useful information, these questions also acted as internal
consistency checks, allowing a user’s answers to be validated. For example, when asked to rank the
3.5 QUESTIONNAIRE - RESULTS 24
recommendation techniques in order of usefulness, a user’s answers would be expected to correlate with
answers to usefulness questions asked earlier in the survey.
3.5 Questionnaire - Results
In total, 5 respondents answered the initial questionnaire (60 questions) and a further 13 respondents
answered the revised questionnaire (65 questions). We now present and discuss the results of the ques-
tionnaire, with reference to the aims of the questionnaire, as expressed in Section 3.4. The results in this
section are rather long because they report respondents’ answers interms of recommendation useful-
ness, recommendation understanding, control features and presentation method. Each of these factors is
important and each of them is different. For each factor, this section reports a small number of averages.
This is explained with illustrative additional data which helps understanding ofthe results. Then there
is a summary of the conclusions and a separate list of the implications for the prototype design. This
section is quite long, but it has not been relegated to an appendix becauseit is all new information about
how users can understand and control recommenders.
3.5.1 Usefulness
This section discusses the questionnaire results relevant to the aim of: assessing the perceived usefulness
of recommendations provided using various recommendation techniques.
In Part B of the questionnaire, respondents rated the usefulness of 18screens that presented recom-
mendations. The screens that had the maximum average usefulness for each technique are presented in
Figure 3.8, along with their average rating (error bars show one standard deviation above and below, ac-
tual results for each respondent shown in Appendix B). For example, from five Social Filtering screens
presented to respondents, the one with the highest average usefulnessrating was the Simple Text screen,
so this is shown in Figure 3.8.
In the Final Questions section of the questionnaire, respondents rankedthe recommendation techniques
in order of usefulness (where 1 is the highest possible ranking, and 6 isthe lowest ranking). Figure
3.9 shows the average ranking given to each technique, with error barsshowing one standard deviation
above and below the mean (actual results for each respondent shown inAppendix B).
3.5 QUESTIONNAIRE - RESULTS 25
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Most Popular 2(Avg. Rating
Info.)
Genre Based 1(Genre Listing)
Word Of Mouth1 (Simple text)
Learn ByExample 2
(Similar Artists)
DescriptionBased 1
(Simple Text)
Lyrics Based 1(Simple Text)
Avg
. Use
fuln
ess
Rat
ing
FIGURE 3.8: The Screens With The Maximum Average Usefulness For Each Recom-mendation Method. Error Bars Show One Standard Deviation Above And Below TheMean. N = 18.
Technique Avg. St. Dev.Word of Mouth 1.9 1.3Genre Based 2.4 1.2Most Popular 2.8 1.3Learn By Example 3.3 1.0Description Based 4.6 1.0Lyrics Based 5.8 0.5
FIGURE 3.9: Average Ranking Given To Each Presentation Method. N = 18. TopRanking = 1. Bottom Ranking = 6.
In the Final Questions section, respondents also indicated the weight they would want to place on each
technique if a combination of techniques was to be used. Figure 3.10 shows the average weight (0-100)
chosen for each method. Note that respondents could choose any value0-100 for each technique. For
example, Participant 6 gave Most Popular a weight of 30, Genre Based aweight of 80, Social Filtering
a weight of 90, Learn By Example a weight of 70, Description Based a weight of 30 and Lyrics Based a
weight of 0. We now discuss these results.
Social Filtering: This method had the highest average ranking (1.9, where 1 is the ) and had
high average usefulness scores, but, surprisingly, it had the secondhighest average contribu-
tion, with a weight of 68. Six people indicated that Social Filtering should havethe most
contribution, but low scores from other respondents caused this technique to receive a lower
average contribution score than Genre Based. Social Filtering (Simple Text) was the highest
rated Social Filtering screen. This screen had the highest average usefulness rating (4.4/5)
of all screens shown in the questionnaire. The next highest rated Social Filtering screen was
3.5 QUESTIONNAIRE - RESULTS 26
0
10
20
30
40
50
60
70
80
90
100
Most Popular Genre Based Word of Mouth Learn ByExample
DescriptionBased
Lyrics Based
Avg
. Wei
gh
t
FIGURE 3.10: Average Response For Contribution That Each Method Should MakeTo A Combination Of Recommendation Methods. Error Bars Show One Standard De-viation Above And Below The Mean. N = 18.
the Simple Graph screen with an average of 3.9/5. Although Social Filtering (Similar Users)
had an average usefulness score of 3.1/5 (the lowest for all Social Filtering screens), four re-
spondents commented that they thought the Social Filtering (Similar Users) screen was useful
because it allowed you to view similar users and their profiles. One respondent commented
that Social Filtering "is a great way to recommend new music." A further two people com-
mented that this method would be useful, as long as similarity between users was calculated
accurately. Another person commented that they did not like the idea of opposite users factor-
ing in their recommendations. Finally, another commented that they would like to be able to
indicate friends that have similar interests and are already using the recommender system.
Genre Based:This method received the highest average contribution score (76) — six people
indicated that this technique should have the most contribution. It was also given the second
best average ranking (2.4/5). However, one respondent did mention that he thought classifying
items by genres was too broad. The Genre Based (Simple Text) screen hadthe second highest
average usefulness (4.1/5) of all screens presented in the questionnaire, and the two Genre
Based screens both had average scores of 4 or more. Two people commented that they thought
Genre Based (Genre Listing) was the best Genre Based screen as it provided more information.
Learn By Example: This method had an average contribution score of 58 and only two people
indicated that this method should have the highest contribution. This method wasgiven an
average ranking of 3.3, the fourth highest average ranking. The SimilarArtists screen had the
highest average usefulness score of the Learn By Example screens,with an average usefulness
3.5 QUESTIONNAIRE - RESULTS 27
of 4.0/5 — the third highest average usefulness score. One respondent commented that they
doubted whether similarity between artists could be calculated objectively.
Most Popular: Five respondents commented that they would not necessarily be interestedin the
the most popular items. However, Most Popular had the second highest average contribution
score, with 68, and seven people indicated that Most Popular should have the most contribution.
Most Popular was also given an average ranking of 2.8, which was the third best average rank-
ing. The two screens displaying Most Popular recommendations — Most Popular (Ranking)
and Most Popular (Avg. Rating Info.) — had average scores of 3.5/5 and 3.4/5 respectively.
Description Based: This method scored 41 average contribution and had the second worst av-
erage ranking. Respondents viewed only one screen that presented Description Based recom-
mendations. This screen had an average usefulness rating of 2.7/5, the second lowest average
usefulness score. Nine people commented that they doubted the usefulness of using descrip-
tions to make recommendations. Four of these people commented that descriptions are too
subjective to be useful.
Lyrics Based: This method scored 12 average contribution and had the worst average ranking.
Respondents viewed only one screen that presented Lyrics Based recommendations. This
screen had an average usefulness rating of 2.2/5, the lowest average usefulness score. Nine
respondents commented that they didn’t think lyrics would be useful for making recommenda-
tions. Seven of these commented that lyrics did not determine whether they likedan item.
Findings.
• Social Filtering and Genre Based were judged by respondents to be the most useful techniques.
This is supported by the fact that these two methods both had either the first or the second best
average score on every question.
• Respondents were less interested in having Most Popular recommendationsdelivered on their
own than they were in having this recommendation method combined with other techniques.
We can see this because this method had the second highest average weight in the question
regarding how techniques should be combined. However, five respondents commented that
they were not interested in just the most popular items.
• Respondents did not think that Description Based or Lyrics Based would be useful. This is
shown by the fact that these two methods consistently had the lowest average scores for each
question.
3.5 QUESTIONNAIRE - RESULTS 28
• Social Filtering (Simple Text), Genre Based (Simple Text), Most Popular (Ranking) and Learn
By Example (Simple Text) were all judged by respondents to be the most useful screens for
their particular recommendation techniques.
• Genre Based (Simple Text) and Genre Based (Genre Listing) were approximately equally use-
ful (their average usefulness scores were quite similar) and each offered a different form of
useful information.
• Most Popular (Avg. Rating Info.) and Most Popular (Ranking) were approximately as useful
as one another (their average usefulness scores were quite similar) andeach offered a different
form of useful information.
• Some users would find the Social Filtering (Similar Users) screen useful. This screen did not
receive a high average usefulness score, but four respondents commented that they liked the
ability it provided to examine the ratings of similar users.
Implications for the prototype.
• Social Filtering and Genre Based should be included as recommendation techniques.
• Most Popular should be included as an optional recommendation technique,or one which can
be combined with other techniques.
• Learn By Example should also be included as a recommendation technique, asit was not found
to be significantly less useful than the top three recommendation techniques.
• Description Based and Lyrics Based shouldnot be included in the prototype.
• Social Filtering (Simple Text), Genre Based (Simple Text), Most Popular (Ranking) and Learn
By Example (Simple Text) should all be included as explanation screens in the prototype.
• Genre Based (Simple Text) and Genre Based (Genre Listing) should be combined into a single
explanation screen, as their average usefulness scores were similar and each displays a different
piece of information which would be useful to users. Further, these two screens could easily
be combined without causing conflicting information to be displayed. For the same reasons,
Most Popular (Avg. Rating Info.) and Most Popular (Ranking) should also be combined.
• Social Filtering (Similar Users) should be considered for implementation in the prototype.
3.5 QUESTIONNAIRE - RESULTS 29
3.5.2 Understanding
This section discusses the questionnaire results relevant to the aim of: assessing understanding of rec-
ommendations provided using various recommendation techniques.
In Part B of the questionnaire, respondents rated their understanding of the 18 screens that presented
recommendations. The screens that had the maximum average understanding for each technique are
presented in Figure 3.11, along with their average rating (Error bars show one standard deviation above
and below the mean. Actual results for each respondent shown in Appendix B). For example, from five
Social Filtering screens presented in the questionnaire, the one with the highest average understanding
rating was the Simple Text screen, so this is shown in Figure 3.11 (3rd bar from the left).
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Most Popular 2(Avg. Rating
Info.)
Genre Based 1(Genre Listing)
Word Of Mouth1 (Simple Text)
Learn ByExample 1
(Avg. RatingInfo.)
DescriptionBased 1
(Simple Text)
Lyrics Based 1(Simple Text)
Avg
. Un
der
stan
din
g R
atin
g
FIGURE 3.11: The Screens With The Maximum Average Understanding For Each Rec-ommendation Method. Error Bars Show One Standard Deviation Above And BelowThe Mean. N = 18
In Part B of the questionnaire, respondents also rated their understanding of four recommendation tech-
niques before and after they saw the screens for that technique. Figure 3.12 shows the average ranking
given to each technique, with error bars showing one standard deviationabove and below the mean (ac-
tual results for each respondent shown in Appendix B). We now discuss the results shown in Figures
3.11 to 3.12.
Social Filtering: Social Filtering (Simple Text) had the highest average understanding of all the
Social Filtering screens, with 4.6/5, which was the second highest average score given to any of
the Social Filtering screens. The Social Filtering (Simple Graph) screen (average of 4.5/5) and
the Social Filtering (Table) screen (average of 4.3/5) both also received high average scores for
understanding. Both Social Filtering (Graph w/ Opposites) and Social Filtering (Similar Users)
3.5 QUESTIONNAIRE - RESULTS 30
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
Most Popular Genre Based Word Of Mouth Learn By Example
Avg
. Un
der
stan
din
g R
atin
g
BeforeExplanations
AfterExplanations
FIGURE 3.12: Respondents’ Average Understanding Of Recommendation MethodsBefore And After Explanations. Error Bars Show One Standard Deviation Above AndBelow The Mean. N = 18
showed ‘opposite users’ in their explanation, but three users said that they were confused by
the ‘opposite users’ concept, and these screens had the lowest average ratings from all of the
Social Filtering screens in the questionnaire (Social Filtering (Similar Users)averaged 3.9/5
and Social Filtering (Graph w/ Opposites) averaged 3.8/5 — these were theonly average scores
that were below 4.0).
Social Filtering was given the highest average understanding rating before explanations
were provided (average of 4.4/5). However, after explanations wereprovided, the average
for this technique dropped to 3.9/5 — the lowest average understanding rating. As mentioned
above, three respondents commented that ‘opposite users’ had confused them and a further two
people commented that the explanations contained too much information and wereconfusing.
Genre Based:Two Genre Based screens were presented in the questionnaire, Genre Based (Sim-
ple Text) received the highest average understanding of all the explanation screens — 4.7/5.
Genre Based (Genre Listing) also received a high average understanding rating of 4.6/5 — the
third highest average understanding given to any of the 18 explanation screens. One respondent
commented that Genre Based (Simple Text) was the better of the two Genre Based screens as
it gave "more information about the individual artist and not just a genre". However, another
commented that Genre Based (Genre Listing) was better, as it was more related to his ratings
and profile.
Genre Based actually received the lowest average understanding rating before the expla-
nation screens were provided (average of 4.2/5). Remarkably, after explanations, the average
usefulness rating for this method increased to 4.8/5. Eight people gave this method a higher
3.5 QUESTIONNAIRE - RESULTS 31
understanding rating after viewing the explanation screens, ten gave it thesame rating, and no
respondents gave it a lower rating.
Learn By Example: Learn By Example (Simple Text) had the highest average understanding
rating of the two Learn By Example screens presented in the questionnaire. Learn By Example
(Simple Text) had an average of 4.2, which was just higher than the average of 4.1/5 for Learn
By Example (Similar Artists).
Learn By Example had the equal highest average understanding (4.4/5)before explanation
screens were presented. However, this dropped to an average of 4.1/5 after respondents viewed
the explanation screens — this was the second lowest after-explanation average. Only one
respondent gave Learn By Example a higher understanding rating afterexplanations, fourteen
gave it the same rating and three gave it a lower understanding rating.
Most Popular: The Most Popular screen with the highest average rating was Most Popular
(Ranking), with a score of 4.7/5 (which was the highest average understanding across all the
explanation screens). However, Most Popular (Avg. Rating Info.) also received a score of
4.5/5. Five people commented that Most Popular (Ranking) made recommendations easier to
understand as it gave more information. One person commented that he wouldlike comments
from users about that item to be added to the screen, indicating why they liked or disliked it.
Figure 3.12 shows that this method improved from an average understanding of 4.3/5 be-
fore explanations to an average of 4.6/5 after the viewing of explanation screens. The average
understanding rating for Most Popular after explanations is the second highest average under-
standing score shown in Figure 3.12. Four respondents gave Most Popular a higher under-
standing rating after explanations, twelve respondents gave it the same rating and two gave it a
lower understanding rating.
Description Based: Respondents viewed only one screen that presented Description Basedrec-
ommendations. This screen had an average understanding rating of 4.0/5,which is the lowest
of all the scores shown in Figure 3.11. Four respondents gave this methoda score of 3 or less.
This method is not shown in Figure 3.12 because once the first five respondents had com-
pleted the questionnaire, respondents were no longer asked to report their understanding of this
method before and after viewing its screens. This decision was made because this method had
been given low usefulness and low understanding scores by the first five respondents.
Lyrics Based: Respondents viewed only one screen that presented Lyrics Based recommenda-
tions. This screen had an average understanding rating of 4.1/5, which isthe second lowest of
3.5 QUESTIONNAIRE - RESULTS 32
all the scores shown in Figure 3.11. Three people gave this method a scoreof 3 or less. One
respondent commented that the way this method works "just seems to make no sense".
This method is not shown in Figure 3.12 because once the first five respondents had com-
pleted the questionnaire, respondents were no longer asked to report their understanding of this
method before and after viewing its screens. This decision was made because this method had
been given low usefulness and low understanding scores by the first five respondents.
Findings. The findings that came from this section of the questionnaire were:
• Each of the recommendation techniques can be explained in a way that userscan easily under-
stand. This is supported by the fact that all of the values shown in Figure 3.12 were equal to or
above 4.0.
• When explaining recommendations, providing more information can often be beneficial. This
is supported by the by user comments that indicated a desire for more informationabout rec-
ommendations. However, it is important to find a clear, concise way to deliverthat information
to people.
• Complicated or poor explanations will often confuse a user’s understanding of a recommenda-
tion technique. For example, three people commented that the ‘opposite users’ idea was con-
fusing. Further, the screens showing opposite users received the lowest average understanding
scores and after these screens were shown to users, the average understanding of the Social
Filtering technique dropped from 4.4/5 to 3.9/5. This finding was also reported in (Herlocker
et al., 2000).
• Social Filtering (Simple Text), Genre Based (Simple Text), Most Popular (Ranking) and Learn
By Example (Simple Text) were judged by users to be the most understandableexplanation of
each of their recommendation techniques (as each of these had the highestaverage understand-
ing of the screens for their technique).
• Social Filtering (Simple Graph) was almost as understandable as Social Filtering (Simple Text)
(as they had average understanding scores only 0.1 points apart).
• Similarly, Learn By Example (Similar Artists) was almost as understandable as Learn By Ex-
ample (Simple Text) (as they had average understanding scores only 0.1 points apart).
• Genre Based (Simple Text) and Genre Based (Genre Listing) were approximately as effective
at explaining recommendations as one another (their average understanding scores were quite
similar) and each offered a different form of useful information.
3.5 QUESTIONNAIRE - RESULTS 33
• Most Popular (Avg. Rating Info.) and Most Popular (Ranking) were also approximately as
effective at explaining recommendations as one another (their average understanding scores
were quite similar) and each offered a different form of useful information.
• The inclusion of the ‘opposite users’ concept negatively affected users’ perceived understand-
ing of the Social Filtering (Similar Users) screen. This is supported by the fact that four re-
spondents commented that the ‘opposite users’ concept confused their understanding of Social
Filtering.
• People found Learn By Example to be harder to understand than techniques such as Most
Popular, Genre Based and even Social Filtering. This is surprising as one of the benefits often
noted for the Learn By Example technique is the "potential to use retrieved cases to explain
[recommendations]" - (Cunninghamet al., 2003), p 1.
• Different people prefer different styles of explanation. Evidence supporting this finding in-
cludes the fact that different users rated their understanding of different explanation screens
higher than others.
Implications for the prototype.
• Social Filtering (Simple Text), Genre Based (Simple Text), Most Popular (Ranking) and Learn
By Example (Simple Text) should all be included as explanation screens in the prototype.
• Learn By Example (Simple Text) and Learn By Example (Similar Artists) should becombined
into a single explanation screen, as their average understanding scoreswere similar and each
displays a different piece of information which would be useful to users.Further, these two
screens could easily be combined without causing conflicting information to bedisplayed.
• The case for combining Most Popular (Avg. Rating Info.) and Most Popular (Ranking) and
Genre Based (Simple Text) and Genre Based (Genre Listing) is also strengthened by these
results, as each of these pairs had similar average understanding ratings.
• Social Filtering (Similar Users) should be included in the prototype, without any reference to
‘opposite users’. This is because the ability to view similar users was deemed useful by some
respondents, and the ratings for this control may have been negatively affected by the fact that
it displayed ‘opposite users’ — a concept which consistently confused people.
3.5 QUESTIONNAIRE - RESULTS 34
3.5.3 Understanding And Usefulness
The Pearson Correlation was calculated between the ratings that respondents gave for the usefulness of
particular explanation screens and the ratings that they gave for their understanding of these screens.
This correlation was calculated to be 0.28. Squaring this value gives 0.078,or 7.8 percent. This suggests
that a user’s understanding of a recommendationdoesaffect how useful they deem it to be. In fact, this
value suggests that 7.8 percent of a user’s opinion on the usefulness of a recommendation technique is
determined by how well they understand that recommendation. This result is confirmed by a number
of cases that were observed within the questionnaire. Particularly significant were the cases in which a
user’s understanding was confused by complicated concepts within explanations. This often caused a
decrease in both the user’s understanding rating and their usefulness rating for that screen.
Findings.
• A user’s opinions on the usefulness of recommendations are related to theirunderstanding of
these recommendations.
3.5.4 Control
This section discusses the questionnaire results relevant to the aim of: assessing users’ attitudes toward
features that provide control over recommender techniques and their understanding of how these would
be used.
In Part B of the questionnaire, respondents rated three control features according to how well they
understood each control, how useful they thought each control wouldbe and how likely they would be
to use that control. Figure 3.13 shows the average score for each of these questions, with error bars
showing the one standard deviation above and below the mean (actual results for each user shown in
Appendix B).
Genre Based Control (Genre Slider):This control had the highest average scores for under-
standing (4.9/5), usefulness (4.5/5) and likelihood of use (4.6/5). All buttwo respondents gave
this control a 5 for understanding; the other two respondents gave it a 4.All but three people
gave this control a 5 for how likely they would be to use it, and all but one users gave this
control a rating of 4 or 5 when asked how useful they thought it would be. Further, seven users
3.5 QUESTIONNAIRE - RESULTS 35
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Genre BasedControl
Word OfMouth
Control 1(Ignore User)
Word OfMouth
Control 2(Adjust
Influence)
Avg
. Rat
ing Understanding
Use
Likelihood Of Use
FIGURE 3.13: Average Ratings For Questions Regarding Respondents’ Understand-ing, Likelihood Of Using And Perceived Usefulness Of Each Control Feature. ErrorBars Show One Standard Deviation Above And Below The Mean. N = 18
commented that they strongly liked this control. One respondent commented thatthey would
like to specify interest in more specific genres (i.e. sub-genres), but another commented that
they thought too many genres would become confusing for users.
Social Filtering Control (Like/Not Like): This control had the second highest average scores
on all questions. Its average ratings were 4.6/5 for understanding, 3.5/5for likelihood of use
and 4.3/5 for usefulness. All but two respondents gave this control a rating of 4 or 5 for
understanding, and the other two gave a rating of 3. Most users also gave this control a rating
of 4 or 5 for usefulness. However, there was much more variation in the likelihood of use
ratings for this control. In fact, this question had the second highest standard deviation (1.3)
of any question asked about the three controls and responses to this question were distributed
relatively evenly between 1 and 5.
Social Filtering Control (Adjust Influence): This control had the lowest average scores for all
questions. Social Filtering Control (Adjust Influence) had an average understanding rating of
3.8, likelihood of use rating of 3.0 and usefulness rating of 3.4. This method asked users to
adjust the impact of ‘opposite users’ upon recommendations. As mentioned insection 3.5.2,
three users commented that the concept of ‘opposite users’ was confusing, and the average
understanding ratings for the Social Filtering technique fell when this concept was introduced.
The ratings given to this method were highly varied — three people responded with a 5 for the
usefulness of this control usefulness and 5 for their likelihood of using it,yet three others gave
scores of only 1 or 2 for both of these questions (each of these three gave lower ratings for
their understanding of the Social Filtering technique once the concept of ‘opposite users’ was
3.5 QUESTIONNAIRE - RESULTS 36
introduced.e three gave lower ratings for their understanding of the Social Filtering technique
once the concept of ‘opposite users’ was introduced.
Findings.
• The Genre Based Control (Genre Slider) would get used often and would be easy to understand.
Further, respondents also believed that it would be very useful. Thesefindings are supported
by the fact this control received the highest average usefulness scores, and most users gave a
rating of 4 or 5 for all questions regarding this control.
• It is important to get the number of available genres correct when allowing users to specify
their interest in genres. This is supported by the fact that many users users commented that
having too many genres would be overwhelming.
• Social Filtering Control (Like/Not Like) is easy to understand (most usersgave a rating of 4 or
5 for understanding). It would be used by some, but not all users (asthere was a high variation
in likelihood of use ratings). Further, most users would find this control to be quite useful
(most users gave 4 or 5 for usefulness).
• In general, most users would not understand how Social Filtering Control (Adjust Influence)
works and most users would not use it. Most respondents believed that this control would not
be very useful. These findings are supported by the fact that this control scored the lowest
average rating in every question and three users commented that they wereconfused by the
opposite users concept, which is a part of Social Filtering Control (Adjust Influence).
Implications for the prototype. Based upon these findings, it was decided:
• To include Genre Based Control (Genre Slider) in the prototype. It is important that the right
number of genres is used with this control. The number of genres should not be too large (as
this may become overwhelming) and should not be too small (as this may not be useful).
• To include Social Filtering Control (Like/Not Like) in the prototype. This control may not be
rated highly by all users, but it is worth testing its effectiveness in a real prototype.
• Not to include Social Filtering Control (Adjust Influence) in the prototype.
3.5 QUESTIONNAIRE - RESULTS 37
3.5.5 Presentation Method
This section discusses the questionnaire results relevant to the aim of: assessing users’ preferences for
recommendation presentation format.
In Part A of the questionnaire, respondents rated their understanding and opinion on the usefulness of
two presentation methods: Map Based and List Based. Figure 3.14(a) shows the average score for each
of these questions, with error bars showing the one standard deviation above and below the mean. Users
also indicated their preference for the way in which they would like recommendations to be displayed.
Figure 3.14(b) shows the sums of responses to this question. The actual results for each user shown in
Appendix B.
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
List Map
Avg
. Rat
ing
Understanding
Use
(a) Understanding And Usefulness Of Presentation Meth-ods
0
2
4
6
8
10
12
List Only Both ListAnd Map
Map Only
Su
m o
f P
refe
ren
ces
(b) Sum Of Recommendation PresentationPreferences.
FIGURE 3.14: User’s Responses For Questions Regarding Recommendation Presenta-tion Methods. Error Bars Show One Standard Deviation Above And Below The Mean.N = 18
Ten users indicated that they would prefer to have only List Based presentation. Four of these users
commented that List Based is quicker to understand and read. These comments are supported by the
results shown in Figure 3.14. This shows that List Based had an averageunderstanding rating of 4.7/5,
exactly one point higher than the average understanding rating for Map Based, which was 3.7/5. In addi-
tion, seven users commented that the map took longer to work out. However, List Based and Map Based
had similar average usefulness scores — List Based scored an average of 3.8/5 and Map Based had an
average of 3.5/5. Two users indicated that they would like to have recommendations presented through
a Map Based only and six users indicated that they would like to have recommendations displayed as
in both List Based and Map Based formats. Four users commented that the mapgave more information
and was useful for that reason.
3.5 QUESTIONNAIRE - RESULTS 38
Findings.
• Most users would find a List Based presentation easier to understand and quicker to read than
a Map Based presentation. This is supported by the fact that users commented that a list based
presentation is quicker and easier to read and by the fact that the List Based presentation scored
a higher average understanding rating than Map Based.
• In general, users indicated they would find a List Based presentation useful. This is evidenced
by the fact that 16/18 respondents indicated that they would want List Based as a part of their
recommendation system and this presentation received the highest average usefulness score.
• Some users indicated they would also find a Map Based presentation to be useful. Evidenced
supporting this finding includes that 8/18 users indicated that they would want a Map Based
presentation included in a recommender.
• Different people prefer different styles of presentation. This was shown through the variation
in the ratings that were given for the questions regarding presentation.
Implications for the prototype. Based upon these findings, it was decided:
• To definitely include a List Based presentation in the prototype.
• That there was enough enough support for the usefulness of a Map Based presentation to
include it in the prototype to examine how users would interact with an implementationof a
Map Based presentation.
3.5.6 Final Questions
This section discusses the results from the final questions asked of users, that gave an overall indication
of their opinion of the various features shown in the questionnaire.
In the Final Questions section of the questionnaire, respondents rated thegeneral usefulness of five
features that could be included in a recommender system. Figure 3.15 showsthe average ratings for
each of these features, with error bars showing the one standard deviation above and below the mean.
Choice Of Recommendation Method:The average rating for the usefulness of the system de-
ciding what recommendation method should be used was 3.6/5. Most people gave this feature a
rating of 3 or more, but one person gave this feature a rating of 1, while giving all other features
mentioned in this section a rating of 5. The average rating for this feature wasmuch lower than
3.5 QUESTIONNAIRE - RESULTS 39
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
System ChoosesReco. Method
System ChoosesCombination OfReco. Methods
View Results FromOther Reco.
Methods
Explanantions Controls
Avg
. Rat
ing
FIGURE 3.15: Average Rating For The Usefulness Of Possible Features Of A Recom-mender. Error Bars Show One Standard Deviation Above And Below The Mean. N =18
the average rating for the usefulness of having the system choose a combination of methods
(average of 4.6/5). There was very little deviation in the responses givento the usefulness of
the system selecting a combination of methods, with all respondents giving ratings of either 4
or 5. This feature had the highest average rating of all features presented in this section of the
questionnaire. Another feature with a high average usefulness rating was the ability to view
recommendations made using different recommendation techniques, which had an average of
4.5/5. One respondent commented that "viewing what your recommendationswould be like
with different methods allows you to compare the usefulness of each method and choose the
best one" and another commented that it would be "interesting and useful tosee what your
recommendations would look like using different methods."
Explanations: The average rating for the usefulness of explanations was 3.8/5 One respondent
commented that the addition of explanations "allows you to make your own judgments about
on the usefulness of the results." More than half of the respondents for this question gave
explanations a usefulness rating of 4 or 5.
Controls: The average rating given by users for the usefulness of controls was4.5/5. As noted in
Section 3.5.4, seven respondents commented that they had a strong liking forthe Genre Based
Control (Genre Slider)control. Twelve respondents rated the usefulness of controls as 5, four
users rated it as 4 and the remaining two gave controls a score of 2 and 1.
Findings.
3.6 TEST DATA 40
• Rather than having the system choose a single recommendation technique to use, people would
prefer to have the system choose a combination of recommendation techniques or allow them
to view recommendations using various techniques. This is supported by the fact that, on
average, users rated the usefulness of the ‘System chooses recommendation method’ feature
lower than the features that involved a combination of recommendation techniques and viewing
recommendations using different techniques.
• People in our study believed that explanations would be a useful addition to arecommender
system. This is evidenced by the fact that users gave an average of 3.8/5when asked to rate the
usefulness of explanations and more than half of the respondents for thisquestion gave a score
of 4 or 5.
• In general, people in our study believed that having control over a recommender system would
be very useful. This is supported by the fact that users gave an average of 4.5/5 when asked to
rate the usefulness of having control over a recommender system.
Implications for the prototype.
• The prototype should allow users to view recommendations produce using various techniques
and/or make recommendations using a combination of prediction techniques.
• The prototype should contain explanations for the recommendations that it produces. These
explanations should be offered to users if they are interested.
• The prototype should allow users to have control over certain elements of the recommender
system, to help them improve their recommendations.
3.6 Test Data
In order to perform evaluations at a later stage in the thesis, a source of test data needed to be established.
(Polcicovaet al., 2000), (Maltz and Ehrlich, 1995), (Konstanet al., 1997) and (Basuet al., 1998)
mention the fact that recommender systems are likely to exhibit poor performance unless they contain
a significantly large number of user ratings. As a result, the data set used for testing needed to be large
enough to allow effective recommendations to be made. In addition, the type and quantity of test data
that could be gained would heavily influence the process of creating and evaluating a prototype at later
stages of the project. An ideal set of test data for this project would havebeen a data set that contained
information about around 1000 users, detailing:
3.6 TEST DATA 41
• Their ratings for particular artists.
• The time that they spent listening to individual music tracks.
• The actions that they performed while listening to music tracks.
This mixture of music ratings information and listening patterns was desirable, asthis would allow
ratings generated from implicit data to be compared with each user’s explicit ratings. However, the
lack of sources for information regarding music ratings and listening patterns meant that it was not
possible to find a single data set containing both users’ explicit ratings and information about listening
habits. Further, it was not possible to find any significant source of information about actions users had
performed while listening to music. A dataset used in (Huet al., 2005) was identified as a possible
source of test data. This dataset is a collection of user’s ratings on for particular albums taken from the
epinions.com2 website. However, this dataset was inadequate for use in this project, as itwas deemed to
be too small to enable a recommendation system to produce good recommendations.
last.fm, an online radio service, was another source of data that was identified. This service makes large
amount of data on users’ play-counts available through a web service. Due to the large amount of data
available through this service, it was decided to use this to produce a dataset for use in investigating
Unobtrusive Recommendation. Reading data from this service produced an initial dataset of 500,000
play counts, spanning 10,000 artists and 5,000 users. This dataset was then culled (to get rid of the users
and artists that had few play-counts associated with them) to a size of 100,000 play-counts, spanning
3333 artists and 948 users. However, at this stage, the only source of test data that had been established
was implicit data based upon users’ listening patterns. This data would indeedbe useful for exploring
theUnobtrusive Recommendationquestion, yet it was not ideal for exploring theScrutability & Control
question. This is because, if scrutability and control features were to be added to a prototype that
made ratings based upon implicit data, then the performance of these features may be affected by the
fact that this was implicit and not explicit data. Therefore, a data set consisting of explicit ratings was
required in order to investigate theScrutability & Controlquestion. At this point, no significant source of
explicit music ratings was able to be located, and so, it was decided that the MovieLens standard dataset
(which provides explicit ratings on movies) should be used to investigate issues relating toScrutability
& Control. This dataset contains 100,000 ratings, from 943 users, on 1682 movies. Thus, two datasets
were chosen for use in this thesis — a dataset compiled from data taken fromlast.fm and the MovieLens
standard dataset.
2http://www.epinions.com
3.7 CONCLUSION 42
Implications for the prototype. The prototype will have to have two variants in order to separately
test the two goals of the thesis. These two variants would be:
• A prototype based upon the MovieLens standard dataset, that investigatedScrutability & Con-
trol.
• A prototype based upon the last.fm dataset that was created, that investigated Unobtrusive
Recommendation.
3.7 Conclusion
In order to investigate the areas ofScrutability & ControlandUnobtrusive Recommendation, an ex-
ploratory study conducted. This began with a Qualitative Analysis, that identified the Duine Toolkit as
the most appropriate code based for extension. This toolkit makes availablesix different recommenda-
tion techniques that could be used within a prototype system. A thorough examination of each technique
was then conducted to ascertain ways in which they could be explained and controlled. A number of
possible recommender usability features were brought to light through this analysis, and these, along
with existing recommender usability features, were investigated through the conduction of a question-
naire. Based upon the results of this questionnaire, a large number of findings could be gleaned about
the respondents in general. However, the data that was collected throughthis questionnaire was quite
rich, and demonstrated the individuality of each of the respondents. Particular respondents had prefer-
ences for different types of presentation and their answers clearly reflected this. This type of variance in
preferences makes a strong case for providing personalisation of presentations and explanations within
recommender systems.
• Each of the recommendation techniques can be explained in a way that userscan easily under-
stand.
• When explaining recommendations, providing more information can often be beneficial.
• Complicated or poor explanations will often confuse a user’s understanding of a recommenda-
tion technique.
• A user’s opinions on the usefulness of recommendations are related to theirunderstanding of
these recommendations.
• Social Filtering and Genre Based were judged by respondents to be the most useful recom-
mendation techniques.
3.7 CONCLUSION 43
• Respondents wanted the Most Popular recommendation technique to be combined with other
techniques.
• Respondents did not think that Description Based or Lyrics Based recommendation techniques
would be useful.
• Respondents believed that Social Filtering (Simple Text), Genre Based (Simple Text), Most
Popular (Ranking) and Learn By Example (Simple Text) screens were the easiest to understand
and most useful for their recommendation techniques.
• Some respondents had a strong interest in the ability to view the profiles of other similar users.
• Respondents indicated they would use the Genre Based Control (Genre Slider) often and that
it was easy to understand. Further, respondents believed that it would be very useful.
• Most respondents indicated they would find a List Based presentation easier to understand
and quicker to read than a Map Based presentation. Most users indicatedthey would find a
List Based presentation useful and some users indicated they would also find a Map Based
presentation to be useful.
• Respondents indicated they like to have the system choose a combination of recommendation
techniques or allow them to view recommendations using various techniques.
• Respondents believed that explanations would be a useful addition to a recommender system.
• Respondents also believed that having control over a recommender system would be very use-
ful.
• Different users prefer different forms of presentation and explanation.
These findings meant that the prototype should:
• Include both List Based and Map Based presentations.
• Allow users to view recommendations produce using various techniques and/or make recom-
mendations using a combination of prediction techniques.
• Contain explanations for recommendations.
• Allow users to have control over certain elements of the recommender system.
• Allow users to view profiles for similar users to them.
• Include Social Filtering, Genre Based, Most Popular and Learn By Example recommendation
techniques.
• Include the following optional explanation screens:
3.7 CONCLUSION 44
– Social Filtering (Simple Text), Social Filtering (Simple Graph) and Social Filtering (Sim-
ilar Users)
– Combination of Genre Based (Simple Text) and Genre Based (Genre Listing)
– Combination of Most Popular (Avg. Rating Info.) and Most Popular (Ranking)
– Combination of Learn By Example (Simple Text) and Learn By Example (Similar Artists)
• Include the following controls:
– Genre Based Control (Genre Slider)
– Social Filtering Control (Like/Not Like)
Finally, two sources of test data were established for use in conducting simulations and evaluations at
a later stage in the thesis. The results of the investigations described in this chapter, along with the test
data that was acquired, would inform the construction of a prototype, described in Chapter 4.
CHAPTER 4
Prototype Design
4.1 Introduction
In order to investigate questions regardingScrutability & Control in recommender systems andUnob-
trusive Recommendation, a prototype was developed. This prototype would later be used to conduct
user evaluations and simulations to establish the usefulness of a number of unobtrusive user modeling
and usability features. The findings of the questionnaire described in Chapter 3 were used to guide the
construction of this prototype and ensure that only features that were likely to be of use in improving
recommendation quality would be included in the prototype.
Section 1 stated that this thesis aimed to investigate two main questions: theScrutability & Controlques-
tion and theUnobtrusive Recommendationquestion. However, each of these two are separate research
questions. If a prototype was created to investigate both of these questionsat once, it could be difficult
to link each of the findings of this study to one specific research question. So, it was decided that two
variants of our prototype should be created - one to investigate each of themajor research questions for
this project. Each of these prototype variants could then be evaluated separately and the results from
each evaluation would provide findings that would clearly be related to only one research question. The
prototype that we created to investigate these questions was called iSuggest.The two variants that we
created of this prototype were called iSuggest-Usability and iSuggest-Unobtrusive.
iSuggest-Usability incorporated the highest rated usability interface features from the questionnaire.
This version of the prototype made movie recommendations, based upon the MovieLens standard data
set. iSuggest-Usability would later be used to investigate theScrutability & Controlfor recommenders
through user evaluations.
45
4.2 USER’ S V IEW 46
iSuggest-Unobtrusive made music recommendations based upon the last.fm1 dataset described in Sec-
tion 3.6. It would be used to investigateUnobtrusive Recommendation. iSuggest-Unobtrusive incorpo-
rated the ability to automatically generate the ratings that a user would give particular items using only
unobtrusively obtained information. Specifically, this meant that it read the play-counts from a user’s
iPod and then automatically generated a set of ratings that a user would giveto particular artists. The au-
tomatically generated ratings were then used to produce recommendations forthat user. This prototype
aimed to generate ratings for a user in a way that was accurate, but was also easy for them to understand.
iSuggest-Unobtrusive would later be used to investigate theUnobtrusive Recommendationthrough both
user evaluations and statistical evaluations.
This chapter describes the functions that each prototype variant made available to users, it then describes
the architecture of each of the two variants.
4.2 User’s View
The basic iSuggest prototype showed users the standard type of interface that is used within most current
recommender systems. A user’s first interaction with the basic iSuggest system was to create an account
within iSuggest and then log in. Users could then view three basic screens:
Rate Items: Showed the items that the user had not yet rated and could still enter a rating for.
My Ratings: Showed the items that the items that the user had rated, and the rating that the user
had given each item.
Recommendation List: Showed a list of the recommendations that the system had produced for
the user. 4.1 shows an example of this screen.
Each of these screens used a standard List Based presentation style, as suggested by the study reported
in Chapter 3. Users were able to click to view more information about any of theitems shown on any
of these screens. They could then click to search the Internet for more information about any of these
items (this linked to imdb.com2 for movie items and Amazon.com3 for music items). Users rated items
by clicking on the Star Bar (shown in Figure 4.2) and dragging their mouse to produce a rating between
0 stars (worst) and 5 stars (best) for each item. This basic prototype made all recommendations using a
1www.last.fm2www.imdb.com3www.amazon.com
4.2 USER’ S V IEW 47
single recommendation method — the Duine Toolkits default Taste Strategy (described in Section 3.3).
The Taste Strategy was chosen for use within the basic prototype as it is shown in (van Settenet al.,
2004) to be the most effective recommendation method available for use in the Duine Toolkit. In this
way, the basic iSuggest prototype utilised the optimum configuration of the Duine Toolkit and provided
a standard List Based presentation of information. The two prototype variants that would be used to
investigate the research goals of this thesis extended this basic prototype to incorporate new features and
enable these features to be evaluated.
FIGURE 4.1: List Based Presentation Of Recommendations
FIGURE 4.2: The Star Bar That Users Used To Rate Items
4.2.1 iSuggest-Usability
This version of the prototype extended the basic iSuggest prototype to incorporate all of the usability
features that the results of the questionnaire suggested would be usefuladditions to a recommender sys-
tem. This version of the prototype made movie recommendations, based upon theMovieLens standard
data set. When using iSuggest-Usability, users were presented with the following new usability and
interface features:
• Multiple recommendation techniques.
4.2 USER’ S V IEW 48
• Explanations for all recommendations that were produced.
• The ability to view a list of users similar to the current user.
• Control features that allowed the user to affect the recommendation process.
• A Map Based presentation of recommendations.
Each of these features is discussed in detail in the sections below.
Multiple Recommendation Techniques. Social Filtering, Genre Based, Most Popular and Learn By
Example recommendation techniques were all included as additional recommendation techniques that
could be used by iSuggest-Usability. These were included as the questionnaire suggested that users
would find these recommendation techniques to be the most useful. The questionnaire also suggested
that users would like a recommendation system to combine multiple techniques to makerecommenda-
tions and/or allow users to select which recommendation technique should be used. Thus, iSuggest-
Usability allowed users to select which of the five available methods (including thestandard Taste Strat-
egy) should be used to create recommendations. Users selected the recommendation technique to be
used by accessing an options screen that presented them with the five techniques. An example of this
screen is shown in Figure 4.3. Each of these techniques had a small description underneath its name to
describe how it functioned. Users selected one option from the list of recommendations and confirmed
this choice. This would cause the user’s recommendations to be replaced witha new set of recommen-
dations.
The questionnaire suggested that it would also have been desirable for iSuggest-Usability to enable
combinations of recommendation techniques to be used. However, this was deemed to be outside the
scope of the project.
Explanations. Every recommendation that was produced using the Social Filtering, GenreBased,
Most Popular or Learn By Example techniques was accompanied by an explanation that users could
view by clicking to see "More Info" about the recommended movie. The explanations provided to users
depended upon the recommendation technique that was used to create the recommendation. The way in
which recommendations from each technique were explained is described below.
Most Popular: The questionnaire suggested that the Most Popular (Avg. Rating Info.) and Most
Popular (Ranking) screens would be useful in explaining this technique tousers. Most Popular
was therefore explained using a combination of these two screens that displayed the amount of
4.2 USER’ S V IEW 49
FIGURE 4.3: Recommendation Technique Selection Screen. Note: The ‘Word OfMouth’ Technique Shown Here Is Social Filtering And The ‘Let iSuggestChoose’ Tech-nique Is The Duine Toolkit Taste Strategy
FIGURE 4.4: Explanation Screen For Genre Based Recommendations
FIGURE 4.5: Social Filtering (Simple Graph) Explanation Screen For Social FilteringRecommendations
users who had rated the recommended movie, the average rating these users had given to the
4.2 USER’ S V IEW 50
FIGURE 4.6: Explanation Screen For Learn By Example Recommendations
FIGURE 4.7: Explanation Screen For Most Popular Recommendations
movie and the rank that this movie therefore had in the database. The Most Popular explanation
screen is shown in Figure 4.7.
Genre Based:The questionnaire suggested that the Genre Based (Simple Text) and Genre Based
(Genre Listing) screens would be useful in explaining this technique to users. However, the
Genre Based (Genre Listing) screen showed users the average ratingthat they had given movies
within a particular genre. Unfortunately, this average is not used by the Genre Based technique
to create recommendations so using it to explain recommendations would not necessarily pro-
duce useful explanations. Rather, the Genre Based technique calculates a user’s interest in
particular genres and uses this to make recommendations. Hence, the explanation for the
Genre Based technique contained a listing of the genres that a movie belonged to and a link to
a screen where the user could view their calculated interest in each genre. The Genre Based
explanation screen is shown in Figure 4.4.
Social Filtering: The questionnaire showed that Social Filtering (Simple Text), Social Filtering
(Simple Graph) and Social Filtering (Similar Users) could all be useful waysto describe this
technique. However, these explanations could not easily be combined. Asa result, three
different types of Social Filtering explanations were provided to users —Simple Text, Graph
and Similar Users. Simple Text presented text indicating the number of similar users this
recommendation was based upon. Graph (shown in Figure 4.5) presentedtext indicating the
number of similar users that this recommendation was based upon and displayed a graph of
the number of users who ‘Liked This Movie’ and ‘Didn’t Like This Movie’.Finally, Similar
Users showed the names of the similar users who were most significant in the creation of this
recommendation and whether these users ‘Liked This Movie’ or ‘Didn’t Like This Movie’.
Users could then click to view the detailed profiles of these similar users.
4.2 USER’ S V IEW 51
Learn By Example: The questionnaire suggested that the Learn By Example (Simple Text) and
Learn By Example (Similar Artists) screens would be useful in explaining this technique to
users. Thus, Learn By Example was described using a combination of these two screens. This
combined screen listed the similar items that this recommendation was based upon (including
the rating that the user had given that item) and stated the average rating thatthis user had given
to these similar items. The Learn By Example explanation screen is shown in Figure 4.6.
Similar Users. This screen allowed a user to view a list of other users who the system believed were
the most similar to them. A user could then click to view the ratings given by each ofthe similar users
displayed in the list. This screen was included because the questionnaire suggested that users had a
strong interest in the ability to view the profiles of other similar users.
Control Features. The questionnaire suggested that control features would be a useful addition to a
recommender system. In particular, it was suggested that Genre Based Control (Genre Slider) and Social
Filtering Control (Like/Not Like) would be quite useful to users. As a result, these two features were
incorporated into iSuggest-Usability. These control features are detailedbelow.
FIGURE 4.8: The Genre Based Control (Genre Slider)
Genre Based Control (Genre Slider): (shown in Figure 4.8) This control screen displayed the
interest that the system had calculated the user had in each genre. Theseinterest levels were
displayed using slider bars and the users was able to manually adjust these sliders to indicate
their actual interest level in each genre.
4.2 USER’ S V IEW 52
FIGURE 4.9: The Social Filtering Control. Note: The actual control is the ‘IgnoreThisUser’ Link
Social Filtering Control: (shown in Figure 4.9) This control was integrated into all screens that
displayed similar users to the current user. On every screen where the system displayed the
details of a similar user, these details were accompanied by the option to ‘Ignore This User’.
Users could then choose to ignore a particular user if they felt that user was not similar to them.
This control feature was a slight variation upon the Social Filtering Controlscreen shown in
the questionnaire. The difference is that this feature no longer allowed users to confirm that
another user was indeed similar to them. This is because such a confirmation would not have
had any impact upon recommendations (as the system already believed that these two users
were similar).
Map Based Presentation. The questionnaire suggested that many users would find the option of a
Map Based presentation of recommendations to be useful. As a result, this form of presentation was
incorporated into the prototype. The Map Based presentation displayed itemsto users so that:
• Each movie on the map was shown as a circle and the name of the movie was written on that
circle.
• The closer that two circles were to one another, the more related they were (e.g. two very
closely related movies would appear right next to one another and two moviesnot related to
one another at all would appear far away from one another). Note: different relationships
between items existed for different map types, these are discussed below.
• If a user had seen an movie, it was coloured blue.
4.2 USER’ S V IEW 53
• If a user had not seen an movie, but their predicted rating for that movie was above 2.5 stars, it
was coloured a shade of green (darker green indicated a higher rating).
• If a user had not seen an movie, but their predicted rating for that movie was close to 2.5 stars,
it was coloured orange.
• If a user had not seen an movie, but their predicted rating for that movie was less than 2.5 stars,
it was coloured a shade of red (darker red indicated a lower rating).
• Users were allowed to zoom in and out on the map and move left, right up and down on the
map.
• Users could click on a particular circle to view more information about the movie that circle
represented.
Three variants of Map Based presentation were included in iSuggest-Usability. These variants were
included in order to investigate how useful users would find particular styles of Map Based presentation.
The details of each of these variants are described below.
FIGURE 4.10: Full Map Presentation — Zoomed Out View
Full Map: (shown in Figures 4.10 & 4.11) This map displayed all of the movies found in the
MovieLens dataset. Each movie on this map was placed close to the genres thatit belonged to.
The names of the genres that movies were divided into were displayed in largewriting on the
map.
4.2 USER’ S V IEW 54
FIGURE 4.11: Full Map Presentation — Zoomed In View
FIGURE 4.12: Similar Items Map Presentation
Top 100 Map: This map was exactly the same as the Full Map, except that to reduce clutter and
confusion on the map, it displayed only 100 movies. These 100 movies were the movies with
the highest predicted rating for this user.
Similar Items Map: (shown in Figure 4.12) This map showed the user a single focus item, sur-
rounded by a number of items. These items were described to users as beingrelated to the
focus item because the users who liked the focus item also liked these items. This map was
4.2 USER’ S V IEW 55
chosen for inclusion because it displays items in a similar to the way that liveplasma4) displays
items.
4.2.2 iSuggest-Unobtrusive
This version of the prototype extended the basic iSuggest prototype to incorporate the ability to generate
ratings using only unobtrusively obtained information about a user. iSuggest-Unobtrusive made use of
the play-counts that were stored on users’ iPods to automatically generate aset of ratings that these
users would give to particular artists. These ratings were then used to generate recommendations for
that user. When using iSuggest-Usability, users connected their iPod, then clicked to ‘Get Ratings
From My iPod’, ratings were then generated from the iPod connected to thesystem and an explanation
of the ratings generation was shown. Users could then see the ratings thathad been generated for
them and the recommendations that had been produced for them. Users were able to choose from three
different recommendation techniques — Random (which merely assigned a random number as the user’s
predicted rating for each item), Social Filtering and Genre Based.
The explanation of the ratings generation that was displayed is shown in Figure 4.13. It described the
number of ratings that had been generated. It also noted that artists the user listened to frequently had
been given a high rating and artists the user listened to less frequently received lower ratings. The con-
struction of the ratings generation algorithm and this explanation screen wasguided by the findings of
the questionnaire. A particularly important consideration was the suggestionthat complicated explana-
tions could confuse a user’s understanding and do more harm than good. Thus, this explanation screen
was designed to be simple for users to understand, yet still communicate effectively the way that ratings
had been generated.
FIGURE 4.13: The Explanation Screen Displayed After Ratings Generation
4http://www.liveplasma.com
4.3 DESIGN & A RCHITECTURE 56
4.3 Design & Architecture
The architecture of the basic prototype is shown in Figure 4.14, with components constructed during
this thesis marked in blue. The core components of the basic prototype were the iSuggest Controller,
the iSuggest Interface and the Duine Toolkit. The iSuggest Controller managed the iSuggest system,
allowing users to log in, submit ratings, set preferences and receive recommendations. It submitted any
ratings and preferences to the Duine Toolkit and decided when a user’srecommendations needed to
be updated. Such an update was required whenever a user changed their preferences or had submitted
a certain number of new ratings to the Duine Toolkit. The iSuggest Interfacemanages all of the user
interaction for the iSuggest system. This component was built using the Processing graphical toolkit
(available from http://processing.org/). The basic iSuggest Interface incorporates List Based presentation
screens that enable users to rate items and view recommendations. The iSuggest Interface submits the
users’ ratings and preferences to the iSuggest Controller and it receives new recommendations from
the iSuggest Controller whenever the user’s recommendations are updated. The Duine Toolkit receives
ratings and preferences from the iSuggest Controller and uses these,along with a Ratings Database to
generate recommendations when required.
��������������� � ����� ������������������������������������������ ����������� ������������������� ��� ! "##$%�&������'�������� ������'����������������(����� ��������(�����FIGURE 4.14: Architecture Of The Basic Prototype, With Components ConstructedDuring This Thesis Marked In Blue
4.3.1 iSuggest-Usability
iSuggest-Usability extended the basic prototype by adding scrutability and control features. This ver-
sion of the prototype made movie recommendations, based upon the MovieLensstandard data set. Fig-
ure 4.15 shows the architecture of iSuggest-Usability, with components constructed during this thesis
marked in blue.
The additional features included in this version of the prototype were:
4.3 DESIGN & A RCHITECTURE 57
)*+,-./0*+*1*/234-/+567+28954:;4<,2=2-/>2/+ 0*+*?2@ )*+,-./AB,/+,-.)*+,-./FIGURE 4.15: Architecture Of iSuggest-Usability, With Components ConstructedDuring This Thesis Marked In Blue
Map Based Presentation Screens:These presentation screens made use of the traer.physics5
and traer.animation6 libraries. The traer.physics library was used to create a simulated par-
ticle system. In such a system, all particles repel one another, and links holdparticles close
to one another. This particle system was used to determine the positions of items inthe Map
Based presentation. The Full Map and Top 100 Map maps began by placingall of the systems
movie genres onto the map as particles. Items were then placed one-by-one onto the map, and
each item would be linked to the genres that it belonged to. In this way, each item would be
repelled by all other items in the system, but it would stay close to the genres thatit belonged
to. The Similar Items Map used a different method to position items. This map calculated the
correlation between each movie and all other movies in the database in terms of the ratings
that users had given them. This map then displayed a single focus item, encircled by all of the
movies that had a high level of correlation with the focus item.
Similar Users Screen:This screen made use of the a list of similar users that was output from
the Social Filtering algorithm. It then displayed the users who were the most similar to the
current user (to a maximum of 9 similar users).
Control Features: These features received input from the user regarding their preferences and
forwarded this information to the iSuggest Controller. The iSuggest Controller then set these
preferences in the Duine Toolkit and updated the user’s recommendations.
Modified Recommendation Algorithms: The Social Filtering, Genre Based, Learn By Exam-
ple and Most Popular algorithms were all modified so that they attached extensive explanation
information to each recommendation that was made. This allowed the Explanation Screens to
5http://www.cs.princeton.edu/ traer/physics/6http://www.cs.princeton.edu/ traer/animation/
4.3 DESIGN & A RCHITECTURE 58
fully explain each of the recommendations. The Social Filtering and Genre Based algorithms
were also modified to make use of the user preferences that were set using control features.
Explanation Screens:These screens took the explanation information that was attached to each
recommendation and displayed this information in a way that the user should be able to under-
stand.
4.3.2 iSuggest-Unobtrusive
iSuggest-Unobtrusive extended the basic prototype by adding the ability to automatically generate a
user’s ratings from play-counts stored on their iPod. This version of theprototype made music recom-
mendations based upon the last.fm dataset. The architecture of iSuggest-Unobtrusive is shown in Figure
4.16, with components constructed during this thesis marked in blue.
CDEFGHIJDEDKDILMNGIEOPQELRSONT UDIEVWTXLIE JDEDYLZ CDEFGHI[\FIEFGHCDEFGHI[\]UDFGDEFNGI [\]UDFG^DEFNGIFIGURE 4.16: Architecture Of iSuggest-Unobtrusive, With Components ConstructedDuring This Thesis Marked In Blue
The additional features included in this version of the prototype were:
Ratings Generation Algorithm. This algorithm needed to be both accurate at generating ratings from
a users’ play-counts and easy to explain to users. The algorithm that waschosen to generate ratings
worked in the following way:
4.4 CONCLUSION 59
Input : Artists and play-counts from an iPod
Output : User’s ratings for artists found on the iPod
minimum count = min(play-counts)1
maximum count = max(play-counts)2
foreachartist on the iPoddo3
artist play-count = sum(play-counts from songs by this artist)4
normalized play-count = (artist play-count - minimum count) / (maximum count- minimum5
count)
new rating = (normalized play-count + 1) * 2.56
end7
Algorithm 1 : Ratings Generation Algorithm
On line 4, the the play-counts are normalized with reference to the other play-counts that exist on the
iPod. This places them on a scale of 0.0 – 1.0 Then, on line 5, these ratings are converted to a scale of
0.0 – 5.0. The minimum rating produced by this algorithm is 2.5, as this is a neutral rating, and the worst
that any artist on a user’s iPod should is neutral (as the mere fact that theartist is on their iPod implies
that the user has at least a neutral attitude toward that artist).
Explanation Screen. This screen took the explanation information that was provided by the ratings
generation algorithm and displayed this in a way that users should be able to understand.
4.4 Conclusion
To investigate the research goals of this project, a prototype called iSuggest was developed. This proto-
type was offered in two different versions, named iSuggest-Usability andiSuggest-Unobtrusive, each of
which was built to explore a separate research question. The basic iSuggest system was created to imitate
existing recommender interfaces and use the default Duine Toolkit recommendation technique (the Taste
Strategy). This basic prototype was extended to create the two prototype variants - iSuggest-Usability
and iSuggest-Unobtrusive.
iSuggest-Usability incorporated the highest rated usability interface features from the questionnaire.
This prototype made movie recommendations, based upon the MovieLens standard data set. It would
4.4 CONCLUSION 60
later be used to investigate the first research goal of the project throughuser evaluations. iSuggest-
Usability made the following functions available to the user:
Multiple Recommendation Techniques:The questionnaire suggested that the ability to choose
the recommendation technique to be used would be useful to users. Thus, iSuggest-Usability
allowed users to request that recommendations be produced using any offive different recom-
mendation techniques (Social Filtering, Genre Based, Most Popular, Learn By Example and
the Duine Toolkits Taste Strategy).
Explanations: Explanations were provided for all recommendations that were produced. Each
recommendation technique was explained using its highest rated explanation screen from the
questionnaire. Social Filtering was explained using three different explanation screens, each
of which were shown by the questionnaire to be useful.
Similar Users: Users were given the ability to view a list of the other users of the system who
were deemed to be the most similar to the current user. Users could view all ofthe ratings
entered by each similar user.
Control Features: These allowed the user to affect the recommendation process. The control
features implemented were the Genre Based Control (Genre Slider) and Social Filtering Con-
trol, as respondents of the questionnaire rated these highly.
Map Based Presentation Of Recommendations:This form of presentation was rated as use-
ful by many questionnaire respondents. Three different map based presentations were made
available to the user - Full Map, Top 100 Map and Similar Items Map.
iSuggest-Unobtrusive incorporated the ability to read the play-counts from a user’s iPod and then gener-
ate a set of ratings that user would give to particular artists. These ratingscould then be used to produce
recommendations for a user. This prototype made the following functions available to the user:
Automatic ratings generation: Users could have ratings automatically generated from the play-
counts on their iPod.
Ratings generation explanation:Every time that ratings were automatically generated by this
system, an explanation screen was shown to users that described how many ratings were gen-
erated and how these had been generated.
Recommendations using unobtrusive information:Recommendations were provided to each
user based upon the ratings that had been automatically generated. iSuggest-Unobtrusive made
4.4 CONCLUSION 61
use of the last.fm dataset, which contains only unobtrusively obtained information, to make
recommendations.
Once the construction of the prototypes was complete, each of them neededto be evaluated to investigate
the research goals of the project. The evaluation of these prototypes is described in Chapter 5.
CHAPTER 5
Evaluations
5.1 Introduction
In order to investigate the research goals for this thesis, the two versions of the prototype — iSuggest-
Usability and iSuggest-Unobtrusive — were evaluated. These evaluationsaimed to establish the ef-
fectiveness of the methods implemented in the prototype for providing scrutability, control and unob-
trusiveness. iSuggest-Usability was evaluated through a user evaluation, which was completed by 10
people. This evaluation aimed to investigate the effectiveness of explanations, controls and Map Based
presentations for improving explanations and providing scrutability. It alsoaimed to investigate how
users interact with these elements. iSuggest-Unobtrusive was evaluated through both a user evaluation
and statistical evaluations. These evaluations aimed to assess the ability of the prototype to generate
ratings from implicit data, and its ability to make useful recommendations using these ratings. Each of
these evaluations needed to be rigorously designed to ensure that it meaningfully and accurately tested
effectiveness and investigated users’ interactions with the prototype system. This chapter describes the
design of these evaluations and their results.
5.2 Design
In order to investigate the way in which users interact with recommender systems and the usefulness of
particularScrutability & Controlelements that we added to the two prototype systems that we developed,
we designed two user evaluations, one for each of the prototype systems that we produced. During the
completion of these evaluations, users were asked to answer questions about the usefulness of particular
aspects of the iSuggest-Usability. For each of these questions, 1 was the lowest score that could be
given, and 5 was the highest. Further, the evaluations were conducted through a process called a Think-
aloud (detailed in (Nielsen, 1993)), which involves asking users to verbalise their thought process while
62
5.2 DESIGN 63
making use of particular elements of a system. During the Think-aloud process, notes were made to
record the though processes expressed by users. Through the Think-aloud process, we aimed to discover
information about how users interacted with recommender systems and how useful they found particular
elements of the prototype that could not be captured by asking simple questions. The design of the two
user evaluations is described below.
5.2.1 iSuggest-Usability
The evaluations of the iSuggest-Usability were designed with the following goals in mind:
Goal 1: Investigate whether providing explanations for recommendations can improve the use-
fulness of these recommendations.
Goal 2: Investigate the most effective way to explain recommendations to users.
Goal 3: Investigate whether there is a trade-off between recommender usefulness and under-
standing of recommendations.
Goal 4: Investigate whether users can utilise control features to improve the quality of their rec-
ommendations.
Goal 5: Investigate whether a recommender system benefits from the introduction ofa map based
presentation.
Goal 6: Investigate the way in which users interact with a map-based style of presentation.
In order to achieve each of these goals, the user evaluations for the iSuggest-Usability consisted of
a Setup stage, Part A and Part B. Each user began by entering rating for movies at the Setup stage.
Following this stage, users were asked to complete the Part A and Part B stages, each of which asked
them to view recommendations and rate a number of different elements that were presented to them.
Finally, users were presented with a set of final questions to answer about their general opinion of
iSuggest-Usability. Part A presented users with a standard set of recommendation, with no additional
Scrutability & Control features at all. This stage was included in the evaluation in order to serve as
a control, to gauge the quality of the recommendations presented to users andto present them with a
standard method of recommendation, without anyScrutability & Control features. Part B presented
users with recommendations that incorporated theScrutability & Control elements of this prototype
and asked them to rate the recommendations and the usefulness of particularScrutability & Control
elements. In order to produce a Double Cross-over study, half of the participants in evaluations were
5.2 DESIGN 64
asked to complete Part A before Part B (Type 1), and the other half completed Part B before Part A
(Type 2). A full description of the details of each of the stages of the evaluation is included below (The
instructions that users followed during these evaluations can be found in Appendix C).
Setup. During this stage, users moved through a list of movies and rated any of the movies that they
had seen, according to how much they liked or disliked that movie. Users were asked to rate approx-
imately 30 movies, as this number of ratings meant that the user was still considered to be a new user
to the system, and thecold start problem for new userswould still be very apparent for this user. The
choice to simulate thecold start problem for new usersduring these user evaluations was motivated by
the fact that explanation and control features are both elements that we have added to our prototype
with the specific intention of: building users’ trust in the system, despite the quality of recommenda-
tions produced; aiding users in making better use of poor recommendations;and improving the quality
of recommendations that are produced by the system. Thecold start problem for new usersis a well
documented problem with recommender systems that causes such systems to produce poor recommen-
dations. Thus, simulating this problem should produce some poor quality recommendations and allow
us to assess the effectiveness of theScrutability & Controlelements that were added to this prototype.
Part A. During Part A of the user evaluations, users were presented with a list ofrecommendations that
were produced using the Duine Toolkit’s Main Strategy. These recommendations were presented to the
user without any form of explanation and users were offered no formof control over these recommen-
dations. Recommendations were presented to users in this form to show them that often recommender
systems do not provide theScrutability & Controlfeatures that were introduced with this prototype.
Part B. During this part of the user evaluations, users were presented with multiple sets of recom-
mendations, accompanied byScrutability & Controlfeatures such as explanations and controls. During
Part B, users were asked a number of questions in order to assess the usefulness of the recommendation
methods and theScrutability & Controlfeatures that were added to the prototype. Users were instructed
to select and use each of the different recommendation method in turn. Eachof these recommendation
methods was accompanied by a short explanation of how it worked, to giveusers some idea of how
recommendations would be produced. The questions that were presentedto the user during this stage
were divided into the following categories:
5.2 DESIGN 65
Recommender Usefulness:After each set of recommendations was presented, the user was
asked to rate how useful they found these recommendations.
Explanation Usefulness:The recommendations presented to users at this stage were each ac-
companied by an explanation, and users were asked to rate how useful they found that ex-
planation for helping them to understand and make use of the recommendationsthat were
provided. In the case of the Social Filtering recommendations, users werein fact presented
with three different forms of explanation for each recommendation and theywere asked to rate
each of these forms of explanation in turn.
Control Feature Usefulness:For the Genre Based and Social Filtering recommendations, users
were instructed to make use of specific control features that were intended to improve the
quality of recommendations. Users were then asked to rate how useful theyfound each control
feature for improving their predictions.
Map Usefulness:Users were presented with the three different Map Based presentations, Full
Map, Top 100 Map and Similar Items Map. They were asked to spend some time making use
of each Map Based presentation and then they were asked to rate its usefulness as a method
for viewing recommendations. In addition to asking users to rate each form of Map Based
presentation, the way in which users interacted with each of them was observed. This section
of the user trial focused on discovering whether users were interestedin having a map based
presentation of recommendations and if so, how such a presentation could most effectively be
created.
Final Questions. Upon completion of the user evaluations, users were asked five questions. They were
asked to rate the general usefulness of the explanations provided by thesystem and the usefulness of the
control features in improving recommendations. Users were also asked whether they would prefer a list
based presentation of recommendations, a map based presentation, or both. Finally, they were asked to
state what the best and worst features of the iSuggest prototype were.
Participants. In all, 10 people completed the evaluations of iSuggest-Usability. This is well beyond
the recommended minimum of 3 to 5 people for usability evaluations stated in (Nielsen, 1994). The
sample group for this evaluation was carefully selected to contain people from a variety of backgrounds
and both males and females. The majority (8/10) of the users who completed the questionnaire were
aged under 30, but modern recommender systems are used most often by people who fall in the 18-30
age range, so a higher proportion of respondents in this age range wasdeemed to be appropriate. Figure
5.2 DESIGN 66
5.1 shows demographical information about each of the participants, as well as indicating whether they
completed Part A first (Type 1) or Part B first (Type 2).
Group 1 Group 2Particpant Number 1 2 3 4 5 6 7 8 9 10Age 22 52 18 21 21 30 23 51 25 23Gender F M F F M M M F M FType 1 or 2 1 2 1 2 1 2 1 2 1 2
FIGURE 5.1: Demographical Information About The Users Who Conducted The Eval-uations Of iSuggest-Usability
5.2.2 iSuggest-Unobtrusive
The evaluations of the iSuggest-Unobtrusive were designed with the following goals in mind:
Goal 1: Investigate whether users’ play counts can be accurately mapped to their ratings.
Goal 2: Investigate whether effective recommendations can be made for users using only ratings
generated from play counts.
In order to achieve each of these goals, the user evaluations for the Usability Prototype consisted of
Parts A and B. The instructions that users followed during this evaluation can be found in Appendix E.
During Part A, ratings were generated for each user by applying the ratings generation algorithm, and
users were then asked to indicate how well they understood how these ratings had been generated and
how accurate the ratings were. Part B presented three sets of recommendations to users:
Random Recommendations:These recommendations were created by assigning a random num-
ber as the user’s predicted interest in each item. These recommendations were included to act
as a control, a reference point which could be used to judge the utility of the rest of the recom-
mendations presented to users.
Social Filtering Recommendations:These recommendations were created using the Social Fil-
tering recommendation technique. This technique was chosen for use as it was the top per-
forming algorithm on a set of statistical evaluations (the results of these statistical evaluations
are summarised in 5.4.1).
5.2 DESIGN 67
Genre Based Recommendations:These recommendations were created using the Genre Based
recommendation technique. This technique was chosen as it was the secondhighest perform-
ing algorithm on a set of statistical evaluations (the results of these statistical evaluations are
summarised later in this chapter, in Section 5.4.1).
For each set of recommendations, users were first presented with the listof recommendations, then they
were asked to spend as much time as they wanted assessing how useful theyfound the recommendations
that were provided. Users were then asked to give the recommendations aratings according to how
useful they were. In order to produce a Double Cross over study, five of the participants in evaluations
were shown Random Recommendations before Social Filtering and Genre Based Recommendations
(Type 1), and the other four were shown Social Filtering and Genre Based Recommendations before
Random Recommendations (Type 2). Once users had completed the trial, theywere also asked to
indicate whether or not they would like to have the ‘Get Ratings From My iPod’feature incorporated
into the iSuggest system.
Participants. In all, 9 people completed the evaluations of iSuggest-Unobtrusive. Theseusers were
not all the same users that completed the evaluation of iSuggest-Usability, though some users did com-
plete both evaluations. Again, the sample group for this evaluation was carefully selected to contain
people from a variety of backgrounds and both males and females. The majority (6/9) of the users
who completed the questionnaire were again aged under 30. Figure 5.2 shows demographic information
about each of the participants, as well as indicating whether they were shown Random Recommenda-
tions first (Type 1) or Social Filtering and Genre Based recommendations first (Type 2).
Participant: 1 2 3 4 5 6 7 8 9Age 18 52 20 51 19 21 20 23 31Gender F M F F M F M M FType 1 or 2 1 2 1 2 1 2 1 2 1
FIGURE 5.2: Demographical Information About The Users Who Conducted The Eval-uations Of iSuggest-Unobtrusive
Statistical Evaluations. In order to evaluate more thoroughly the ratings and recommendations that
were produced by iSuggest-Unobtrusive, a set of simulations were carried out, and statistical data was
collected during these simulations. An important issue in the execution of these simulations was the
choice of statistical measures for evaluating performance. The chosen measures needed to provide a
useful and reliable gauge of each systems performance. It was decided to evaluate the performance of
5.3 ISUGGEST-USABILITY EVALUATIONS — RESULTS 68
the ratings algorithm through the distribution of the ratings that were produced by that algorithm. This
distribution could then be compared to the distribution of ratings within the MovieLens standard dataset.
Evaluation of the usefulness of recommendations produced by iSuggest-Unobtrusive was slightly more
complicated. (Herlocker, 2000) provides an evaluation of a number of possible measures for evaluating
the usefulness of recommendations. This paper concluded that the MAE metric is an appropriate metric
for use in evaluating recommender systems. This metric judges the accuracy of the predictions that a
recommender system makes about a user’s level of interest in specific items. More accurate predictions
will lead to higher quality recommendations and thus, a better MAE will result in better recommenda-
tions. One of the advantages of calculating the MAE is the fact that this metric was also used in (van
Settenet al., 2002). This means that results from this simulation should be roughly comparable to the
results of this study. MAE measures the absolute difference between a predicted rating and the user’s
true rating for an item. The MAE is computed by taking the average value of this difference across the
entire system. The MAE of a system represents the overall accuracy of predictions (and thus recom-
mendations) made by that system. The standard deviation of the absolute error values (SDAE) is also
useful to compute, as this measure describes how consistently a system will produce reliable predic-
tions (and thus reliable recommendations). Thus, MAE and SDAE metrics wereused to evaluate the
iSuggest-Unobtrusive prototype.
5.3 iSuggest-Usability Evaluations — Results
This section reports the results of the evaluations of iSuggest-Usability. Theresults are reported in terms
of recommendation usefulness, explanations, control features and presentation method. At this point, it
is important to note that the average amount of ratings that were entered by users during evaluations was
27.1. This is only a small number of ratings for a user to have entered into a recommender system, so
thecold start problem for new usersexisted for each user during evaluations.
5.3.1 Recommender Usefulness
Users rated the usefulness of the six sets of recommendations produced.Figure 5.3 shows the average
score for each of the different techniques, with error bars showing one standard deviation above and
below the mean (actual results for each user shown in Appendix D). We now discuss these techniques
in order of average usefulness.
5.3 ISUGGEST-USABILITY EVALUATIONS — RESULTS 69
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Duine MostPopular
GenreBased
GenreBased
(Revised)
SocialFiltering
Learn ByExample
Avg
. Use
fuln
ess
Rat
ing
FIGURE 5.3: Average Usefulness Ratings For Each Recommendation Method. ErrorBars Show One Standard Deviation Above And Below The Mean. N = 10
Genre Based (Revised):(average score of 3.9/5 after control features were used, ranked 1st).
The Genre Based recommendations were the lowest rated when first presented, with an average
score of 2.7/5. Five users gave their lowest rating to these recommendations and no users gave
their highest score. However, once users were given the chance to adjust their genre interests,
the average score for this method improved by 20% to 3.9/5. Seven people gave their highest
score to these revised recommendations, and only two did not (due to an error in copying the
questionnaire, one user did not give a rating for the revised Genre Based recommendations).
Learn By Example: (average score of 3.7/5, ranked 2nd). This method produced the largest
variation in user’s ratings, with most users rating this method above 3, yet others rating it as a
2. Despite the variation, this method had the second highest average score, and six users gave
this method their highest score.
Most Popular: (average score of 3.3/5, ranked 3rd). Three users rated this method highest and
two of these users spontaneously commented that they would be very interested in the movies
that were the most popular overall. In contrast, two other users rated this method lowest and
one user spontaneously commented that this recommendation method was unlikelyto ever
produce good recommendations for him, as he was not interested in popularmovies.
5.3 ISUGGEST-USABILITY EVALUATIONS — RESULTS 70
Duine: (average score of 3.1/5, ranked 4th). Most users were observed tofind that these recom-
mendations contained just a few items that were very interesting to them, among many that
they were uninterested in. Similar to the Most Popular method, three users rated Duine the
highest, and two users rated it lowest.
Social Filtering: (average score of 2.8/5, ranked 5th). Four users rated this method the lowest,
and although three users did give this method a score of 4/5, in general it was observed to often
recommend movies that were completely unsuited to the user’s tastes.
Discussion. Individuals differentiated the quality of the recommender techniques. However there was
no consistently superior technique: all methods were given at least one user’s highest rating, yet all
methods were also given at least one user’s lowest rating. This suggests the value of allowing users to
choose their recommendation method. Further, participants commented that the different recommenda-
tion methods could be useful for different tasks (e.g. one user commentedthat if he were in the mood
to see something quite mainstream, he would choose Most Popular recommendations. However, if he
were in the mood to see something more tailored to his own interests, he could choose Genre Based
recommendations). The fact that some users commented that they would be interested in Most Popular
recommendations, while others commented that they would not be is an example ofthe individuality of
users. Such individuality makes a case for providing personalisation of presentations and explanations
within recommender systems
Of significant interest is the fact that allowing users to adjust their genre interests improved recommen-
dations significantly, moving the Genre Based recommendations from the lowest rated to the highest
rated set of recommendations. The average rating for Genre Based recommendations increased by 20%
after the introduction of the Genre Control. This is strong evidence of the usefulness of control features
in recommender systems. Also interesting was the impact of thecold start problem for new userson the
performance of recommendation techniques. The learn algorithm was ratedhighly by users, indicating
that it is able to produce good recommendations even when users have entered few ratings. In contrast,
users rated the Social Filtering recommendations the second lowest, indicatingit produced poor rec-
ommendations. The poor performance of this recommendation algorithm was due to the its inability to
cope with such a small amount of ratings information. This serves as confirmation of the existence of
the cold-start problem in our evaluations. It is in this case, where the recommendations produced by the
social filtering algorithm are not good, that the explanations that are provided to users are quite crucial
— in order to help the user to decide how much trust to place in recommendations by allowing them to
5.3 ISUGGEST-USABILITY EVALUATIONS — RESULTS 71
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
MostPopular
GenreBased
SocialFiltering(Simple
Text)
SocialFiltering(Graph)
SocialExplain(SimilarUsers)
Learn ByExample
Avg
. Exp
lan
atio
n R
atin
g
FIGURE 5.4: Average Usefulness Ratings For Each Explanation. Error Bars ShowStandard Deviation. N = 10
understand how and why the system made a recommendation, especially if it is arecommendation that
the user feels is not useful.
5.3.2 Explanations
Users rated six explanation methods according to their usefulness for helping understand and use rec-
ommendations. Figure 5.4 shows the average score for each of the different explanations, with error
bars showing one standard deviation above and below the mean (actual results for each user shown in
Appendix D). We now discuss these explanations in order of average usefulness.
Most Popular: (average score of 4.0/5, ranked equal 1st). Seven people gave the Most Popular
explanation a score of 4 or more, and no users rated it below 3. However, one user did state
that he believed that the Most Popular recommendations were calculated using more than just
a simple average of the ratings given to each item — this belief was incorrect.
Social Filtering (Graph): (average score of 4.0/5, ranked equal 1st). This explanation had the
highest average rating of all the Social Filtering explanations. Seven users rated this explana-
tion highest and no users rated it lowest.
5.3 ISUGGEST-USABILITY EVALUATIONS — RESULTS 72
Learn By Example: (average score of 3.6/5, ranked 3rd). Nine users gave this explanation a
rating of 3 or more and four of these users rated this explanation the highest. However, while
viewing these explanations, two users spontaneously commented that they disagreed with the
similarity measure used by the Learn By Example technique. They were interested in knowing
more information about how similarity is computed. One of these users expressed a desire to
control the way that similarity is calculated.
Genre Based: (average score of 3.4/5, ranked 4th). Five users gave this explanation their lowest
score. Users were often observed to find these explanations inadequate. Two users sponta-
neously commented that although these explanations indicated the genres thateach item be-
longed to, the reason that items from these genres were recommended was not made clear.
Social Filtering (Simple Text): (average score of 2.8/5, ranked 5th). This explanation had the
highest variance of all the explanations. Two users gave this explanationa score of 4 or more,
and yet five users rated this method the lowest of all the explanations.
Social Filtering (Similar Users): (average score of 2.6/5, ranked 6th). Similar to the Social Fil-
tering (Simple Text), five users rated this method the lowest of all the explanations. No users
gave this method a 5, and only two users gave this method a score above 3.
Users also rated the overall usefulness of the iSuggest explanations for helping them understand and use
recommendations. The average score for this question was 3.7/5. Figure 5.5 shows each user’s response
to this question (actual results for each user shown in Appendix D).
1
2
3
4
5
1 4 7 10
Participant
Use
fuln
ess
of
Exp
lan
atio
ns
FIGURE 5.5: Users’ Ratings For The Overall Use Of The iSuggest Explanations. N = 10
Discussion. The fact that users gave an average rating of 3.7 when asked to rate theusefulness of the
iSuggest explanations shows that explanations appear to improve the usefulness and understandability
of recommendations. After viewing the explanations provided for the LearnBy Example technique,
one user even expressed a desire to control how similarity between items wascomputed. This suggests
5.3 ISUGGEST-USABILITY EVALUATIONS — RESULTS 73
that scrutability might spur some users to take more control over a system. In general, most of the
complaints that users did have about the explanations provided were that they wanted to know more
details about how the recommendation process worked. In particular, users wanted the Genre Based
and Learn By Example explanations to contain more information. Possible extensions to the existing
iSuggest explanations could include:
Genre Based: Indicating the user’s calculated interest in each genre that an item belongsto.
Learn By Example: Indicating why items were judged to be similar to one another. Further, a
useful control feature could be the ability to adjust the factors that are used to judge similarity
between items.
Of course, further research would be required to discover if these extension could be useful in improving
the understandability and usefulness of recommendations.
It was not surprising that the Most Popular explanations were rated highest on average. This method is
quite simple in operation and thus is easy to explain to users. However, the fact that the Social Filter-
ing (Graph) explanations were also rated highest on average was remarkable, as this recommendation
method is much more complicated. On average, the Graph-based explanation of the Social Filtering
technique was rated the higher than both the Simple Text and the Similar Users forms of explanation.
This suggests that users found this graph of the ratings of similar users to aid their understanding and
ability to use recommendations. The high performance of the Social Filtering (Graph) conflicted with the
results of the questionnaire(where Social Filtering (Simple Text had the highest average understanding
rating). The fact that Social Filtering (Graph) scored a higher average rating than Simple Text demon-
strated the value of implementing and testing explanations. In fact, this result is supported by research
in (Herlocker, 2000), where it was found that a histogram of similar user’s ratings was the most effec-
tive form of Social Filtering explanation. The fact that the Learn By Example explanations were rated
third is somewhat surprising, as one of the benefits often noted for the Learn By Example technique is
the "potential to use retrieved cases to explain [recommendations]" - (Cunninghamet al., 2003), p 1.
Finally, the Genre Based explanations scored poorly mainly due to the fact that these explanations did
not contain enough detail.
5.3 ISUGGEST-USABILITY EVALUATIONS — RESULTS 74
5.3.3 Controls
Users rated two control features according to their effectiveness in improving recommendations recom-
mendations. Figure 5.6 shows users’ ratings for each of the control features, with error bars showing the
standard deviation (results for each user also shown in Appendix D).
1
2
3
4
5
1 2 3 4 5 6 7 8 9 10
Participant
Eff
ecti
ven
ess
Rat
ing
(a) Genre Based
1
2
3
4
5
1 2 3 4 5 6 7 8 9 10
Participants
Eff
ecti
ven
ess
Rat
ing
(b) Social Filtering
FIGURE 5.6: Users’ Ratings For The Effectiveness Of Control Features.
Prediction Method Control: No specific statistical results were collected with respect to the
ability of users to control the recommendation method that was used. However, three users of
the system spontaneously commented that the ability to use many different prediction mech-
anisms was quite useful and one user stated that this helped him to "work with the system to
producerecommendations rather than simply be given a set of ‘take-it-or-leave-it’ recommen-
dations."
Genre Based Control: (average score of 4.4/5, rated 1st). Nine users gave this method a score
of 4 or more, and one user gave this control a 3. As noted in Section 5.3.1, the original Genre
Based recommendations received the lowest average score. However, once users were given
the chance to adjust their genre interests, the revised Genre Based recommendations received
an average of 3.9/5 — the highest average score. One user spontaneously commented that he
would like his genre interests to be used as input to other recommendation techniques, not just
Genre Based. Another user spontaneously commented that he would like to be able to adjust
his interest sub-genres, as well as genres. He felt that the ability to specify interest in sub
genres would enable this control to improve his recommendations even further.
5.3 ISUGGEST-USABILITY EVALUATIONS — RESULTS 75
Social Filtering Control: (average score of 2.6/5, rated 2nd). Three users rated this control ei-
ther 4 or 5, while the other seven users gave this control a rating of 2 or less. One user was
observed to find no users whom he thought should be ignored, despite examining the ratings for
all of the 9 most similar users. Two other users spontaneously commented thatalthough they
did click to ignore particular users, this had little to no impact upon their recommendations.
Users also rated the overall effectiveness of the iSuggest control features for improving their recommen-
dations. The average score for this question was 4.4/5. Figure 5.7 showseach user’s response to this
question (actual results for each user shown in Appendix D).
1
2
3
4
5
1 2 3 4 5 6 7 8 9 10
Participant
Eff
ecti
ven
ess
Rat
ing
FIGURE 5.7: Users’ Ratings For The Overall Effectiveness Of The iSuggestControl Features.
Discussion. The results of the survey showed that users were highly interested in having control over
their recommender system. The results of these evaluations confirmed that such control features can be
effectively incorporated into a recommender system. When asked how useful they found the iSuggest
control features in improving their recommendations, all gave consistently high scores. This is strong
evidence to support the case for including controls in recommender systems. However, the Social Fil-
tering control feature was rated quite lowly by many users. This is most probably due to the fact the
average amount of users that were ignored through the use of this control was only 2.3 — which is often
not enough users to produce a significant change. This result suggests that most users would not use this
control to ignore a large amount of users, and thus it would not be likely to be highly effective. However,
some users did rate this control highly, so further investigation is needed. Despite the poor performance
of this particular control, the overall results from this section of the evaluation show that control fea-
tures can be highly effective — as long as the controls that are incorporated are able to demonstrate a
noticeable effect.
5.3 ISUGGEST-USABILITY EVALUATIONS — RESULTS 76
The conclusions that we can draw from this investigation into the usefulnessof control features include:
• Controls can be useful in improving recommendations.
• Users have shown a strong interest in being offered control over theirrecommender system.
• The Genre Based Control is a very useful method for allowing users to improve the quality of
recommendations.
• Users found the ability to choose which recommendation technique was used was highly use-
ful.
5.3.4 Presentation Method
Five users rated the usefulness of three types of Map Based Presentation. After these users completed
evaluations, their feedback was used to make the following changes to the Map Based Presentations:
• Spread out the items in the map to make it less cluttered.
• Allowed users to click on a genre to zoom in on that genre.
• Had the map start in the ’zoomed out’ state, rather then a very ’zoomed in’ state.
• Allowed users to zoom in further to read movie titles more clearly.
A further group of five users then rated the usefulness of the Map Based Presentations. Figure 5.8 shows
the average score that each group gave to the different forms of Map Based Presentation, with error bars
showing the standard deviation (actual results for each user shown in Appendix D).
1
2
3
4
5
Full Map Top 100 Map Item-To-ItemMap
Avg
. Use
fuln
ess
(a) Group 1
1
2
3
4
5
6
Full Map Top 100 Map Item-To-ItemMap
Avg
. Use
fuln
ess
(b) Group 2 (After Revision Of Maps)
FIGURE 5.8: Average Usefulness Of The Map Based Presentations. Error Bars ShowStandard Deviation.
5.3 ISUGGEST-USABILITY EVALUATIONS — RESULTS 77
Full Map Presentation: (average of 2.0/5 from Group 1, average of 4.3/5 from Group 2). Group
1 gave this method a maximum rating of 3. Two users from this group commented that the
Map was too crowded. One user spontaneously commented that sometimes items were placed
near genres that they didn’t really belong to — which was confusing. However, following
the revision of the maps, Group 2 gave this method an average 4.3/5 — the highest score for
any of the maps. Further, all users from Group 2 gave the Full Map more than 3/5. Three
users from Group 2 rated Full Map the highest. One user from Group 2 commented that the
Full Map "gives you a scope and makes it easier to navigate between genres". Another user
spontaneously commented that she found the colour coding to be a useful way to quickly
discover what genres the system thought you were interested in.
Top 100 Presentation: (average of 2.6/5 from Group 1, average of 4.0/5 from Group 2). On
average, Group 1 rated Top 100 slightly higher than Full Map. However,as was the case with
Full Map presentation, all users from Group 1 rated Top 100 as 3 or below. The average rating
for Top 100 from Group 2 (4.0/5) was slightly lower than the average for Full Map, but 4.0 was
the second highest average score for any of the maps. One user fromGroup 2 gave this map a
5, three gave it a 4 and one user gave it a 3. Two users from Group 2 rated Top 100 the highest.
Item-to-item Similarity: (average of 2.6/5 from Group 1, average of 3.0/5 from Group 2). Two
users from Group 2 gave this method a four, but all other users from Groups 1 and 2 gave this
method 3 or less. In Group 1, this map had the equal highest average score. In Group 2, the
average scores of Full Map and Top 100 improved, but the average score for this map did not.
This meant that this map had the lowest average score for Group 2. One user spontaneously
commented that this map was not useful as it showed items that were not highly rated for
her and that often the map would display relationships between items that she felt were not
related. Another user volunteered that he felt this map should show more levels of Item-To-
Item similarity.
Users were also reported their preferred presentation type (’List Only’, ’Map Only’ or ’Both List And
Map’). Figure 5.9 shows the sum of the responses given by groups 1 and 2 (actual results for each user
shown in Appendix D).
Discussion. The initial group of five users gave all of the map based forms of presentations quite low
scores. Only one of this initial group indicated he would like Map Based Presentations included in a
recommender system. In general, users in Group 1 felt that the map based presentations were difficult
5.4 ISUGGEST-UNOBTRUSIVE - RESULTS 78
0
1
2
3
4
5
List Only Both ListAnd Map
Map Only
Su
m o
f P
refe
ren
ces
(a) Group 1
0
1
2
3
4
5
List Only Both ListAnd Map
Map Only
Su
m o
f P
refe
ren
ces
(b) Group 2 (After Revision Of Maps)
FIGURE 5.9: Sum Of Votes For The Preferred Presentation Type.
to use. This was because the map seemed very crowded and it was hard to zoom in on particular items
or areas of interest. However once the map interface was revised, the second group of users gave the
map-based presentation higher scores for utility. Users in Group 2 foundthe Full Map and Top 100
maps to be especially useful. The probable cause for the lower performance of the Item-to-Item map
lies in the fact that the Item-to-Item collaborative filtering process can sometimes produce relationships
between items that a user might not expect. This confused users who wereexpecting items that were
more directly related to be displayed with one another (e.g. movies in the same genre).
After the revision of the maps, four out of five users said they would like both List-Based and Map-
Based presentation. This strongly suggests that Map Based Presentationof recommendations would be
a worthwhile addition to a recommender system. The Full Map and Top 100 presentations are useful
presentation methods, though user interaction and scalability are two areas where more research needs
to be conducted. However, in general, once the initial usability issues wereovercome, users seemed
quite keen on having a Full Map presentation incorporated into a recommender system.
5.4 iSuggest-Unobtrusive - Results
This section reports the results of both statistical and user evaluations of iSuggest-Usability. At this
point, it is important to note that the average amount of ratings that were automatically generated for
users during user evaluations was 80.5. This was a sufficient number ofratings to mean that thecold
start problem for new userswould not be a factor during evaluations.
5.4 ISUGGEST-UNOBTRUSIVE - RESULTS 79
5.4.1 Statistical Evaluations
Before any user evaluations were performed, statistical evaluations were carried out on iSuggest-Unobtrusive.
These evaluations attempted to investigate the performance of the ratings generation algorithm and the
quality of recommendations produced using these ratings. The datasets used to complete these evalu-
ations were the MovieLens standard dataset, which contained 100,000 ratings and the last.fm dataset,
which contained 100,000 play-counts, that were converted into 70149 ratings. The two statistical eval-
uations that were conducted were: a calculation of the distribution of the ratings that existed or were
produced for each dataset; and a calculation of the MAE and SDAE for four recommendation tech-
niques using each of the datasets. The results of these evaluations are reported below.
The distribution of the ratings that were calculated from play-count data was calculated. This was
compared to the distribution of ratings within the MovieLens standard data set. Figures 5.10(a) and
5.10(b) show these distributions.
0%
10%
20%
30%
40%
50%
60%
70%
80%
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Rating
% o
f T
ota
l Rat
ing
s
(a) Unobtrusively Generated Music Ratings
0%
10%
20%
30%
40%
50%
60%
70%
80%
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Rating
% o
f T
ota
l Rat
ing
s
(b) Movie Ratings From MovieLens Dataset
FIGURE 5.10: Comparison Of Distribution Of Ratings Values.
The rating scale that was used to calculate the distribution of ratings was a scale of 0.0-5.0, with incre-
ments of 0.5 (as all ratings within iSuggest were displayed on this scale). However, the ratings contained
within the MovieLens dataset were based on a scale of 1.0-5.0, with increments of 1. This means that
there are a number of values shown in Figure 5.10(b) for which no ratingsexist. Despite this, the gen-
eral distribution of ratings in the MovieLens dataset is clear. Only sixteen percent of the ratings in the
MovieLens dataset occur below the value of 2.5, and zero percent of the ratings in the generated set
occur below this value. twenty seven percent of the MovieLens ratings were 2.5’s, compared to sixteen
percent of the generated ratings. Thirty five percent of MovieLens ratings were occur within the range
of 3.0 to 4.5 (inclusive), whereas eighty three percent of the generatedratings occur within this range.
5.4 ISUGGEST-UNOBTRUSIVE - RESULTS 80
Finally, twenty percent of the Movielens ratings were 5’s; only one percent of the generated ratings were
5’s.
The MAE for four different recommendation techniques was calculated using the ratings generated from
play-count data. This was compared to the MAE for the same techniques when recommending movies
using the MovieLens standard data set. Figures 5.11(a) and 5.11(b) show the MAE for each of the four
recommendation techniques, using MovieLens ratings and the generated ratings.
Technique GMAE St. Dev.Social Filtering 0.091 0.171Genre Based 0.101 0.174Learn By Example 0.102 0.185Most Popular 0.106 0.178
(a) MAE of Recommendation Techniques UsingUnobtrusively Generated Music Ratings
Technique GMAE St. Dev.Social Filtering 0.384 0.490Genre Based 0.425 0.530Learn By Example 0.465 0.592Most Popular 0.384 0.488
(b) MAE and SDAE of Recommendation Tech-niques Using Movie Ratings Taken From Movie-Lens Dataset
FIGURE 5.11: Comparison Of MAE And SDAE For Movielens RecommendationsAnd Recommendations Using Generated Ratings. Lower Scores Are Better.Tech-niques Are Sorted By MAE.
The average MAE for the recommendations using the generated ratings wascalculated to be 0.315
lower than the average MAE for the recommendations using the MovieLens dataset. Further, the av-
erage SDAE for recommendations using generated ratings was 0.348 lowerthan the average SDAE for
recommendations using MovieLens ratings. The Most Popular technique had the best (i.e. the lowest)
MAE for recommendations using the MovieLens data set. It also had the lowest standard deviation.
In contrast, this technique had the highest MAE for the recommendations created using generated rat-
ings. Genre Based had the second best MAE for simulations. Learn By Example had the second worst
MAE for the MovieLens recommendations, and the worst MAE for the generated rating recommenda-
tions. Finally, Social Filtering had the second worst MAE when recommendations were made using
the MovieLens ratings. However, it had the best MAE when the generatedratings were used to make
recommendations.
Discussion. The statistical evaluations showed that the ratings generation algorithm was generally
quite conservative — the percentage of generated ratings above 3 was smaller than the percentage of
ratings above 3 in the MovieLens data. One of the causes of this was the fact that the data used to
generate ratings was counts of songs that users listened to. Often this datawill contain artists for whom
the user has only one song, and whom the user listens to infrequently. Suchartists would be given
5.4 ISUGGEST-UNOBTRUSIVE - RESULTS 81
a rating quite close to 2.5 by the generation algorithm. Another cause is the factthat often, a user
will listen to one ‘favourite’ artist very frequently, and other artists less frequently. In this case, the
normalisation performed by the generation algorithm will result in the ‘favourite’ artist getting a high
rating and the other artists getting lower ratings. In fact, the more that a user listens to a single artist, the
lower that the ratings for other artists will be. As many users listen to a few ‘favourite artists’ very often,
the ratings for the artists who are not a user’s favourites are likely to be relatively close to 2.5. The use
of additional information in the ratings generation process (such as the number of songs by each artist
that are on a user’s iPod and the amount of time that a user has spent listening to each track) would be
likely to improve the accuracy of the ratings generation.
The evaluation of the ratings algorithm using MAE and SDAE showed that the average MAE and SDAE
for the recommendations using the generated ratings was much lower than the average MAE for recom-
mendations using MovieLens. For the most part, this is due to the fact that the generated ratings were
distributed over a much smaller range than the MovieLens ratings. The smaller range of the generated
ratings meant that predictions for a user’s interest in a particular item usingthese generated ratings would
be more likely to be correct than the predictions made using MovieLens data. Therefore, the MAE when
using generated ratings is likely to be much lower than the MAE when using the MovieLens ratings.
Due to the complexity of this situation, the MAE and SDAE calculations for the two simulations are not
comparable. However, the MAE does still provide a useful measure of theperformance of each of the
prediction techniques. The two techniques that had the best MAE for the generated ratings simulation
were Genre Based and Social Filtering. This meant that these two techniques were likely to be the most
useful for making recommendations based upon the generated ratings.
Once these statistical evaluations had been completed, users evaluations were conducted. The results of
the user evaluations are reported in Sections 5.4.2 to 5.4.3.
5.4.2 Ratings Generation
Users rated their understanding of how ratings had been generated from their iPod. They also rated the
accuracy of the ratings that were generated. The results of from thesequestions are discussed below.
Understanding Of Ratings Generation: (average score of 5.0/5). All users responded to this
question with a score of 5/5.
5.4 ISUGGEST-UNOBTRUSIVE - RESULTS 82
Accuracy Of The Ratings Generated: (average score of 4.3/5). One user spontaneously com-
mented that the program seemed to be a little bit conservative — being quite hesitant to give
out higher ratings, and tending to give out ratings of mainly 2.5 and 3 stars.However, this
question received very high scores from all users — no users responded with less than a score
of 4, and three users gave a score of 5. Two users spontaneously commented that their favourite
artist had been given the highest rating.
Discussion. Users gave consistently high scores when asked about their understanding of how their
ratings were generated. This indicates that they believed they had a very clear understanding of how their
ratings had been generated. Users also gave consistently high scores when asked about the accuracy of
their ratings. This suggests that the algorithm implemented in this prototype was able to successfully
model users’ interests in particular artists. Some users did comment that, as was shown in Section
5.4, the ratings generation process was quite conservative. Yet despitethis, users felt that the ratings
generated were quite accurate, especially due to the fact that the users’favourite artists were consistently
given the highest ratings.
5.4.3 Recommendations
Users rated the usefulness of the three sets of recommendations produced from their generated ratings.
Figure 5.12 shows the average score for each of the different techniques, with error bars showing the
standard deviation (actual results for each user shown in Appendix F).We now discuss these techniques
in order of average usefulness.
Genre Based Recommendations:(average score of 3.9/5, ranked 1st). The average rating for
these recommendations was substantially higher than the average for Random recommenda-
tions. In fact, all but one of the users gave Genre Based recommendations their highest rating.
Social Filtering Recommendations:(average score of 3.1/5, ranked 2nd). This method received
a higher average score than the Random Recommendations, yet it was notthe highest rated
recommendation method. One user commented that some artists that were recommended did
seem to be quite appropriate, but that the recommendation list contained too many incorrect
recommendations for it to be really useful.
Random Recommendations:(average score of 2.2/5, ranked 3rd). Seven users gave this method
their lowest rating. No users gave this method their highest rating.
5.4 ISUGGEST-UNOBTRUSIVE - RESULTS 83
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Random Social Filtering Genre Based
Avg
. Use
fuln
ess
Rat
ing
FIGURE 5.12: Average Usefulness Ratings For Each Recommendation Method. ErrorBars Show Standard Deviation.
Users also reported whether they would like the ’Get Rating From My iPod’ feature incorporated into a
recommender system. In answer to this question, all users reported that they would like to have the ’Get
Ratings From My iPod’ function incorporated into a recommender system. Oneuser spontaneously
commented that "this is a great idea, and a really useful time saver". Three users commented that
having ratings generated was highly preferable to rating items individually bymoving through a large
list. One of these users continued, saying that they would be willing to make minoradjustments to
the ratings produced by the generation process to make the ratings more accurate and receive better
recommendations.
Discussion. The fact that Random recommendations received the lowest average score is not surpris-
ing, as these recommendations were presented to users to act as a control.The fact that 2.2/5 is the score
that users would give a random set of recommendations can serve as a reference point for judging the
utility of the recommendations presented to users. Social Filtering performed the best in the statistical
evaluations described in Section 5.4.1, so it was assumed that users would find it to be highly useful.
However, on average the usefulness of this method was rated lower than the Genre Based recommenda-
tions. The most likely reason for this is the fact that the ratings produced bythe generation algorithm
were distributed over only a small range. This meant that the process of matching similar users to one
another was less successful, as the differences between users in termsof their ratings was less pro-
nounced. This resulted in lower quality Social Filtering recommendations. Social Filtering performed
well in statistical evaluations because it predicts a user’s rating for a new item in a way that is similar to
taking the average rating that similar users gave this item. When there is such a low range of ratings in
5.5 CONCLUSION 84
the system, this ‘average rating’ style approach is very likely to calculate a predicted value that is close
to the average rating that users gave to items. Basically, because the rangeof ratings was so small in
this example, a predictor such as this, which draws heavily upon users’ ratings is more likely to perform
well on statistical evaluations. However, when used in a real world system,this recommendation method
does not produce optimum results because it struggles to clearly identify similar and opposite users and
thus produces poor recommendations.
The fact that Genre Based recommendations were rated highly by the majorityof users is strong evi-
dence to suggest that useful recommendations can indeed be made using only implicit ratings data. The
most likely reason that this recommendation method was able to produce high quality recommendations
is the fact that it does not use the ratings that are input by a user in the sameway that the Social Filtering
method does. The Genre Based method uses the user’s ratings to adjust their predicted interest in partic-
ular genres. This predicted interest is most significantly affected by the itemsthat a user has rated very
high or very low. Items that the user has given a relatively neutral rating affect these predicted interests
in a much less significant way. As a result, this recommendation method is not adversely affected by the
fact that the ratings generation algorithm produced a large amount of relatively neutral ratings. Thus, this
recommendation method was able to use the items that the user has rated highly to infer genre interests
and make successful recommendations.
The results of these user trials strongly suggest that useful recommendations can be made using only
implicit data as ratings information. One big indicator of this lay in the fact that, when asked, all users
reported that they would like to have the ’Get Ratings From My iPod’ functionincorporated into a
recommender system. In the future, more research is required to investigatewhether ratings generated
using a different algorithm might alter the performance of each recommendation technique.
5.5 Conclusion
Evaluations were designed and conducted for each of the two prototype variants. These evaluations
aimed to investigate the research questions defined in Chapter 1 and build upon the knowledge that was
gained from the questionnaire conducted in Chapter 3.
iSuggest-Usability was evaluated through user evaluations, conducted with10 people. These user eval-
uations produced the following findings:
5.5 CONCLUSION 85
Recommendation usefulness.
• Despite the fact that very few ratings had been entered by each user, the Genre Based and Learn
By Example techniques were highly rated by users. This suggests that these two techniques
would be useful, even in situations where thecold start problem for new usersexists.
Understanding.
• Explanations were shown to be a useful addition to a recommender system.
• A graph based method was shown to be the most effective way to explain Social Filtering
recommendations.
• On average, the Learn By Example recommendations were rated to be the thirdmost under-
standable recommendations — a curious results given that as one of the benefits of the Learn
By Example technique is stated to be the "potential to use retrieved cases to explain [recom-
mendations]" - (Cunninghamet al., 2003)
• Some of the explanations incorporated into the prototype would benefit fromthe addition of
extra information.
• Comments made by during evaluations suggested that the addition of scrutability might spur
some users to take more control over a system.
User Control.
• Controls can be useful for allowing users to improve their recommendations,particularly the
Genre Based control.
• Users have a high level of interest in being given control of their recommender system.
• Evidence showed that allowing users to select which recommendation technique should be
used is a highly useful.
Presentation.
• Evidence suggested that a Map Based presentation of recommendations (such as Full Map or
Top 100 Map included in iSuggest-Usability) would be a useful addition to a recommender
system.
5.5 CONCLUSION 86
Evaluations also highlighted the individuality of users, many of whom preferred different presentation
styles, explanation styles and recommendation techniques. In general, users found many of the features
included in iSuggest-Usability to be quite useful for improving the quality of recommendations and the
scrutability of a recommender system.
iSuggest-Unobtrusive was evaluated through user evaluations, conducted with 9 people, as well as
through statistical evaluations. These evaluations produced the following findings:
• Ratings can be generated from implicit information in a way that users have indicated is easy
to understand and is generally accurate.
• Useful recommendations can be made based purely upon ratings generated from implicit in-
formation about users.
• The ratings generation algorithm implemented in iSuggest-Unobtrusive is conservative, and
could definitely be improved upon.
• Genre Based is a useful recommendation technique to use when the distribution of ratings
values is conservative.
• The addition of other types of implicit data to the ratings generation process (such as time spent
listening to each track) could improve quality of the generated ratings.
Generally, found that iSuggest-Unobtrusive incorporated highly useful features that enabled ratings to
be generated unobtrusively and effective recommendations producedfrom this information.
Overall, the evaluations of the two prototype variants produced a number ofimportant findings regarding
both theScrutability & ControlandUnobtrusive Recommendationresearch questions.
CHAPTER 6
Conclusion
The research questions for this thesis were expressed in Chapter 1 to be:
Scrutability & Control: What is the impact of adding scrutability and control to a recommender
system?
Unobtrusive Recommendation:Can a recommender system provide useful recommendations
without asking users to explicitly rate items?
As noted in Chapter 2, there is very little published research that deals with either of these two ques-
tions, but clear recognition of their importance and the challenges of achieving them. Thus, this thesis
investigated each of these questions. An exploratory study was conducted, which involved an analysis of
existing systems and the conduct of a questionnaire. The results from this study informed the creation of
a prototype system, which included a number of scrutability, control and unobtrusive recommendation
features. Finally, this system was evaluated through a combination of statistical methods and user eval-
uations. Both the exploratory study and the evaluations of the prototype produced significant findings.
These findings include:
Scrutability & Control. Based on the results from the questionnaire (which had 18 respondents and
is detailed in Chapter 3) and the two user evaluations (each of which had at least 9 participants and are
detailed in Chapter 5), the following findings were made:
• Explanations are a useful addition to a recommender system. However, complicated or poor
explanations can often confuse a user’s understanding of recommendations.
• Specific explanation types were found to be more useful than others for explaining particular
recommendation techniques.
• Different users prefer different forms of presentation and explanation.
87
6.1 FUTURE WORK 88
• Genre Based and Learn By Example are both techniques that could be utilised to avoid thecold
start problem for new users.
• A Map Based presentation of recommendations can be a useful addition to a recommender
system.
• Users have a high level of interest in being given control of their recommender system. Further,
such controls can be useful for allowing users to improve the usefulnessof recommendations
• Respondents to our questionnaire did not think that Description Based or Lyrics Based recom-
mendation techniques would be useful.
Unobtrusive Recommendation.
• Ratings can be generated from implicit information in a way that users have indicated is easy
to understand and is generally accurate. These ratings can then be usedto made useful recom-
mendations.
Overall, this thesis was highly successful. It highlighted a number of key scrutability and control features
that would appear to be useful additions to existing recommender systems. These features can be used
to improve recommendation quality and usefulness, as well as improve users’trust and understanding
of recommender systems. Further, the Genre Based and Learn By Exampletechniques were shown to
produce useful recommendations, even when users had not entered alarge number of ratings (a situation
that causes many recommendation techniques to produce poor recommendations). It was also shown
that a Map Based presentation would be a useful presentation method, which could be incorporated
into existing recommender systems. Finally, it was shown that ratings automaticallygenerated from
implicit information about a user can be used to make useful recommendations.Each of these findings is
significant, as they can be used to improve the effectiveness, usefulness and user friendliness of existing
recommender systems.
6.1 Future Work
Despite the substantial progress made during this thesis, there are a numberof areas that require future
research. These areas include:
• Investigation of the usefulness of dynamically combining multiple recommendation techniques.
• Investigation of new or extended ways of providing explanations and control to users.
6.1 FUTURE WORK 89
• Further investigation into the most useful methods for providing a Map Basedpresentation of
recommendations.
• Improvements to the ratings generation algorithm presented in this thesis.
• Investigation of other types of implicit data that could be used to generate ratings.
References
G. Adomavicius and A. Tuzhilin. 2005. Toward the next generation of recommender systems: a surveyof the state-of-the-art and possible extensions.Knowledge and Data Engineering, IEEE Transactionson, 17(6):734–749.
J. Atkinson. 2006. Free music recommendation services, 25th May.
C. Basu, H. Hirsh, and W. Cohen. 1998. Recommendation as classification: Using social and content-based information in recommendation.Proceedings of the Fifteenth National Conference on ArtificialIntelligence.
J. S. Breese, D. Heckerman, and C. Kadie. 1998. Empirical analysis ofpredictive algorithms for collab-orative filtering.Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence,461.
P. Cano, M. Koppenberger, and N. Wack. 2005. An industrial-strength content-based music recommen-dation system.Proceedings of the 28th annual international ACM SIGIR conference onResearch anddevelopment in information retrieval, pages 673–673.
P. Cunningham, D. Doyle, and J. Loughrey. 2003. An Evaluation of the Usefulness of Case-BasedExplanation.Case-Based Reasoning Research and Development. LNAI, 2689:122–130.
M. Deshpande and G. Karypis. 2004. Item-based top-n recommendation algorithms.ACM Transactionson Information Systems (TOIS), 22(1):143–177.
J. L. Herlocker, J. A. Konstan, and J. Riedl. 2000. Explaining collaborative filtering recommendations.Proceedings of the 2000 ACM conference on Computer supported cooperative work, pages 241–250.
J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. 2004. Evaluating collaborative filteringrecommender systems.ACM Transactions on Information Systems (TOIS), 22(1):5–53.
J. L. Herlocker. 2000.Understanding and Improving Automated Collaborative Filtering Systems. Ph.D.thesis, UNIVERSITY OF MINNESOTA.
X. Hu, J.S. Downie, K. West, and A. Ehmann. 2005. Mining Music Reviews:Promising PreliminaryResults. Proceedings of the 6th International Symposium on Music Information Retrieval, pages536–539.
A. Kiss and J. Quinqueton. 2001. Machine learning of user preferences in a corporate knowledgemanagement system.Proceedings of ISMCIK ’01, pages 257–269.
90
REFERENCES 91
J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl. 1997. Grouplens:applying collaborative filtering to usenet news.Communications of the ACM, 40(3):77–87.
B. Logan. 2004. Music recommendation from song sets.Proc ISMIR.
H. Mak, I. Koprinska, and J. Poon. 2003. Intimate: a web-based movie recommender using textcategorization.Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence,pages 602–605.
D. Maltz and K. Ehrlich. 1995. Pointing the way: active collaborative filtering. Proceedings of theSIGCHI conference on Human factors in computing systems, pages 202–209.
D. Mcsherry. 2005. Explanation in Recommender Systems.Artificial Intelligence Review, 24(2):179–197.
S. E. Middleton, D. C. De Roure, and N. R. Shadbolt. 2001. Capturing knowledge of user preferences:ontologies in recommender systems.Proceedings of the international conference on Knowledgecapture, pages 100–107.
R. J. Mooney and L. Roy. 2000. Content-based book recommending using learning for text categoriza-tion. Proceedings of the fifth ACM conference on Digital libraries, pages 195–204.
J. Nielsen. 1993. Evaluating the thinking-aloud technique for use by computer scientists.Advances inhuman-computer interaction, 3:69–82.
J. Nielsen. 1994. Estimating the number of subjects needed for a thinking aloud test. InternationalJournal of Human-Computer Studies, 41(3):385–397.
D. W. Oard and J. Kim. 1998. Implicit feedback for recommender systems.Proceedings of the AAAIWorkshop on Recommender Systems, pages 81–83.
G. Polcicova, R. Slovak, and P. Navrat. 2000. Combining content-basedand collaborative filtering.Proceedings of ADBIS-DASFAA Symposium 2000, page 118âAS127.
U. Shardanand and P. Maes. 1995. Social information filtering: algorithmsfor automating âAIJwordof mouthâAI. Proceedings of the SIGCHI conference on Human factors in computing systems, pages210–217.
R. Sinha and K. Swearingen. 2001. Beyond algorithms: An hci perspective on recommender systems.Proceedings of the SIGIR 2001 Workshop on Recommender Systems.
R. Sinha and K. Swearingen. 2002. The role of transparency in recommender systems.Proceedings ofthe conference on Human Factors in Computing Systems, pages 830–831.
M. van Setten, M. Veenstra, and A. Nijholt. 2002. Prediction strategies: Combining prediction tech-niques to optimize personalization.Proceedings of the workshop Personalization in Future TV’02,pages 23–32.
M. van Setten, M. Veenstra, A. Nijholt, and B. van Dijk. 2003. Prediction strategies in a tv recommendersystem: Framework and experiments.Proceedings of IADIS WWW/Internet 2003, pages 203–210.
REFERENCES 92
M. van Setten, M. Veenstra, A. Nijholt, and B. van Dijk. 2004. Case-based reasoning as a predictionstrategy for hybrid recommender systems.Proceedings of the Atlantic Web Intelligence Conference,pages 13–22.
M. van Setten. 2005.Supporting People In Finding Information. Telematica Institut.
APPENDIX A
Appendix A — Questionnaire Form
Note: On this questionnaire, the technique referred to in the thesis as Learning By Example is called
Learning From Similar. Also, the technique referred to in the thesis as SocialFiltering is called Word
Of Mouth.
93
APPENDIX B
Appendix B — Questionnaire Results
Note: A * indicates that this user did not answer this question due to the fact that the content of the
questionnaire changed after the first five respondents.
94
APPENDIX D
Appendix D — iSuggest-Usability Evaluation Results
Note: A * indicates that this user did not answer this question due to a copyingerror.
96
top related