utrecht university - universiteit utrecht
TRANSCRIPT
Recommendation Systems
Unsupervised Machine Learning
Prof. Yannis VelegrakisUtrecht University
[email protected]://velgias.github.io
2
Disclaimer
The following set of slides originates from a number of different presentations and courses of the following people. l Yannis Velegrakis (Utrecht University)l Jeff Ullman (Stanford University)l Bill Howe (U of Washington)l Martin Fouler (Thought Works) l Ekaterini Ioannou (Tilburg University)l Themis Palpanas (U of Paris-Descartes)l Yannis Velegrakis (Utrecht University)l Copyright stays with the authors.
No distribution is allowed without prior permission by the authors.
3
Example: Recommender Systems
l Customer Xn Buys Metallica CDn Buys Megadeth CD
l Customer Yn Does search on Metallican Recommender system
suggests Megadeth from data collected about customer X
4
Recommendations
Items
Search Recommendations
Products, web sites, blogs, news items, …
4
Examples:
5
From Scarcity to Abundance
l Shelf space is a scarce commodity for traditional retailers n Also: TV networks, movie theaters,…
l Web enables near-zero-cost dissemination of information about productsn From scarcity to abundance
l More choice necessitates better filtersn Recommendation enginesn How Into Thin Air made Touching the Void
a bestseller: http://www.wired.com/wired/archive/12.10/tail.html
6
Sidenote: The Long Tail
Source: Chris Anderson (2004)
6
7
Physical vs. Online
Read http://www.wired.com/wired/archive/12.10/tail.html to learn more!
8
Types of Recommendations
l Editorial and hand curatedn List of favoritesn Lists of “essential” items
l Simple aggregatesn Top 10, Most Popular, Recent Uploads
l Tailored to individual usersn Amazon, Netflix, …
8
9
Formal Model
lX = set of CustomerslS = set of Items
lUtility function u: X × S à RnR = set of ratingsnR is a totally ordered setne.g., 0-5 stars, real number in [0,1]
9
10
Utility Matrix
0.410.2
0.30.50.21
Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
10
11
Key Problems
l (1) Gathering “known” ratings for matrixn How to collect the data in the utility matrix
l (2) Extrapolate unknown ratings from the known onesn Mainly interested in high unknown ratings
uWe are not interested in knowing what you don’t like but what you like
l (3) Evaluating extrapolation methodsn How to measure success/performance of
recommendation methods
11
12
(1) Gathering Ratings
l Explicitn Ask people to rate itemsn Doesn’t work well in practice – people
can’t be bothered
l Implicitn Learn ratings from user actions
uE.g., purchase implies high ratingn What about low ratings?
12
13
(2) Extrapolating Utilities
l Key problem: Utility matrix U is sparsen Most people have not rated most itemsn Cold start:
uNew items have no ratingsuNew users have no history
l Two approaches to recommender systems:n 1) Content-basedn 2) Collaborative
Content-based Recommender Systems
15
Content-based Recommendations
l Main idea: Recommend items to customer x similar to previous items rated highly by x
Example:l Movie recommendations
n Recommend movies with same actor(s), director, genre, …
l Websites, blogs, newsn Recommend other sites with “similar” content
15
16
Plan of Action
likes
Item profiles
RedCircles
Triangles
User profile
match
recommendbuild
16
17
Item Profiles
l For each item, create an item profile
l Profile is a set (vector) of featuresn Movies: author, title, actor, director,…n Text: Set of “important” words in document
l How to pick important features?n Usual heuristic from text mining is TF-IDF
(Term frequency * Inverse Doc Frequency)uTerm … FeatureuDocument … Item
17
18
Sidenote: TF-IDF
fij = frequency of term (feature) i in doc (item) j
ni = number of docs that mention term iN = total number of docs
TF-IDF score: wij = TFij × IDFi
Doc profile = set of words with highest TF-IDF scores, together with their scores
18
Note: we normalize TFto discount for “longer”
documents
19
User Profiles
20
Example 1: Boolean Utility Matrix
21
Example 2: Star Ratings
22
Making Predictions
23
User Profiles and Prediction
l User profile possibilities:n Weighted average of rated item profilesn Variation: weight by difference from average
rating for itemn …
l Prediction heuristic:n Given user profile x and item profile i, estimate 𝑢(𝒙, 𝒊) =
cos(𝒙, 𝒊) = 𝒙·𝒊| 𝒙 |⋅| 𝒊 |
24
Pros: Content-based Approach
l +: No need for data on other usersn No cold-start or sparsity problems
l +: Able to recommend to users with unique tastes
l +: Able to recommend new & unpopular itemsn No first-rater problem
l +: Able to provide explanationsn Can provide explanations of recommended items by listing
content-features that caused an item to be recommended
25
Cons: Content-based Approach
l –: Finding the appropriate features is hardn E.g., images, movies, music
l –: Recommendations for new usersn How to build a user profile?
l –: Overspecializationn Never recommends items outside user’s
content profilen People might have multiple interestsn Unable to exploit quality judgments of other users
Collaborative Filtering
nHarnessing the judgment of other users
27
Collaborative Filtering
l Consider user x
l Find set N of other users whose ratings are “similar” to x’s ratings
l Estimate x’s ratings based on ratings of users in N
x
N
28
29
Option 1: Jaccard Similarity
30
Option 2: Cosine Similarity
31
Option 3: Centered Cosine
32
Centered Cosine Similarity (2)
33
Rating Predictions
34
Item-Item Collaborative Filtering
l So far: User-user collaborative filteringl Another view: Item-item
n For item i, find other similar itemsn Estimate rating for item i based
on ratings for similar itemsn Can use same similarity metrics and
prediction functions as in user-user model
åå
Î
Î×
=);(
);(
xiNj ij
xiNj xjijxi s
rsr
sij… similarity of items i and jrxj…rating of user u on item jN(i;x)… set items rated by x similar to i
35
Item-Item CF (|N|=2)
121110987654321
455311
3124452
534321423
245424
5224345
423316
usersm
ovi
es
- unknown rating - rating between 1 to 5
36
Item-Item CF (|N|=2)
121110987654321
455? 311
3124452
534321423
245424
5224345
423316
users
- estimate rating of movie 1 by user 5
mo
vies
37
Item-Item CF (|N|=2)
121110987654321
455? 311
3124452
534321423
245424
5224345
423316
users
Neighbor selection:Identify movies similar to movie 1, rated by user 5
mo
vies
1.00
-0.18
0.41
-0.10
-0.31
0.59
sim(1,m)
Here we use Pearson correlation as similarity:1) Subtract mean rating mi from each movie i
m1 = (1+3+5+5+4)/5 = 3.6row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0]2) Compute cosine similarities between rows
38
Item-Item CF (|N|=2)
121110987654321
455? 311
3124452
534321423
245424
5224345
423316
users
Compute similarity weights:s1,3=0.41, s1,6=0.59
mo
vies
1.00
-0.18
0.41
-0.10
-0.31
0.59
sim(1,m)
39
Item-Item CF (|N|=2)
121110987654321
4552.6311
3124452
534321423
245424
5224345
423316
users
Predict by taking weighted average:
r1.5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6
mo
vies
40
Item-Item vs. User-User
0.418.010.90.30.5
0.81Avatar LOTR Matrix Pirates
Alice
Bob
Carol
David
¡ In practice, it has been observed that item-itemoften works better than user-user
¡ Why? Items are simpler, users have multiple tastes
41
Implementing Collaborative Filtering
42
Collaborative Filtering: Complexity
43
Pros/Cons of Collaborative Filtering
l + Works for any kind of itemn No feature selection needed
l - Cold Start:n Need enough users in the system to find a match
l - Sparsity: n The user/ratings matrix is sparsen Hard to find users that have rated the same items
l - First rater: n Cannot recommend an item that has not been
previously ratedn New items, Esoteric items
l - Popularity bias: n Cannot recommend items to someone with
unique taste n Tends to recommend popular items
44
Hybrid Methods
l Implement two or more different recommenders and combine predictionsn Perhaps using a linear model
l Add content-based methods to collaborative filteringn Item profiles for new item problemn Demographics to deal with new user problem
45
Global Baseline Estimate
46
Combining Global Baseline Estimate with CF
47
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 2 2
5
2 1 1
3 3
1
480,000 users
17,700 movies
Matrix R
48
Evaluation
1 3 4
3 5 5
4 5 5
3
3
2 ? ?
?
2 1 ?
3 ?
1
Test Data Set
RMSE = ./
∑(1,2)∈4 �̂�21 − 𝑟21 8
480,000 users
17,700 movies
Predicted rating
True rating of user x on item i
𝒓𝟑,𝟔
Matrix R
Training Data Set
49
Problems with Error Measures
l Narrow focus on accuracy sometimes misses the pointn Prediction Diversityn Prediction Contextn Order of predictions
l In practice, we care only to predict high ratings:n RMSE might penalize a method that does well
for high ratings and badly for others
50
Collaborative Filtering: Complexity
l Expensive step is finding k most similar customers: O(|X|) l Too expensive to do at runtime
n Could pre-compute
l Naïve pre-computation takes time O(k ·|X|)v X … set of customers
l We already know how to do this!n Near-neighbor search in high dimensions (LSH)n Clusteringn Dimensionality reduction
51
Tip: Add Data
l Leverage all the datan Don’t try to reduce data size in an
effort to make fancy algorithms workn Simple methods on large data do best
l Add more datan e.g., add IMDB data on genres
l More data beats better algorithmshttp://anand.typepad.com/datawocky/2008/03/more-data-usual.html
52
Grand Prize: 0.8563
Netflix: 0.9514
Movie average: 1.0533
User average: 1.0651
Global average: 1.1296
Performance of Various Methods
Basic Collaborative filtering: 0.94
Latent factors: 0.90
Latent factors+Biases: 0.89
Collaborative filtering++: 0.91
Latent factors+Biases+Time: 0.876
When no prize…LGetting desperate.
Try a “kitchen sink” approach!
Dimensionality Reduction
54
Dimensionality Reduction
l Assumption: Data lies on or near a low d-dimensional subspace
l Axes of this subspace are effective representation of the data
55
Dimensionality Reduction
l Compress / reduce dimensionality:n 106 rows; 103 columns; no updatesn Random access to any cell(s); small error: OK
The above matrix is really “2-dimensional.” All rows can be reconstructed by scaling [1 1 1 0 0] or [0 0 0 1 1]
56
Rank is “Dimensionality”
l Cloud of points 3D space:n Think of point positions
as a matrix:
l We can rewrite coordinates more efficiently!n Old basis vectors: [1 0 0] [0 1 0] [0 0 1]n New basis vectors: [1 2 1] [-2 -3 1]n Then A has new coordinates: [1 0]. B: [0 1], C: [1 1]
uNotice: We reduced the number of coordinates!
1 row per point:
ABC A
57
Dimensionality Reduction
l Goal of dimensionality reduction is to discover the axis of data!
Rather than representingevery point with 2 coordinateswe represent each point with
1 coordinate (corresponding tothe position of the point on
the red line).
By doing this we incur a bit oferror as the points do not
exactly lie on the line
58
Why Reduce Dimensions?
Why reduce dimensions?l Discover hidden correlations/topics
n Words that occur commonly together
l Remove redundant and noisy featuresn Not all words are useful
l Interpretation and visualizationl Easier storage and processing of the data
58
59
SVD - Definition
A[m x n] = U[m x r] S [ r x r] (V[n x r])T
l A: Input data matrixn m x n matrix (e.g., m documents, n terms)
l U: Left singular vectors n m x r matrix (m documents, r concepts)
l S: Singular valuesn r x r diagonal matrix (strength of each ‘concept’)
(r : rank of the matrix A)l V: Right singular vectors
n n x r matrix (n terms, r concepts)
60
SVD - Properties
It is always possible to decompose a real matrix A into A = U S VT , where
l U, S, V: uniquel U, V: column orthonormal
n UT U = I; VT V = I (I: identity matrix)n (Columns are orthogonal unit vectors)
l S: diagonaln Entries (singular values) are positive,
and sorted in decreasing order (σ1 ³ σ2 ³ ... ³ 0)
61
SVD – Example: Users-to-Movies
l A = U S VT - example: Users to Movies
=SciFi
Romnce
x x
Mat
rix
Alie
nSer
enity
Cas
abla
nca
Am
elie
1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2
0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32
12.4 0 00 9.5 00 0 1.3
0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09
62
SVD – Example: Users-to-Movies
l A = U S VT - example: Users to MoviesSciFi-concept
Romance-concept
=SciFi
Romnce
x x
Mat
rix
Alie
nSer
enity
Cas
abla
nca
Am
elie
1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2
0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32
12.4 0 00 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09
63
SVD – Example: Users-to-Movies
l A = U S VT - example:
Romance-concept
U is “user-to-concept” similarity matrix
SciFi-concept
=SciFi
Romnce
x x
Mat
rix
Alie
nSer
enity
Cas
abla
nca
Am
elie
1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2
0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32
12.4 0 00 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09
64
SVD – Example: Users-to-Movies
lA = U S VT - example:
SciFi
Romnce
SciFi-concept
“strength” of the SciFi-concept
=SciFi
Romnce
x x
Mat
rix
Alie
nSer
enity
Cas
abla
nca
Am
elie
1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2
0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32
12.4 0 00 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09
65
SVD – Example: Users-to-Movies
l A = U S VT - example:
SciFi-concept
V is “movie-to-concept”similarity matrix
SciFi-concept
=SciFi
Romnce
x x
Mat
rix
Alie
nSer
enity
Cas
abla
nca
Am
elie
1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2
0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32
12.4 0 00 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09
Dimensionality Reduction with SVD
67
SVD – Dimensionality Reduction
l Goal: Minimize the sumof reconstruction errors:
uwhere are the “old” and are the “new” coordinates
l SVD gives ‘best’ axis to project on:n ‘best’ = minimizing the reconstruction errors
l In other words, minimum reconstruction error
v1
first right singular vector
Movie 1 rating
Mov
ie 2
ratin
g
68
SVD - Interpretation #2
More detailsl Q: How exactly is dim. reduction done?
= x x
1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2
0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32
12.4 0 00 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09
69
SVD - Interpretation #2
More detailsl Q: How exactly is dim. reduction done?l A: Set smallest singular values to zero
= x x
1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2
0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32
12.4 0 00 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09
70
SVD - Interpretation #2
More detailsl Q: How exactly is dim. reduction done?l A: Set smallest singular values to zero
x x
1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2
0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32
12.4 0 00 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09
»
71
SVD - Interpretation #2
More detailsl Q: How exactly is dim. reduction done?l A: Set smallest singular values to zero
x x
1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2
0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32
12.4 0 00 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09
»
72
SVD - Interpretation #2
More detailsl Q: How exactly is dim. reduction done?l A: Set smallest singular values to zero
» x x
1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2
0.13 0.020.41 0.070.55 0.090.68 0.110.15 -0.590.07 -0.730.07 -0.29
12.4 0 0 9.5
0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.69