cs 124/linguist 180 from languages to information · 2018-10-25 · utility)matrix 322 chapter 9....

CS 124/LINGUIST 180From Languages to

Information

Dan JurafskyStanford University

Recommender Systems & Collaborative Filtering

Slides adapted from Jure Leskovec

Recommender Systems

Customer X Buys Metallica CD Buys Megadeth CD

Customer Y Does search on Metallica Recommender system suggests Megadethfrom data collected about customer X

2/24/16 JURE LESKOVEC, STANFORD C246: MINING MASSIVE DATASETS 2

Recommendations

2/24/16SLIDES ADAPTED FROM JURE LESKOVEC

3

Items

Search Recommendations

Products, web sites, blogs, news items, …

Examples:

From Scarcity to AbundanceShelf space is a scarce commodity for traditional retailers Also: TV networks, movie theaters,…

Web enables near-‐zero-‐cost dissemination of information about products From scarcity to abundance

More choice necessitates better filters Recommendation engines How Into Thin Air made Touching the Voida bestseller: http://www.wired.com/wired/archive/12.10/tail.html

2/24/16 4SLIDES ADAPTED FROM JURE LESKOVEC

Sidenote: The Long Tail

2/24/16 5Source: Chris Anderson (2004)

SLIDES ADAPTED FROM JURE LESKOVEC

Physical vs. Online


Read http://www.wired.com/wired/archive/12.10/tail.html to learn more!SLIDES ADAPTED FROM JURE LESKOVEC

Types of Recommendations

Editorial and hand curated List of favorites Lists of “essential” items

Simple aggregates Top 10, Most Popular, Recent Uploads

Tailored to individual users Amazon, Netflix, …

2/24/16 7

Today class


Formal ModelX = set of CustomersS = set of Items

Utility function u: X × Sà RR = set of ratingsR is a totally ordered sete.g., 0-‐5 stars, real number in [0,1]


Utility Matrix

0.410.2

0.30.50.21

2/24/16 9

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David


Key Problems(1) Gathering “known” ratings for matrixHow to collect the data in the utility matrix

(2) Extrapolate unknown ratings from known onesMainly interested in high unknown ratings We are not interested in knowing what you don’t like but what you like

(3) Evaluating extrapolation methodsHow to measure success/performance of recommendation methods


(1) Gathering Ratings

Explicit Ask people to rate itemsDoesn’t work well in practice – people can’t be bothered Crowdsourcing: Pay people to label items

Implicit Learn ratings from user actions E.g., purchase implies high rating

What about low ratings?2/24/16 11


(2) Extrapolating Utilities

Key problem: Utility matrix U is sparseMost people have not rated most itemsCold start: New items have no ratings New users have no history

Three approaches to recommender systems:1. Content-‐based2. Collaborative3. Latent factor based

2/24/16 12

Today!


Content-‐based Recommender Systems

2/24/16 JURE LESKOVEC, STANFORD C246: MINING MASSIVE DATASETS 13SLIDES ADAPTED FROM JURE LESKOVEC

Content-‐based Recommendations

Main idea: Recommend items to customer xsimilar to previous items rated highly by x

Example:Movie recommendationsRecommend movies with same actor(s), director, genre, …

Websites, blogs, newsRecommend other sites with “similar” content


Plan of Action

2/24/16 15

likes

Item profiles

RedCirclesTriangles

User profile

match

recommendbuild


Item ProfilesFor each item, create an item profile

Profile is a set (vector) of featuresMovies: author, genre, director, actors, year…Text: Set of “important” words in document

How to pick important features?TF-‐IDF (Term frequency * Inverse Doc Frequency) Term… Feature Document… Item


• Maybe there is a scaling factor α between binary and numeric features

• Or maybe α=1

Cosine(Movie X, Movie Y) = !"#!$%

&"'$% &"#($%

Content-‐based Item Profiles

2/24/16 17

MelissaMcCarthy

JohnnyDepp

Movie X

Movie Y

0 1 1 0 1 1 0 1 3α1 1 0 1 0 1 1 0 4α

ActorA

ActorB …

AvgRating

PirateGenre

SpyGenre

ComicGenre


User Profiles

Want a vector with the same components/dimensions as items Could be 1s representing user purchasesOr arbitrary numbers from a rating

User profile is aggregate of items:Average(weighted?)of rated item profiles


Sample user profile• Items are movies• Utility matrix has 1 if user has seen movie• 20% of the movies user U has seen have Melissa McCarthy• U[“Melissa McCarthy”] = 0.2

MelissaMcCarthy

User U 0.2 .005 0 0 …

ActorA

ActorB …

Prediction

User and item vectors have the same components/dimensions!So just recommend the items whose vectors are most similar to the user vector!

Given user profile x and item profile i, estimate 𝑢 𝒙, 𝒊 = cos (𝒙, 𝒊) = 𝒙·𝒊

| 𝒙 |⋅| 𝒊 |2/24/16 20


Pros: Content-‐based Approach+: No need for data on other usersNo cold-‐start or sparsity problems

+: Able to recommend to users with unique tastes+: Able to recommend new & unpopular itemsNo first-‐rater problem

+: Able to provide explanations Can provide explanations of recommended items by listing content-‐features that caused an item to be recommended


Cons: Content-‐based Approach–: Finding the appropriate features is hardE.g., images, movies, music–: Recommendations for new usersHow to build a user profile?–: OverspecializationNever recommends items outside user’s content profilePeople might have multiple interestsUnable to exploit quality judgments of other users


Collaborative FilteringHarnessing quality judgments of other users

Collaborative FilteringConsider user x

Find set N of other users whose ratings are “similar” to x’s ratings

Estimate x’s ratings based on ratings of users in N

2/24/16 24

x

N


Finding Similar UsersLet rx be the vector of user x’s ratingsJaccard similarity measureProblem: Ignores the value of the rating

Cosine similarity measure sim(x, y) = cos(rx, ry) =

78⋅79||78||⋅||79||

Problem: Treats missing ratings as “negative”

2/24/16 25

rx = [*, _, _, *, ***]ry = [*, _, **, **, _]

rx, ry as sets:rx = 1, 4, 5ry = 1, 3, 4

rx, ry as points:rx = 1, 0, 0, 1, 3ry = 1, 0, 2, 2, 0


Utility Matrix

322 CHAPTER 9. RECOMMENDATION SYSTEMS

9.3.1 Measuring Similarity

The first question we must deal with is how to measure similarity of users oritems from their rows or columns in the utility matrix. We have reproducedFig. 9.1 here as Fig. 9.4. This data is too small to draw any reliable conclusions,but its small size will make clear some of the pitfalls in picking a distancemeasure. Observe specifically the users A and C. They rated two movies incommon, but they appear to have almost diametrically opposite opinions ofthese movies. We would expect that a good distance measure would makethem rather far apart. Here are some alternative measures to consider.

HP1 HP2 HP3 TW SW1 SW2 SW3A 4 5 1B 5 5 4C 2 4 5D 3 3

Figure 9.4: The utility matrix introduced in Fig. 9.1

Jaccard Distance

We could ignore values in the matrix and focus only on the sets of items rated.If the utility matrix only reflected purchases, this measure would be a goodone to choose. However, when utilities are more detailed ratings, the Jaccarddistance loses important information.

Example 9.7 : A and B have an intersection of size 1 and a union of size 5.Thus, their Jaccard similarity is 1/5, and their Jaccard distance is 4/5; i.e.,they are very far apart. In comparison, A and C have a Jaccard similarity of2/4, so their Jaccard distance is the same, 1/2. Thus, A appears closer to Cthan to B. Yet that conclusion seems intuitively wrong. A and C disagree onthe two movies they both watched, while A and B seem both to have liked theone movie they watched in common.

Cosine Distance

We can treat blanks as a 0 value. This choice is questionable, since it has theeffect of treating the lack of a rating as more similar to disliking the movie thanliking it.

Example 9.8 : The cosine of the angle between A and B is

4 × 5√42 + 52 + 12

√52 + 52 + 42

= 0.380

Intuitively we want: sim(A, B) > sim(A, C)Jaccard similarity: 1/5 < 2/4Cosine similarity: 0.386 > 0.322

Considers missing ratings as “negative”

Utility Matrix






Jaccard Distance



Cosine Distance



4 × 5√42 + 52 + 12

√52 + 52 + 42

= 0.380

• Problem with cosine: 0 acts like a negative review• C really loves SW• A hates SW• B just hasn’t seen it

• Another problem: we’d like to normalize for raters• D rated everything the same; not very useful

Modified Utility Matrix:subtract the means of each row


HP1 HP2 HP3 TW SW1 SW2 SW3A 2/3 5/3 −7/3B 1/3 1/3 −2/3C −5/3 1/3 4/3D 0 0


The cosine of the angle between between A and C is

(5/3)× (−5/3) + (−7/3)× (1/3)!

(2/3)2 + (5/3)2 + (−7/3)2!

(−5/3)2 + (1/3)2 + (4/3)2= −0.559

Notice that under this measure, A and C are much further apart than A andB, and neither pair is very close. Both these observations make intuitive sense,given that A and C disagree on the two movies they rated in common, while Aand B give similar scores to the one movie they rated in common.

9.3.2 The Duality of Similarity

The utility matrix can be viewed as telling us about users or about items, orboth. It is important to realize that any of the techniques we suggested inSection 9.3.1 for finding similar users can be used on columns of the utilitymatrix to find similar items. There are two ways in which the symmetry isbroken in practice.

1. We can use information about users to recommend items. That is, givena user, we can find some number of the most similar users, perhaps usingthe techniques of Chapter 3. We can base our recommendation on thedecisions made by these similar users, e.g., recommend the items that thegreatest number of them have purchased or rated highly. However, thereis no symmetry. Even if we find pairs of similar items, we need to takean additional step in order to recommend items to users. This point isexplored further at the end of this subsection.

2. There is a difference in the typical behavior of users and items, as itpertains to similarity. Intuitively, items tend to be classifiable in simpleterms. For example, music tends to belong to a single genre. It is impossi-ble, e.g., for a piece of music to be both 60’s rock and 1700’s baroque. Onthe other hand, there are individuals who like both 60’s rock and 1700’sbaroque, and who buy examples of both types of music. The consequenceis that it is easier to discover items that are similar because they belongto the same genre, than it is to detect that two users are similar becausethey prefer one genre in common, while each also likes some genres thatthe other doesn’t care for.






Jaccard Distance



Cosine Distance



4 × 5√42 + 52 + 12

√52 + 52 + 42

= 0.380

• Now a 0 means no information• And negative ratings means viewers with opposite ratings will have vectors in opposite directions!

Modified Utility Matrix:subtract the means of each row324 CHAPTER 9. RECOMMENDATION SYSTEMS




(5/3)× (−5/3) + (−7/3)× (1/3)!

(2/3)2 + (5/3)2 + (−7/3)2!

(−5/3)2 + (1/3)2 + (4/3)2= −0.559






9.3. COLLABORATIVE FILTERING 323

The cosine of the angle between A and C is

5 × 2 + 1 × 4√42 + 52 + 12

√22 + 42 + 52

= 0.322

Since a larger (positive) cosine implies a smaller angle and therefore a smallerdistance, this measure tells us that A is slightly closer to B than to C.

Rounding the Data

We could try to eliminate the apparent similarity between movies a user rateshighly and those with low scores by rounding the ratings. For instance, we couldconsider ratings of 3, 4, and 5 as a “1” and consider ratings 1 and 2 as unrated.The utility matrix would then look as in Fig. 9.5. Now, the Jaccard distancebetween A and B is 3/4, while between A and C it is 1; i.e., C appears furtherfrom A than B does, which is intuitively correct. Applying cosine distance toFig. 9.5 allows us to draw the same conclusion.

HP1 HP2 HP3 TW SW1 SW2 SW3A 1 1B 1 1 1C 1 1D 1 1

Figure 9.5: Utilities of 3, 4, and 5 have been replaced by 1, while ratings of 1and 2 are omitted

Normalizing Ratings

If we normalize ratings, by subtracting from each rating the average ratingof that user, we turn low ratings into negative numbers and high ratings intopositive numbers. If we then take the cosine distance, we find that users withopposite views of the movies they viewed in common will have vectors in almostopposite directions, and can be considered as far apart as possible. However,users with similar opinions about the movies rated in common will have arelatively small angle between them.

Example 9.9 : Figure 9.6 shows the matrix of Fig. 9.4 with all ratings nor-malized. An interesting effect is that D’s ratings have effectively disappeared,because a 0 is the same as a blank when cosine distance is computed. Note thatD gave only 3’s and did not differentiate among movies, so it is quite possiblethat D’s opinions are not worth taking seriously.

Let us compute the cosine of the angle between A and B:

(2/3) × (1/3)!

(2/3)2 + (5/3)2 + (−7/3)2!

(1/3)2 + (1/3)2 + (−2/3)2= 0.092





(5/3)× (−5/3) + (−7/3)× (1/3)!

(2/3)2 + (5/3)2 + (−7/3)2!

(−5/3)2 + (1/3)2 + (4/3)2= −0.559






Cos(A,B) =

Cos(A,C) =

Now A and C are (correctly) way further apart than A,B

Cosine after subtracting meanTurns out to be the same as Pearson correlation coefficient!!!

Cosine similarity is correlation when the data is centered at 0

Terminological Note: subtracting the mean is zero-‐centering, not normalizing (normalizing is dividing by a norm to turn something into a probability), but the textbook (and in common use) we sometimes overload the term “normalize”


Finding Similar UsersLet rx be the vector of user x’s ratingsCosine similarity measure sim(x, y) = cos(rx, ry) =

78⋅79||78||⋅||79||

Problem: Treats missing ratings as “negative”

Pearson correlation coefficient Sxy = items rated by both users x and y

2/24/16 31

rx = [*, _, _, *, ***]ry = [*, _, **, **, _]

rx, ry as points:rx = 1, 0, 0, 1, 3ry = 1, 0, 2, 2, 0

rx, ry… avg.rating of x, y

𝒔𝒊𝒎 𝒙,𝒚 =∑ 𝒓𝒙𝒔 − 𝒓𝒙 𝒓𝒚𝒔 − 𝒓𝒚𝒔∈𝑺𝒙𝒚

∑ 𝒓𝒙𝒔 − 𝒓𝒙 𝟐𝒔∈𝑺𝒙𝒚

∑ 𝒓𝒚𝒔 − 𝒓𝒚𝟐

𝒔∈𝑺𝒙𝒚


Rating PredictionsFrom similarity metric to recommendations:Let rx be the vector of user x’s ratingsLet N be the set of k users most similar to xwho have rated item iPrediction for item i of user x: 𝑟DE =

#F ∑ 𝑟GEG∈H

𝑟DE =∑ I89⋅79J9∈K∑ I899∈K

•Many other tricks possible…

2/24/16 32

Shorthand:𝒔𝒙𝒚 = 𝒔𝒊𝒎 𝒙,𝒚


Item-‐Item Collaborative FilteringSo far: User-‐user collaborative filteringAnother view: Item-‐item For item i, find other similar items Estimate rating for item i based on ratings for similar items Can use same similarity metrics and prediction functions as in user-‐user model

2/24/16 33

∑∑

∈

∈⋅

=);(

);(

xiNj ij

xiNj xjijxi s

rsr sij… similarity of items i and j

rxj…rating of user x on item iN(i;;x)…set of items rated by xsimilar to i


Item-‐Item CF (|N|=2)

2/24/16 34

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

movies

- unknown rating - rating between 1 to 5



2/24/16 35

121110987654321

455? 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

movies



2/24/16 36

121110987654321

455? 311

3124452

534321423

245424

5224345

423316

users

Neighbor selection:Identify movies similar to movie 1, rated by user 5

movies

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)

Here we use Pearson correlation as similarity:1) Subtract mean rating mi from each movie im1 = (1+3+5+5+4)/5 = 3.6row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0]

2) Compute cosine similarities between rowsSLIDES ADAPTED FROM JURE LESKOVEC


2/24/16 37

121110987654321

455? 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weights:s1,3=0.41, s1,6=0.59

movies

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)



2/24/16 38

121110987654321

4552.6311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average:

r1,5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6

movies

𝒓𝒊𝒙 =∑ 𝒔𝒊𝒋 ⋅ 𝒓𝒋𝒙𝒋∈𝑵(𝒊;𝒙)

∑𝒔𝒊𝒋SLIDES ADAPTED FROM JURE LESKOVEC

Item-‐Item vs. User-‐User

0.418.010.90.30.5

0.81

2/24/16 39

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

¡ In practice, it has been observed that item-‐itemoften works better than user-‐user

¡ Why? Items are simpler, users have multiple tastesSLIDES ADAPTED FROM JURE LESKOVEC

Pros/Cons of Collaborative Filtering+ Works for any kind of item No feature selection needed

-‐ Cold Start: Need enough users in the system to find a match

-‐ Sparsity: The user/ratings matrix is sparse Hard to find users that have rated the same items

-‐ First rater: Cannot recommend an item that has not been previously rated New items, Esoteric items

-‐ Popularity bias: Cannot recommend items to someone with unique taste Tends to recommend popular items


Hybrid Methods

Implement two or more different recommenders and combine predictions Perhaps using a linear model

Add content-‐based methods to collaborative filtering Item profiles for new item problemDemographics to deal with new user problem


Evaluation


1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users


Evaluation


1 3 4

3 5 5

4 5 5

3

3

2 ? ?

?

2 1 ?

3 ?

1

Test Data Set

users

movies

Evaluating PredictionsCompare predictions with known ratingsRoot-‐mean-‐square error (RMSE)

∑ 𝑟DE − 𝑟DE∗!

DE where 𝒓𝒙𝒊 is predicted, 𝒓𝒙𝒊∗ is the true rating of x on i

Rank Correlation: Spearman’s correlation between system’s and user’s complete rankings


Problems with Error MeasuresNarrow focus on accuracy sometimes misses the point Prediction Diversity Prediction ContextOrder of predictions

In practice, we care only to predict high ratings: RMSE might penalize a method that does well for high ratings and badly for others


There’s No Data like Mo’ Data

Leverage all the data Simple methods on large data do best

Add more data e.g., add IMDB data on genres

More data beats better algorithms


Geared towards females

Geared towards males

serious

escapist

The PrincessDiaries

The Lion King

Braveheart

Independence Day

AmadeusThe Color Purple

Ocean’s 11

Sense and Sensibility

Gus

Dave

Latent Factor Models (like SVD)


[Bellkor Team]

Lethal Weapon

Dumb and Dumber


Famous Historical Example:The Netflix Prize

Training data 100 million ratings, 480,000 users, 17,770 movies 6 years of data: 2000-‐2005Test data Last few ratings of each user (2.8 million) Evaluation criterion: root mean squared error (RMSE) Netflix Cinematch RMSE: 0.9514Competition 2700+ teams $1 million prize for 10% improvement on Cinematch BellKor system won in 2009. Combined many factors Overall deviations of users/movies Regional effects Local collaborative filtering patterns Temporal biases


Summary on Recommendation Systems

• The Long Tail• Content-‐based Systems• Collaborative Filtering• Latent Factors

cs 124/linguist 180 from languages to information · 2018-10-25 · utility)matrix 322 chapter 9....

Documents