preliminaries data mining - university of notre...

39
Preliminaries Data Mining 1 The art of extracting knowledge from large bodies of structured data. Let’s put it to use!

Upload: others

Post on 20-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Data Mining

1

The art of extracting knowledge from large bodies of structured data.

Let’s put it to use!

Page 2: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Recommendations

2

Page 3: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Basic Recommendations with Collaborative Filtering

Page 4: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Making Recommendations

4

Page 5: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

The Netflix Prize (2006-2009)

5

Page 6: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

The Netflix Prize (2006-2009)

6

Page 7: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

What was the Netflix Prize?

• In October, 2006 Netflix released a dataset containing 100 million anonymous movie ratings and challenged the data mining, machine learning, and computer science communities to develop systems that could beat the accuracy of its recommendation system, Cinematch.

• Thus began the Netflix Prize, an open competition for the best collaborative filtering algorithm to predict user ratings for films, solely based on previous ratings without any other information about the users or films.

7

Page 8: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

The Netflix Prize Datasets

• Netflix provided a training dataset of 100,480,507 ratings that 480,189 users gave to 17,770 movies.

– Each training rating (or instance) is of the form user,movie, data of rating, rating .

– The user and movie fields are integer IDs, while ratings are from 1 to 5 (integral) stars.

8

Page 9: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

The Netflix Prize Datasets

• The qualifying dataset contained over 2,817,131 instances of the form user,movie, date of rating , with ratings known only to the jury.

• A participating team’s algorithm had to predict grades on the entire qualifying set, consisting of a validation and test set.

– During the competition, teams were only informed of the score for a validation or quiz set of 1,408,342 ratings.

– The jury used a test set of 1,408,789 ratings to determine potential prize winners.

9

Page 10: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

The Netflix Prize Data

10

1 2 . . 𝑚

1 5 2 5 4

2 2 5 3

.

. 2 2 4 2

𝑛 5 1 5 ?

Users

Movie Ratings

Page 11: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

The Netflix Prize Data

11

1 2 . . 𝑚

1 5 2 5 4

2 2 5 3

.

. 2 2 4 2

𝑛 5 1 5 ?

Instances (samples, examples,

observations)

Movie Ratings

Page 12: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

The Netflix Prize Data

12

1 2 . . 𝑚

1 5 2 5 4

2 2 5 3

.

. 2 2 4 2

𝑛 5 1 5 ?

Users

Features (attributes, dimensions)

Page 13: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

The Netflix Prize Goal

13

Star Wars

Hoop Dreams

Contact Titanic

Joe 5 2 5 4

John 2 5 3

Al 2 2 4 2

Everaldo 5 1 5 ?

Movie Ratings

Users

Goal: Predict ? (a movie rating) for a user

Page 14: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

The Netflix Prize Methods

14

Bennett, James, and Stan Lanning. "The Netflix Prize." Proceedings of KDD Cup and Workshop. Vol. 2007. 2007.

Page 15: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

The Netflix Prize Methods

15

We will discuss these methods now.

We will discuss these methods by the end of the course.

Page 16: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Raw Averages

User average: Simply assign the average rating given by user 𝑟𝑢∈𝑈 , where 𝑈 is the set of all users.

Item average: Simply assign 𝑟𝑢,𝑖𝑖∈𝐼 , where 𝐼 is the set of all

items and 𝑟𝑢,𝑖 is the rating given to item 𝑖 by user 𝑢.

16

Page 17: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Raw Averages

User average: Simply assign the average rating given by user 𝑟𝑢∈𝑈 , where 𝑈 is the set of all users.

Item average: Simply assign 𝑟𝑢,𝑖𝑖∈𝐼 , where 𝐼 is the set of all

items and 𝑟𝑢,𝑖 is the rating given to item 𝑖 by user 𝑢.

17

What about universally good or bad movies? Or skewed rating systems?

Page 18: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Bayesian Method

Apply Bayes’ Theorem: Of the ratings 𝑟 ∈ 𝑅 a user could give for a movie, assign the highest value:

18

𝑃 𝑟 𝑖 =𝑃 𝑖 𝑟 𝑃 𝑟

𝑃 𝑖

where 𝑃 𝑟 𝑖 is the (conditional) probability of rating 𝑟 given item 𝑖, 𝑃 𝑖 𝑟 is the (conditional) probability of item 𝑖 given rating 𝑟, 𝑃 𝑟 is the (prior) probability of rating 𝑟, and 𝑃 𝑖 is the (prior) probability of item 𝑖.

Page 19: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Bayesian Method

Apply Bayes’ Theorem: Of the ratings 𝑟 ∈ 𝑅 a user could give for a movie, assign the highest value:

19

𝑃 𝑟 𝑖 =𝑃 𝑖 𝑟 𝑃 𝑟

𝑃 𝑖

where 𝑃 𝑟 𝑖 is the (conditional) probability of rating 𝑟 given item 𝑖, 𝑃 𝑖 𝑟 is the (conditional) probability of item 𝑖 given rating 𝑟, 𝑃 𝑟 is the (prior) probability of rating 𝑟, and 𝑃 𝑖 is the (prior) probability of item 𝑖.

But this method still doesn’t account for the similarity between users.

Page 20: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Cute Kitten Picture Intermission

20

Page 21: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Key to Collaborative Filtering

Common insight: personal tastes are correlated

If Alice and Bob both like X and Alice likes Y, then Bob is more likely to like Y, especially (perhaps) if Bob knows Alice.

21

Page 22: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Collaborative Filtering

Collaborative filtering (CF) systems work by collecting user feedback in the form of ratings for items in a given domain and exploiting similarities in rating behavior amongst several users in determining how to recommend an item

22

Page 23: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Collaborative Filtering Dataset

23

1 2 . . 𝑚

1 5 2 5 4

2 2 5 3

.

. 2 2 4 2

𝑛 5 1 5 ?

Users

Items

Goal: Predict ? (an item) for 𝑛 (a user)

Page 24: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Types of Collaborative Filtering

24

Neighborhood- or Memory-based

Model-based

Hybrid

2

3

1

Page 25: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Types of Collaborative Filtering

25

Neighborhood- or Memory-based

2

3

1 We’ll talk about this type now.

Page 26: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Neighborhood-based CF

A subset of users are chosen based on their similarity to the active users, and a weighted combination of their ratings is used to produce predictions for this user.

26

Page 27: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Neighborhood-based CF

It has three steps:

27

1 Assign a weight to all users with respect to similarity with the active user

2

3

Select k users that have the highest similarity with the active user—commonly called the neighborhood.

Compute a prediction from a weighted combination of the selected neighbors’ ratings.

Page 28: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Neighborhood-based CF

In step 1, the weight 𝑤𝑎,𝑢 is a measure of similarity

between the user 𝑢 and the active user 𝑎. The most commonly used measure of similarity is the Pearson correlation coefficient between the ratings of the two users:

28

Step 1

𝑤𝑎,𝑢 = 𝑟𝑎,𝑖 − 𝑟 𝑎 𝑟𝑢,𝑖 − 𝑟 𝑢𝑖∈𝐼

𝑟𝑎,𝑖 − 𝑟 𝑎2 𝑟𝑢,𝑖 − 𝑟 𝑢

2

𝑖∈𝐼𝑖∈𝐼

where 𝐼 is the set of items rated by both users, 𝑟𝑢,𝑖 is the rating given to item 𝑖 by user 𝑢, and 𝑟 𝑢 is the mean rating given by user 𝑢.

Page 29: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Neighborhood-based CF

In step 2, some sort of threshold is used on the similarity score to determine the “neighborhood.”

29

Step 2

Page 30: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Neighborhood-based CF

In step 3, predictions are generally computed as the weighted average of deviations from the neighbor’s mean, as in:

30

Step 3

𝑝𝑎,𝑖 = 𝑟 𝑎 = 𝑟𝑢,𝑖 − 𝑟 𝑢 × 𝑤𝑎,𝑢𝑢∈𝐾

𝑤𝑎,𝑢𝑢∈𝐾

where 𝑝𝑎,𝑖 is the prediction for the active user 𝑎 for item 𝑖, 𝑤𝑎,𝑢 is the similarity between users 𝑎 and 𝑢, and 𝐾 is the neighborhood or set of most similar users.

Page 31: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Neighborhood-base CF

• Common Problems: – The search for similar users has high computational

complexity, causing conventional neighborhood-based CF algorithms to not scale well.

– It is common for the active user to have highly correlated neighbors that are based on very few co-rated (overlapping) items, which often result in bad predictors.

– When measuring the similarity between users, items that have been rated by all (and universally liked or disliked) are not as useful as less common items. 31

Page 32: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Item-to-Item Matching

• An extension to neighborhood-based CF.

• Addresses the problem of high computational complexity of searching for similar users.

• The idea:

32

Rather than matching similar users, match a user’s rated items to similar items.

Page 33: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Item-to-Item Matching

In this approach, similarities between pairs of items 𝑖 and 𝑗 are computed off-line using Pearson correlation, given by:

33

𝑤𝑖,𝑗 = 𝑟𝑢,𝑖 − 𝑟 𝑖 𝑟𝑢,𝑗 − 𝑟 𝑗𝑢∈𝑈

𝑟𝑢,𝑖 − 𝑟 𝑖2 𝑟𝑢,𝑗 − 𝑟 𝑗

2

𝑢∈𝑈𝑢∈𝑈

where 𝑈 is the set of all users who have rated both items 𝑖 and 𝑗, 𝑟𝑢,𝑖 is the rating of user 𝑢 on item 𝑖, and 𝑟 𝑖 is the average rating of

the 𝑖th item across users.

Page 34: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Item-to-Item Matching

Now, the rating for item 𝑖 for user 𝑎 can be predicted using a simple weighted average, as in:

34

𝑝𝑎,𝑖 = 𝑟𝑢,𝑖𝑤𝑖,𝑗𝑗∈𝐾

𝑤𝑖,𝑗𝑗∈𝐾

where 𝐾 is the neighborhood set of the 𝑘 items rated by 𝑎 that are most similar to 𝑖.

Page 35: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Significance Weighting

• Another extension to neighborhood-based CF.

• Addresses the problem of bad predictors generated by active user to have highly correlated neighbors that are based on very few co-rated (overlapping) items.

• The idea:

35

Multiply the similarity weight by a significance weighting factor, which devalues the correlations

based on a few co-rated items.

Page 36: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Inverse User Frequency

• Yet another extension to neighborhood-based CF.

• Addresses the problem of the dominance of items that have been rated by all (and universally liked or disliked), yet are not as useful as less common items.

• The idea:

36

Weight an item rating by the inverse of the frequency that item is rated.

Page 37: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

Inverse User Frequency

When measuring the similarity between users, items that have been rated by all (and universally liked or disliked) are not as useful as less common items. To account for this, compute

37

𝑓𝑖 = log𝑛 𝑛𝑖

where 𝑛𝑖 is the number of users who have rated item 𝑖 out of the total number of 𝑛 users. To apply inverse user frequency while using similarity-based CF, the original rating is transformed for 𝑖 by multiplying it by the factor 𝑓𝑖 .

Page 38: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

And Now…

38

Let’s run “the data mining” on some data!

Page 39: Preliminaries Data Mining - University of Notre Damerjohns15/cse40647.sp14/www/content/lectures/02... · recommendation system, Cinematch. •Thus began the Netflix Prize, an open

Preliminaries

References

• Prem Melville and Vikas Sindhwani. Recommender Systems. In Encyclopedia of Machine Learning, Claude Sammut and Geoffrey Webb (Eds), Springer, 2010.

39