algorithms and data structures...pick the n best items of all the predictions and show to user a...

38
Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck Algorithms and Data Structures Spring 2016 JAVA PROJECT

Upload: others

Post on 22-May-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Algorithms and Data Structures

Spring 2016

JAVA PROJECT

Page 2: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Practical information

Work in groups of 2 → Register on Minerva before 31/03

No collaboration between different groups

Hand in: Deadline 05/05, both for Java files and written report

Short informal talk about project during last lab session 11/05

Questions can be asked during the second Java lab session

E-mail: [email protected]

No questions, other than practical, will be answered after 01/05

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Page 3: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Recommender system

A recommender system is a system that seeks to predict which items a

user likes and/or dislikes. This is done by estimating the rating that a

user would give to an item.

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Page 4: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Recommender system: collaborative filtering

If you want to make predictions for a user A, then these predictions are based

on similar users. Similar users are defined as users having rated the same items

with a similar number.

The underlying assumption of the collaborative filtering approach is that if a

person A has the same opinion as a person B on an item, A is more likely to

have B's opinion on a different item x than to have the opinion on x of a person

chosen randomly

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Page 5: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Recommender system: collaborative filtering

How to make predictions for user A

1. Find similar users for user A

2. Average the ratings of the similar users for unrated items of user A.

3. Pick the n best items of all the predictions and show to user A

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

items = movies ⇒

Page 6: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Recommender system: collaborative filtering

Similar users Average ratings Select the top m

Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the

data Normal Weighted

Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

/2 /2 /1 /1

/2

/2

Implement one: 7.5

Implement both: 10

Page 7: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Recommender system: collaborative filtering

Similar users Average ratings Select the top m

Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the

data Normal Weighted

Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Page 8: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Calculate the distance between two users

A user is represented as a vector consisting of its ratings.

The distance between two users 𝐴 = (𝐴1, 𝐴2, … , 𝐴𝑛) and 𝐵 = 𝐵1, 𝐵2, … , 𝐵𝑛

is calculated using the cosine distance:

d 𝐴, 𝐵 = 1 − 𝐴𝑖 ∗ 𝐵𝑖𝑖

𝐴𝑖2

𝑖 ∗ 𝐵𝑖2

𝑖

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Movie x Movie y Movie z

User A 5 1

User B 1 2

𝑑 𝐴, 𝐵 = 1 − 0 ∗ 1 + 5 ∗ 0 + 1 ∗ 2

25 + 1 ∗ 1 + 4

Not rated = 0

Page 9: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Calculate the distance between two users

Special case:

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Movie x Movie y Movie z

User A 5

User B 1 2

𝑑 𝐴, 𝐵 = +∞

Movie x Movie y Movie z

User A 0 5

User B 1 2

𝑑 𝐴, 𝐵 = 1

Page 10: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Recommender system: collaborative filtering

Similar users Average ratings Select the top m

Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the

data Normal Weighted

Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Page 11: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Normalising the data

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Movie x Movie y Movie z

User A 5 5

User B 3 3

User C 3 1

Similar users, just different scale

Normalised ratings of user A = ratings of user A – average rating of user A

Movie x Movie y Movie z

User A 0 0

User B 0 0

User C 1 -1

If all values are the same, normalised

ratings = 0. If we leave it this way,

then distance will be +∞. To avoid

this, we add noise in this case.

Movie x Movie y Movie z

User A 0.000127 0.001201

User B 0.000541 0.000143

User C 1 -1

Page 12: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Recommender system: collaborative filtering

Similar users Average ratings Select the top m

Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the

data Normal Weighted

Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Page 13: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Finding the n most similar user: Exhaustive

search Assume that the user who wants predictions is user A.

1. For all other users X

a. Calculate the distance 𝑑 𝐴, 𝑋

b. Keep track of the n closest users

Handy data structure: fixed size priority queue

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Page 14: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Fixed Size Priority Queue

= priority queue with fixed size

Assume priority = distance, highest priority = highest distance, and size = n

When we pop (remove one element from the queue), the element with

highest distance will be removed

Only the n elements with smallest distance will remain.

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Page 15: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Fixed Sized Priority Queue: example with n = 3

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Add user B with d(A,B) = 5

Add user C with d(A,C) = 10

Add user D with d(A,D) = 7

Add user E with d(A,E) = 8

B, 5

B, 5 C, 10

B, 5 C, 10 D, 7

B, 5 D, 7

B, 5 E,8 D, 7

Page 16: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Recommender system: collaborative filtering

Similar users Average ratings Select the top m

Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the

data Normal Weighted

Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Page 17: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Average the ratings

Input: the set of similar users for user A (here user B and C) and their ratings

for all movies

Output: the predicted ratings for unseen movies for user A

prediction for movie x = average of the ratings for movie x given by

the similar users (if a similar user didn’t rate

movie x, then he doesn’t count for the

average)

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Movie w Movie x Movie y Movie z

User A 1 -1

User B 1/3 1/3 -2/3

User C 1 -1

Movie w Movie x Movie y Movie z

User A 1 1/3 -2.5/3 -1

User B 1/3 1/3 -2/3

User C 1 -1

Page 18: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Recommender system: collaborative filtering

Similar users Average ratings Select the top m

Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the

data Normal Weighted

Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Page 19: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

[Optional] Average the ratings with weights

prediction for movie x = weighted average of the ratings for movie x given by

the similar users (if a similar user didn’t rate movie x,

then he doesn’t count for the average)

The closer a user is, the higher its assigned weight. The furthest similar user

will get weight 1, the second furthest similar user gets 11,…, and the most

similar user will get 1 + n*10, where n is the number of similar users who have

rated the movie

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Movie w Movie x Movie y Movie z Weight

User A 1 1/3 -32/33 -1

User B 1/3 1/3 -2/3 1

User C 1 -1 10

Page 20: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Redo the normalisation

We already normalised earlier, so we restore the data by re-adding the

average rating of a user.

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Movie w Movie x Movie y Movie z

User A 1 1/3 -2.5/3 -1 average rating user A = 2

Movie w Movie x Movie y Movie z

User A 3 2.33 1.166 1

Page 21: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Recommender system: collaborative filtering

Similar users Average ratings Select the top m

Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the

data Normal Weighted

Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Page 22: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Select the top m recommendations

The m best recommendations = the m recommendations with the highest

predicted scores.

Again, the fixed size priority queue can be used.

Assume priority = rating, highest priority = highest rating, and size = m

When we pop (remove one element from the queue), the element with the

highest predicted score will be removed

Only the elements with the m highest scores will remain.

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Page 23: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Recommender system: collaborative filtering

Similar users Average ratings Select the top m

Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the

data Normal Weighted

Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Page 24: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Finding the n most similar user: branch and

bound approach For the branch and bound approach a k-d tree is constructed. This is a data

structure used for organizing some points in a k-dimensional space.

More specifically, it is a binary search tree where each internal node has the

following properties:

• Dimension 𝑑 on which the split should happen

• Left child: subtree containing the points 𝑝 for which 𝑝𝑑< median value of dimension 𝑑

• Right child: subtree containing the points 𝑝 for which 𝑝𝑑> median value of dimension 𝑑

Only the leaf nodes contain data points.

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Page 25: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Constructing a k-d tree K-DTree(group):

1. If enough elements in the group:

a. While(good dimension is not yet found):

I. Select the best dimension to split on (see next slide)

II. If good dimension is found:

i. Split the group into left children and right children

ii. If the split created two groups:

- leftChild = K-DTree(left children)

- rightChild = K-DTree(right children)

iii. Else:

- Find other dimension

III. Else:

i. Make this group a leaf node

2. Else:

a. Make this group a leaf node

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Page 26: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Constructing a k-d tree: selecting the best

dimension to split over A dimension = a movie

The best movie to split on = the movie that is rated the most often. This amount

should be larger than a predefined amount (eg. 1)

Here the best movie would be movie w

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Movie w Movie x Movie y Movie z

User A 1 -1

User B 1 1/3 -4/3

User C 1 -1

Page 27: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Constructing a k-d tree: split the group

The group is divided based on the median value of the chosen dimension.

Here the chosen dimension = movie w, and the median of the ratings [1,1,1] is 1.

All users with a rating for movie w smaller than 1, are assigned to the left

child and all users with a rating for movie w greater or equal to 1 to the right

child.

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Movie w Movie x Movie y Movie z

User A 1 -1

User B 1 1/3 -4/3

User C 1 -1

Left children = {}

Right children = {A,B,C}

Not a good split, need to

find another dimension

Page 28: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Constructing a k-d tree: finding the best

dimension The best movie to split on = the movie with the most ratings but not yet tried

out for splitting

Movie w is not possible anymore, so the next

movie having the most ratings is Movie y

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Movie w Movie x Movie y Movie z

User A 1 -1

User B 1 1/3 -4/3

User C 1 -1

Page 29: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

construction k-d tree: split the group

Here the chosen dimension = movie y, and the median of the ratings [-4/3 -1]

is -3/6.

What to do with users that didn’t rate the movie? Assign it to the left child.

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Movie w Movie x Movie y Movie z

User A 1 -1

User B 1 1/3 -4/3

User C 1 -1

Left children = {A,B}

Right children = {C}

Page 30: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Constructing a k-d tree: finding best

dimension It is possible that a good dimension can’t be found. This happens

when all the movies in a group are only rated less than the required predefined

amount.

In that case you should make this node a leaf node.

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Page 31: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Finding the n most similar users: branch and

bound method

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

If you want to find the n most similar users of user A:

1. Descend the created k-d tree to find the leaf node containing user A

2. In this leaf node, you can assume that the amount of users present > n

Now, you just need to select the n most similar users from this group in an

exhaustive way.

Page 32: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Recommender system: collaborative filtering

Similar users Average ratings Select the top m

Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the

data Normal Weighted

Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Page 33: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Finding the n most similar users: hashing

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

1. Take the ratings matrix and divide the columns per band

Here the amount of bands b is 4 and the amount of movies in a band is 2

Movie s Movie t Movie u Movie v Movie w Movie x Movie y Movie z

User A 1 -1/2 -1/2

User B 1 1/3 -4/3 2/3 -2/3

User C 1 -1 1 -1

Band 1 1 2 2 3 3 4 4

Page 34: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Finding the n most similar users: hashing

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

1. Take the ratings matrix and divide the columns per band

Movie s Movie t Movie u Movie v Movie w Movie x Movie y Movie z

User A 1 -1/2 -1/2

User B 1 1/3 -4/3 2/3 -2/3

User C 1 -1 1 -1

Band 1 1 2 2 3 3 4 4

Page 35: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Finding the n most similar users: hashing

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

1. Take the ratings matrix and divide the columns per band

2. For each band, apply a hash function,

(sign of average) * [ (the rounded value of (the average * 100)) modulo

(amount of buckets) ]

and create buckets

Assume in this example amount of buckets = 10

Band 1 Band 2 Band 3 Band 4

User A 1 0 0

User B 7 -3 7 -7

User C 1 -1 1 -1

Hash User

Band 1 1 A,C

7 B

Band 2 0 A

-1 C

-3 B

Band 3 7 B

1 C

…..

Page 36: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Finding the n most similar users: hashing

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

So to find similar users for user A, you first calculate for each band its hash

value, resulting in b hashes.

In each band, you select all the users belonging to the same bucket to which

user A belongs. These must be added to a collection of possible similar users.

To obtain the final n most similar users, you again work in an exhaustive

search.

Page 37: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Submission details

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

When: Deadline 05/05 (11:59 p.m.)

How: Submit using Dropbox functionality of Minerva: “Alle cursusbeheerders”

What: 1 zip file named GroupX_LastName1_LastName2.zip containing:

- The written report in pdf format

- The relevant Java files

Do not change the package structure!

Do not submit entire Eclipse or other project folders

Do not submit class file

Naming: - X = Number of group on Minerva

- Mention your group and names in the written report and the Java files

Page 38: Algorithms and Data Structures...Pick the n best items of all the predictions and show to user A Java project - Algorithms and Data Structures, Spring 2016 Leen De Baets & Joeri Ruyssinck

Some tips

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Be sure to complete the tasks mentioned in the Java code.

You can add as much code as you want, but do not change existing code.

Do not neglect your written report. It can be short but needs to be precise. Answer the

questions present in RecommenderSystem.java. If asked what the time or memory

complexity is, a high level reasoning is enough. You do not need to prove it.

Try to work together in your group and do not simply divide the workload.