algorithms and data structures...pick the n best items of all the predictions and show to user a...

Java project - Algorithms and Data Structures, Spring 2016

Leen De Baets & Joeri Ruyssinck

Algorithms and Data Structures

Spring 2016

JAVA PROJECT

Practical information

Work in groups of 2 → Register on Minerva before 31/03

No collaboration between different groups

Hand in: Deadline 05/05, both for Java files and written report

Short informal talk about project during last lab session 11/05

Questions can be asked during the second Java lab session

E-mail: [email protected]

No questions, other than practical, will be answered after 01/05



Recommender system

A recommender system is a system that seeks to predict which items a

user likes and/or dislikes. This is done by estimating the rating that a

user would give to an item.



Recommender system: collaborative filtering

If you want to make predictions for a user A, then these predictions are based

on similar users. Similar users are defined as users having rated the same items

with a similar number.

The underlying assumption of the collaborative filtering approach is that if a

person A has the same opinion as a person B on an item, A is more likely to

have B's opinion on a different item x than to have the opinion on x of a person

chosen randomly




How to make predictions for user A

1. Find similar users for user A

2. Average the ratings of the similar users for unrated items of user A.

3. Pick the n best items of all the predictions and show to user A



items = movies ⇒


Similar users Average ratings Select the top m

Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the

data Normal Weighted

Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

/2 /2 /1 /1

/2

/2

Implement one: 7.5

Implement both: 10



Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the


Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Calculate the distance between two users

A user is represented as a vector consisting of its ratings.

The distance between two users 𝐴 = (𝐴1, 𝐴2, … , 𝐴𝑛) and 𝐵 = 𝐵1, 𝐵2, … , 𝐵𝑛

is calculated using the cosine distance:

d 𝐴, 𝐵 = 1 − 𝐴𝑖 ∗ 𝐵𝑖𝑖

𝐴𝑖2

𝑖 ∗ 𝐵𝑖2

𝑖



Movie x Movie y Movie z

User A 5 1

User B 1 2

𝑑 𝐴, 𝐵 = 1 − 0 ∗ 1 + 5 ∗ 0 + 1 ∗ 2

25 + 1 ∗ 1 + 4

Not rated = 0

Calculate the distance between two users

Special case:




User A 5

User B 1 2

𝑑 𝐴, 𝐵 = +∞


User A 0 5

User B 1 2

𝑑 𝐴, 𝐵 = 1



Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the


Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Normalising the data




User A 5 5

User B 3 3

User C 3 1

Similar users, just different scale

Normalised ratings of user A = ratings of user A – average rating of user A


User A 0 0

User B 0 0

User C 1 -1

If all values are the same, normalised

ratings = 0. If we leave it this way,

then distance will be +∞. To avoid

this, we add noise in this case.


User A 0.000127 0.001201

User B 0.000541 0.000143

User C 1 -1



Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the


Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Finding the n most similar user: Exhaustive

search Assume that the user who wants predictions is user A.

1. For all other users X

a. Calculate the distance 𝑑 𝐴, 𝑋

b. Keep track of the n closest users

Handy data structure: fixed size priority queue



Fixed Size Priority Queue

= priority queue with fixed size

Assume priority = distance, highest priority = highest distance, and size = n

When we pop (remove one element from the queue), the element with

highest distance will be removed

Only the n elements with smallest distance will remain.



Fixed Sized Priority Queue: example with n = 3



Add user B with d(A,B) = 5

Add user C with d(A,C) = 10

Add user D with d(A,D) = 7

Add user E with d(A,E) = 8

B, 5

B, 5 C, 10

B, 5 C, 10 D, 7

B, 5 D, 7

B, 5 E,8 D, 7



Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the


Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Average the ratings

Input: the set of similar users for user A (here user B and C) and their ratings

for all movies

Output: the predicted ratings for unseen movies for user A

prediction for movie x = average of the ratings for movie x given by

the similar users (if a similar user didn’t rate

movie x, then he doesn’t count for the

average)



Movie w Movie x Movie y Movie z

User A 1 -1

User B 1/3 1/3 -2/3

User C 1 -1


User A 1 1/3 -2.5/3 -1

User B 1/3 1/3 -2/3

User C 1 -1



Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the


Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

[Optional] Average the ratings with weights

prediction for movie x = weighted average of the ratings for movie x given by

the similar users (if a similar user didn’t rate movie x,

then he doesn’t count for the average)

The closer a user is, the higher its assigned weight. The furthest similar user

will get weight 1, the second furthest similar user gets 11,…, and the most

similar user will get 1 + n*10, where n is the number of similar users who have

rated the movie



Movie w Movie x Movie y Movie z Weight

User A 1 1/3 -32/33 -1

User B 1/3 1/3 -2/3 1

User C 1 -1 10

Redo the normalisation

We already normalised earlier, so we restore the data by re-adding the

average rating of a user.




User A 1 1/3 -2.5/3 -1 average rating user A = 2


User A 3 2.33 1.166 1



Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the


Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Select the top m recommendations

The m best recommendations = the m recommendations with the highest

predicted scores.

Again, the fixed size priority queue can be used.

Assume priority = rating, highest priority = highest rating, and size = m

When we pop (remove one element from the queue), the element with the

highest predicted score will be removed

Only the elements with the m highest scores will remain.





Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the


Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Finding the n most similar user: branch and

bound approach For the branch and bound approach a k-d tree is constructed. This is a data

structure used for organizing some points in a k-dimensional space.

More specifically, it is a binary search tree where each internal node has the

following properties:

• Dimension 𝑑 on which the split should happen

• Left child: subtree containing the points 𝑝 for which 𝑝𝑑< median value of dimension 𝑑

• Right child: subtree containing the points 𝑝 for which 𝑝𝑑> median value of dimension 𝑑

Only the leaf nodes contain data points.



Constructing a k-d tree K-DTree(group):

1. If enough elements in the group:

a. While(good dimension is not yet found):

I. Select the best dimension to split on (see next slide)

II. If good dimension is found:

i. Split the group into left children and right children

ii. If the split created two groups:

- leftChild = K-DTree(left children)

- rightChild = K-DTree(right children)

iii. Else:

- Find other dimension

III. Else:

i. Make this group a leaf node

2. Else:

a. Make this group a leaf node



Constructing a k-d tree: selecting the best

dimension to split over A dimension = a movie

The best movie to split on = the movie that is rated the most often. This amount

should be larger than a predefined amount (eg. 1)

Here the best movie would be movie w




User A 1 -1

User B 1 1/3 -4/3

User C 1 -1

Constructing a k-d tree: split the group

The group is divided based on the median value of the chosen dimension.

Here the chosen dimension = movie w, and the median of the ratings [1,1,1] is 1.

All users with a rating for movie w smaller than 1, are assigned to the left

child and all users with a rating for movie w greater or equal to 1 to the right

child.




User A 1 -1

User B 1 1/3 -4/3

User C 1 -1

Left children = {}

Right children = {A,B,C}

Not a good split, need to

find another dimension

Constructing a k-d tree: finding the best

dimension The best movie to split on = the movie with the most ratings but not yet tried

out for splitting

Movie w is not possible anymore, so the next

movie having the most ratings is Movie y




User A 1 -1

User B 1 1/3 -4/3

User C 1 -1

construction k-d tree: split the group

Here the chosen dimension = movie y, and the median of the ratings [-4/3 -1]

is -3/6.

What to do with users that didn’t rate the movie? Assign it to the left child.




User A 1 -1

User B 1 1/3 -4/3

User C 1 -1

Left children = {A,B}

Right children = {C}

Constructing a k-d tree: finding best

dimension It is possible that a good dimension can’t be found. This happens

when all the movies in a group are only rated less than the required predefined

amount.

In that case you should make this node a leaf node.



Finding the n most similar users: branch and

bound method



If you want to find the n most similar users of user A:

1. Descend the created k-d tree to find the leaf node containing user A

2. In this leaf node, you can assume that the amount of users present > n

Now, you just need to select the n most similar users from this group in an

exhaustive way.



Exhaustive

search

Branch and

bound Hashing

Distance

between 2

users

Normalise the


Select best

dimension Create

division

Search in

k-d tree

Create the

k-d tree

Create bands Create

buckets

Search

buckets

Finding the n most similar users: hashing



1. Take the ratings matrix and divide the columns per band

Here the amount of bands b is 4 and the amount of movies in a band is 2

Movie s Movie t Movie u Movie v Movie w Movie x Movie y Movie z

User A 1 -1/2 -1/2

User B 1 1/3 -4/3 2/3 -2/3

User C 1 -1 1 -1

Band 1 1 2 2 3 3 4 4





Movie s Movie t Movie u Movie v Movie w Movie x Movie y Movie z

User A 1 -1/2 -1/2

User B 1 1/3 -4/3 2/3 -2/3

User C 1 -1 1 -1

Band 1 1 2 2 3 3 4 4





2. For each band, apply a hash function,

(sign of average) * [ (the rounded value of (the average * 100)) modulo

(amount of buckets) ]

and create buckets

Assume in this example amount of buckets = 10

Band 1 Band 2 Band 3 Band 4

User A 1 0 0

User B 7 -3 7 -7

User C 1 -1 1 -1

Hash User

Band 1 1 A,C

7 B

Band 2 0 A

-1 C

-3 B

Band 3 7 B

1 C

…..




So to find similar users for user A, you first calculate for each band its hash

value, resulting in b hashes.

In each band, you select all the users belonging to the same bucket to which

user A belongs. These must be added to a collection of possible similar users.

To obtain the final n most similar users, you again work in an exhaustive

search.

Submission details



When: Deadline 05/05 (11:59 p.m.)

How: Submit using Dropbox functionality of Minerva: “Alle cursusbeheerders”

What: 1 zip file named GroupX_LastName1_LastName2.zip containing:

- The written report in pdf format

- The relevant Java files

Do not change the package structure!

Do not submit entire Eclipse or other project folders

Do not submit class file

Naming: - X = Number of group on Minerva

- Mention your group and names in the written report and the Java files

Some tips



Be sure to complete the tasks mentioned in the Java code.

You can add as much code as you want, but do not change existing code.

Do not neglect your written report. It can be short but needs to be precise. Answer the

questions present in RecommenderSystem.java. If asked what the time or memory

complexity is, a high level reasoning is enough. You do not need to prove it.

Try to work together in your group and do not simply divide the workload.

algorithms and data structures...pick the n best items of all the predictions and show to user a...

Documents