algorithms and data structures...pick the n best items of all the predictions and show to user a...
TRANSCRIPT
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Algorithms and Data Structures
Spring 2016
JAVA PROJECT
Practical information
Work in groups of 2 → Register on Minerva before 31/03
No collaboration between different groups
Hand in: Deadline 05/05, both for Java files and written report
Short informal talk about project during last lab session 11/05
Questions can be asked during the second Java lab session
E-mail: [email protected]
No questions, other than practical, will be answered after 01/05
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Recommender system
A recommender system is a system that seeks to predict which items a
user likes and/or dislikes. This is done by estimating the rating that a
user would give to an item.
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Recommender system: collaborative filtering
If you want to make predictions for a user A, then these predictions are based
on similar users. Similar users are defined as users having rated the same items
with a similar number.
The underlying assumption of the collaborative filtering approach is that if a
person A has the same opinion as a person B on an item, A is more likely to
have B's opinion on a different item x than to have the opinion on x of a person
chosen randomly
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Recommender system: collaborative filtering
How to make predictions for user A
1. Find similar users for user A
2. Average the ratings of the similar users for unrated items of user A.
3. Pick the n best items of all the predictions and show to user A
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
items = movies ⇒
Recommender system: collaborative filtering
Similar users Average ratings Select the top m
Exhaustive
search
Branch and
bound Hashing
Distance
between 2
users
Normalise the
data Normal Weighted
Select best
dimension Create
division
Search in
k-d tree
Create the
k-d tree
Create bands Create
buckets
Search
buckets
/2 /2 /1 /1
/2
/2
Implement one: 7.5
Implement both: 10
Recommender system: collaborative filtering
Similar users Average ratings Select the top m
Exhaustive
search
Branch and
bound Hashing
Distance
between 2
users
Normalise the
data Normal Weighted
Select best
dimension Create
division
Search in
k-d tree
Create the
k-d tree
Create bands Create
buckets
Search
buckets
Calculate the distance between two users
A user is represented as a vector consisting of its ratings.
The distance between two users 𝐴 = (𝐴1, 𝐴2, … , 𝐴𝑛) and 𝐵 = 𝐵1, 𝐵2, … , 𝐵𝑛
is calculated using the cosine distance:
d 𝐴, 𝐵 = 1 − 𝐴𝑖 ∗ 𝐵𝑖𝑖
𝐴𝑖2
𝑖 ∗ 𝐵𝑖2
𝑖
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Movie x Movie y Movie z
User A 5 1
User B 1 2
𝑑 𝐴, 𝐵 = 1 − 0 ∗ 1 + 5 ∗ 0 + 1 ∗ 2
25 + 1 ∗ 1 + 4
Not rated = 0
Calculate the distance between two users
Special case:
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Movie x Movie y Movie z
User A 5
User B 1 2
𝑑 𝐴, 𝐵 = +∞
Movie x Movie y Movie z
User A 0 5
User B 1 2
𝑑 𝐴, 𝐵 = 1
Recommender system: collaborative filtering
Similar users Average ratings Select the top m
Exhaustive
search
Branch and
bound Hashing
Distance
between 2
users
Normalise the
data Normal Weighted
Select best
dimension Create
division
Search in
k-d tree
Create the
k-d tree
Create bands Create
buckets
Search
buckets
Normalising the data
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Movie x Movie y Movie z
User A 5 5
User B 3 3
User C 3 1
Similar users, just different scale
Normalised ratings of user A = ratings of user A – average rating of user A
Movie x Movie y Movie z
User A 0 0
User B 0 0
User C 1 -1
If all values are the same, normalised
ratings = 0. If we leave it this way,
then distance will be +∞. To avoid
this, we add noise in this case.
Movie x Movie y Movie z
User A 0.000127 0.001201
User B 0.000541 0.000143
User C 1 -1
Recommender system: collaborative filtering
Similar users Average ratings Select the top m
Exhaustive
search
Branch and
bound Hashing
Distance
between 2
users
Normalise the
data Normal Weighted
Select best
dimension Create
division
Search in
k-d tree
Create the
k-d tree
Create bands Create
buckets
Search
buckets
Finding the n most similar user: Exhaustive
search Assume that the user who wants predictions is user A.
1. For all other users X
a. Calculate the distance 𝑑 𝐴, 𝑋
b. Keep track of the n closest users
Handy data structure: fixed size priority queue
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Fixed Size Priority Queue
= priority queue with fixed size
Assume priority = distance, highest priority = highest distance, and size = n
When we pop (remove one element from the queue), the element with
highest distance will be removed
Only the n elements with smallest distance will remain.
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Fixed Sized Priority Queue: example with n = 3
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Add user B with d(A,B) = 5
Add user C with d(A,C) = 10
Add user D with d(A,D) = 7
Add user E with d(A,E) = 8
B, 5
B, 5 C, 10
B, 5 C, 10 D, 7
B, 5 D, 7
B, 5 E,8 D, 7
Recommender system: collaborative filtering
Similar users Average ratings Select the top m
Exhaustive
search
Branch and
bound Hashing
Distance
between 2
users
Normalise the
data Normal Weighted
Select best
dimension Create
division
Search in
k-d tree
Create the
k-d tree
Create bands Create
buckets
Search
buckets
Average the ratings
Input: the set of similar users for user A (here user B and C) and their ratings
for all movies
Output: the predicted ratings for unseen movies for user A
prediction for movie x = average of the ratings for movie x given by
the similar users (if a similar user didn’t rate
movie x, then he doesn’t count for the
average)
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Movie w Movie x Movie y Movie z
User A 1 -1
User B 1/3 1/3 -2/3
User C 1 -1
Movie w Movie x Movie y Movie z
User A 1 1/3 -2.5/3 -1
User B 1/3 1/3 -2/3
User C 1 -1
Recommender system: collaborative filtering
Similar users Average ratings Select the top m
Exhaustive
search
Branch and
bound Hashing
Distance
between 2
users
Normalise the
data Normal Weighted
Select best
dimension Create
division
Search in
k-d tree
Create the
k-d tree
Create bands Create
buckets
Search
buckets
[Optional] Average the ratings with weights
prediction for movie x = weighted average of the ratings for movie x given by
the similar users (if a similar user didn’t rate movie x,
then he doesn’t count for the average)
The closer a user is, the higher its assigned weight. The furthest similar user
will get weight 1, the second furthest similar user gets 11,…, and the most
similar user will get 1 + n*10, where n is the number of similar users who have
rated the movie
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Movie w Movie x Movie y Movie z Weight
User A 1 1/3 -32/33 -1
User B 1/3 1/3 -2/3 1
User C 1 -1 10
Redo the normalisation
We already normalised earlier, so we restore the data by re-adding the
average rating of a user.
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Movie w Movie x Movie y Movie z
User A 1 1/3 -2.5/3 -1 average rating user A = 2
Movie w Movie x Movie y Movie z
User A 3 2.33 1.166 1
Recommender system: collaborative filtering
Similar users Average ratings Select the top m
Exhaustive
search
Branch and
bound Hashing
Distance
between 2
users
Normalise the
data Normal Weighted
Select best
dimension Create
division
Search in
k-d tree
Create the
k-d tree
Create bands Create
buckets
Search
buckets
Select the top m recommendations
The m best recommendations = the m recommendations with the highest
predicted scores.
Again, the fixed size priority queue can be used.
Assume priority = rating, highest priority = highest rating, and size = m
When we pop (remove one element from the queue), the element with the
highest predicted score will be removed
Only the elements with the m highest scores will remain.
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Recommender system: collaborative filtering
Similar users Average ratings Select the top m
Exhaustive
search
Branch and
bound Hashing
Distance
between 2
users
Normalise the
data Normal Weighted
Select best
dimension Create
division
Search in
k-d tree
Create the
k-d tree
Create bands Create
buckets
Search
buckets
Finding the n most similar user: branch and
bound approach For the branch and bound approach a k-d tree is constructed. This is a data
structure used for organizing some points in a k-dimensional space.
More specifically, it is a binary search tree where each internal node has the
following properties:
• Dimension 𝑑 on which the split should happen
• Left child: subtree containing the points 𝑝 for which 𝑝𝑑< median value of dimension 𝑑
• Right child: subtree containing the points 𝑝 for which 𝑝𝑑> median value of dimension 𝑑
Only the leaf nodes contain data points.
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Constructing a k-d tree K-DTree(group):
1. If enough elements in the group:
a. While(good dimension is not yet found):
I. Select the best dimension to split on (see next slide)
II. If good dimension is found:
i. Split the group into left children and right children
ii. If the split created two groups:
- leftChild = K-DTree(left children)
- rightChild = K-DTree(right children)
iii. Else:
- Find other dimension
III. Else:
i. Make this group a leaf node
2. Else:
a. Make this group a leaf node
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Constructing a k-d tree: selecting the best
dimension to split over A dimension = a movie
The best movie to split on = the movie that is rated the most often. This amount
should be larger than a predefined amount (eg. 1)
Here the best movie would be movie w
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Movie w Movie x Movie y Movie z
User A 1 -1
User B 1 1/3 -4/3
User C 1 -1
Constructing a k-d tree: split the group
The group is divided based on the median value of the chosen dimension.
Here the chosen dimension = movie w, and the median of the ratings [1,1,1] is 1.
All users with a rating for movie w smaller than 1, are assigned to the left
child and all users with a rating for movie w greater or equal to 1 to the right
child.
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Movie w Movie x Movie y Movie z
User A 1 -1
User B 1 1/3 -4/3
User C 1 -1
Left children = {}
Right children = {A,B,C}
Not a good split, need to
find another dimension
Constructing a k-d tree: finding the best
dimension The best movie to split on = the movie with the most ratings but not yet tried
out for splitting
Movie w is not possible anymore, so the next
movie having the most ratings is Movie y
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Movie w Movie x Movie y Movie z
User A 1 -1
User B 1 1/3 -4/3
User C 1 -1
construction k-d tree: split the group
Here the chosen dimension = movie y, and the median of the ratings [-4/3 -1]
is -3/6.
What to do with users that didn’t rate the movie? Assign it to the left child.
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Movie w Movie x Movie y Movie z
User A 1 -1
User B 1 1/3 -4/3
User C 1 -1
Left children = {A,B}
Right children = {C}
Constructing a k-d tree: finding best
dimension It is possible that a good dimension can’t be found. This happens
when all the movies in a group are only rated less than the required predefined
amount.
In that case you should make this node a leaf node.
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Finding the n most similar users: branch and
bound method
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
If you want to find the n most similar users of user A:
1. Descend the created k-d tree to find the leaf node containing user A
2. In this leaf node, you can assume that the amount of users present > n
Now, you just need to select the n most similar users from this group in an
exhaustive way.
Recommender system: collaborative filtering
Similar users Average ratings Select the top m
Exhaustive
search
Branch and
bound Hashing
Distance
between 2
users
Normalise the
data Normal Weighted
Select best
dimension Create
division
Search in
k-d tree
Create the
k-d tree
Create bands Create
buckets
Search
buckets
Finding the n most similar users: hashing
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
1. Take the ratings matrix and divide the columns per band
Here the amount of bands b is 4 and the amount of movies in a band is 2
Movie s Movie t Movie u Movie v Movie w Movie x Movie y Movie z
User A 1 -1/2 -1/2
User B 1 1/3 -4/3 2/3 -2/3
User C 1 -1 1 -1
Band 1 1 2 2 3 3 4 4
Finding the n most similar users: hashing
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
1. Take the ratings matrix and divide the columns per band
Movie s Movie t Movie u Movie v Movie w Movie x Movie y Movie z
User A 1 -1/2 -1/2
User B 1 1/3 -4/3 2/3 -2/3
User C 1 -1 1 -1
Band 1 1 2 2 3 3 4 4
Finding the n most similar users: hashing
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
1. Take the ratings matrix and divide the columns per band
2. For each band, apply a hash function,
(sign of average) * [ (the rounded value of (the average * 100)) modulo
(amount of buckets) ]
and create buckets
Assume in this example amount of buckets = 10
Band 1 Band 2 Band 3 Band 4
User A 1 0 0
User B 7 -3 7 -7
User C 1 -1 1 -1
Hash User
Band 1 1 A,C
7 B
Band 2 0 A
-1 C
-3 B
Band 3 7 B
1 C
…..
Finding the n most similar users: hashing
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
So to find similar users for user A, you first calculate for each band its hash
value, resulting in b hashes.
In each band, you select all the users belonging to the same bucket to which
user A belongs. These must be added to a collection of possible similar users.
To obtain the final n most similar users, you again work in an exhaustive
search.
Submission details
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
When: Deadline 05/05 (11:59 p.m.)
How: Submit using Dropbox functionality of Minerva: “Alle cursusbeheerders”
What: 1 zip file named GroupX_LastName1_LastName2.zip containing:
- The written report in pdf format
- The relevant Java files
Do not change the package structure!
Do not submit entire Eclipse or other project folders
Do not submit class file
Naming: - X = Number of group on Minerva
- Mention your group and names in the written report and the Java files
Some tips
Java project - Algorithms and Data Structures, Spring 2016
Leen De Baets & Joeri Ruyssinck
Be sure to complete the tasks mentioned in the Java code.
You can add as much code as you want, but do not change existing code.
Do not neglect your written report. It can be short but needs to be precise. Answer the
questions present in RecommenderSystem.java. If asked what the time or memory
complexity is, a high level reasoning is enough. You do not need to prove it.
Try to work together in your group and do not simply divide the workload.