svd and the netflix dataset
DESCRIPTION
Short summary and explanation of LSI (SVD) and how it can be applied to recommendation systems and the Netflix dataset in particular.TRANSCRIPT
![Page 1: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/1.jpg)
SVD Applied to Collaborative Filtering
~ URUG 7-12-07 ~
![Page 2: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/2.jpg)
Recommendation System
![Page 3: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/3.jpg)
Recommendation SystemAnswers the question:
What do I want next?!?
![Page 4: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/4.jpg)
Recommendation System
Very consumer driven.
Must provide good results or a user may not trust the system in the future.
Answers the question:What do I want next?!?
![Page 5: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/5.jpg)
Collaborative FilteringBase user recommendations off of:
User’s past history.
History of like-minded users.
View data as product X user matrix.
Find a “neighborhood” of similar users for that user.
Return the top-N recommendations.
![Page 6: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/6.jpg)
Early Approaches
Goldberg, et. al. (1992), Using collaborative filtering to weave an information tapestry
Konstan, J., el. at (1997), Applying Collaborative Filtering to Usenet news.
Use Pearson Correlation or cosine similarity as a measure of similarity to form neighborhoods.
![Page 7: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/7.jpg)
Early CF Challenges
![Page 8: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/8.jpg)
Early CF Challenges
Sparsity - No correlation between users can be found. Reduced coverage occurs.
![Page 9: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/9.jpg)
Early CF Challenges
Sparsity - No correlation between users can be found. Reduced coverage occurs.
Scalability - Nearest neighbor algorithms computation time grows with the number of products and users.
![Page 10: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/10.jpg)
Early CF Challenges
Sparsity - No correlation between users can be found. Reduced coverage occurs.
Scalability - Nearest neighbor algorithms computation time grows with the number of products and users.
Synonymy
![Page 11: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/11.jpg)
Dimensionality Reduction
![Page 12: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/12.jpg)
Dimensionality ReductionLatent Semantic Indexing (LSI)
![Page 13: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/13.jpg)
Dimensionality ReductionLatent Semantic Indexing (LSI)
Algorithm from IR community (late 80s-early 90s.)
![Page 14: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/14.jpg)
Dimensionality ReductionLatent Semantic Indexing (LSI)
Algorithm from IR community (late 80s-early 90s.)
Addresses the problems of synonymy, polysemy, sparsity, and scalability for large datasets.
![Page 15: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/15.jpg)
Dimensionality ReductionLatent Semantic Indexing (LSI)
Algorithm from IR community (late 80s-early 90s.)
Addresses the problems of synonymy, polysemy, sparsity, and scalability for large datasets.
Reduces dimensionality of a dataset and captures the latent relationships.
![Page 16: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/16.jpg)
Dimensionality ReductionLatent Semantic Indexing (LSI)
Algorithm from IR community (late 80s-early 90s.)
Addresses the problems of synonymy, polysemy, sparsity, and scalability for large datasets.
Reduces dimensionality of a dataset and captures the latent relationships.
Easily maps to CF!
![Page 17: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/17.jpg)
Dimensionality ReductionLatent Semantic Indexing (LSI)
Algorithm from IR community (late 80s-early 90s.)
Addresses the problems of synonymy, polysemy, sparsity, and scalability for large datasets.
Reduces dimensionality of a dataset and captures the latent relationships.
Easily maps to CF!
![Page 18: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/18.jpg)
Framing LSI for CFProducts X Users matrix instead of Terms X Documents.
480,189 users, 17,770 movies, only ~100 milion ratings.
17,770 X 480,189 matrix that is 99% sparse!
About 8.5 billion potential ratings.
Netflix Dataset
![Page 19: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/19.jpg)
SVD- The math behind LSISingular Value Decomposition
For any M x N matrix A of rank r, it can decomposed as:
A = U!V TU is a M x M orthogonal matrix.V is a N X N orthogonal matrix.Σ is a M x N diagonal matrix whose first r diagonal entries are the nonzero singular values of A.
!1 ! !2...! !r > !r+1 = ... = !n = 0
![Page 20: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/20.jpg)
Related to eigenvalue decomposition (PCA)
U is the orthornormal eigenspace of AA^T. Spans the “column space”, known as left singular vectors.
V is the orthornormal eigenspace of A^TA. Spans “row space”. Right vectors.
Singular values are the square roots of the eigenvalues.
![Page 21: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/21.jpg)
Reducing Dimensionality
A_k is the closest approximation to A.
A_k minimizes the Frobenius norm over all rank-k matrices:
Ak = Uk!kV Tk
||A!Ak||F
![Page 22: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/22.jpg)
Making RecommendationsCosine Similarity- common way to find neighborhood.
cos(i, j) =i · j
||i||2 ! || j||2Somehow base recommendations off of that neighborhood and its users.
Can also make predictions of products with a simple dot product if the singular values are combined with the singular vectors.
CPprod = Cavg +UkS1/2k (c) · S1/2
k V Tk (p)
![Page 23: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/23.jpg)
Challenges with SVDScalability - Once again, compute time grows with the number of users and products. O(m^3)
Offline stage.
Online stage.
Even doing the SVD computation offline is not possible for large datasets. Other methods are needed.
![Page 24: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/24.jpg)
Incremental SVD
uk = uTVk!!1k
![Page 25: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/25.jpg)
Incremental SVD Results
![Page 26: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/26.jpg)
GHA for SVD
Gorrell (2006),GHA for Incremental SVD in NLP
Based off of Sanger’s (1989) GHA for eigen decomposition.
!cai = cb
i · b(x!"j<i
(a · caj)c
aj)
!cbi = ca
i · a(b!"j<i
(b · cbj)c
bj)
![Page 27: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/27.jpg)
GHA extended by Funk
void train(int user, int movie, real rating) { real err = lrate * (rating - predictRating(movie, user));
userValue[user] += err * movieValue[movie]; movieValue[movie] += err * userValue[user]; }
![Page 28: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/28.jpg)
Netflix Results
Best RMSEs
0.9283
0.9212
Blended to get 0.9189, 3.42% better than Netflix.
![Page 29: SVD and the Netflix Dataset](https://reader034.vdocuments.us/reader034/viewer/2022052522/554de6d2b4c905f6598b4670/html5/thumbnails/29.jpg)
SummarySVD provides an elegant and automatic recommendation system that has the potential to scale.
There are many different algorithms to calculate or at least approximate SVD which can be used in offline stages for websites that need to have CF.
Every dataset is different and requires experimentation with to get the best results.