music recommendations at scale with spark
DESCRIPTION
Spotify uses a range of Machine Learning models to power its music recommendation features including the Discover page, Radio, and Related Artists. Due to the iterative nature of these models they are a natural fit to the Spark computation paradigm and suffer from the IO overhead incurred by Hadoop. In this talk, I review the ALS algorithm for Matrix Factorization with implicit feedback data and how we’ve scaled it up to handle 100s of Billions of data points using Scala, Breeze, and Spark.TRANSCRIPT
![Page 1: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/1.jpg)
June 27, 2014
Music Recommendations at Scale with Spark
Chris Johnson @MrChrisJohnson
![Page 2: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/2.jpg)
Who am I??•Chris Johnson
– Machine Learning guy from NYC – Focused on music recommendations – Formerly a PhD student at UT Austin
![Page 3: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/3.jpg)
3Recommendations at Spotify !
• Discover (personalized recommendations) • Radio • Related Artists • Now Playing
![Page 4: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/4.jpg)
How can we find good recommendations?
!•Manual Curation !!!
•Manually Tag Attributes !!• Audio Content,
Metadata, Text Analysis !!• Collaborative Filtering
4
![Page 5: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/5.jpg)
How can we find good recommendations?
!•Manual Curation !!!
•Manually Tag Attributes !!• Audio Content,
Metadata, Text Analysis !!• Collaborative Filtering
5
![Page 6: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/6.jpg)
Collaborative Filtering - “The Netflix Prize” 6
![Page 7: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/7.jpg)
Collaborative Filtering 7
Hey,I like tracks P, Q, R, S!
Well,I like tracks Q, R, S, T!
Then you should check out track P!
Nice! Btw try track T!
Image via Erik Bernhardsson
![Page 8: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/8.jpg)
Section name 8
![Page 9: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/9.jpg)
Explicit Matrix Factorization 9
Movies
Users
Chris
Inception
•Users explicitly rate a subset of the movie catalog •Goal: predict how users will rate new movies
![Page 10: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/10.jpg)
• = bias for user • = bias for item • = regularization parameter
Explicit Matrix Factorization 10
ChrisInception
? 3 5 ? 1 ? ? 1 2 ? 3 2 ? ? ? 5 5 2 ? 4
•Approximate ratings matrix by the product of low-dimensional user and movie matrices •Minimize RMSE (root mean squared error)
• = user rating for movie • = user latent factor vector • = item latent factor vector
X YUsers
Movies
![Page 11: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/11.jpg)
Implicit Matrix Factorization 11
1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1
•Instead of explicit ratings use binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a
function of total streams as weights
• = bias for user • = bias for item • = regularization parameter
• = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector
X YUsers
Songs
![Page 12: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/12.jpg)
Alternating Least Squares (ALS) 12
1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1
•Instead of explicit ratings use binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a
function of total streams as weights
• = bias for user • = bias for item • = regularization parameter
• = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector
X YUsers
SongsFix songs
![Page 13: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/13.jpg)
Alternating Least Squares (ALS) 13
1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1
•Instead of explicit ratings use binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a
function of total streams as weights
• = bias for user • = bias for item • = regularization parameter
• = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector
X YUsers
SongsFix songs
Solve for users
![Page 14: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/14.jpg)
Alternating Least Squares (ALS) 14
1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1
•Instead of explicit ratings use binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a
function of total streams as weights
• = bias for user • = bias for item • = regularization parameter
• = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector
X YUsers
Songs Fix users
![Page 15: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/15.jpg)
Alternating Least Squares (ALS) 15
1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1
•Instead of explicit ratings use binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a
function of total streams as weights
• = bias for user • = bias for item • = regularization parameter
• = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector
X YUsers
SongsSolve for songs
Fix users
![Page 16: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/16.jpg)
Alternating Least Squares (ALS) 16
1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1
•Instead of explicit ratings use binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a
function of total streams as weights
• = bias for user • = bias for item • = regularization parameter
• = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector
X YUsers
SongsSolve for songs
Fix users
Repeat until convergence…
![Page 17: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/17.jpg)
Alternating Least Squares (ALS) 17
1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1
•Instead of explicit ratings use binary labels – 1 = streamed, 0 = never streamed •Minimize weighted RMSE (root mean squared error) using a
function of total streams as weights
• = bias for user • = bias for item • = regularization parameter
• = 1 if user streamed track else 0 • • = user latent factor vector • =i tem latent factor vector
X YUsers
SongsSolve for songs
Fix users
Repeat until convergence…
![Page 18: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/18.jpg)
18Alternating Least Squares
code: https://github.com/MrChrisJohnson/implicitMF
![Page 19: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/19.jpg)
Section name 19
![Page 20: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/20.jpg)
Scaling up Implicit Matrix Factorization with Hadoop
20
![Page 21: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/21.jpg)
Hadoop at Spotify 2009 21
![Page 22: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/22.jpg)
Hadoop at Spotify 2014 22
700 Nodes in our London data center
![Page 23: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/23.jpg)
Implicit Matrix Factorization with Hadoop 23
Reduce stepMap step
u % K = 0i % L = 0
u % K = 0i % L = 1 ... u % K = 0
i % L = L-1
u % K = 1i % L = 0
u % K = 1i % L = 1 ... ...
... ... ... ...
u % K = K-1i % L = 0 ... ... u % K = K-1
i % L = L-1
item vectorsitem%L=0
item vectorsitem%L=1
item vectorsi % L = L-1
user vectorsu % K = 0
user vectorsu % K = 1
user vectorsu % K = K-1
all log entriesu % K = 1i % L = 1
u % K = 0
u % K = 1
u % K = K-1
Figure via Erik Bernhardsson
![Page 24: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/24.jpg)
Implicit Matrix Factorization with Hadoop 24
One map taskDistributed
cache:All user vectors where u % K = x
Distributed cache:
All item vectors where i % L = y
Mapper Emit contributions
Map input:tuples (u, i, count)
where u % K = x
andi % L = y
Reducer New vector!
Figure via Erik Bernhardsson
![Page 25: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/25.jpg)
Hadoop suffers from I/O overhead 25
IO Bottleneck
![Page 26: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/26.jpg)
Spark to the rescue!! 26
Vs
http://www.slideshare.net/Hadoop_Summit/spark-and-shark
Spark
Hadoop
![Page 27: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/27.jpg)
Section name 27
![Page 28: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/28.jpg)
28
ratings user vectors item vectors
First Attempt (broadcast everything)
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
• For each iteration: 1. Compute YtY over item vectors and broadcast 2. Broadcast item vectors 3. Group ratings by user 4. Solve for optimal user vector
![Page 29: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/29.jpg)
29
ratings user vectors item vectors
First Attempt (broadcast everything)
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
• For each iteration: 1. Compute YtY over item vectors and broadcast 2. Broadcast item vectors 3. Group ratings by user 4. Solve for optimal user vector
![Page 30: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/30.jpg)
First Attempt (broadcast everything) 30
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
• For each iteration: 1. Compute YtY over item vectors and broadcast 2. Broadcast item vectors 3. Group ratings by user 4. Solve for optimal user vector
![Page 31: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/31.jpg)
31
ratings user vectors item vectors
First Attempt (broadcast everything)
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
• For each iteration: 1. Compute YtY over item vectors and broadcast 2. Broadcast item vectors 3. Group ratings by user 4. Solve for optimal user vector
![Page 32: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/32.jpg)
First Attempt (broadcast everything) 32
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
• For each iteration: 1. Compute YtY over item vectors and broadcast 2. Broadcast item vectors 3. Group ratings by user 4. Solve for optimal user vector
![Page 33: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/33.jpg)
First Attempt (broadcast everything) 33
![Page 34: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/34.jpg)
First Attempt (broadcast everything) 34
•Cons: – Unnecessarily shuffling all data across wire each iteration. – Not caching ratings data – Unnecessarily sending a full copy of user/item vectors to all workers.
![Page 35: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/35.jpg)
Second Attempt (full gridify) 35
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
•Group ratings matrix into K x L, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector send a copy to each rating block in the item % L column 3. Compute intermediate terms for each block (partition) 4. Group by user, aggregate intermediate terms, and solve for optimal user vector
![Page 36: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/36.jpg)
Second Attempt (full gridify) 36
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
•Group ratings matrix into K x L, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector send a copy to each rating block in the item % L column 3. Compute intermediate terms for each block (partition) 4. Group by user, aggregate intermediate terms, and solve for optimal user vector
![Page 37: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/37.jpg)
Second Attempt (full gridify) 37
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
•Group ratings matrix into K x L, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector send a copy to each rating block in the item % L column 3. Compute intermediate terms for each block (partition) 4. Group by user, aggregate intermediate terms, and solve for optimal user vector
![Page 38: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/38.jpg)
Second Attempt (full gridify) 38
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
•Group ratings matrix into K x L, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector send a copy to each rating block in the item % L column 3. Compute intermediate terms for each block (partition) 4. Group by user, aggregate intermediate terms, and solve for optimal user vector
![Page 39: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/39.jpg)
Second Attempt (full gridify) 39
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
•Group ratings matrix into K x L, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector send a copy to each rating block in the item % L column 3. Compute intermediate terms for each block (partition) 4. Group by user, aggregate intermediate terms, and solve for optimal user vector
![Page 40: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/40.jpg)
Second Attempt (full gridify) 40
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
•Group ratings matrix into K x L, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector send a copy to each rating block in the item % L column 3. Compute intermediate terms for each block (partition) 4. Group by user, aggregate intermediate terms, and solve for optimal user vector
![Page 41: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/41.jpg)
Second Attempt (full gridify) 41
ratings user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
•Group ratings matrix into K x L, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector send a copy to each rating block in the item % L column 3. Compute intermediate terms for each block (partition) 4. Group by user, aggregate intermediate terms, and solve for optimal user vector
![Page 42: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/42.jpg)
Second Attempt 42
![Page 43: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/43.jpg)
Second Attempt 43
•Pros – Ratings get cached and never shuffled – Each partition only requires a subset of item (or user) vectors in memory each iteration – Potentially requires less local memory than a “half gridify” scheme
•Cons - Sending lots of intermediate data over wire each iteration in order to aggregate and solve for optimal vectors - More IO overhead than a “half gridify” scheme
![Page 44: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/44.jpg)
Third Attempt (half gridify) 44
ratings user vectors item vectors
•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector, send a copy to each user rating partition that requires it (potentially
all partitions) 3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
![Page 45: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/45.jpg)
Third Attempt (half gridify) 45
ratings user vectors item vectors
•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector, send a copy to each user rating partition that requires it (potentially
all partitions) 3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
![Page 46: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/46.jpg)
Third Attempt (half gridify) 46
ratings user vectors item vectors
•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector, send a copy to each user rating partition that requires it (potentially
all partitions) 3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
![Page 47: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/47.jpg)
Third Attempt (half gridify) 47
ratings user vectors item vectors
•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector, send a copy to each user rating partition that requires it (potentially
all partitions) 3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
![Page 48: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/48.jpg)
Third Attempt (half gridify) 48
ratings user vectors item vectors
•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector, send a copy to each user rating partition that requires it (potentially
all partitions) 3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
![Page 49: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/49.jpg)
Third Attempt (half gridify) 49
ratings user vectors item vectors
•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector, send a copy to each user rating partition that requires it (potentially
all partitions) 3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
![Page 50: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/50.jpg)
Third Attempt (half gridify) 50
ratings user vectors item vectors
•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache •For each iteration: 1. Compute YtY over item vectors and broadcast 2. For each item vector, send a copy to each user rating partition that requires it (potentially
all partitions) 3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
Note that we removed the extra shuffle from the full gridify approach.
![Page 51: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/51.jpg)
51Third Attempt (half gridify)
•Pros – Ratings get cached and never shuffled – Once item vectors are joined with ratings partitions each partition has enough information to solve optimal user
vectors without any additional shuffling/aggregation (which occurs with the “full gridify” scheme) •Cons - Each partition could potentially require a copy of each item vector (which may not all fit in memory) - Potentially requires more local memory than “full gridify” scheme
Actual MLlib code!
![Page 52: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/52.jpg)
ALS Running Times 52
Hadoop Spark (full gridify)
Spark (half gridify)
10 hours 3.5 hours 1.5 hours
•Dataset consisting of Spotify streaming data for 4 Million users and 500k artists -Note: full dataset consists of 40M users and 20M songs but we haven’t yet successfully run with Spark
•All jobs run using 40 latent factors •Spark jobs used 200 executors with 20G containers •Hadoop job used 1k mappers, 300 reducers
![Page 53: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/53.jpg)
ALS Running Times 53
ALS runtime numbers via @evansparks using Spark version 0.8.0
![Page 54: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/54.jpg)
Section name 54
![Page 55: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/55.jpg)
Random Learnings 55
•PairRDDFunctions are your friend!
![Page 56: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/56.jpg)
Random Learnings 56
•Kryo serialization faster than java serialization but may require you to write and/or register your own serializers
![Page 57: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/57.jpg)
Random Learnings 57
•Kryo serialization faster than java serialization but may require you to write and/or register your own serializers
![Page 58: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/58.jpg)
Random Learnings 58
•Running with larger datasets often results in failed executors and job never fully recovers
![Page 59: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/59.jpg)
Section name 59
Fin
![Page 60: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/60.jpg)
Section name 60
![Page 61: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/61.jpg)
Section name 61
![Page 62: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/62.jpg)
Section name 62
![Page 63: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/63.jpg)
Section name 63
![Page 64: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/64.jpg)
Section name 64
![Page 65: Music Recommendations at Scale with Spark](https://reader034.vdocuments.us/reader034/viewer/2022042713/540dc1798d7f728d7e8b4aba/html5/thumbnails/65.jpg)
Section name 65