Download - Big data and machine learning @ Spotify
![Page 2: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/2.jpg)
● D-student starting 2009● Graduated last year from CSALL
(Student in this class 2013)
● Master thesis at Spotify
● Data Engineer at Spotify in Gothenburg
Me
![Page 3: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/3.jpg)
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
![Page 4: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/4.jpg)
Supervised learning: data (X), labels (Y)
Unsupervised learning:data (X)
In the Machine Learning class:
![Page 5: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/5.jpg)
What is data at Spotify?
Songs Track Metadata
User generated Users Playlists
Cover arts Listens Country, email etc Tracks of playlist
Album Clicks Add/Removes
Genres, Mood etc
Page views
30 Million songs
60 Million Monthly Active Users
58 Markets
15 Million subscribers
1.5 Billion Playlists
![Page 6: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/6.jpg)
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
![Page 7: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/7.jpg)
Big Data and processing it
● 20 TB compressed data / DAY○ 200 TB generated and stored / day (replication)
● Our business is highly dependent on these logs○ We pay artist depending on plays, plays = logs
Too much to store on a single computer. We need a cluster to process it!.. this is typically what is called “Big Data”
![Page 8: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/8.jpg)
Big Data and processing it
● Distributed computing and storage○ Hadoop
■ MapReduce○ Cassandra
● Hadoop cluster○ 1100 nodes○ ~8000 jobs/day
![Page 9: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/9.jpg)
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
![Page 10: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/10.jpg)
Using data at Spotify
Everyone part of the company is interested in our data
● Product○ Are people using X? Should we focus on features such as Y?
● Insights○ What music is trending? What artists is popular where?
● Performance○ How is latency in country Y? Did this reduce stutter in country X?
![Page 11: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/11.jpg)
Using data at Spotify
● Data-driven decision making○ Like.. every decision.○ Analysts / Data scientists
● A/B test everything!● A/B testing:
○ Statistical hypothesis testing○ Simple randomized experiment with >= 2
variants (A, B)
![Page 12: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/12.jpg)
Using data at Spotify: A/B testing
Objective: Decrease time from loading playlist to first play
Hypothesis: The bigger button the faster users finds it
Test set up: ● A - variant 1
○ 2% US and SE MAU users● B - variant 2
○ 2% US and SE MAU users● Control - normal
○ Rest of users in US SE
“The shuffle button”
![Page 13: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/13.jpg)
Using data at Spotify: A/B testing
CONTROL A B
![Page 14: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/14.jpg)
Analytics: A/B testing
Metric:Share of users playing first play > 500ms
(500ms is made up)
Lets roll out A to all users and throw away B!
![Page 15: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/15.jpg)
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
![Page 16: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/16.jpg)
● Machine Learning○ User analysis○ Artist disambiguation○ Recommender systems
Outline
![Page 17: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/17.jpg)
“ A music session somehow represents a moment for the user. Can we find these moments and
describe them? ”
![Page 18: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/18.jpg)
● Take a subset of user listening data with new genre data○ Combine listens in sessions
■ Consequent plays, no 15 min pause○ Session = [genres]
● Clustering algorithms to find similar sessions○ K-means / Hierarchical clustering
● Describe the clusters using logistic regression
Machine Learning: Cluster user music sessions
![Page 19: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/19.jpg)
Machine Learning: Cluster user music sessions
K-Means Per cluster classification
![Page 20: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/20.jpg)
Machine Learning: Cluster user music sessions
Per cluster logistic regression
w: weight vector
Each w_i can be interpreted as the effect in the x_i variable
x_i = genres
![Page 21: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/21.jpg)
Machine Learning: Cluster user music sessions
Clusters described by logistic regression name of x_iat largestw_i
![Page 22: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/22.jpg)
Machine Learning: Cluster user music sessions
![Page 23: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/23.jpg)
Machine Learning: Cluster user music sessions
![Page 24: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/24.jpg)
Machine Learning
Artist disambiguation
Cleaning up the artists pages
![Page 25: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/25.jpg)
Machine Learning: Artist disambiguation
![Page 26: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/26.jpg)
Machine Learning: Artist disambiguation
Lets listen to those tracks!
Is it really the same Fredrik?
![Page 27: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/27.jpg)
Machine Learning: Artist disambiguation
![Page 28: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/28.jpg)
Machine Learning: Artist disambiguation
● Rank artists with probability of being ambiguous
● Apply clustering on each “ambiguous” artists albums/tracks○ Using features such as country, release year,
label/licensor etc.○ Distinct cluster could be different artists
● Nicely present this for manual curation
![Page 29: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/29.jpg)
Machine Learning: Recommender system
The discover page
![Page 30: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/30.jpg)
Machine Learning: Recommender system
Collaborative filtering
![Page 31: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/31.jpg)
Machine Learning: Recommender system
Collaborative filtering● Build a matrix of user plays● Compute similarity between items
![Page 32: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/32.jpg)
Machine Learning: Recommender system
4 Million tracks x 60 Million users→ Pairwise similarity infeasible Approximate the matrix with NMF
![Page 33: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/33.jpg)
Machine Learning: Recommender system
Matrix factorization (latent factor models)
![Page 34: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/34.jpg)
Machine Learning: Recommender system
Small vectorsCosine similarity and dot product efficient
![Page 35: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/35.jpg)
Machine Learning: Recommender system
Finding recommendations:Approximate nearest neighbour (ANN)code: https://github.com/spotify/annoy
Related artists & Radio:Similar to user recommendations, more models and not
all CF-based
Multiple models:Score candidates from all models, combine and rank!
![Page 36: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/36.jpg)
Machine Learning: Recommender system
I just went through this quickly, read more details of Spotify Rec sys here:
Doing this on MapReduce Comparing with NetflixMusic Rec @ MLConf 2014
![Page 37: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/37.jpg)
● More content-based ML○ Fingerprinting: Echo nest○ Content-based music recommendation using
convolutional neural networks
● Personalize everything○ Emails○ Ads○ User profiling
● ML on other parts of product than Rec Sys
.. final last words on the Future of ML at Spotify
![Page 38: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/38.jpg)
Summary
● Multiple data sources -> multiple angles
● Data drives decision with A/B testing
● User analysis○ Cluster and describe with classifier
● Artist disambiguation○ Cluster and give to manual curators
● Recommender systems○ Collaborative filtering
![Page 39: Big data and machine learning @ Spotify](https://reader033.vdocuments.us/reader033/viewer/2022042701/55a50e0b1a28abdf588b48e1/html5/thumbnails/39.jpg)
● We supervise thesis workers○ Artist disambiguation/deduplication○ Cluster user music sessions○ Context-based recommender systems○ Personalized ads / Personalized emails
● We have internships!
www.spotify.com/jobs
.. and potentially you could help us?