the evolution of big data at spotify
TRANSCRIPT
![Page 2: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/2.jpg)
Who Am I?‣ Technical Product Owner at
Spotify, responsible for Hadoop
@l_phant
![Page 3: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/3.jpg)
Overview
• Creating Music Charts in three parts:
• Playing Music
• Collecting Data
• Processing Data
• The Future
![Page 4: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/4.jpg)
Building Music Charts
![Page 5: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/5.jpg)
![Page 6: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/6.jpg)
Building Music Charts
Play Music Collect Data Process
![Page 7: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/7.jpg)
Playing Music
![Page 8: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/8.jpg)
What is Spotify?
• Music Streaming Service
• Browse and Discover Millions of Songs, Artists and Albums
• By the end of 2014
• 60 Million Monthly Users
• 15 Million Paid Subscribers
![Page 9: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/9.jpg)
What is Spotify?
• Data Infrastructure
• 1300 Hadoop Nodes
• 42 PB Storage
• 30 TB data ingested via Kafka/day
• 400 TB generated by Hadoop/day
![Page 10: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/10.jpg)
Powered by Data
• Running App
• Matches music to running tempo
• Personalized running playlists in multiple tempos for millions of active users
http://www.theverge.com/2015/6/1/8696659/spotify-running-is-great-for-discovery
![Page 11: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/11.jpg)
Powered by Data
• Now Page
• Shows, podcasts and playlists based on day-parts
• Personalized layout so you always have the right music for the right moment
![Page 12: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/12.jpg)
![Page 13: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/13.jpg)
Building Music Charts10.123.133.333 - - [Mon, 3 June 2015 11:31:33 GMT] "GET /api/admin/job/aggregator/status HTTP/1.1" 200 1847 "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
10.123.133.222 - - [Mon, 3 June 2015 11:31:43 GMT] "GET /api/admin/job/aggregator/status HTTP/1.1" 200 1984 "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36”
10.123.133.222 - - [Mon, 3 June 2015 11:33:02 GMT] "GET /dashboard/courses/1291726 HTTP/1.1" 304 - "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
10.321.145.111 - - [Mon, 3 June 2015 11:33:03 GMT] "GET /api/loggedInUser HTTP/1.1" 304 - "https://my.analytics.app/dashboard/courses/1291726" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
10.112.322.111 - - [Mon, 3 June 2015 11:33:03 GMT] "POST /api/instrumentation/events/new HTTP/1.1" 200 2 "https://my.analytics.app/dashboard/courses/1291726" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36”
10.123.133.222 - - [Mon, 3 June 2015 11:33:02 GMT] "GET /dashboard/courses/1291726 HTTP/1.1" 304 - "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
• Raw data is complicated
• Often dirty
• Evolving structure
• Duplication all over
• Getting data to a central processing point is HARD
![Page 14: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/14.jpg)
Collecting Data
![Page 15: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/15.jpg)
“It’s simple, we just throw the data into
Hadoop”A naive data engineer
![Page 16: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/16.jpg)
LogArchiver
• Original method to transport logs from APs to HDFS
• Lasted from 2009 - 2013
• Relies on rsynch/scp to move files around
• Regularly scheduled via cron
![Page 17: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/17.jpg)
![Page 18: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/18.jpg)
LogArchiver Fails
• Worked well with small number of APs
• Issues with scale
• Manual Processes of adding new hosts
• Frequent dying of hosts or network issues caused massive congestion
• Manual process of overrides
![Page 19: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/19.jpg)
Apache Kafka to the rescue!
• A publish-subscribe messaging system open sourced by LinkedIn in 2011
• High level overview:
• Topic: Feeds of messages
• Producer: A message publisher
• Consumer : A subscriber of topics
![Page 20: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/20.jpg)
Apache Kafka to the rescue!
• Log -> HDFS latency reduced from hours to seconds!
• Benefits:
• Community supported
• Division of responsibilities
• Allowed for enhanced streaming use-cases
![Page 21: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/21.jpg)
![Page 22: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/22.jpg)
Processing Data
![Page 23: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/23.jpg)
Workflow Management Fail!
0 * * * * spotify-‐core hadoop jar merge_hourly.jar 15 * * * * spotify-‐core hadoop jar aggregate_song_plays.jar 30 * * * * spotify-‐analytics hadoop jar merge_artist_song.jar * 1 * * * spotify-‐core hadoop jar daily_aggregate.jar * 2 * * * spotify-‐core hadoop jar calculate_royalties.jar */2 22 * * * spotify-‐radio hadoop jar generate_radio.jar
![Page 24: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/24.jpg)
Handles the ‘plumbing’ for Hadoop jobs
https://github.com/spotify/luigi
Luigi - Python Workflow Manager
Easy to get started, no xml like Oozie
![Page 25: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/25.jpg)
Hadoop Availability
• In 2013:
• Hadoop expanded to 200 nodes
• It was business critical
• It was not very reliable :-(
• Created a ‘squad’ with two missions:
• Migrate to a new distribution with Yarn
• Make Hadoop reliable
![Page 26: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/26.jpg)
How did we do?
Had
oop
Uptim
e
90%
92%
94%
96%
98%
100%
Q3-2012 Q4-2012 Q1-2013 Q2-2013 Q3-2013 Q4-2013 Q1-2014 Q2-2014 Q3-2014 Q4-2014 Q1-2015 Q2-2015
Hadoop ownerless Dedicated squad launches
Upgrade instability
Continually improving
![Page 27: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/27.jpg)
Going from Python to Crunch
• Most of our jobs were Hadoop (python) streaming
• Lots of failures, slow performance
• Had to find a better way
![Page 28: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/28.jpg)
Moving from Python to Crunch
• Investigated several frameworks*
• Selected Crunch:
• Real types - compile time error detection, better testability
• Higher level API - let the framework optimize for you
• Better performance #JVM_FTW
*thewit.ch/scalding_crunchy_pig
![Page 29: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/29.jpg)
![Page 30: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/30.jpg)
Play Music Collect Data Process
Data driven features that allows for new ways to
play and discover music
Lower latency and enhanced reliability for
passing data from Access Points to HDFS via Kafka
Increased Hadoop reliability, Luigi scheduling
and better performance with Crunch
Improving Charts!
![Page 31: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/31.jpg)
The Future
![Page 32: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/32.jpg)
Gro
wth
%
0
500
1000
1500
2000
2500
3000
3500
2012 2013 2014 2015
Hadoop Usage Spotify Users
Growth of Hadoop vs. Spotify Users
![Page 33: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/33.jpg)
Explosive Growth
• Increased Spotify Users
• More users listening to more music -> more data -> longer running jobs
• Increased Use Cases
• Beyond simple analytics into Machine Learning, advanced processing
• Increased Engineers
• In 2014, growth of data and machine learning engineers grew rapidly
![Page 34: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/34.jpg)
Scaling Machines: Easy Scaling People: Hard
![Page 35: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/35.jpg)
User Feedback: Automate it!
![Page 37: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/37.jpg)
Hadoop Report Card
• Contains Statistics • Guidelines and Best
Practices • Sent Quarterly
![Page 38: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/38.jpg)
Real Time Use Cases
• Expanding our use of Storm for:
• Targeting Ads based on genres
• Visualizing Data
• Quicker recommendations
• More information:
• https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-storm/
![Page 39: The Evolution of Big Data at Spotify](https://reader038.vdocuments.us/reader038/viewer/2022103019/55c4d0c8bb61ebc9218b4882/html5/thumbnails/39.jpg)
Two takeaways
• Getting data into Hadoop is half the challenge. Think early and often about scale.
• Increasing infrastructure reliability and performance leads to expanded use. This adds challenges but it’s a good problem to have.