shortening the feedback loop
Post on 08-Jan-2017
5.425 Views
Preview:
TRANSCRIPT
Shortening the Feedback Loop How Spotify’s Big Data Ecosystem Has Evolved to Leverage Actionable Insights
Josh Baer (jbx@spotify.com)Note: opinions expressed in these slides are the authors and not necessarily those of Spotify
Who am I?
• Technical Product Owner at Spotify
• Working with fast processing infrastructure
• Previously, building out Spotify’s 2500 node Hadoop cluster
@l_phant
• Spotify Launches
• Access to a gigantic catalog of music
• Click to play instantaneous!
In 2008
Behind the Scenes: Days to Insights
Behind the Scenes
Behind the Scenes
Minutes to transfer
Hours to Clean and Bucket
Hours to Run Jobs or Ad Hoc
QueriesDAYS TO INSIGHTS
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
Real-time
ProcessingBatch Processing
(Hadoop, Hive, BigQuery)
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
To leverage actionable insights, we need a
faster feedback loop!
• Music Streaming Service
• Launched in 2008
• Premium and Free Tiers
• Available in 60 Countries
What is Spotify?
Over 100 Million Active Users
Over 30 Million Songs
Over 1 Billion Plays Per Day
And we have Data
Hadoop at Spotify
• ~2,500 Nodes
• >100 PB Capacity
• >100 TB Memory accessible by jobs
• 20K Jobs/Day
Apache Kafka at Spotify
• 500 Kafka-related machines
• 40 TB/day from logs
Real-Time at Spotify
• Storm Topologies fed via Kafka
• Mostly used for hack ideas or proof of concepts
Migrating to the Cloud
In the Beginning…
• Spotify was almost completely on-premise/bare metal
• Grew to 2,500 node Hadoop cluster and over 10K total machines in production at four globally distributed data centers
• “Flirted” with cloud providers at various times
In 2014
• Maybe we should try this cloud thing for real
Why Move to the Cloud?
• Cloud Providers have matured, decreasing in costs and increasing in reliability and variety of service offered
• Owning and operating physical machines is not a competitive advantage for Spotify
Why Google’s Cloud?
• We believe Google’s industry leading background in Big Data technologies will give us a data processing advantage
Google Cloud Data Building Blocks
BigQuery
• Ad-hoc and interactive querying service for massive datasets
• Like Hive, but without needing to manage Hadoop and servers
• Leverages Google’s internal tech
• Dremel (query execution engine)
• Colossus (distributed storage)
• Borg (distributed compute)
• Jupiter (network)
Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood
BigQuery vs. Hive
• Example Queries:
• What are the top 10 songs by popularity in Spain during October 2016?
• How many hours did users in Spain spend listening to Spotify during October?
BigQuery vs. Hive
• What are the top 10 songs by popularity in Spain during October 2016?
• Hive
• 2647s (44min, 7sec)
• 15.5 TB processed
• BigQuery
• 108s (1min, 48sec)
• 1.50 TB processed
Note: Hive performance unoptimized. Version used (0.14), input format (Avro), run on a ~2500 node Yarn cluster. This is not considered to be a thorough benchmark
Top 10 Tracks in Spain during October 2016
Rank Artist(s) Track Name1 JBalvin Safari
2 DJSnake LetMeLoveYou
3 RickyMar8n VentePa'Ca
4 Sebas8anYatra Traicionera
5 Zion&Lennox(feat.JBalvin) OtraVez
6 CarlosVives,Shakira LaBicicleta
7 TheChainsmokers Closer
8 MajorLazer(feat.Jus8nBieber&MØ) ColdWater
9 Sia TheGreatest
10 IAmChino(feat.Pitbull,Yandel&Chacal) AyMIDios
BigQuery vs. Hive
• How much time did users in Spain spend listening to Spotify during October?
• Hive
• 969s (16min, 9 sec)
• 15.5 TB processed
• BigQuery
• 33s
• 780 GB processed
Note: Hive performance unoptimized. Version used (0.14), input format (Avro), run on a ~2500 node Yarn cluster. This is not considered to be a thorough benchmark
Nearly 10,000 Years!
BigQuery at Spotify
• Interactive and ad-hoc querying immediately started to transfer to BQ once the data was available on the cloud
• Pace of learning increases as friction to question decreases
Cloud Pub/Sub
• At least once globally distributed message queue
• For high volume, low topic (<10,000) publish subscribe behavior
• Like Kafka, but without needing to operate servers and supporting services (zookeeper)
Cloud Pub/Sub at Spotify
• 800K events/second? No problem
• P99 Latency of ingestions into ES: 500ms
• Ingestion from globally distributed non-GCP datacenters is painless
• Managed Service for running batch and streaming jobs
• Unified API for batch and streaming mode
• Inspired by internal Google tools like FlumeJava and Millwheel
• Programming model open-sourced as Apache Beam (currently incubating)
Cloud Dataflow
• Usually run via Scio: https://github.com/spotify/scio
• Scio provides a scala API for running Dataflow jobs and provides easy integrations with BigQuery
• New batch processing jobs at Spotify are being written in Scio/Dataflow
Cloud Dataflow (Batch) at Spotify
• Exactly-once stream processing framework
• A replacement for Spark/Flink streaming and Storm workloads at Spotify
• Optimizes for consistency which can complicate real-time workloads
Cloud Dataflow (Streaming) at Spotify
Spotify + Google Cloud Timeline
2015 2016
Beginning of Google Cloud evaluation
BigQuery begins to replace Hive
Cloud Pub/Sub begins to replace Kafka
Dataflow (streaming) begins to replace Storm
Dataflow (batch) replacing Map/Reduce
Note: Dates are approximations
Putting It All Together
The Problem
• We want to detect within minutes if we’ve introduced a bug in a client release that affects important event logging behavior
Before…
Minutes to transfer
Hours to Clean and Bucket
Hours to Run Jobs or Ad Hoc
QueriesDAYS TO INSIGHTS
Getting Data from Clients to Pub/Sub
• Built Pulsar, a simple service aggregating data from Access Points and feeding it into Cloud Pub/Sub
• Replaces the Kafka real-time event feed
Pulsar
Dataflow
• Subscribes to important event Pub/Sub topics
• Aggregate events into minute windows
• Always running, no need to schedule or wait for results
BigQuery
• Receives aggregates from Dataflow
• Allows for ad-hoc inspection or slicing on different dimensions
Tableau
• Data Visualization Tool that integrates nicely with BigQuery
• Pulls data from BigQuery periodically and caches for quick inspection
Milliseconds to transfer
Milliseconds to process
Seconds to Query
SECONDS TO INSIGHTS
Faster Insights to Client Behavior
Problem
As a developer, I want to be able to instantly explore data being logged by the clients.
Solution
• Produce a topic for all employee client events
• Store in Elasticsearch
• Visualize in Kibana
Benefits
• Able to understand what’s being sent by the client as it happens
• Exploring events, visualizing distribution (i.e. does this field actually get populated)
• Prototyping analysis based on a sample
• Dashboards for Employee Releases
Other Uses
Ad Targeting
• Real-time genre targeting
• Session insights — explicit filter
Real-time Recommendations
Live Results for X-Factor
• X-Factor: music competition
• Songs available on Spotify immediately after show airs
• Listener behavior determines the order of contestants on the playlist
Review
Real-time
ProcessingBatch Processing
(Hadoop, Hive, BigQuery)
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
Behind the Scenes
Minutes to transfer
Hours to Clean and Bucket
Hours to Run Jobs or Ad Hoc
QueriesDAYS TO INSIGHTS
To leverage actionable insights, we need a
faster feedback loop!
Putting it all togetherMilliseconds
to transfer
Milliseconds to process
Seconds to Query
SECONDS TO INSIGHTS
The Value of a Fast Feedback Loop
• Detecting problems early in data avoids long backfills or long term data loss
• Instant insights on newly developed features allows teams to iterate quicker and take risks
• Providing a quicker ad-hoc querying engine allows teams to ask more questions and learn faster
Use Anything and Everything
• Spotify has leveraged Google Cloud tools, such as Pub/Sub, Dataflow and BigQuery
• Opensource and other cloud providers offer many alternatives to this stack
• Opensource tools (Elasticsearch/Kibana) and proprietary solutions (Tableau) have also been useful additions
Where Are We Going?
• The real-time mission is in the early stages at Spotify
Stream Processing First
• The sun never sets on Spotify, why impose boundaries on our datasets?
• What’s the shortest distance between two points? Zero!
• Can we reduce the feedback cycle to zero?
We’re Hiring!Engineers, Managers, Product Owners needed in NYC and Stockholm
https://www.spotify.com/jobs
Thanks! BigDataSpain!
top related