spotify infrastructure
TRANSCRIPT
Spotify infrastructure
Behind the scenes
Javier Ubillos2013-11-25
Javier Ubillos
What is Spotify?
► Lightweight on-demand streaming
► Over 24 Million active users
► Over 6 Million subscribers
► 20 Million tracks (totally licenced, catalogue size varies per country)
► Available in 32 countries
► Convenient (more so than piracy)
► Legal
2013-11-25
Javier Ubillos
More numbers
► Over 20,000 new tracks added each day
► Over 1 billion playlists created
► Over $500M has now been paid to rights holders since launch in October 2008
► Spotify is the overall number two digital service in Europe, after iTunes. (IFPI Digital Music Report, Jan 2011)
2013-11-25
Javier Ubillos
Work setting
► Spotify version of SCRUM or KANBAN
►Three week sprints (varies greatly, lower is encouraged)
► Features are developed in squads of ~3 - ~10 people
►Each squad should feel like a mini start-up company
► Tech offices in Stockholm, Gothenburg, New york and San Francisco
► Hack days, 1.5 days every sprint (~10% of work-time)
2013-11-25
Javier Ubillos
About me
► Started at Spotify 2011 as a backend-infrastructure developer► Previously phd student within computer-communications for Mälardalen University (MDH), hired at Swedish Institute of Computer Science (SICS)
► Mainly research at (OSI-)layer 3
► Now working as a “technical product owner” for our team developing our tools for operational monitoring.
► Create the necessary alignment to allow the squad autonomy and freedom to act and develop (both technically as well as personally)
2013-11-25
Javier Ubillos
The infrastructure
Client Accesspoint
Spotifybackend
Client
Client
Client
P2P
2013-11-25
Javier Ubillos
Latency is everything
► Google websearch: Adding 100ms – 400ms latency to serve the searchresults increases drops in searches from -0.2% to -0.6%
- Speed Matters for Google Web Search, Jake Brutlag, 2009
► Amazon reports that every 100ms of latency costs 1% of profit
► Google maps found that an increase from 400ms to 900ms to render search results dropped traffic with 20%
► A blink of an eye takes between 300ms and 400ms
► Median playback latency in Spotify is 265ms (feels immedate)
2013-11-25
Javier Ubillos
Caching
► Player caches tracks it has played
► Default policy is to use 10% of free space (capped at 10 GB)
► Caches are large (56% are over 5GB)
► Least Recently Used policy for cache eviction
► Over 50% of data comes from local cache
► Cached files are served in P2P overlay
Note:Numbers are from the desktop client
2013-11-25
Javier Ubillos
► 55.4% from local caches
► 35.8% from P2P
► 8.8% from Spotify servers
► Median latency 265 ms
► 75th percentile: 515 ms
► 90th percentile: 1047 ms
► Less than 1% of playbacks had stutter occurrences.
Data sources (2012)
P2P
55.4%
Spotify backend
Local cache
35.8%
8.8%
2013-11-25
Javier Ubillos
Transfer strategy
Client Accesspoint S
poti
fy b
acke
nd
Client
Client
Client
► Fetch first piece from Spotify servers
► Meanwhile; search P2P for remainder
► Switch back and forth between Spotify servers and peers as needed
► Towards end of a track, start prefetching the next track
► Spotify traffic is very bursty
2013-11-25
Javier Ubillos
The parts
► The streaming protocol
► The P2P net
► The backend
2013-11-25
Javier Ubillos
Streaming
► Proprietary protocol
► Designed for on-demand streaming
► Only Spotify can add tracks
► 96-320 kbps audio streams (most are Ogg Vorbis q5, 160 kbps)
► Peer-assisted streaming (YAY!)
2013-11-25
Javier Ubillos
Protocol use
► Everything over TCP (almost)
► Everything encrypted (almost)
► Multiplex messages over a single TCP connection
► TCP_NODELAY (disables Nagles algorithm)
► Avoids with TCP slow start
► Persistent TCP connection to server while logged in reaming
2013-11-25
Javier Ubillos
Intermission:NAT and DNS error reporting
► Even when 'normal' internet access is blocked, you can often get DNS packets through
► We use this for error signaling, to find out why our users are having problems
► Other implementations of this way of communication are Ozymandns & iodine, both of who tunnel traffic over DNS
► Ozymandns & iodine
2013-11-25
Send: request TXT record for com.domain.sub.<base32encoded data>
Receive: reply in TXT record <base32encoded data>
Javier Ubillos
Why P2P?
Client
Client
► Offloading our servers
► Less bandwidth
► Less servers to buy & support
► Somewhat higher reliability
2013-11-25
Javier Ubillos
P2P overlay
Client
Client
► One P2P overlay for all tracks (actually three)
► Unstructured network (not DHT)
► All peers are equal (no supernodes)
► Nodes who are not active (do not play) drop out of the overlay
2013-11-25
Javier Ubillos
Peer coordination
Client
Client
Rendezvous
Spo
tify
bac
kend
► Server-side tracker (BitTorrent style)
► No overlay routing (e.g. Kademila)
► Neighborhood in overlay (Gnutella style)
► Server-side tracker
► Gnutella style queries (search) (2 hops)
► LAN peer discovery
2013-11-25
Javier Ubillos
Peer coordination
Client
Client
Rendezvous
Spo
tify
bac
kend
► Fixed maximum degree (60)
► Do not enforce fairness (e.g. tit-for-tat)
► Neighbor eviction by heuristic evaluation of utility
► No pro-active caching! Be kind to the client
► Nodes who are not active (do not play) drop out. The tracker return the last few hundreds of observed plays of each specific track. A balance between collaboration from nodes and not being abusive to our customers.
2013-11-25
Javier Ubillos
Peer coordination
Client
Client
Rendezvous
Spo
tify
bac
kend
► Fixed maximum degree (60)
► Do not enforce fairness (e.g. tit-for-tat)
► Neighbor eviction by heuristic evaluation of utility
► No pro-active caching! Be kind to the client
► Nodes who are not active (do not play) drop out. The tracker return the last few hundreds of observed plays of each specific track. A balance between collaboration from nodes and not being abusive to our customers.
2013-11-25
Javier Ubillos
Rendezvous?
Client
Client
Rendezvous
Spo
tify
bac
kend
► We use TCP (almost) everywhere
► Both sender and receiver are instructed by the rendezvous service to connect to each other connect
► Typically, one of the two sides will be connectable
► We try to open ports using UPNP
► In some regions, UPNP and connectivity totally sucks! [1]
► On average, 65% … failure.
[1] Measurements on the spotify peer-assisted music-on-demand streaming systems, Mikael Goldmann & Gunnar Kreitz, P2P – IEEE 2011
2013-11-25
Javier Ubillos
Peer finding efficency
Client
Client
2013-11-25
Sources for peers Fraction of searches
Tracker and P2P 75.1%
Only tracker 9.0%
Only P2P 7.0%
No peers found 8.9%
► Most efficient during
► Weekends
► Peak hours
Javier Ubillos
Peer finding efficency
Client
Client
2013-11-25
Type Fraction
Music data, used 94.80%
Music data, unused 2.38%
Search overhead 2.33%
Other overhead 0.48%
► Measured at socket layer
► Unused data means cancelled/duplicate
Javier Ubillos
When should we start playing?
► Model the users bandwidth using a Markov chain
► Feed the Markov model with bandwidth transition observations made every 100ms
► Simulate 100 times
► If more than 1 would result in stuttering, download more
20KB/s 20KB/s
30KB/s
10KB/s
20KB/s
30KB/s
10KB/s
Finished
x%
y%
z%
x%
y%
z%
x%
y%
z%
2013-11-25
Javier Ubillos
Cellphones, my nemesis
► Spotify on handheld devices has proven to be very popular.
► But it breaks my baby (p2p)!
► Giving life to P2P on the cellphone is a very hard problem.
RBS RNCSpotify
backend
Client
Device
$£€!!
Latency, packetdrop, congestion
Client
2013-11-25
Javier Ubillos
Cellphones, discoveries so far
► PhD student Raul Jimenez (KTH) did a study on battery consumption
► The radio is up most of the time► The radio & screen are the biggest battery consumers
► We can optimize for radio downtime by buffering communication between the device and the backend.
► The current situation, where the radio is up most of the time suggests that P2P could be a viable option.
2013-11-25
Javier Ubillos
Client Accesspoint
The backend
2013-11-25
Accesspoint
Javier Ubillos
The backend
Client
► Built on many small services
► All traffic enter and exit via the Access Points
► Try to keep service interdependence low
► (Lots of) Small services
2013-11-25
Javier Ubillos
Client Accesspoint
Playlist
Search
Storage
User
Proprietary
Proprietary
HTTP
HTTP
HTTP► A change to a better protocol wouldallow us to.
► Route on service (not specific host)
► Message passing style communication
► Push and publish-subscribe will actually work
Backend internals► The request-reply style of RPC is limiting
►A (bad) alternative would beHTTP long-polling
2013-11-25
Javier Ubillos
Client Accesspoint
Playlist
Search
Storage
User
Proprietary
Subset of the backend protocol
Hermes
Hermes
Hermes
Hermes► Can be routed on the service name (~DNS)
► Request-reply, Push/Pull, Pub/Sub
► Message passing style communication
► Push and publish-subscribe will actually work
Backend internals► So, we did that.
► A “service oriented protocol” called Hermes
2013-11-25
Javier Ubillos
Playlist
► Our most complex service!
► Simultaneous writes with automatic conflict resolution
► Publish-subscribe to clients
► Changes automatically versioned, transmits deltas
► Terabyte sizes of data, over 500 million playlists
► Cassandra database
2013-11-25
Javier Ubillos
Data collection
► Four general monitoring mechanisms
► Plain logging of events
► Probabilistic logging (mainly to avoid data transfer overhead in the client)
► Munin (one datasample every fifth minute)
► In house built statistics library
2013-11-25
Data collection
► Business data as well as operational data are on two separate systems
► Business data take on average ~4h to propagate from producer to consumer
► Operational data take 5-10 minutes from production to display.
► Unreasonable to use data as foundation for decisions when developing in tight iterations.
Javier Ubillos 2013-11-25
Accesspoint data
► Produce both business data as well as operational data
► Produce about half of all data (most likely more, as it contains all user-produced logs as well)
► ~2TB every day
► Played songs, user login/logouts, p2p-data, user-views, user-events (e.g. clicks)
► All data in flying in as log-files
Javier Ubillos 2013-11-25
Javier Ubillos
Operational data
► For services in the backend we use munin and RRDs*
► Every datasource is sampled once every five minutes
► Collected by site-local data-collectors
► Forwarded to central data-repository
► Graphed and presented to service owners and operations
* Round Robbin Database Read/write data-store of constant size
2013-11-25
Javier Ubillos
Merging the datastreams
► How can we merge these datastreams?
► Pipe logs through kafka (publish-subscribe infrastructure developed by linked-in)
► Push to storm (developed by twitter)
► Back to kafka, deliver to subscribers (graphing systems)
► Have all operational metrics pushed to a narow waist (collectd)
► Publish to kafka, deliver to subscribers
► Graphed and presented to service owners and operations
* Round Robbin Database Read/write data-store of constant size
2013-11-25
Javier Ubillos
Merging the datastreams
► How can we merge these datastreams?
► Pipe logs through kafka (publish-subscribe infrastructure developed by linked-in)
► Push to storm (developed by twitter)
► Back to kafka, deliver to subscribers (graphing systems)
► Have all operational metrics pushed to a narow waist (collectd)
► Publish to kafka, deliver to subscribers
► Graphed and presented to service owners and operations
* Round Robbin Database Read/write data-store of constant size
2013-11-25
Linux kernel send-window limitations
Javier Ubillos
Merging the datastreams
► How can we merge these datastreams?
► Pipe logs through kafka (publish-subscribe infrastructure developed by linked-in)
► Push to storm (developed by twitter)
► Back to kafka, deliver to subscribers (graphing systems)
► Have all operational metrics pushed to a narow waist (collectd)
► Publish to kafka, deliver to subscribers
► Graphed and presented to service owners and operations
* Round Robbin Database Read/write data-store of constant size
2013-11-25
Linux kernel send-window limitations
Completly different message-passing paradigm
Javier Ubillos
Merging the datastreams
► How can we merge these datastreams?
► Pipe logs through kafka (publish-subscribe infrastructure developed by linked-in)
► Push to storm (developed by twitter)
► Back to kafka, deliver to subscribers (graphing systems)
► Have all operational metrics pushed to a narow waist (collectd)
► Publish to kafka, deliver to subscribers
► Graphed and presented to service owners and operations
* Round Robbin Database Read/write data-store of constant size
2013-11-25
Linux kernel send-window limitations
UDP too spammy, need to batch with TCP
Completly different message-passing paradigm
Javier Ubillos
Merging the datastreams
► How can we merge these datastreams?
► Pipe logs through kafka (publish-subscribe infrastructure developed by linked-in)
► Push to storm (developed by twitter)
► Back to kafka, deliver to subscribers (graphing systems)
► Have all operational metrics pushed to a narow waist (collectd)
► Publish to kafka, deliver to subscribers
► Graphed and presented to service owners and operations
* Round Robbin Database Read/write data-store of constant size
2013-11-25
Linux kernel send-window limitations
UDP too spammy, need to batch with TCP
Completly different message-passing paradigm
Websockets?Browsers don't handle high rate of packets very well.