Barcelona Developers Conference - 18 November 2011
Playing for millions, tuning for more
David Poblador i Garcia - @davidpobladorNick Barkas - @snb
fredag 18 november 11
Spotifiera anyone?
fredag 18 november 11
Outline
Growth
Deploying lots of servers
Backend architecture overview
Communication protocols
Storage
Monitoring
Future improvements
fredag 18 november 11
We’re kind of big
Over ten million registered users
Over two million paying subscribers
Launched in 12 countries
Over 15 million tracks*
Over 400 million playlists
Three datacentres
Over 1300 servers
* Number of tracks licensed globally. Catalogue size varies in each country.
fredag 18 november 11
We’re getting bigger!
More countries
• Added US (July) and Denmark (October) this year
• Austria, Switzerland, and Belgium added this week
More users
• Sign-up via Facebook
• From one to two million paying subscribers in six months
More music!
• Adding over 20,000 tracks each day
fredag 18 november 11
How to manage so many servers?
ServerDBFAIDebian PackagingPuppet (yes, we also hate it sometimes)Monitoring
fredag 18 november 11
ServerDB
In house toolAn authoritative database of equipment• Locations• Datacentres• Hostnames
Aiming to have it as the unique source of info• DNS config• What server does what• Puppet classes• FAI classes
fredag 18 november 11
FAI and Puppet
FAI installs all the basic stuff on TFTP boot• Partitions based on server type (and FAI class)• Installs base packages (.deb, of course)• Sets the basic network configuration• Bootstraps Puppet
Puppet takes over• Installs packages based on Puppet recipes• Our devs write Puppet manifests• We hate it (sometimes)
fredag 18 november 11
Let’s install a server!
fredag 18 november 11
Overview of Spotify components
accesspoint
storage
search
playlist
user
web api
browse
...
Backend services
Clients
www.spotify.com
adssocial
key Facebook
Amazon S3
CDN
Content ingestion, indexing, and transcoding
Log analysis (hadoop)
Record labels
fredag 18 november 11
Reducing bandwidth: P2P and caching
fredag 18 november 11
DNS: finding services and resources
What’s the hostname and port for the service I want?• SRV record:_frobnicator._http.example.com. 3600 SRV 10 50 8081 frob1.example.com.
name ttl prio weight port host
Which service instance should I ask for a resource? • Distributed hash tables (DHT). Ring configuration:config._frobnicator._http.example.com. 3600 TXT “slaves=0”config._frobnicator._http.example.com. 3600 TXT “slaves=2 redundancy=host”
• Mapping ring segment to service instance:tokens.8081.frob1.example.com. 3600 TXT “00112233445566778899aabbccddeeff”
fredag 18 november 11
Communication between services
Clients -> AP: proprietary protocol
AP -> service and service <-> service
• HTTP
‣ Originally all services used this
‣ Simple, well known, battle tested
‣ Each service defines its own (usually) RESTful protocol
• Splat: Service Platform
‣ Custom-built by Spotify devs
‣ Protocol defined with Thrift
‣ Provides replication and load balancing
fredag 18 november 11
New communication framework: hermes
Thin layer on top of ØMQ
Data in messages are serialized as protobuf
• Services define their APIs partly as protobuf messages
Hermes messages embedded in client <-> AP protocol
• AP doesn’t need to translate protocols; acts as ØMQ router
In addition to request/reply, we get pub/sub
fredag 18 november 11
Storage technologies
Critical, consistency important: PostgreSQL• User info required for authentication
Huge, growing, eventual consistency OK: Cassandra• Playlists, other user info, social
Fast, small, read-only key-value: Tokyo Cabinet• Track/artist/album metadata, encryption keys
Large files, read-only: Nginx caching proxy + Amazon S3• Music files, album cover art
fredag 18 november 11
Monitoring
We graph all our systems
• Munin plugins to collect data
‣ Server related figures (CPU, disk...)
‣ Systems related figures (latency, playbacks...)
• We use our own frontend to display the data
Alerts are handled using Zabbix
• We classify alerts by severity
• High severity alerts are delivered to our pagers
‣ Currently we only get a handful per week
fredag 18 november 11
Future (and current) challenges
Self-recovery
• Diagnose
• Take measures
Auto notification
• Do not bother ops, bother our suppliers
Auto scaling
• Bring up new servers
Better way to register services than DNS
• ZooKeeper? Faster to update, always consistent
fredag 18 november 11
Gràcies!
Preguntes?Nick Barkas @snb
David Poblador i Garcia @davidpoblador
fredag 18 november 11