obvious and non-obvious scalability issues: spotify learnings
DESCRIPTION
These are the slides for the talk I held during the Barcelona Developers Conference 2013. In this talk, I cover some of the scalability issues we've been facing during our intense growth experienced since 2008. The talk is mostly focused to systems and backend engineers. Note: some of the slides are not superawesome because the transitions are lost in the conversion to PDF.TRANSCRIPT
November 12, 2013
Obvious and Non-ObviousScalability Issues: Spotify Learnings
David Poblador i Garcia@davidpoblador
!
!
BcnDevCon13
Spotify in numbers
2011
2011
2013
One order of magnitude bigger
in some dimensions
1.000.000.000 playlists400M two years ago
2M new playlists every day
6000+ servers1300 two years ago
20 in 2008
Available in 32 markets12 two years ago
Two years ago less than 10 people in OPS + Inframore than 10 times bigger now
4Data Centers
More than 20M songsAdding 20K every day
More than 24M active users6M paying subscribers
More than 50 teamsbuilding products & features
Around 100backend systems
Learning to Scale
ScalingData Centers
Admit that when you are small, there is
someone better than you at building
datacenters1Scaling Data Centers
Scaling Data Centers
2009?
Streamlineyour procurement
process2Scaling Data Centers
Scaling Data Centers
2012
Have a“unit of capacity”
!
We call it POD3Scaling Data Centers
Scaling Data Centers
2012
Data Centers are being commoditized
!
Chances are that only a few players will deploy DCs in the
future. !
Keep an eye on that. Might make sense for your needs
4Scaling Data Centers
Scaling Operations
cloud
Scaling your backend
Scaling your backend
AP
AP
AP
AP
User
User
User
…
backendservice
backendservice
Scaling your backend
know your limits
Scaling your backend
AP
AP
AP
AP
User
User
User
…
backendservice node
backendservice node
60K users5000 reqs/second
examples
Do not try to be‘too smart’
Do not try to be ‘too smart’
DNS à la Spotify
Do not try to be ‘too smart’
Error Reporting DHT ring lookup
Service Discovery User Distribution
Do not try to be ‘too smart’
AP
AP
AP
AP
User
User
User
…
DNS GeoIP magic
Do not try to be ‘too smart’
Do not try to be ‘too smart’
8 . 8 . 8 . 8
Do not try to be ‘too smart’
AP
AP
AP
AP
User
User
User
…
8 . 8 . 8 . 8
Storage Devices
Storage Devices
AP
AP
AP
AP
backendservice node
backendservice node
5000 reqs/second
RAM
Storage Devices
AP
AP
AP
AP
big backend service node
big backend service node
? reqs/second
Does not fit in RAM anymore
Storage Devices
Hard Drives200 IOPS
Storage Devices
AP
AP
AP
AP
big backend service node
big backend service node
? reqs/second
Storage Devices
SSD10,000 IOPS
Storage Devices
Fusion IO250,000 IOPS
Page Cache
Page Cache
AP
AP
AP
AP
backendservice node
backendservice node
5000 reqs/second
Example !
RAM: 32 GB OS RAM: 2 GB !
Songs: 10M Index size: 10 GB
Page Cache
AP
AP
AP
AP
backendservice node
backendservice node
Increase in data (songs…) !
Index: approx 13 GB
Page Cache
Page Cache
posix_fadvise(2)
orchestrate index deployment
mlock(2)
Retry (not much) Back Off Fail Fast
Degrade Gracefully
Retry (not much). Back Off. Fail Fast. Degrade Gracefully
AP
AP
AP
AP
User
User
User
…
backendservice
backendservice
Retry (not much). Back Off. Fail Fast. Degrade Gracefully
APUser
5000 conns/sec
DDoS’d by your clients
Retry (not much). Back Off. Fail Fast. Degrade Gracefully
APUser
5000 conns/sec
Exponential Back Off Retry
Retry (not much). Back Off. Fail Fast. Degrade Gracefully
AP
AP
AP
AP
User
User
User
…
backendservice
backendservice
Fail Fast
Retry (not much). Back Off. Fail Fast. Degrade Gracefully
AP
AP
AP
AP
User
User
User
…
backendservice
backendservice
Degrade gracefully
Acceptable Behaviour
Test in real world conditions
Test in real world conditions
Use your most valuable assetStart by sending X% of users to X% of your servers
Automate
Automate
When necessary
Automatehttp://xkcd.com/1205/
Take a self service approach everywhere
Take a self service approach everywhere
Configuration Management Databases and Storage Provisioning of Servers
Service Discovery Load Balancing
Monitoring …
ScalingOperations(the team)
Scaling Operations
2011
Start having teams carry operational
responsibility for their own services,
including on-call duties for the systems
they own1
Scaling Operations
Scaling Operations
2012
Infrastructure and Operations provide
expert guidance/help on how to run
service(s) teams own in production
(and everywhere else)
2Scaling Operations
Scaling Operations
2013
Infrastructure and Operations focus the effort on building and
extending our platform to create an awesome place to run
services3
Scaling Operations
Scaling Operations
devops
IncidentManagement
Process
Incident Management Process
“Prevent an issue from happening twice”
Incident Management Process
OPS-6000
Incident Management Process
Incident (severity)
Postmortem meeting with stakeholders
Remediations (urgency)
November 12, 2013
Moltes gràcies!David Poblador i Garcia@davidpoblador !
!
BcnDevCon13