obvious and non-obvious scalability issues: spotify learnings

November 12, 2013

Obvious and Non-ObviousScalability Issues: Spotify Learnings

David Poblador i Garcia@davidpoblador

!

!

BcnDevCon13

Spotify in numbers

One order of magnitude bigger

in some dimensions

1.000.000.000 playlists400M two years ago

2M new playlists every day

6000+ servers1300 two years ago

20 in 2008

Available in 32 markets12 two years ago

Two years ago less than 10 people in OPS + Inframore than 10 times bigger now

4Data Centers

More than 20M songsAdding 20K every day

More than 24M active users6M paying subscribers

More than 50 teamsbuilding products & features

Around 100backend systems

Learning to Scale

ScalingData Centers

Admit that when you are small, there is

someone better than you at building

datacenters1Scaling Data Centers

Scaling Data Centers

2009?

Streamlineyour procurement

process2Scaling Data Centers


2012

Have a“unit of capacity”

!

We call it POD3Scaling Data Centers


2012

Data Centers are being commoditized

!

Chances are that only a few players will deploy DCs in the

future. !

Keep an eye on that. Might make sense for your needs

4Scaling Data Centers

Scaling Operations

cloud

Scaling your backend


AP

AP

AP

AP

User

User

User

…

backendservice

backendservice


know your limits


AP

AP

AP

AP

User

User

User

…

backendservice node

backendservice node

60K users5000 reqs/second

examples

Do not try to be‘too smart’

Do not try to be ‘too smart’

DNS à la Spotify


Error Reporting DHT ring lookup

Service Discovery User Distribution


AP

AP

AP

AP

User

User

User

…

DNS GeoIP magic


8 . 8 . 8 . 8


AP

AP

AP

AP

User

User

User

…

8 . 8 . 8 . 8

Storage Devices

Storage Devices

AP

AP

AP

AP

backendservice node

backendservice node

5000 reqs/second

RAM

Storage Devices

AP

AP

AP

AP

big backend service node


? reqs/second

Does not fit in RAM anymore

Storage Devices

Hard Drives200 IOPS

Storage Devices

AP

AP

AP

AP



? reqs/second

Storage Devices

SSD10,000 IOPS

Storage Devices

Fusion IO250,000 IOPS

Page Cache

Page Cache

AP

AP

AP

AP

backendservice node

backendservice node

5000 reqs/second

Example !

RAM: 32 GB OS RAM: 2 GB !

Songs: 10M Index size: 10 GB

Page Cache

AP

AP

AP

AP

backendservice node

backendservice node

Increase in data (songs…) !

Index: approx 13 GB

Page Cache

Page Cache

posix_fadvise(2)

orchestrate index deployment

mlock(2)

Retry (not much) Back Off Fail Fast

Degrade Gracefully

Retry (not much). Back Off. Fail Fast. Degrade Gracefully

AP

AP

AP

AP

User

User

User

…

backendservice

backendservice


APUser

5000 conns/sec

DDoS’d by your clients


APUser

5000 conns/sec

Exponential Back Off Retry


AP

AP

AP

AP

User

User

User

…

backendservice

backendservice

Fail Fast


AP

AP

AP

AP

User

User

User

…

backendservice

backendservice

Degrade gracefully

Acceptable Behaviour

Test in real world conditions

Test in real world conditions

Use your most valuable assetStart by sending X% of users to X% of your servers

Automate

Automate

When necessary

Automatehttp://xkcd.com/1205/

http://xkcd.com/1205/

Take a self service approach everywhere

Take a self service approach everywhere

Configuration Management Databases and Storage Provisioning of Servers

Service Discovery Load Balancing

Monitoring …

ScalingOperations(the team)

Scaling Operations

2011

Start having teams carry operational

responsibility for their own services,

including on-call duties for the systems

they own1

Scaling Operations

Scaling Operations

2012

Infrastructure and Operations provide

expert guidance/help on how to run

service(s) teams own in production

(and everywhere else)

2Scaling Operations

Scaling Operations

2013

Infrastructure and Operations focus the effort on building and

extending our platform to create an awesome place to run

services3

Scaling Operations

Scaling Operations

devops

IncidentManagement

Process

Incident Management Process

“Prevent an issue from happening twice”


OPS-6000


Incident (severity)

Postmortem meeting with stakeholders

Remediations (urgency)

November 12, 2013

Moltes gràcies!David Poblador i Garcia@davidpoblador !

!

BcnDevCon13

obvious and non-obvious scalability issues: spotify learnings

Technology