obvious and non-obvious scalability issues: spotify learnings

74
November 12, 2013 Obvious and Non-Obvious Scalability Issues : Spotify Learnings David Poblador i Garcia @davidpoblador BcnDevCon 13

Upload: david-poblador-i-garcia

Post on 05-Dec-2014

2.324 views

Category:

Technology


0 download

DESCRIPTION

These are the slides for the talk I held during the Barcelona Developers Conference 2013. In this talk, I cover some of the scalability issues we've been facing during our intense growth experienced since 2008. The talk is mostly focused to systems and backend engineers. Note: some of the slides are not superawesome because the transitions are lost in the conversion to PDF.

TRANSCRIPT

Page 1: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

November 12, 2013

Obvious and Non-ObviousScalability Issues: Spotify Learnings

David Poblador i Garcia@davidpoblador

!

!

BcnDevCon13

Page 2: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Spotify in numbers

Page 3: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

2011

Page 4: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

2011

Page 5: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

2013

Page 6: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

One order of magnitude bigger

in some dimensions

Page 7: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

1.000.000.000 playlists400M two years ago

2M new playlists every day

Page 8: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

6000+ servers1300 two years ago

20 in 2008

Page 9: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Available in 32 markets12 two years ago

Page 10: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Two years ago less than 10 people in OPS + Inframore than 10 times bigger now

Page 11: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

4Data Centers

Page 12: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

More than 20M songsAdding 20K every day

Page 13: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

More than 24M active users6M paying subscribers

Page 14: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

More than 50 teamsbuilding products & features

Page 15: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Around 100backend systems

Page 16: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Learning to Scale

Page 17: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

ScalingData Centers

Page 18: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Admit that when you are small, there is

someone better than you at building

datacenters1Scaling Data Centers

Page 19: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Scaling Data Centers

2009?

Page 20: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Streamlineyour procurement

process2Scaling Data Centers

Page 21: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Scaling Data Centers

2012

Page 22: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Have a“unit of capacity”

!

We call it POD3Scaling Data Centers

Page 23: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Scaling Data Centers

2012

Page 24: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Data Centers are being commoditized

!

Chances are that only a few players will deploy DCs in the

future. !

Keep an eye on that. Might make sense for your needs

4Scaling Data Centers

Page 25: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Scaling Operations

cloud

Page 26: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Scaling your backend

Page 27: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Scaling your backend

AP

AP

AP

AP

User

User

User

backendservice

backendservice

Page 28: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Scaling your backend

know your limits

Page 29: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Scaling your backend

AP

AP

AP

AP

User

User

User

backendservice node

backendservice node

60K users5000 reqs/second

examples

Page 30: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Do not try to be‘too smart’

Page 31: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Do not try to be ‘too smart’

DNS à la Spotify

Page 32: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Do not try to be ‘too smart’

Error Reporting DHT ring lookup

Service Discovery User Distribution

Page 33: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Do not try to be ‘too smart’

AP

AP

AP

AP

User

User

User

DNS GeoIP magic

Page 34: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Do not try to be ‘too smart’

Page 35: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Do not try to be ‘too smart’

8 . 8 . 8 . 8

Page 36: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Do not try to be ‘too smart’

AP

AP

AP

AP

User

User

User

8 . 8 . 8 . 8

Page 37: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Storage Devices

Page 38: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Storage Devices

AP

AP

AP

AP

backendservice node

backendservice node

5000 reqs/second

RAM

Page 39: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Storage Devices

AP

AP

AP

AP

big backend service node

big backend service node

? reqs/second

Does not fit in RAM anymore

Page 40: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Storage Devices

Hard Drives200 IOPS

Page 41: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Storage Devices

AP

AP

AP

AP

big backend service node

big backend service node

? reqs/second

Page 42: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Storage Devices

SSD10,000 IOPS

Page 43: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Storage Devices

Fusion IO250,000 IOPS

Page 44: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Page Cache

Page 45: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Page Cache

AP

AP

AP

AP

backendservice node

backendservice node

5000 reqs/second

Example !

RAM: 32 GB OS RAM: 2 GB !

Songs: 10M Index size: 10 GB

Page 46: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Page Cache

AP

AP

AP

AP

backendservice node

backendservice node

Increase in data (songs…) !

Index: approx 13 GB

Page 47: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Page Cache

Page 48: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Page Cache

posix_fadvise(2)

orchestrate index deployment

mlock(2)

Page 49: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Retry (not much) Back Off Fail Fast

Degrade Gracefully

Page 50: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Retry (not much). Back Off. Fail Fast. Degrade Gracefully

AP

AP

AP

AP

User

User

User

backendservice

backendservice

Page 51: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Retry (not much). Back Off. Fail Fast. Degrade Gracefully

APUser

5000 conns/sec

DDoS’d by your clients

Page 52: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Retry (not much). Back Off. Fail Fast. Degrade Gracefully

APUser

5000 conns/sec

Exponential Back Off Retry

Page 53: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Retry (not much). Back Off. Fail Fast. Degrade Gracefully

AP

AP

AP

AP

User

User

User

backendservice

backendservice

Fail Fast

Page 54: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Retry (not much). Back Off. Fail Fast. Degrade Gracefully

AP

AP

AP

AP

User

User

User

backendservice

backendservice

Degrade gracefully

Acceptable Behaviour

Page 55: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Test in real world conditions

Page 56: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Test in real world conditions

Use your most valuable assetStart by sending X% of users to X% of your servers

Page 57: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Automate

Page 58: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Automate

When necessary

Page 59: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Automatehttp://xkcd.com/1205/

Page 60: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Take a self service approach everywhere

Page 61: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Take a self service approach everywhere

Configuration Management Databases and Storage Provisioning of Servers

Service Discovery Load Balancing

Monitoring …

Page 62: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

ScalingOperations(the team)

Page 63: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Scaling Operations

2011

Page 64: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Start having teams carry operational

responsibility for their own services,

including on-call duties for the systems

they own1

Scaling Operations

Page 65: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Scaling Operations

2012

Page 66: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Infrastructure and Operations provide

expert guidance/help on how to run

service(s) teams own in production

(and everywhere else)

2Scaling Operations

Page 67: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Scaling Operations

2013

Page 68: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Infrastructure and Operations focus the effort on building and

extending our platform to create an awesome place to run

services3

Scaling Operations

Page 69: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Scaling Operations

devops

Page 70: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

IncidentManagement

Process

Page 71: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Incident Management Process

“Prevent an issue from happening twice”

Page 72: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Incident Management Process

OPS-6000

Page 73: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

Incident Management Process

Incident (severity)

Postmortem meeting with stakeholders

Remediations (urgency)

Page 74: Obvious and Non-Obvious Scalability Issues: Spotify Learnings

November 12, 2013

Moltes gràcies!David Poblador i Garcia@davidpoblador !

!

BcnDevCon13