why did we think large scale distributed systems would be easy? - puppetconf 2013

12
Why did we think large scale distributed systems would be easy? Gordon Rowell PuppetConf San Francisco 2013 [email protected]

Upload: puppet-labs

Post on 02-Jul-2015

21.659 views

Category:

Technology


1 download

DESCRIPTION

"Why Did We Think Large Scale Distributed Systems Would be Easy?" by Gordon Rowell, Site Reliability Manager, Google. Presentation Overview: Google's Corporate Engineering SRE team provides infrastructure services used by many of Google's desktops, laptops and servers. This talk gives an overview of the design philosophy, challenges, technologies and some interesting failures seen while implementing infrastructure at scale. Speaker Bio: Gordon Rowell is a site reliability manager at Google, Sydney. His team focuses on delivering services to Googlers around the world. They have migrated major internal services to run on Google technology and are currently focused on removing dependencies on the corporate network. He enjoys the challenges of building robust systems that scale and has a particular passion for configuration management. Prior to joining Google in 2006, he worked as an independent systems developer with a focus on telecommunications infrastructure. He also worked at e-smith/Mitel building an open source Internet small business server/gateway. He lives in Sydney, but used to live in Ottawa, where he ice-skated to work. Gordon has earned a bachelor's of science with honours in Computer Science from the University of NSW.

TRANSCRIPT

Page 1: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Why did we think large scale distributed systems would be

easy? Gordon Rowell

PuppetConf San Francisco 2013

[email protected]

Page 2: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Background

Site Reliability Engineering runs many services The same rules always apply:

●  Make the service scale ●  Make the deployment consistent ●  Understand all layers of the system ●  Monitor everything ●  Plan for failure ●  Break things, under controlled conditions

Page 3: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Scaling is fun

We don't deploy "a server" •  Servers break, power fails •  Clients/DNS need to be reconfigured

We don't deploy "a cluster"

•  Networks break, servers break, power fails •  Clients/DNS need to be reconfigured

We deploy redundant clusters

•  Attempt to send clients to nearest serving cluster •  Anycast allows for unified client configuration

Page 4: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

But client DoS is not

Poorly written code... ●  on small numbers of clients... ●  is annoying

Poorly written code...

●  on a huge number of clients... ●  can cause serious infrastructure pain

Write good code and stage your releases

●  Work with the service owners ●  Stage rollouts, allow soak time ●  Have a rollback plan for clients and test it ●  Have DoS limits for services, test them

Page 5: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Load balancing is fun

Do you have enough capacity? •  How many backends do you need? •  What happens if half of your backends lose power? •  What about when half are already out for repairs?

How do you send clients to the right cluster?

•  Client configuration •  DNS round-robin (simple global load balancing) •  DNS views (give best answer for client IP) •  Anycast (portable IP, routed to "nearest" cluster) •  Consider: DNS views plus Anycast

Page 6: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

But global outages are not

Monitor everything ●  Health check failures bring down your service ●  ...by design

Test everything

●  You should expect (and test) data center outages ●  A global outage can ruin your day ●  Cascading failures are unpleasant

Learn from outages

●  Write postmortems ●  Focus on the facts! ●  What went wrong and what can be better? ●  A postmortem is not about blame

Page 7: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Thundering herds are not

For Puppet •  "Lots" of Mac desktops and laptops •  "Lots" of Ubuntu desktops, laptops and servers •  "Some" others

What if they all want to do a puppet run?

•  What about every hour? •  What about every five minutes?

Randomize your cron jobs! (and test it) How can you shed load on the server?

Page 8: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Anycast is fun

Anycast is "coarse-grain" load balancing •  Routes traffic to the “nearest”, “serving” cluster

Networks break

•  Physical issues •  Routing issues •  Configuration issues •  Load balancer bugs

Anycast monitoring is hard

Page 9: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Anycast directed to one site is not fun

Page 10: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Anycast directed to one site is not fun All clients could be sent to the same cluster

•  Be ready for that •  Can a single cluster handle worldwide traffic? •  What do you do if it can't?

Have a mitigation strategy to shed load

●  Include load calculations early in health checks ●  Consider DNS views to redirect some traffic ●  Drop traffic if you have to

Page 11: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Diversity is good...for people

Be ruthless against platform diversity If you can’t automate it, don’t do it

●  “Could we bring up another 50 today, please?” ●  “That backend was just a little different and...oops”

Anycast helps you be consistent

●  Traffic could go anywhere Every OS upgrade is a time to refactor and clean

Page 12: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Questions?

Gordon Rowell [email protected]