the economics of scale: promises and perils of going distributed

99
The Economics of Scale Tyler Treat Workiva Promises and Perils of Going Distributed September 19, 2015

Upload: tyler-treat

Post on 07-Jan-2017

832 views

Category:

Software


0 download

TRANSCRIPT

Page 1: The Economics of Scale: Promises and Perils of Going Distributed

The Economics of Scale Tyler TreatWorkivaPromises and Perils of Going Distributed

September 19, 2015

Page 2: The Economics of Scale: Promises and Perils of Going Distributed

About The Speaker

• Backend engineer at Workiva

• Messaging platform tech lead

• Distributed systems

• bravenewgeek.com @tyler_treat [email protected]

Page 3: The Economics of Scale: Promises and Perils of Going Distributed

About The Talk

• Why distributed systems?

• Case study

• Advantages/Disadvantages

• Strategies for scaling and resilience patterns

• Scaling Workiva

Page 4: The Economics of Scale: Promises and Perils of Going Distributed

What does it mean to “scale” a system?

Page 5: The Economics of Scale: Promises and Perils of Going Distributed

Scale Up vs. Scale Out

❖ Add resources to a node❖ Increases node capacity, load

is unaffected❖ System complexity unaffected

Vertical Scaling❖ Add nodes to a cluster❖ Decreases load, capacity is

unaffected❖ Availability and throughput w/

increased complexity

Horizontal Scaling

Page 6: The Economics of Scale: Promises and Perils of Going Distributed

Okay, cool—but what does this actually mean?

Page 7: The Economics of Scale: Promises and Perils of Going Distributed

Let’s look at a real-world example…

Page 8: The Economics of Scale: Promises and Perils of Going Distributed

How does Twitter work?

Page 9: The Economics of Scale: Promises and Perils of Going Distributed

Just write tweets to a big database table.

Page 10: The Economics of Scale: Promises and Perils of Going Distributed
Page 11: The Economics of Scale: Promises and Perils of Going Distributed

Getting a timeline is a simple join query.

Page 12: The Economics of Scale: Promises and Perils of Going Distributed
Page 13: The Economics of Scale: Promises and Perils of Going Distributed

But…this is crazy expensive.

Page 14: The Economics of Scale: Promises and Perils of Going Distributed

Joins are sloooooooooow.

Page 15: The Economics of Scale: Promises and Perils of Going Distributed

Prior to 5.5, MySQL used table-level locking. Now it uses row-level locking. Either way,

lock contention everywhere.

Page 16: The Economics of Scale: Promises and Perils of Going Distributed

As the table grows larger, lock contention on indexes becomes worse too.

Page 17: The Economics of Scale: Promises and Perils of Going Distributed

And UPDATES get higher priority than SELECTS.

Page 18: The Economics of Scale: Promises and Perils of Going Distributed
Page 19: The Economics of Scale: Promises and Perils of Going Distributed

So now what?

Page 20: The Economics of Scale: Promises and Perils of Going Distributed

Distribute!

Page 21: The Economics of Scale: Promises and Perils of Going Distributed

Specifically, shard.

Page 22: The Economics of Scale: Promises and Perils of Going Distributed

Partition tweets into different databases using some consistent hash scheme

(put a hash ring on it).

Page 23: The Economics of Scale: Promises and Perils of Going Distributed
Page 24: The Economics of Scale: Promises and Perils of Going Distributed

This alleviates lock contention and improves throughput…

but fetching timelines is still extremely costly

(now scatter-gather query across multiple DBs).

Page 25: The Economics of Scale: Promises and Perils of Going Distributed

Observation: Twitter is a consumption mechanism more than an ingestion one…

i.e. cheap reads > cheap writes

Page 26: The Economics of Scale: Promises and Perils of Going Distributed

Move tweet processing to the write pathrather than read path.

Page 27: The Economics of Scale: Promises and Perils of Going Distributed

Ingestion/Fan-Out Process

1. Tweet comes in

2. Query the social graph service for followers

3. Iterate through each follower and insert tweet ID into their timeline (stored in Redis)

4. Store tweet on disk (MySQL)

Page 28: The Economics of Scale: Promises and Perils of Going Distributed
Page 29: The Economics of Scale: Promises and Perils of Going Distributed

Ingestion/Fan-Out Process

• Lots of processing on ingest, no computation on reads

• Redis stores timelines in memory—very fast

• Fetching timeline involves no queries—get timeline from Redis cache and rehydrate with multi-get on IDs

• If timeline falls out of cache, reconstitute from disk

• O(n) on writes, O(1) on reads

• http://www.infoq.com/presentations/Twitter-Timeline-Scalability

Page 30: The Economics of Scale: Promises and Perils of Going Distributed

Key Takeaway: think about your access patterns and design accordingly.

Optimize for the critical path.

Page 31: The Economics of Scale: Promises and Perils of Going Distributed

Let’s Recap…

• Advantages of single database system:

• Simple!

• Data and invariants are consistent (ACID transactions)

• Disadvantages of single database system:

• Slow

• Doesn’t scale

• Single point of failure

Page 32: The Economics of Scale: Promises and Perils of Going Distributed

Going distributed solved the problem, but at what cost?

(hint: your sanity)

Page 33: The Economics of Scale: Promises and Perils of Going Distributed

Distributed systems are all about trade-offs.

Page 34: The Economics of Scale: Promises and Perils of Going Distributed

By choosing availability, we give up consistency.

Page 35: The Economics of Scale: Promises and Perils of Going Distributed

This problem happens all the time on Twitter.

For example, you tweet, someone else replies, and I see the reply before the original tweet.

Page 36: The Economics of Scale: Promises and Perils of Going Distributed

Can we solve this problem?

Page 37: The Economics of Scale: Promises and Perils of Going Distributed

Sure, just coordinate things before proceeding…

“Have you seen this tweet? Okay, good.” “Have you seen this tweet? Okay, good.” “Have you seen this tweet? Okay, good.” “Have you seen this tweet? Okay, good.” “Have you seen this tweet? Okay, good.” “Have you seen this tweet? Okay, good.”

Page 38: The Economics of Scale: Promises and Perils of Going Distributed

Sooo what do you do when Justin Bieber tweets to his 67 million followers?

Page 39: The Economics of Scale: Promises and Perils of Going Distributed
Page 40: The Economics of Scale: Promises and Perils of Going Distributed

Coordinating for consistency is expensivewhen data is distributed

because processescan’t make progress independently.

Page 41: The Economics of Scale: Promises and Perils of Going Distributed
Page 42: The Economics of Scale: Promises and Perils of Going Distributed
Page 43: The Economics of Scale: Promises and Perils of Going Distributed

Source: Peter Bailis, 2015 https://speakerdeck.com/pbailis/silence-is-golden-coordination-avoiding-systems-design

Page 44: The Economics of Scale: Promises and Perils of Going Distributed

Key Takeaway: strong consistency is slow and distributed coordination is expensive (in terms of

latency and throughput).

Page 45: The Economics of Scale: Promises and Perils of Going Distributed

Sharing mutable data at large scale is difficult.

Page 46: The Economics of Scale: Promises and Perils of Going Distributed

If we don’t distribute, we risk scale problems.

Page 47: The Economics of Scale: Promises and Perils of Going Distributed

Let’s say we want to count the number of times a tweet is retweeted.

Page 48: The Economics of Scale: Promises and Perils of Going Distributed

“Get, add 1, and put” transaction will notscale.

Page 49: The Economics of Scale: Promises and Perils of Going Distributed

If we do distribute, we risk consistency problems.

Page 50: The Economics of Scale: Promises and Perils of Going Distributed
Page 51: The Economics of Scale: Promises and Perils of Going Distributed

What do we do when our system is partitioned?

Page 52: The Economics of Scale: Promises and Perils of Going Distributed
Page 53: The Economics of Scale: Promises and Perils of Going Distributed

If we allow writes on both sides of the partition, how do we resolve conflicts when the partition

heals?

Page 54: The Economics of Scale: Promises and Perils of Going Distributed

Distributed systems are hard!

Page 55: The Economics of Scale: Promises and Perils of Going Distributed

But lots of good research going on to solve these problems…

CRDTs Lasp

SyncFree RAMP transactions

etc.

Page 56: The Economics of Scale: Promises and Perils of Going Distributed

Twitter has 316 million monthly active users. Facebook has 1.49 billion monthly active users. Netflix has 62.3 million streaming subscribers.

Page 57: The Economics of Scale: Promises and Perils of Going Distributed

How do you build resilient systems at this scale?

Page 58: The Economics of Scale: Promises and Perils of Going Distributed

Embrace failure.

Page 59: The Economics of Scale: Promises and Perils of Going Distributed

Provide partial availability.

Page 60: The Economics of Scale: Promises and Perils of Going Distributed

If an overloaded service is not essential tocore business, fail fast to prevent availability or

latency problems upstream.

Page 61: The Economics of Scale: Promises and Perils of Going Distributed
Page 62: The Economics of Scale: Promises and Perils of Going Distributed
Page 63: The Economics of Scale: Promises and Perils of Going Distributed

It’s better to fail predictably than fail in unexpected ways.

Page 64: The Economics of Scale: Promises and Perils of Going Distributed

Use backpressure to reduce load.

Page 65: The Economics of Scale: Promises and Perils of Going Distributed
Page 66: The Economics of Scale: Promises and Perils of Going Distributed
Page 67: The Economics of Scale: Promises and Perils of Going Distributed

Flow-Control Mechanisms

• Rate limit

• Bound queues/buffers

• Backpressure - drop messages on the floor

• Increment stat counters for monitoring/alerting

• Exponential back-off

• Use application-level acks for critical transactions

Page 68: The Economics of Scale: Promises and Perils of Going Distributed

Bounding resource utilization and failing fast helps maintain predictable performance and

impedes cascading failures.

Page 69: The Economics of Scale: Promises and Perils of Going Distributed

Going distributed means more wire time.How do you improve performance?

Page 70: The Economics of Scale: Promises and Perils of Going Distributed
Page 71: The Economics of Scale: Promises and Perils of Going Distributed

Cache everything.

Page 72: The Economics of Scale: Promises and Perils of Going Distributed
Page 73: The Economics of Scale: Promises and Perils of Going Distributed

“There are only two hard things in computer science: cache invalidation and naming

things.”

Page 74: The Economics of Scale: Promises and Perils of Going Distributed
Page 75: The Economics of Scale: Promises and Perils of Going Distributed

Embrace asynchrony.

Page 76: The Economics of Scale: Promises and Perils of Going Distributed
Page 77: The Economics of Scale: Promises and Perils of Going Distributed
Page 78: The Economics of Scale: Promises and Perils of Going Distributed

Distributed systems are not just about workload scale, they’re about organizational scale.

Page 79: The Economics of Scale: Promises and Perils of Going Distributed

In 2010, Workiva released a product to streamline financial reporting.

Page 80: The Economics of Scale: Promises and Perils of Going Distributed

A specific solution to solve a very specific problem, originally built by a few dozen

engineers.

Page 81: The Economics of Scale: Promises and Perils of Going Distributed

Fast forward to today: a couple hundred engineers,

more users, more markets, more solutions.

Page 82: The Economics of Scale: Promises and Perils of Going Distributed

How do you ramp up new products quickly?

Page 83: The Economics of Scale: Promises and Perils of Going Distributed

You stop thinking in terms of products and start thinking in terms of platform.

Page 84: The Economics of Scale: Promises and Perils of Going Distributed

From Product to Platform

Page 85: The Economics of Scale: Promises and Perils of Going Distributed

At this point, going distributed is all but necessary.

Page 86: The Economics of Scale: Promises and Perils of Going Distributed

Service-Oriented Architecture allows us to independently build, deploy, and scale discrete

parts of the platform.

Page 87: The Economics of Scale: Promises and Perils of Going Distributed

Loosely coupled services let us tolerate failure. And things fail constantly.

Page 88: The Economics of Scale: Promises and Perils of Going Distributed

Shit happens — network partitions, hardware failure, GC pauses, latency, dropped packets…

Page 89: The Economics of Scale: Promises and Perils of Going Distributed

Build resilient systems.

Page 90: The Economics of Scale: Promises and Perils of Going Distributed

–Ken Arnold

“You have to design distributed systems with the expectation of failure.”

Page 91: The Economics of Scale: Promises and Perils of Going Distributed

Design for failure.

Page 92: The Economics of Scale: Promises and Perils of Going Distributed

Consider the trade-off between consistency and availability.

Page 93: The Economics of Scale: Promises and Perils of Going Distributed

Most important?

Page 94: The Economics of Scale: Promises and Perils of Going Distributed

Don’t distribute until you have a reason to!

Page 95: The Economics of Scale: Promises and Perils of Going Distributed

Scale up until you have toscale out.

Page 96: The Economics of Scale: Promises and Perils of Going Distributed

–Paul Barham

“You can have a second computer once you’ve shown you know how to use the first one.”

Page 97: The Economics of Scale: Promises and Perils of Going Distributed

And when you do distribute, don’t go overboard.

Walk before you run.

Page 98: The Economics of Scale: Promises and Perils of Going Distributed

Remember, when it comes to distributed systems…

for every promise there’s a peril.

Page 99: The Economics of Scale: Promises and Perils of Going Distributed

Thanks!@tyler_treatgithub.com/tylertreatbravenewgeek.com