cassandra: one (is the loneliest number)

2015-12-09

One (is the loneliest number)donny@pagerduty.com & paul@pagerduty.com

2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC

2015-12-09ONE (IS THE LONELIEST NUMBER)

Failure

2015-12-08

Background

ONE (IS THE LONELIEST NUMBER)

• Shared cluster, 5 machines (with replication factor = 5) • 10s of GBs of data • In-flight data: 10s of MBs, maybe 100s

Casssandra Replication

Client

Casssandra Replication - Failure

Client

Foreshadowing• Series of small outages / degradations • Repair process started • High load, high latency • Response: disable thrift, turn off nodes

Coordinator Read Latency (in ms, by host)

6 seconds

~25 ms

The next day…

The Plan• Trigger repair… … with lots of people watching • Use our load shedding strategies for any problems:

• Proactively disable non-critical services • Disable thrift

Surprise!• Cron triggers a repair of a different keyspace • Plus a compaction for a large CF

Outgoing Notification Backlog Size

Normal

Horrible

Outgoing Notification Backlog Size

NormalBad

Horrible

Cassandra Pending Tasks: ReadStage (by host)

Over 9000

Cassandra CPU (by host)

Factory ResetSuccess… kind of

What went wrong?

or: What can we learn from Aimee Mann?

One is the loneliest number that you'll ever do Two can be as bad as one It's the loneliest number since the number one

No, is the saddest experience you'll ever know Yes, it's the saddest experience you'll ever know

2015-12-09

No, is the saddest experience you’ll ever know

•Cassandra sheds load when overloaded •Shedding drops “stale” requests •Clients see timeouts and have trouble making progess

•Sheds load if clients abandon the failed requests •But if clients retry those requests…

2015-12-09

Event ProcessingEvent Processing

So I heard you like retries…

Notification Management

App HostApp HostApp Host

Cassandra Cluster

Cass Client retries (S)

Service client retries (T)

Load balancer retries (H)

Retries are multiplicative

Total # of retries: O(S*H*T)

Interactive Request (from user)

Load Balancer

2015-12-09

Yes, it’s the saddest experience you’ll ever know

•Dropped requests were retried •…causing load amplification •…causing more dropped requests •…causing even more retries •…causing misery. •i.e. too much load leads to much too much load

2015-12-09

How does overload get started?

•Unpredictable workloads •Could be from request volume •In our case, from batch-style processes •Repairs, compaction, application-level tasks (e.g. archiving)

2015-12-09

PagerDuty system architecture

Cassandra Cluster

Inbound Event Buffer

Data Access

Message Delivery

Monitoring Events SMS, Phone Calls

App Host

Interactive Requests (from users)

Load Balancer

2015-12-09

=Workload A + B

Workload A Workload B

…and more bursts are more worst

2015-12-09

One (cluster) is the loneliest number that you’ll ever do

•How many ops are A vs. B? •Must reverse engineer the contributions •Build (constantly evolving) models •Hard to reason about system behaviour •…and gets substantially harder when your entire production stack is overloaded

How we fixed it

2015-12-09

Stop poking the bear

•Only retry when necessary - is failure an option? •Less risky to retry user-initiated requests •Don’t retry retries (much) •Specifically:

•Only try a single fallback C* host at the driver level, not N-1 •Only try a single fallback service host, not M-1

2015-12-09

Prepare for the worst case

•To avoid overload, must provision for the worst case •So either scale for the (bursty) stars aligning… •…or prevent stars from aligning in the first place

2015-12-09

Preventing star-bursts, part 1: coordinate

•Explicit scheduling to interleave bursts •Repairs, compactions, batch jobs - Cassandra & services •Automation can help… •…but still error prone

2015-12-09

Preventing star-bursts, part 2: smooth, not chunky

•Jobs can be done more frequently •But with smaller batch size

•In the limit, aims for continuous & constant intensity workload •Some Cassandra options too:

•Compaction, transfer, and other throttle limits •Levelled compaction vs. size-tiered compaction

2015-12-09

Preventing star-bursts, part 3: isolation

•Air gap between each workload •Distinct Cassandra cluster for each service/workload •Cons:

•More infrastructure •More configuration management

•Pros: •Easy to monitor, reason about, diagnose, and scale •Reduces the blast radius when failures happen (and they will)

2015-12-09

PagerDuty system architecture: today

Inbound Event Buffer

Message Delivery

Cassandra Cluster

Lessons learned

2015-12-09

What have we learned?

• Retries: the devil’s in the details • Variable workloads: bad, especially if unpredictable • Workload peaks: additive, and bad in multiples • Isolation: the gift that keeps on giving

2015-12-09

One is the loneliest number that you'll ever do Two can be as bad as one It's the loneliest number since the number one

No, is the saddest experience you'll ever know Yes, it's the saddest experience you'll ever know

2015-12-09

donny@pagerduty.com & paul@pagerduty.com PAGERDUTY.COM/JOBS

2015-12-09

Questions?donny@pagerduty.com & paul@pagerduty.com

cassandra: one (is the loneliest number)

Technology

la cassandra day 2015 - testing cassandra

apache cassandra in action - o'reilly...

associate)professor)cassandra)l.atherton) deakin...

the loneliest polar bear -...

mariadb cassandra interoperability cassandra storage engine...

apache cassandra™...

introduction to cassandra • why spark + cassandra ... ·...

running cassandra on amazon’s ecs -...

cassandra summit 2014: cassandra compute cloud: an elastic...

a guide to stress testing kafka, spark and cassandra … ·...

solr & cassandra: searching cassandra with datastax...

hector, “ritain’s loneliest dog” and media star, has...

apache cassandra, part 3 – machinery, work with cassandra

state of cassandra, 2012 - nosql | apache cassandra ·...

distributed counters in cassandra (cassandra summit 2010)

the loneliest astronaut

cassandra community webinar - august 22 2013 - cassandra...

helsinki cassandra meetup #2: from postgres to cassandra

cassandra day denver 2014: introduction to apache cassandra

cassandra day atlanta 2015: troubleshooting with apache...