capacity planning - meetupfiles.meetup.com/19253477/capacity planning.pdftesting, or observation)...

Capacity Planning

Sandy Strong

Reliability Engineering MeetupAugust 2016

About:● Sandy (@st5are)● SRE @ Twitter - Ads Serving Team● Previous experience:

○ Systems Engineer @ web hosting company○ DevOps @ social game startup○ Application Developer, Infrastructure Engineer @ web app startup○ Operations Engineer @ MMO studio

What this talk is and isn’t● I can’t tell you how to plan capacity for your services

○ Every service is different■ Different needs (resource utilization profile, SLA)■ Always in flux (usage patterns, organic growth, feature launches)

● I will tell you about how I approach capacity planning, using an example service

○ Things I’ve learned○ Strategies that work for me

Quick Survey● Are you responsible for capacity planning?● Do you run on bare metal?● Public cloud? Private cloud?

“Physicists, in search of basic principles, tend to stand too close (and some biologists too far away). At the right distance, wondrous and lovely things appear.”

- Dudley Herschbach

Ok, what is “capacity”?The amount of work (requests) that can be done over a period of time (1 second) with a given amount of resources (CPU, network, RAM).

Example:

I have 1 dedicated server with 24 CPU. The server is running a web application. The web application is CPU bound. I spin up 23 worker processes to handle user requests. Each worker has its own CPU. Each worker can handle 100 QPS.

My service has the capacity to handle 2,300 QPS.

So “capacity planning” is...The process of determining the production resources needed to meet product demand.

Example:

My service needs to accommodate 10,000 QPS. I already know (from load testing, or observation) that each worker can handle 100 QPS. And each server has 23 CPU available to use. To figure out how many additional servers I need:

10,000 QPS / (23 workers * 100 QPS) = 4.3 servers = 5 servers

I need 5 servers, this will leave me over-provisioned, but I will be able to comfortably handle 10,000 QPS.

My Service● Lives in the middle of the stack

○ Multiple upstreams/downstreams

● Stateless● JVM● Downstream services are of

a variety of type:○ Business logic○ Cache○ Relational & Non-relational databases○ PubSub

● Runs in private cloud (mesos)● Large deployment

○ Increasing capacity by meaningful percentage represents a lot of hardware○ Potential to impact other customers competing for a slice of the same cloud

There’s a lot going on in that diagram, but I focus on...

1. Resource utilization for my service

2. My service’s utilization of my immediate downstreams

3. What’s going on with my immediate upstreams

#1 Resource utilization for my serviceHow to think about my resources?

● Reserved resources in mesos: CPU, RAM, disk

● Resources with host-wide quota: network egress

● Learn about how resource isolation works for the system your service runs in

○ How much CPU time do I get, and over what time interval, when I reserve “2 CPU”?

○ At what point will network bandwidth throttling kick in, and what happens when it does?

Example: CPU in Mesos4 CPU = 400ms of CPU time guaranteed (continuously) for the duration of each 100ms interval

The available CPU time can be used at any point in the interval.

This flexibility allows the CPU time to be used in different ways, depending on what the application needs.

● Scenario A: the application can use up to 4 cores continuously for every 100 ms interval. It is never throttled and starts processing new requests immediately.

● Scenario B: the application uses up to 8 cores (depending on availability) but is throttled after 50 ms. The CPU quota resets at the start of each new 100 ms interval.

● Scenario C: is like Scenario A, but there is a garbage collection event in the second interval that consumes all CPU quota. The application throttles for the remaining 75 ms of that interval and cannot service requests until the next interval. In this example, the garbage collection finished in one interval but, depending on how much garbage needs collecting, it may take more than one interval and further delay service of requests.

Source: http://aurora.apache.org/documentation/latest/features/resource-isolation/

http://aurora.apache.org/documentation/latest/features/resource-isolation/

Example: Disk Utilization (even for stateless services)

My service reserves 10GB of disk, and I run 100 instances in a private cloud (mesos). There are 200 bare metal hosts in the cloud, each has a 15GB disk.

This means:

● My service runs on 100 bare metal hosts● Only services with <5GB disk reservation can co-locate with mine

Implications:

● Inefficient bin packing○ More valuable resources may be left on the table

● Likelihood of someone having my service as a neighbor is 50%○ Can cause side-effects for a lot of other services if I become a “bad neighbor”

#2 Utilization of downstream services (= resources)

● If I make 100 QPS to a downstream, I am consuming that resource.

○ How does my consumption of that resource impact their service?

○ What (in my service) drives utilization of the downstream resource?

● What type of operations is the downstream performing for me

○ How expensive are those operations for the downstream?■ Bandwidth■ CPU

I learn about my downstreams...● Considerations vary, depending on the type of system a downstream is

○ What demand am I placing on a cache downstream?■ How does this demand impact the cache?■ Is the cache configured for my usage pattern? (heavy on writes, reads, what’s the TTL?)

■ For multi-tenant: How do I impact other users, and how do other users impact me?

○ What demand am I placing on a relational or non-relational database?■ Relational: slow queries, network egress from DB → service

■ Non-relational: write throughput, compaction

I find out from downstream service owners how they measure my utilization of their systems. In this sense, I help them help me.

Identify and understand important metrics for your downstreams● If someone other than you (or your team) owns your downstreams, this

means you need to talk to those people

● Understanding is really important○ Happier services, happier downstream owners → everybody wins!

● Identify which metric(s) you need to care about for your downstreams○ Varies depending on the type of downstream○ Depends on usage type: read, write, both?○ Sending spikey QPS?○ Sustained elevated QPS (peak traffic)?

Example: PubSubMy service publishes events to a stream. This stream is owned by the PubSub team. The PubSub team provisioned this stream to handle 500 writes/second.

I’m launching a new feature that means my service will begin writing 1000 events/second to the stream. Uh oh!

How did the PubSub team come up with 500 events/second? (event size * # events/second = throughput)

My stream is provisioned to handle 500 events/second, assuming my events are 5KB in size. ( 5KB * 500 events/second = 2500Kbps max throughput)

With this feature launch, my event size is decreasing to 2KB. What is the new capacity of my stream?

2KB / 2500Kbps = 1250 events/second

My stream does not need a capacity increase, I actually have headroom now!

#3 Keeping tabs on my immediate upstreams

The resources my service consumes is a function of what traffic I’m getting from my upstreams.

● Monitor existing upstreams○ Usage can increase for existing endpoints

○ New endpoints and features may drive more usage

● Be aware of new upstreams that may be coming online○ Meet with the owners ahead of launch, discuss initial resource requirements

1. Resource utilization for my service

2. My service’s utilization of my immediate downstreams

3. What’s going on with my immediate upstreams

Just to recap...

How I approach these 3 things...

Capacity is a moving target

● Efficiency of a system changes over time

○ Performance regressions in code

○ Workload imposed by each request increases/decreases

○ Internal business logic/complexity

Capacity is a moving target

● Demand changes over time

○ New endpoints to support new product functionality

○ New clients onboarding to use your service

○ Throughout the day (high vs low traffic times)

○ Over time (organic growth)

○ Marketing pushes or special events

It’s best to plan continuously

● It’s a marathon, not a sprint

● Production systems are always changing and growing → capacity is always changing

● You don’t always know when (exactly) a change is coming, or what its real impact will be in production

○ Communication is hard

○ Stress testing pre-launch may not show the entire picture

● Ideally: additional capacity is deployed before it is needed (duh)○ This can be a challenge (see above)

Can’t I just add more capacity when my service starts to struggle under the load?

It depends...

● What’s your lead time on getting more capacity?○ Depends on the environment your service runs in, and on how much hardware you need

● How much more expensive is it to get capacity last-minute?○ Planning ahead is almost always cheaper

● What happens to your service when it reaches/exceeds max capacity?○ Cost to the business & users○ Cost to on-call○ Impact on your upstreams (clients) - will it cause cascading failures?

These things help:● Know what resource your service is bound on

○ CPU bound, network bound

● Automate where appropriate○ Set up alerts to warn when resource utilization passes a certain threshold

■ e.g. “CPU throttled for 20% of interval for 5 minutes”; “Write throughput exceeded 10Mbps”

● Create simple graphs with essential metrics, review them (often)○ In many cases human discretion > alert emails

○ High-level dashboards that make it easy to view WoW and MoM trends for utilization metrics

● Know that your models will change regularly ○ Keep formulas up to date

○ Re-evaluate the accuracy of models: “Does this still make sense? Is this calculation still telling me what I need to know?”

○ Develop new models when you need to, save historical data to test them against

Stress Testing & Graceful Degradation● Define your SLA

○ Success rate, latency, exceptions

● Load test your service in staging○ Catch performance regressions before they hit production○ The tool you decide to use will depend on the service you need to load test.

■ There’s a lot out there, you should not have to roll your own○ Generate load with synthetic traffic

■ iago, ab, tsung○ Capture and replay traffic

■ gor

Stress Testing & Graceful Degradation● Find your service’s redline

○ The maximum amount of requests a single instance can take before SLA is violated

● Shift traffic between DCs/availability zones○ How does your service hold up when you remove a percentage of its global capacity?

○ What are the expectations here, what is acceptable? (e.g. “When we lose an availability zone, p999 latency increases by 15%”)

■ How much more $$ do you need to spend to keep p999 latency in normal SLA during a

traffic shift? Is it worth the additional cost?

● Load shedding○ Consider implementing backpressure to protect against over-zealous upstreams

● Do downstreams have backpressure mechanisms?○ Find out what they are, when and how they kick in.○ Understand how a downstream’s backpressure mechanism impacts the quality of your service

■ Consider user impact, your service should be able to operate in a degraded state

Continuous Conversations● Talk to other individuals and teams

○ Upstream and downstream teams○ Developers on your team○ Teams (or vendors) responsible for provisioning/providing resources○ Other SREs

■ Learn about what tools other people are creating and using, share knowledge/tools■ Share models

○ Other infrastructure teams: NetEng, DBA○ Talk to PMs, business teams, and marketing teams, find out what’s on the horizon

Build relationships with other teams. If they know who you are and what your use case is, they will be better equipped to help you.

Questions?

Sourceshttp://aurora.apache.org/documentation/latest/features/resource-isolation/

http://twitter.github.io/iago/index.html

http://httpd.apache.org/docs/current/programs/ab.html

http://tsung.erlang-projects.org/

https://github.com/buger/gor

https://xkcd.com/435/

http://faculty.chemistry.harvard.edu/files/dudley-herschbach/files/dh_blithe_sibling_no.359_0.pdf















capacity planning - meetupfiles.meetup.com/19253477/capacity planning.pdftesting, or observation)...

Documents