from 0 to capacity planning

From ZeroTo Capacity Planning

@Randommood

INES Sombra

Globally distributed and Highly available

Why capacity planning?

Or a journey of discovery and ingenuity

The views reflected in this talk are not to be considered a reflection of the skills of my

coworkers who are extremely nice human beings and way better at capacity planning

than I am.

😜

NOT A monitoring person💀

🚨🚨

INSTRUMENT

MONITOR & ALERT

PLAN & PREDICT

The Road to Capacity planning

?

FindingsBooks

0Day One

Some Learning

Our Discoveries

Rituals & Myths

Asking Around

Bringing it Home

our Path today

Checking The Edge

zero… Oh shit!

a convenient ”situation”

Handles StateMany Clients

Other systems depend on this service to be: up, healthy, and available!

A bit F*cked

Our World

Edge Core✨ ✨

a Fastly POP

I Rule the Edge!

Evaluates weekly global POPs performance & makes projections

Publishes capacity performance report in clear location

Plans for our physical capacity & transit capacity

Meet Catharine

Planning Our CapacitySome metrics - Network Capacity (Gb) - Ordered Network Capability (Gb) - Planned Network Capacity (Gb) - RPS Capacity (k) - Network peak (Gb) - RPS peak (k) - Site CPU Peak (%) - Network Utilization (%)

Over 30%: flagged, Over 70%: Red status

Edge InsightsOur ability to correctly plan for capacity is critical to our bottom line

Capacity doesn’t just involve hardware; software optimizations matter

People affect capacity

HittingThe

Books

Defining Capacity planningMeasuring, planning, & managing system growth

Determines what your system needs & when

From the observation of actual traffic. Use current performance as baseline.

Must happen regardless of what you might optimize

ARE WE RIGHT

NOW?

We have to be this fast & reliable

X per second & Y% Uptime

MEASURE HOW/RELIABLE WE ARE

HARDWARESOFTWAREARCHITECTURE

CHANGE / ADD / REMOVE

FIGURE OUT HOW TO STAY

FAST/RELIABLE ENOUGH

Yes!

No!

Allspaw's Wisdom

From The Art of Capacity Planning👈

System’s Ceiling: critical level of a resource that cannot be crossed without failure. Find yours

Another form of Capacity Planning: Controlled load testing

Predictions: ceilings + historical data

Allspaw's Wisdom

Allspaw's WisdomSystem architecture can affect your ability to add capacity

Identify & track your application’s metrics

Tying metrics to user behavior is helpful

If you don’t have ways to measure your current capacity you can’t plan

Little’s Law & Capacity planningL = λW

Capacity (L), Throughput (λ), and Latency (W)

Applies to stable systems

Use this information to better understand our workload and to define constraints

Literature InsightsPossible to have plenty of capacity and a slow site nonetheless

Projections & curve fitting are guesses

Keep track of API calls & their rate

Always gonna be spikes & hiccups. Take the bad with the good & plan for it

Rituals&

Myths

Crowdsourcing Capacity planning

Industry InsightsHard to extrapolate general advice into something applicable for my situation

Simplicity & ability to reason are the only things I could trust

Confusing community stance on the ROI of capacity planning

& Putting things in practiceFindings

Step One Step Two

steps followed

Documented system architecture & request lifecycle

Formalized: clients, SLAs, & operational requirements

DiscoveryConfirmed constraints & determined strategy

Parallelized capacity & optimizations tasks

Organized a team

Gauging & Planning

Edge

Core APP / API APP / API

LB LB

COORDINATOR A COORDINATOR B COORDINATOR C🐤

CACHELON

CACHEDFW

CACHEFRA

CACHELAX

CACHEAMS

CACHESYD

REQUEST flow

📄 📄 📄👉

Step Foursteps followed

Start process again

Tons of tuning left to do. We know we have suboptimal configs!

re-Evaluation

Step Three

Doubled RAM: our constrained resource

Horizontally scaled to 3 servers + 1 canary

Capacity expansion

System Before

System After

System Before System After

Unexpected ChallengesOur goal when adding capacity was no service disruption.

Localhost is the goddamn devil

Gap from metric/graph to insight can be huge

Slowness is the nemesis of distributed system

The Oprah ProblemDeveloping operational insights into non-owned system under pressure is not great

Use playbooks, debug.md, rotations, & rollout owners

Proactivity and clarity are your best tools

Everyone gets more capacity!

Some InsightsAnything API driven ought to carry a rate limit - We can easily DDOS ourselves!

Monitor and alert on expensive API actions

Mind your system dependencies: practice defensive system design & architecture

CAPACITY PLANNING

ALERTING

MONITORING

Some FindingsCapacity tied to murky organizational structure is both good & bad (but mostly bad)

Mind your error descriptions! Cheeky today ⇒ misleading tomorrow!

Finding my system’s ceiling is still tricky

Services owned by engineers means you need to level up on Ops skills

Back to re-evaluate setup to get more out of this new capacity

Performance testing ought to be done on the core’s side (& edge)

My Insights

TL;DR

Is a process not a one time event

Pushes you to better understand your

system, its capacity & its boundaries - that is

good!

Proactivity is best

Capacity planningRequest lifecycle gets

tricky

System boundaries, dependencies & SLAs

must be discussed

Your system’s capacity may bound other systems capacity

Distributed systems

github.com/Randommood/ZerotoCapacityPlanning

Special Thanks to: Catharine Strauss, Alan Kasindorf, Matt Whiteley, Caitie McCaffrey, Thom Mahoney, Mike O’Neill, Devon O’Dell, Katherine Daniels, Nathan Taylor, Bruce Spang, and Greg Bako

Thank you !

github.com/Randommood/ZerotoCapacityPlanning

from 0 to capacity planning

Engineering