from 0 to capacity planning
TRANSCRIPT
From ZeroTo Capacity Planning
@Randommood
INES Sombra
Globally distributed and Highly available
Why capacity planning?
Or a journey of discovery and ingenuity
The views reflected in this talk are not to be considered a reflection of the skills of my
coworkers who are extremely nice human beings and way better at capacity planning
than I am.
😜
NOT A monitoring person💀
🚨🚨
INSTRUMENT
MONITOR & ALERT
PLAN & PREDICT
The Road to Capacity planning
?
FindingsBooks
0Day One
Some Learning
Our Discoveries
Rituals & Myths
Asking Around
Bringing it Home
our Path today
Checking The Edge
zero… Oh shit!
a convenient ”situation”
Handles StateMany Clients
Other systems depend on this service to be: up, healthy, and available!
A bit F*cked
Our World
Edge Core✨ ✨
a Fastly POP
I Rule the Edge!
Evaluates weekly global POPs performance & makes projections
Publishes capacity performance report in clear location
Plans for our physical capacity & transit capacity
Meet Catharine
Planning Our CapacitySome metrics - Network Capacity (Gb) - Ordered Network Capability (Gb) - Planned Network Capacity (Gb) - RPS Capacity (k) - Network peak (Gb) - RPS peak (k) - Site CPU Peak (%) - Network Utilization (%)
Over 30%: flagged, Over 70%: Red status
Edge InsightsOur ability to correctly plan for capacity is critical to our bottom line
Capacity doesn’t just involve hardware; software optimizations matter
People affect capacity
HittingThe
Books
Defining Capacity planningMeasuring, planning, & managing system growth
Determines what your system needs & when
From the observation of actual traffic. Use current performance as baseline.
Must happen regardless of what you might optimize
ARE WE RIGHT
NOW?
We have to be this fast & reliable
X per second & Y% Uptime
MEASURE HOW/RELIABLE WE ARE
HARDWARESOFTWAREARCHITECTURE
CHANGE / ADD / REMOVE
FIGURE OUT HOW TO STAY
FAST/RELIABLE ENOUGH
Yes!
No!
Allspaw's Wisdom
From The Art of Capacity Planning👈
System’s Ceiling: critical level of a resource that cannot be crossed without failure. Find yours
Another form of Capacity Planning: Controlled load testing
Predictions: ceilings + historical data
Allspaw's Wisdom
Allspaw's WisdomSystem architecture can affect your ability to add capacity
Identify & track your application’s metrics
Tying metrics to user behavior is helpful
If you don’t have ways to measure your current capacity you can’t plan
Little’s Law & Capacity planningL = λW
Capacity (L), Throughput (λ), and Latency (W)
Applies to stable systems
Use this information to better understand our workload and to define constraints
Literature InsightsPossible to have plenty of capacity and a slow site nonetheless
Projections & curve fitting are guesses
Keep track of API calls & their rate
Always gonna be spikes & hiccups. Take the bad with the good & plan for it
Rituals&
Myths
Crowdsourcing Capacity planning
Crowdsourcing Capacity planning
Industry InsightsHard to extrapolate general advice into something applicable for my situation
Simplicity & ability to reason are the only things I could trust
Confusing community stance on the ROI of capacity planning
& Putting things in practiceFindings
Step One Step Two
steps followed
Documented system architecture & request lifecycle
Formalized: clients, SLAs, & operational requirements
DiscoveryConfirmed constraints & determined strategy
Parallelized capacity & optimizations tasks
Organized a team
Gauging & Planning
Edge
Core APP / API APP / API
LB LB
COORDINATOR A COORDINATOR B COORDINATOR C🐤
CACHELON
CACHEDFW
CACHEFRA
CACHELAX
CACHEAMS
CACHESYD
REQUEST flow
📄 📄 📄👉
Step Foursteps followed
Start process again
Tons of tuning left to do. We know we have suboptimal configs!
re-Evaluation
Step Three
Doubled RAM: our constrained resource
Horizontally scaled to 3 servers + 1 canary
Capacity expansion
System Before
System After
System Before System After
System Before System After
Unexpected ChallengesOur goal when adding capacity was no service disruption.
Localhost is the goddamn devil
Gap from metric/graph to insight can be huge
Slowness is the nemesis of distributed system
The Oprah ProblemDeveloping operational insights into non-owned system under pressure is not great
Use playbooks, debug.md, rotations, & rollout owners
Proactivity and clarity are your best tools
Everyone gets more capacity!
Some InsightsAnything API driven ought to carry a rate limit - We can easily DDOS ourselves!
Monitor and alert on expensive API actions
Mind your system dependencies: practice defensive system design & architecture
CAPACITY PLANNING
ALERTING
MONITORING
Some FindingsCapacity tied to murky organizational structure is both good & bad (but mostly bad)
Mind your error descriptions! Cheeky today ⇒ misleading tomorrow!
Finding my system’s ceiling is still tricky
Services owned by engineers means you need to level up on Ops skills
Back to re-evaluate setup to get more out of this new capacity
Performance testing ought to be done on the core’s side (& edge)
My Insights
TL;DR
Is a process not a one time event
Pushes you to better understand your
system, its capacity & its boundaries - that is
good!
Proactivity is best
Capacity planningRequest lifecycle gets
tricky
System boundaries, dependencies & SLAs
must be discussed
Your system’s capacity may bound other systems capacity
Distributed systems
github.com/Randommood/ZerotoCapacityPlanning
Special Thanks to: Catharine Strauss, Alan Kasindorf, Matt Whiteley, Caitie McCaffrey, Thom Mahoney, Mike O’Neill, Devon O’Dell, Katherine Daniels, Nathan Taylor, Bruce Spang, and Greg Bako
Thank you !
github.com/Randommood/ZerotoCapacityPlanning