capacity management from flickr
DESCRIPTION
TRANSCRIPT
![Page 1: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/1.jpg)
Capacity Management
for Web Operations
John AllspawOperations Engineering
![Page 2: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/2.jpg)
the book I’m writing
![Page 3: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/3.jpg)
???
![Page 4: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/4.jpg)
Rules of Thumb
Planning/Forecasting
Stupid Capacity Tricks
(with some Flickr statistics sprinkled in)
![Page 5: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/5.jpg)
bugs (disguised as capacity problems)
edge cases (disguised as capacity problems)
security incidents
real capacity problems*
* (should be the last thing you need to worry about)
Things that can cause downtime
![Page 6: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/6.jpg)
Capacity != Performance
Forget about performance for right now
Measure what you have right NOW
Don’t count on it getting any better
![Page 7: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/7.jpg)
Thank You HPC Industry!
Automated Stuff
Scalable Metric Collection/Display
a lot of great deployment and management trickscome from them, adopted by web ops
![Page 8: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/8.jpg)
Good Measureme
nt Tools
record and storemetrics in/outcustom metricseasily comparelightweight-ish
I
![Page 9: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/9.jpg)
Clouds need planning too
Makes deployment and procurement easy and quick
But clouds are still resources with costs and limits, just like your own stuff
Black-boxes: you may need to pay even more attention than before
![Page 10: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/10.jpg)
Metrics
System Statistics
![Page 11: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/11.jpg)
Metrics“Application” Level
(photos processed per minute)
(average processing time per photo)
(apache requests)
(concurrent busy apache procs)
![Page 12: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/12.jpg)
MetricsApp-level meets system-level
here, total CPU = ~1.12 * # busy apache procs (ymmv)
![Page 13: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/13.jpg)
2400
photos per minute being uploaded right NOW (Tuesday afternoon)
![Page 14: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/14.jpg)
Ceilings
the most amount of “work” yourresources will allow before
degradationor failure
![Page 15: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/15.jpg)
Forget Benchmarking
![Page 16: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/16.jpg)
Find your ceilings
The End
what you have left
![Page 17: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/17.jpg)
Use real live production data
to find ceilings
Production: “it’s like a lab, but bigger!”
![Page 18: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/18.jpg)
Like: database ceilings
replication lag: bad!
![Page 19: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/19.jpg)
Ceilings
waiting on disk too much
sustained disk I/O wait for >40% creates
slave lag**for us, YMMV
![Page 20: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/20.jpg)
35,000photo requests per second on a Tuesday peak
![Page 21: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/21.jpg)
Safety Factors
![Page 22: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/22.jpg)
Safety Factors
Ceiling * Factor of Safety = UR LIMITZ
![Page 23: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/23.jpg)
Safety Factors
webserver!
![Page 24: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/24.jpg)
what you have left
“safe” ceiling
@85% CPU
Safety Factors
85% total CPU = ~76 busy apache procs
![Page 25: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/25.jpg)
Safety FactorsYahoo Front Page
link to Chinese NewYearPhotos
(photo requests/second)
(8% spike)
![Page 26: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/26.jpg)
Forecasting
![Page 27: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/27.jpg)
Forecasting
Fictional Example:webservers
![Page 28: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/28.jpg)
Forecasting
Fictional example: 15 webservers. 1 week.
peak of the week
![Page 29: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/29.jpg)
...bigger sample, 6 weeks....isolate the peaks...
Forecasting
![Page 30: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/30.jpg)
...”Add a Trendline” with some decent correlation...
Forecasting
not too shabby
now
![Page 31: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/31.jpg)
Forecasting
15 servers @76 busy apache proc limit = 1140 total procs
when is this?
this will tell you when it isceiling
what you have left
![Page 32: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/32.jpg)
Forecasting
(week #10, duh)
(1140-726) / 42.751 = 9.68
![Page 33: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/33.jpg)
Writing excel macros is boring
All we want is “days remaining”, so all we need is the curve-fit
Forecasting Automation
Use http://fityk.sf.net to automate the curve-fit
![Page 34: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/34.jpg)
Forecasting
Fictional Example:storage consumption
![Page 35: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/35.jpg)
Forecasting Automation
actual flickr storage consumption from early 2005, in GB
(ceiling is fictional)
this will tellyou when this is
![Page 36: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/36.jpg)
Forecasting Automationcmd line script
outputjallspaw:~]$cfityk ./fit-storage.fit
1> # Fityk script. Fityk version: 0.8.22> @0 < '/home/jallspaw/storage-consumption.xy'15 points. No explicit std. dev. Set as sqrt(y)3> guess QuadraticNew function %_1 was created.4> fitInitial values: lambda=0.001 WSSR=464.564#1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%)#2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%)#3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%)#4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%)Fit converged.Better fit found (WSSR = 0.736763, was 464.564, -99.8414%).5> info formula in @0# storage-consumption14147.4+146.657*x+0.786854*x^26> quitbye...
![Page 37: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/37.jpg)
Forecasting Automation
(SAME)
fityk gave:
y = 0.786854x2 + 146.657x + 14147.4
( R2 = 99.84)
Excel gave:
y = 0.7675x2 + 146.96x + 14147.3
( R2 = 99.84)
![Page 38: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/38.jpg)
Capacity Health
12,629 nagios checks
1314 hosts
6 datacenters
4 photo “farms”
farm = 2 DCs (east/west)
![Page 39: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/39.jpg)
High and Low Water Marks
alert if higher
alert if lower
Per server, squid requests per second
![Page 40: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/40.jpg)
A good dashboard looks something like...
type #limit/box
ceiling units
limit (total)
current
(peak)%
peak
Est daysleft
www 20 80busy procs
1600 100062.50
%36
shard db
20 40I/O
wait800 220
27.50%
120
squid 18 950 req/sec
17,100
11,400
66.67%
48
(yes, fictional numbers)
![Page 41: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/41.jpg)
Diagonal Scaling
Image processing machines
Replace Dell PE860s with HP DL140G3s
vertically scaling your already horizontal nodes
![Page 42: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/42.jpg)
Diagonal Scalingexample: image processing
4 cores
8 cores
(about the same CPU “usage” per box)
![Page 43: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/43.jpg)
~45 images/min @ peak
~140 images/min @ peak
(same CPU usage, but ~3x more work)“processing” means making 4 sizes from originals
Diagonal Scalingexample: image processing
throughput
![Page 44: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/44.jpg)
3008.4 Watts
1036.8 Watts
went from:
23 Dell PE860s
8 HP DL140 G3s
to:
1035 photos/min
1120 photos/min
(75% faster, even)
23Urack
8Urack
Diagonal Scalingexample: image processing
!!!
![Page 45: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/45.jpg)
3.52
terabytes will be consumed today (on a Tuesday)
![Page 46: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/46.jpg)
2nd Order Effects(beware the wandering
bottleneck)
running hot,so add more
![Page 47: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/47.jpg)
2nd Order Effects(beware the wandering
bottleneck)
running great now,so more traffic!
now these
run hot
![Page 48: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/48.jpg)
Stupid Capacity Tricks
![Page 49: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/49.jpg)
Stupid Capacity Tricksquick and dirty management
DSHhttp://freshmeat.net/projects/dsh
[root@netmon101 ~]# cat group.of.servers
www100
www118
dbcontacts3
admin1
admin2
![Page 50: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/50.jpg)
Stupid Capacity Tricksquick and dirty management
[root@netmon101 ~]# dsh -N group.of.servers
dsh> dateexecuting 'date'www100: Mon Jun 23 14:14:53 UTC 2008www118: Mon Jun 23 14:14:53 UTC 2008dbcontacts3: Mon Jun 23 07:14:53 PDT 2008admin1: Mon Jun 23 14:14:53 UTC 2008admin2: Mon Jun 23 14:14:53 UTC 2008dsh>
![Page 51: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/51.jpg)
Stupid Capacity TricksTurn Stuff OFF
Disable heavy-ish features of the site(on/off switches)
We have 195 different things to disable in case of emergency.
![Page 52: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/52.jpg)
Stupid Capacity TricksTurn Stuff OFF
uploads (photo)
uploads (video)
uploads by email
various API things
various mobile things
various search things
etc., etc.
![Page 53: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/53.jpg)
Host your outage/status/blog page in more than one datacenter.
Tell your users WTF is going on, they’ll appreciate it.
Stupid Capacity TricksOutages Happen
![Page 54: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/54.jpg)
Stupid Capacity TricksHit the Pause Button
Bake the dynamic into static
Some Y! properties have a big red button to instantly bake (and un-bake) at will
![Page 55: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/55.jpg)
thankshttp://flickr.com/photos/bondidwhat/402089763/http://flickr.com/photos/74876632@N00/2394833962/http://flickr.com/photos/42311564@N00/220394633/http://flickr.com/photos/unloveable/2422483859/http://flickr.com/photos/absolutwade/149702085/http://flickr.com/photos/krawiec/521836276/http://flickr.com/photos/eschipul/1560875648/http://flickr.com/photos/library_of_congress/2179060841/http://flickr.com/photos/jekkyl/511187885/http://flickr.com/photos/ab8wn/368021672/http://flickr.com/photos/jaxxon/165559708/http://flickr.com/photos/sparktography/75499095/
![Page 56: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/56.jpg)
We’re Hiring!flickr.com/jobs
Come see me!
![Page 57: Capacity Management from Flickr](https://reader033.vdocuments.us/reader033/viewer/2022061300/54c7f5944a795917178b45c0/html5/thumbnails/57.jpg)
questions?