how does the cloud foundry diego project run at scale?
TRANSCRIPT
![Page 1: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/1.jpg)
How does the Cloud Foundry Diego Project Run at Scale?
and updates on .NET Support
![Page 2: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/2.jpg)
Who’s this guy?
• Amit Gupta
• https://akgupta.ca
• @amitkgupta84
![Page 3: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/3.jpg)
Who’s this guy?
• Berkeley math grad school… dropout
• Rails consulting… deserter
• now I do BOSH, Cloud Foundry, Diego, etc.
![Page 4: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/4.jpg)
Testing Diego Performance at Scale
• current Diego architecture• performance testing approach• test specifications• test implementation and tools• results• bottom line• next steps
![Page 5: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/5.jpg)
Current Diego Architecture
+
![Page 6: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/6.jpg)
Current Diego Architecture
What’s new-ish?• consul for service discovery• receptor (API) to decouple from CC• SSH proxy for container access• NATS-less auction• garden-windows for .NET applications
![Page 7: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/7.jpg)
Current Diego Architecture
Main components:
• etcd ephemeral data store• consul service discovery• receptor Diego API• nsync sync CC desired state w/Diego• route-emitter sync with gorouter• converger health mgmt & consistency• garden containerization• rep sync garden actual state w/Diego• auctioneer workload scheduling
![Page 8: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/8.jpg)
Performance Testing Approach
• full end-to-end tests• do a lot of stuff:– is it correct, is it performant?
• kill a lot of stuff:– is it correct, is it performant?
• emit logs and metrics (business as usual)• plot & visualize• fix stuff, repeat at higher scale*
![Page 9: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/9.jpg)
Test Specifications
#1: #2:
#3: #4:
![Page 10: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/10.jpg)
Test Specifications
#1: #2:
#3: #4:
x 1#1: #2:
#3: #4:
x 2
#1: #2:
#3: #4:
x 5#1: #2:
#3: #4:
x 10
n
![Page 11: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/11.jpg)
Test Specifications
• Diego does tasks and long-running processes• launch 10n, …, 400n tasks:– workload distribution?– scheduling time distribution?– running time distribution?– success rate?– growth rate?
• launch 10n, …, 400n-instance LRP:– same questions…
![Page 12: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/12.jpg)
Test Specifications
• Diego+CF stages and runs apps• > cf push• upload source bits• fetch buildpack and stage droplet (task)• fetch droplet and run app (LRP)• dynamic routing• streaming logs
![Page 13: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/13.jpg)
Test Specifications
• bring up n nodes in parallel– from each node, push a apps in parallel– from each node, repeat this for r rounds
• a is always ≈ 20• r is always = 40• n starts out = 1
![Page 14: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/14.jpg)
Test Specifications
• the pushed apps have varying characteristics:– 1-4 instances– 128M-1024M memory– 1M-200M source code payload– 1-20 log lines/second– crash never vs. every 30 s
![Page 15: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/15.jpg)
Test Specifications
• starting with n=1:– app instances ≈ 1k – instances/cell ≈ 100 – memory utilization across cells ≈ 90% – app instances crashing (by-design) ≈ 10%
![Page 16: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/16.jpg)
Test Specifications
• evaluate:– workload distribution– success rate of pushes– success rate of app routability– times for all the things in the push lifecycles– crash recovery behaviour– all the metrics!
![Page 17: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/17.jpg)
Test Specifications
• kill 10% of cells– watch metrics for recovery behaviour
• kill moar cells… and etcd– does system handle excess load gracefully?
• revive everything with > bosh cck– does system recover gracefully…– with no further manual intervention?
![Page 18: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/18.jpg)
Test Specifications
– Figure Out What’s Broke –
– Fix Stuff –
– Move On Scale Up & Repeat –
![Page 19: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/19.jpg)
Test Implementation and Tools
• S3 log, graph, plot backups• ginkgo & gomega testing DSL• BOSH parallel test-lab deploys• tmux & ssh run test suites remotely• papertrail log archives• datadog metrics visualizations• cicerone (custom) log visualizations
![Page 20: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/20.jpg)
Results400 tasks’ lifecycle timelines, dominated by container creation
![Page 21: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/21.jpg)
Results
Maybe some cells’ gardens were running slower?
![Page 22: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/22.jpg)
ResultsGrouping by cell shows uniform container creation slowdown
![Page 23: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/23.jpg)
Results
So that’s not it…Also, what’s with the blue steps?
Let’s visualize logs a couple more waysThen take stock of the questions raised
![Page 24: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/24.jpg)
ResultsLet’s just look at scheduling (ignore container creation, etc.)
![Page 25: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/25.jpg)
ResultsScheduling again, grouped by which API node handled the request
![Page 26: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/26.jpg)
ResultsAnd how about some histograms of all the things?
![Page 27: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/27.jpg)
Results
From the 400-task request from “Fezzik”:• only 3-4 (out of 10) API nodes handle reqs?• recording task reqs take increasing time?• submitting auction reqs sometimes slow?• later auctions take so long?• outliers wtf?• container creation takes increasing time?
![Page 28: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/28.jpg)
Results
• only 3-4 (out of 10) API nodes handle reqs?– when multiple address requests during DNS lookup, Golang
returns the DNS response to all requests; this results in only 3-4 API endpoint lookups for the whole set of tasks
• recording task reqs take increasing time?– API servers use an etcd client with throttling on # of concurrent
requests
• submitting auction reqs sometimes slow?– auction requests require API node to lookup auctioneer address
in etcd, using throttled etcd client
![Page 29: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/29.jpg)
Results
• later auctions take so long?– reps were taking longer to report their state to auctioneer,
because they were making expensive calls to garden, sequentially, to determine current resource usage
• outliers wtf?– combination of missing logs due to papertrail lossiness, +
cicerone handling missing data poorly
• container creation takes increasing time?– garden team tasked with investigation
![Page 30: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/30.jpg)
Results
Problems can come from:
• our software– throttled etcd client– sequential calls to garden
• software we consume– garden container creation
• “experiment apparatus” (tools and services):– papertrail lossiness– cicerone sloppiness
• language runtime– Golang’s DNS behaviour
![Page 31: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/31.jpg)
ResultsFixed what we could control, and now it’s all garden
![Page 32: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/32.jpg)
ResultsOkay, so far, that’s just been
#1: #2:
#3: #4:
x 1#1: #2:
#3: #4:
x 2
#1: #2:
#3: #4:
x 5#1: #2:
#3: #4:
x 10
![Page 33: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/33.jpg)
ResultsNext, the timelines of pushing 1k app instances
![Page 34: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/34.jpg)
Results
• for the fastest pushes– dominated by red, blue, gold– i.e. upload source & CC emit “start”, staging process,
upload droplet• pushes get slower – growth in green, light blue, fucsia, teal– i.e. schedule staging, create staging container, schedule
running, create running container
• main concern: why is scheduling slowing down?
![Page 35: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/35.jpg)
Results
• we had a theory (blame app log chattiness)• reproduced experiment in BOSH-Lite– with chattiness turned on– with chattiness turned off
• appeared to work better• tried it on AWS• no improvement
![Page 36: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/36.jpg)
Results
• spelunked through more logs• SSH’d onto nodes and tried hitting services• eventually pinpointed it:– auctioneer asks cells for state– cell reps ask garden for usage– garden gets container disk usage bottleneck
![Page 37: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/37.jpg)
ResultsGarden stops sending disk usage stats, scheduling time disappears
![Page 38: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/38.jpg)
ResultsLet’s let things stew between
and
![Page 39: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/39.jpg)
ResultsRight after all app pushes, decent workload distribution
![Page 40: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/40.jpg)
Results… an hour later, something pretty bad happened
![Page 41: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/41.jpg)
Results
• cells heartbeat their presence to etcd• if ttl expires, converger reschedules LRPs• cells may reappear after their workloads have
been reassigned• they remain underutilized
• but why do cells disappear in the first place?• added more logging, hope to catch in n=2 round
![Page 42: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/42.jpg)
ResultsWith the one lingering question about cell disappearnce, on to n=2
#1: #2:
#3: #4:
x 1#1: #2:
#3: #4:
x 2
#1: #2:
#3: #4:
x 5#1: #2:
#3: #4:
x 10
✓✓
✓ ✓
?
![Page 43: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/43.jpg)
ResultsWith 800 concurrent task reqs, found container cleanup garden bug
![Page 44: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/44.jpg)
ResultsWith 800-instance LRP, found API node request scheduling serially
![Page 45: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/45.jpg)
Results
• we added a story to the garden backlog• the serial request issue was an easy fix
• then, with n=2 parallel test-lab nodes, we pushed 2x the apps– things worked correctly– system was performant as a whole– but individual components showed signs of scale
issues
![Page 46: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/46.jpg)
ResultsOur “bulk durations” doubled
![Page 47: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/47.jpg)
Results
• nsync fetches state from CC and etcd to make sure CC desired state is reflected in diego
• converger fetches desired and actual state from etcd to make sure things are consistent
• route-emitter fetches state from etcd to keep gorouter in sync
• bulk loop times doubled from n=1
![Page 48: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/48.jpg)
Results… and this happened again
![Page 49: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/49.jpg)
Results
– the etcd and consul story –
![Page 50: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/50.jpg)
ResultsFast-forward to today
#1: #2:
#3: #4:
x 1#1: #2:
#3: #4:
x 2
#1: #2:
#3: #4:
x 5#1: #2:
#3: #4:
x 10
✓✓
✓ ✓
? ✓✓
✓ ✓
?
✓✓
✓ ✓
? ✓ ???
![Page 51: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/51.jpg)
Bottom LineAt the highest scale:
• 4000 concurrent tasks ✓• 4000-instance LRP ✓
• 10k “real app” instances @ 100 instances/cell:– etcd (ephemeral data store) ✓– consul (service discovery) ? (… it’s a long story)– receptor (Diego API) ? (bulk JSON)– nsync (CC desired state sync) ? (because of receptor)– route-emitter (gorouter sync) ? (because of receptor)– garden (containerizer) ✓– rep (garden actual state sync) ✓– auctioneer (scheduler) ✓
![Page 52: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/52.jpg)
Next Steps
• Security– mutual SSL between all components– encrypting data-at-rest
• Versioning– handle breaking API changes gracefully– production hardening
• Optimize data models– hand-in-hand with versioning– shrink payload for bulk reqs– investigate faster encodings; protobufs > JSON– initial experiments show 100x speedup
![Page 54: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/54.jpg)
Updates on .NET Support
• what’s currently supported?– ASP.NET MVC– nothing too exotic– most CF/Diego features, e.g. security groups– VisualStudio plugin, similar to the Eclipse CF plugin for
Java
• what are the limitations?– some newer Diego features, e.g. SSH– in α/β stage, dev-only
![Page 55: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/55.jpg)
Updates on .NET Support
• what’s coming up?– make it easier to deploy Windows cell– more VisualStudio plugin features– hardening testing/CI
• further down the line?– remote debugging– the “Spring experience”
![Page 56: How does the Cloud Foundry Diego Project Run at Scale?](https://reader037.vdocuments.us/reader037/viewer/2022110200/55c472c8bb61eb926e8b458c/html5/thumbnails/56.jpg)
Updates on .NET Support
• shout outs– CenturyLink– HP
• feedback & questions?– Mark Kropf (PM): [email protected]– David Morhovich (Lead): [email protected]