anynines - running cloud foundry for 12 months - an experience report
DESCRIPTION
anynines runs a public PaaS located in a German datacenter based on Cloud Foundry. In more than 12 months of running a Cloud Foundry PaaS man lessons about security, high availability, open stack and many other exciting topics have been learned. See how Bosh can be used and how it shouldn't be used. Learn how to perform Cloud Foundry upgrades and read how to harden Cloud Foundry by adding more fault tolerance with pacemaker.TRANSCRIPT
Running Cloud Foundry An Experience Report
About this talk
• Receive an opinion about running Cloud Foundry (CF)
• How to shoot your own leg with CF and overcommitment settings
• How to perform CF updates
• How to harden CF
• Wise words about CF services
Introduction
about.me/fischerjulian
Running a public Cloud Foundry
for more than a year.
It works.
In order to run Cloud Foundry smoothly …
… refer to the package leaflet for risks and side effects and consult pivotal, cloudcredo or anynines.“
The details
The anynines Stack
Hardware
OpenStack
Cloud Foundry
VMware
We migrated from a Rented VMware to a
self-hosted OpenStack.
For more details on this: http://rh.gd/a9vmw2sos
Proof point made…
Cloud Foundry saves investments into software development
by being infrastructure agnostic.
Running Cloud Foundry. What happened.
Security Issues
• Pivotal informs partners early about issued
• Usually along with fixes
OpenStack Issues
• Ext4 vs. Ext3
• DEA MTU
• rsyslogd command not found
CF Gotchas
DEA evacuate & Bosh timeout race-condition
• Removing a DEA → Apps will be evacuated→ DEA will be stopped
• Bosh deployment will fail when evacuation takes longer than the Bosh timeout
• Set your Bosh timeout accordingly!
DEA over-commitment
Default overcommitment factor = 4
RAM peaks may cause random errors
• Failures during staging
• Random application crashes
• No meaningful log information
Reducing over-commitment
• Native strategy
• Reduce over-commitment factor
• Bosh deploy
• 8 GB VM, OC factor 4 → Announces 32 GB (V)RAM
• 8 GB VM, OC factor 2 → Announces 16 GB (V)RAM
• When evacuating a 32 GB (V)RAM host, another 32 GB (V)RAM host will be preferred (more free space)
Evacuation Wave
1 GB
1 GB
1 GB
1 GB
= maximum impact on running apps!
New DEAs (OC 2) will receive apps when old DEAs
(OC 4) have been stopped.
Hints
• Create 2nd resource pool for new DEAs
• Deploy the 2nd resource pool before startup to stop old DEAs
• (-) Needs more resources
• (+) Smoother transition
Updating Cloud Foundry
Required: Staging System
• Structurally identical
• Less VMs
1. Determine new features
since last release
2. Study
deployment manifest changes
3. Apply
deployment manifest changes
4. First staging attempt
5. Debug and Fix it!
6. Simulate the live-upgrade
8. Perform the upgrade
and cross fingers.
CF Hardening
Accept that VMs are ephemeral
VM Failover Strategies
Resurrect
• Monitor VM
• Re-Build VMs automatically
• e.g. using Cloud Foundry Bosh
• + Easy
• - Takes long (minutes not seconds)
• - Open Stack doesn’t release persistent disks automatically
Failover to Standby VM
Distribute CF components across availability zones
• Build disjunct networks, racks, etc.
• Each disjunct zone = availability zone
• Tell your IaaS about availability zones
• On provision choose the AZ
• Build Bosh releases accordingly
• Provide stand-by VM
• Monitor VM and perform failover
• IP failover using Pacemaker
• + Fast failover (seconds)
• - Pacemaker not easy to use (& boshify)
• - Increased resource usage by stdby VM(s)
• 2 * UAA
• 2 * CC
• 2 * n * DEAs
• 2 * Health Manager
• …
UAA & CC DB =
SPOF
HA Postgres
• UAA and Cloud Controller database
• Single point of failure for Cloud Foundry
• Postgres not inherently clusterable > failover with standby vm
• Master/slave replication
• Pacemaker/corosync
• IP-Failover using NIC-reattachment
That’s half way towards a PostgreSQL CF Service
• Add a V2 Service Broker
• Add a provisioning logic
• Provision 2-node db cluster on cf create service postgres medium-cluster
Services
“The best way to find yourself is to lose yourself in the service of others.”
― Mahatma Gandhi
Wardenized Services (community services)
are cute for pet projects.
Not suitable for production.
• Implementations are outdated
• One size doesn’t fit all!
No production CF without high quality services.
CF Service Design
• Use clusterable services if possible
• Implement automatic failover if not
• Autoprovisioning using Bosh
• Organize self-healing
• (Semi-)Automatic recovery from degraded mode
Summary
• Bosh & the CF release are powerful, yet you can cut yourself.
• HA Services are very necessary.
• CF is ready to be used in production.
Questions?
Thank you!