moving a running openstack cloud to a new data center

Moving a Running OpenStack Cloud to a New Data Center

IntroductionMatt Fischer

– [email protected]– IRC: mfisch– Twitter: @openmfisch

Craig DeLatte• [email protected]• IRC: cdelatte

mailto:[email protected]

mailto:[email protected]

Background• OpenStack in two national data centers• Hundreds of nodes per data center• Running lots of business critical applications

Why Move?• Data centers physically out of space for expansion• Separation of environments for performance and

robustness• Allow us to control our own full hardware stack• Redesign the network layout

Mission Impossible?• We aren’t allowed in the data centers• We weren’t allowed to make network switch

changes • We don’t use the corporate change management

system• We don’t set schedules or priorities for other groups• Customer VMs are pets

First Technical Planning Session

“You want to do what?” “You mean like physically move the boxes?”

Hardware Plan• What do we “need” to accomplish:

–Network layout change–Upgrade firmware across the board

• What do we “want” to accomplish:–Burn-in testing to eliminate hardware issues–Fix server hardware layout

Physical Node Move Steps• After the node has been cleared of running services

–Don’t forget to wipe your boot drive –Physically move the server–Swap nics–Re-cable–Upgrade f/w• Re-IP/Update MAC

Hardware Hurdles• Standing up infra is hard, things to consider

–Firmware upgrades–Hardware config–Burn-in testing

“Infrastructure automation is a prerequisite for a project like this”

We Designed For Automation• Full node server build automation with

PXE/Cobbler/Puppet• Hardware load balancer automation with Ansible• Network switch automation with Ansible• API quiescing with xinetd• Guest VM live-migration• Tooling to manage and move virtual routers

Load Balancers• Software load balancers (haproxy) managed by

puppet already• Hardware load balancer without automation

–config done by hand–validation is looking for a green dot in GUI

• We automated the A10 deploy and post-deploy validation with ansible

Switches• Switch config before automation

–Required approval from three teams–Configs pasted in from a Wiki–Could take of days or weeks

• Automated Juniper switch deployment–Done with ansible+Jenkins–Follows code review process–Network Engineering team using gerrit!

Caveats• Some expensive pieces of hardware lack full API

support or documentation.• Ansible or Puppet support may be missing• You may be the first one to ask your vendor rep

about their automation story

“This should be no more disruptive to APIs than a normal weekly deploy or to guests than a live-migration”

General Node Move Process• Drop DNS TTL• Evacuate Node/Quiesce Traffic• Wipe Drive• Power off box• Physically move the node*• Update DNS Record• Build box with PXE• Test new node• Update load balancers/nova/ceph config

Traffic Quiescing• API services have a special health check port

–Utilizes xinetd and socat–Used by haproxy and A10

• Dropping a file into place marks the node as disabled, but doesn’t interrupt active connections.

• This also works for internal services like mysql and rabbitMQ

Ordering - First Production Move

Ordering - Second Production Move

Puppet Master / Build Server• Puppet master moved via “brain transplant”• Automated this process with ansible• The puppet master also handles PXE boot via

cobbler–First box to be moved

• Wanted to avoid inter-DC PXE booting, but had emergency procedures

Load Balancer + VIP Move

haproxy

Backup haproxy

node

VIP - 1.2.3.4

api.twc.net

API Services

API Calls

Old DC New DC

Load Balancer: Move Node + Test

haproxy haproxy

VIP - 1.2.3.4

api.twc.net

API Services

API Calls

Old DC New DC

Test API Calls

VIP - 5.6.7.8

Load Balancer: Move DNS & Wait

haproxy haproxy

VIP - 1.2.3.4

API Services

Running API Connections

Old DC New DC

API Calls

VIP - 5.6.7.8

api.twc.net

Load Balancer: Final State

haproxy

Backup haproxy

node

VIP - 5.6.7.8

api.twc.net

API Services

API Calls

Old DC New DC

Keystone• Quiesce traffic & wait for connections to drop• Stop services• Power off box• Rebuild new box in new DC• Test new box before adding to API cluster• No impact - done during the day

Control - Routers• Router moves are the most customer impacting

part of this process.• Some customers have a lot of FIPs per router• Evacuating all routers at once is a bad idea,

although that’s what we did.• Moved all routers before stopping OpenStack

services.

Control - API Services + RabbitMQ• Quiesce connections on this node• But what about connections to RabbitMQ?

– Stop OpenStack on this node–Restart OpenStack on other control nodes–Restart nova-compute on other nodes

• Stop Rabbit/Stop mysql• Power down• Rebuild, Test, Add to API Cluster

Compute• Basic Plan: build, evac, move. Repeat.• ansible tooling for live migration

–canary VM with ping check• But… Live-migration is not guaranteed to work

–Limit your parallelization–Bigger and busier ones may never live-migrate

“..uh what about our Petabytes of data?”

Swift• Power off, move, then rebuild node

–Leave data drives alone during rebuild, only incrementally migrate data

• Add in nodes to accept data–Ensure all routes are in place to new networks

Cephmon• Attempts to virtualize cephmon IPs failed

–Alerts it may be a security breach• Multiple steps to get the right cephmon IPs

– Instance boot drive• Nova stop/start or nova resize

–Attached volumes• Live-migration

Ceph OSDs• Bring up some new nodes• Add New OSDs to crushmap• Data migrated to new nodes• Remove old OSDs from crushmap• Power off some old nodes• Physically move some old nodes to new DC• Repeat...

Data MigrationsCan you spot the issue?

Watch Your Bottlenecks

“I can promise you you there will be problems.”

Issues• Networking

–ACLs– Incorrect cabling–Bottlenecks

• Software–VTEP address overlap–keepalived VIPs

Issues (cont.)• Vendors

–Bugs bugs bugs….• Deployment process

–Running different levels of deployments until the move was complete

• Customers–Gaining customer buy-in is a chess match

Delays• Vendors

–Found multiple issues with PXE booting, VLAN, and LACP

• Space–Needed to build a new data hall–Not to mention a new data center

Customer Issues• Actual

–VTEP overlap–Oops we upgraded OVS–File descriptor limits on qemu processes

• Perceived–High latency reports

• App owners released a new campaign targeted to millions of customers

“If you are going to do this...”

If You’re Going to Do This...• Our cloud has lots of interdependencies, tracking

these was key.–Caching DNS on load balancers

• Which things in your system are still configured using IP addresses?

–Galera, ceph, haproxy

If You’re Going to Do This...• What resources are protected by VLAN specific

ACLs in your company?– DNS, LDAP/AD

• Do you have maintenance plans and automation for each of your nodes?

If You’re Going to Do This...• Communicate with customers, but don’t over-

communicate. –Most get nervous if they know too much.

• Don’t get overly aggressive with your timeline• Practice Practice Practice

–Production was our 3rd time, not our 1st–We made improvements to the process every

time.

Summary

moving a running openstack cloud to a new data center

Internet