moving a running openstack cloud to a new data center
TRANSCRIPT
Moving a Running OpenStack Cloud to a New Data Center
IntroductionMatt Fischer
– [email protected]– IRC: mfisch– Twitter: @openmfisch
Craig DeLatte• [email protected]• IRC: cdelatte
Background• OpenStack in two national data centers• Hundreds of nodes per data center• Running lots of business critical applications
Why Move?• Data centers physically out of space for expansion• Separation of environments for performance and
robustness• Allow us to control our own full hardware stack• Redesign the network layout
Mission Impossible?• We aren’t allowed in the data centers• We weren’t allowed to make network switch
changes • We don’t use the corporate change management
system• We don’t set schedules or priorities for other groups• Customer VMs are pets
First Technical Planning Session
“You want to do what?” “You mean like physically move the boxes?”
Hardware Plan• What do we “need” to accomplish:
–Network layout change–Upgrade firmware across the board
• What do we “want” to accomplish:–Burn-in testing to eliminate hardware issues–Fix server hardware layout
Physical Node Move Steps• After the node has been cleared of running services
–Don’t forget to wipe your boot drive –Physically move the server–Swap nics–Re-cable–Upgrade f/w• Re-IP/Update MAC
Hardware Hurdles• Standing up infra is hard, things to consider
–Firmware upgrades–Hardware config–Burn-in testing
“Infrastructure automation is a prerequisite for a project like this”
We Designed For Automation• Full node server build automation with
PXE/Cobbler/Puppet• Hardware load balancer automation with Ansible• Network switch automation with Ansible• API quiescing with xinetd• Guest VM live-migration• Tooling to manage and move virtual routers
Load Balancers• Software load balancers (haproxy) managed by
puppet already• Hardware load balancer without automation
–config done by hand–validation is looking for a green dot in GUI
• We automated the A10 deploy and post-deploy validation with ansible
Switches• Switch config before automation
–Required approval from three teams–Configs pasted in from a Wiki–Could take of days or weeks
• Automated Juniper switch deployment–Done with ansible+Jenkins–Follows code review process–Network Engineering team using gerrit!
Caveats• Some expensive pieces of hardware lack full API
support or documentation.• Ansible or Puppet support may be missing• You may be the first one to ask your vendor rep
about their automation story
“This should be no more disruptive to APIs than a normal weekly deploy or to guests than a live-migration”
General Node Move Process• Drop DNS TTL• Evacuate Node/Quiesce Traffic• Wipe Drive• Power off box• Physically move the node*• Update DNS Record• Build box with PXE• Test new node• Update load balancers/nova/ceph config
Traffic Quiescing• API services have a special health check port
–Utilizes xinetd and socat–Used by haproxy and A10
• Dropping a file into place marks the node as disabled, but doesn’t interrupt active connections.
• This also works for internal services like mysql and rabbitMQ
Ordering - First Production Move
Ordering - Second Production Move
Puppet Master / Build Server• Puppet master moved via “brain transplant”• Automated this process with ansible• The puppet master also handles PXE boot via
cobbler–First box to be moved
• Wanted to avoid inter-DC PXE booting, but had emergency procedures
Load Balancer + VIP Move
haproxy
Backup haproxy
node
VIP - 1.2.3.4
api.twc.net
API Services
API Calls
Old DC New DC
Load Balancer: Move Node + Test
haproxy haproxy
VIP - 1.2.3.4
api.twc.net
API Services
API Calls
Old DC New DC
Test API Calls
VIP - 5.6.7.8
Load Balancer: Move DNS & Wait
haproxy haproxy
VIP - 1.2.3.4
API Services
Running API Connections
Old DC New DC
API Calls
VIP - 5.6.7.8
api.twc.net
Load Balancer: Final State
haproxy
Backup haproxy
node
VIP - 5.6.7.8
api.twc.net
API Services
API Calls
Old DC New DC
Keystone• Quiesce traffic & wait for connections to drop• Stop services• Power off box• Rebuild new box in new DC• Test new box before adding to API cluster• No impact - done during the day
Control - Routers• Router moves are the most customer impacting
part of this process.• Some customers have a lot of FIPs per router• Evacuating all routers at once is a bad idea,
although that’s what we did.• Moved all routers before stopping OpenStack
services.
Control - API Services + RabbitMQ• Quiesce connections on this node• But what about connections to RabbitMQ?
– Stop OpenStack on this node–Restart OpenStack on other control nodes–Restart nova-compute on other nodes
• Stop Rabbit/Stop mysql• Power down• Rebuild, Test, Add to API Cluster
Compute• Basic Plan: build, evac, move. Repeat.• ansible tooling for live migration
–canary VM with ping check• But… Live-migration is not guaranteed to work
–Limit your parallelization–Bigger and busier ones may never live-migrate
“..uh what about our Petabytes of data?”
Swift• Power off, move, then rebuild node
–Leave data drives alone during rebuild, only incrementally migrate data
• Add in nodes to accept data–Ensure all routes are in place to new networks
Cephmon• Attempts to virtualize cephmon IPs failed
–Alerts it may be a security breach• Multiple steps to get the right cephmon IPs
– Instance boot drive• Nova stop/start or nova resize
–Attached volumes• Live-migration
Ceph OSDs• Bring up some new nodes• Add New OSDs to crushmap• Data migrated to new nodes• Remove old OSDs from crushmap• Power off some old nodes• Physically move some old nodes to new DC• Repeat...
Data MigrationsCan you spot the issue?
Watch Your Bottlenecks
“I can promise you you there will be problems.”
Issues• Networking
–ACLs– Incorrect cabling–Bottlenecks
• Software–VTEP address overlap–keepalived VIPs
Issues (cont.)• Vendors
–Bugs bugs bugs….• Deployment process
–Running different levels of deployments until the move was complete
• Customers–Gaining customer buy-in is a chess match
Delays• Vendors
–Found multiple issues with PXE booting, VLAN, and LACP
• Space–Needed to build a new data hall–Not to mention a new data center
Customer Issues• Actual
–VTEP overlap–Oops we upgraded OVS–File descriptor limits on qemu processes
• Perceived–High latency reports
• App owners released a new campaign targeted to millions of customers
“If you are going to do this...”
If You’re Going to Do This...• Our cloud has lots of interdependencies, tracking
these was key.–Caching DNS on load balancers
• Which things in your system are still configured using IP addresses?
–Galera, ceph, haproxy
If You’re Going to Do This...• What resources are protected by VLAN specific
ACLs in your company?– DNS, LDAP/AD
• Do you have maintenance plans and automation for each of your nodes?
If You’re Going to Do This...• Communicate with customers, but don’t over-
communicate. –Most get nervous if they know too much.
• Don’t get overly aggressive with your timeline• Practice Practice Practice
–Production was our 3rd time, not our 1st–We made improvements to the process every
time.
Summary