openstack operations quick ramp-up and survival guide · 2019-02-26 · openstack operations quick...

Post on 13-Mar-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

OpenStack Operations Quick Ramp-up and Survival GuideJoshua Guan, Operations Engineer, IBM Bluemix Private Cloud, @joshuakwanFan He, Architect, IBM Bluemix Private Cloud, @fancyhe

Joshua Guan, Operations LeadIBM Bluemix Private Cloud

Fan He, Cloud ArchitectIBM Bluemix Private Cloud

A Little Bit Background …• Bluemix Private Cloud is

IBM’s private cloud as service based on OpenStack

• Bluemix Private Cloud landed in China to support IBM’s Cloud business there.

• We were building an OpenStack Operations Team from scratch

Agenda• Define an OpenStack Operations Team

• Operating Model• Processes• Tooling• Teaming

• Tooling Integration• Cliché: OpenStack upgrade, HA, Live Migration

Operating OpenStack is like …

You thought you would work like this

And, Welcome to the real world

Define an OpenStack Operations Team

Operating Model• How the cloud services are

offered• What is the SLA• Collaboration with Business

Partners, Data Centers and backend teams, etc.

Processes• Operation Tiers• Escalation Levels• Incident Management• Change Management• Shifts• Onboard & Offboard• …

Tooling• Monitoring• Collaboration• Cloud Management• Knowledge Base• Security• Customer Support

Teaming• Roles and Responsibilities• Shift Model

Operating Model

Data Center

Service Level Agreement

Business Partner

Development Team

OpenStack Service Offering

Customers

OpenStack Operations

Support Entry Points

use consume

complies

operates

collaborate/escalate

route

collaborate/escalate

Processes

Operation Tiers

Escalation Flows

Incident Management

Change Management

Shifts

Security• Roles• Responsibilities

Tier Role Responsibilities

1 Support First line of defense

2 Operations Deploy, upgrade, admin

3 OpenStack Engineering Build the product

3 Network Engineering Undercloud networks

Processes

Operation Tiers

Escalation Flows

Incident Management

Change Management

Shifts

Security• How tickets/alerts/incidents

go between different tiers

customer

Tier 1

Tier 2

Tier 3Tier 3 Tier 3

Processes

Operation Tiers

Escalation Flows

Incident Management

Change Management

Shifts

Security

Definition Example

Priority Level P0, P1, P2

IncidentDefinition

OpenStack node failure, Data center network interruption

ManagementActivities

RFO, Outage Track

Response time Immediate, 15min, 1hr

Update interval Every 30min

Communication method

Customer ticket, email, statuspage.io

Escalcation to leadership

1hr

Processes

Operation Tiers

Escalation Flows

Incident Management

Change Management

Shifts

Security

• Different types of changes• How the change will be rolled

out• When the change will be

rolled out• Review and approval • Customer communication

Processes

Operation Tiers

Escalation Flows

Incident Management

Change Management

Shifts

Securityat-work

on-call primary

on-call secondary

Time

at-work

on-call primary

on-call secondary

at-work

on-call primary

on-call secondary

at-work at-workat-work

Processes

Operation Tiers

Escalation Flows

Incident Management

Change Management

Shifts

Security

• Security Compliance Activities• Health Check• Patch Reporting• Vulnerability Scanning• Continuous Business Need

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base

Security

Customer Support

• Monitoring• Alerting• Log Aggregation• Dashboard

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base

Security

Customer Support

• Chat• File Sharing• Project Kanban• Shift Management

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base

Security

Customer Support

• CMDB• Asset Management• Change Management• Incident Management

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base

Security

Customer Support

• Internal Wiki/Runbooks• Product Documents for

Customers

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base

Security

Customer Support

• Access Management• Security Compliance

Management• Health Checking• Patching Reporting• Vulnerability Scanning

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base

Security

Customer Support

• Ticketing System• Customer Chat• Customer Satisfaction• Cloud Level Maintenance

Communication• Site Level Maintenance

Communication

Teaming

Service Level Agreement

Service Availability

Shift Model

Teaming• 24x7 Availability• Spread the pain• Eliminate interruptions as

possibleat-work

on-call primary

on-call secondary

Time

at-work

on-call primary

on-call secondary

at-work

on-call primary

on-call secondary

at-work at-workat-work

Operators on shift

SME On-call 1

Triage

at-work at-workat-work at-work

primary

secondary

SME On-call 2

primary

secondary

SME On-call 3

primary

secondary

Tooling Integration• A lot of screens to watch• A lot of systems to work on• A lot of interruptions• Use your tools to “kill” them

Tooling IntegrationAs a good start: Kill ”context switch” – work on a single platform

Tooling IntegrationAs a good start: Kill ”context switch” – work on a single platform

Tooling IntegrationWhat’s next: Kill ”all interruptions” – workflow automation across platforms

Cliché – Where BOOOOOM Happens• Implementations & Operations: Change management• The Practices of Upgrade• The Story of HA• The Myth of Live Migration

Change management• “Infrastructure as Code”• Incoming change requests

• Customer initiated requirements• Internal enhancements roll out• Compliance

• Change planning for Consistency• Priorities• Dependencies

OpenStack Upgrade• Prerequisites: deployment automation

• Consistency – cloud configurations in CMDB• Idempotency – code to run OpenStack upgrade

• Upgrade process design• Upgrade orchestration• Repeatable success &

minimum disruption

Reference: Upgrading OpenStack: A Best Practices Guide

Let’s talk about High Availability….• Architecture decisions for HA

• Eliminate SPOF; Non-disruptive upgrade; Load Balancing; …• Inherent availability = MTTF / (MTTF + MTTR)

• HA’s “dark side” for cloud operations• Recovery with HA resetting• Complexity’s impact on recovery time

• Mitigation plan• Built-in monitoring for HA mechanism• Recovery automation

Live Migration? • Does ”nova live-migrate” work?• Manage customer expectations• Abuse prevention

• Limited appropriate scenarios• Automation with caution• Integration with pre & post-

verification routine

Reference: Live Migration is a Perk, not a Panacea@kiwik http://kiwik.github.io/openstack/2015/05/23/Nova-Live-Migration-Workflow/

11:25 Kickoff with Todd MooreIBM Vice President, Open Technology

11:30 OpenStack for BeginnersShamail Tahir • Tyler Britten

12:15 The Open Cloud: A Platform of Possibilities Jesse Proudman • Azmir Mohamed

2:15 Don’t Just Take Our Word for It: Use Cases from Materna & AT&TArmin von Dolenga (Materna) • Jacob Caspi (AT&T)

3:05 Part 1 - Designing Effective MicroservicesManuel Silveyra

3:55 Part 2 - Deploying Infrastructure FoundationsShaun Murikami • Andrew Bodine

5:05 Part 3 - Delivering Application MicroservicesDaniel Krook

5:55 Part 4 – Directing Deployments with DevOpsMegan Kostick • Michael Brewer • Manuel Silveyra

Microservices on the Open Cloud

Enterprise Perspectives

4:30 Join Brad Topol and the Interop

Challenge Vendors for refreshments

The Open Cloud: Delivering Solutions with Choice October 26th CCIB Room 116

Thank You

top related