openstack operations quick ramp-up and survival guide · 2019-02-26 · openstack operations quick...
TRANSCRIPT
OpenStack Operations Quick Ramp-up and Survival GuideJoshua Guan, Operations Engineer, IBM Bluemix Private Cloud, @joshuakwanFan He, Architect, IBM Bluemix Private Cloud, @fancyhe
Joshua Guan, Operations LeadIBM Bluemix Private Cloud
Fan He, Cloud ArchitectIBM Bluemix Private Cloud
A Little Bit Background …• Bluemix Private Cloud is
IBM’s private cloud as service based on OpenStack
• Bluemix Private Cloud landed in China to support IBM’s Cloud business there.
• We were building an OpenStack Operations Team from scratch
Agenda• Define an OpenStack Operations Team
• Operating Model• Processes• Tooling• Teaming
• Tooling Integration• Cliché: OpenStack upgrade, HA, Live Migration
Operating OpenStack is like …
You thought you would work like this
And, Welcome to the real world
Define an OpenStack Operations Team
Operating Model• How the cloud services are
offered• What is the SLA• Collaboration with Business
Partners, Data Centers and backend teams, etc.
Processes• Operation Tiers• Escalation Levels• Incident Management• Change Management• Shifts• Onboard & Offboard• …
Tooling• Monitoring• Collaboration• Cloud Management• Knowledge Base• Security• Customer Support
Teaming• Roles and Responsibilities• Shift Model
Operating Model
Data Center
Service Level Agreement
Business Partner
Development Team
OpenStack Service Offering
Customers
OpenStack Operations
Support Entry Points
use consume
complies
operates
collaborate/escalate
route
collaborate/escalate
Processes
Operation Tiers
Escalation Flows
Incident Management
Change Management
Shifts
Security• Roles• Responsibilities
Tier Role Responsibilities
1 Support First line of defense
2 Operations Deploy, upgrade, admin
3 OpenStack Engineering Build the product
3 Network Engineering Undercloud networks
Processes
Operation Tiers
Escalation Flows
Incident Management
Change Management
Shifts
Security• How tickets/alerts/incidents
go between different tiers
customer
Tier 1
Tier 2
Tier 3Tier 3 Tier 3
Processes
Operation Tiers
Escalation Flows
Incident Management
Change Management
Shifts
Security
Definition Example
Priority Level P0, P1, P2
IncidentDefinition
OpenStack node failure, Data center network interruption
ManagementActivities
RFO, Outage Track
Response time Immediate, 15min, 1hr
Update interval Every 30min
Communication method
Customer ticket, email, statuspage.io
Escalcation to leadership
1hr
Processes
Operation Tiers
Escalation Flows
Incident Management
Change Management
Shifts
Security
• Different types of changes• How the change will be rolled
out• When the change will be
rolled out• Review and approval • Customer communication
Processes
Operation Tiers
Escalation Flows
Incident Management
Change Management
Shifts
Securityat-work
on-call primary
on-call secondary
Time
at-work
on-call primary
on-call secondary
at-work
on-call primary
on-call secondary
at-work at-workat-work
Processes
Operation Tiers
Escalation Flows
Incident Management
Change Management
Shifts
Security
• Security Compliance Activities• Health Check• Patch Reporting• Vulnerability Scanning• Continuous Business Need
Tooling
OpenStack Operations
Monitoring
Collaboration
Cloud Management
Knowledge Base
Security
Customer Support
• Monitoring• Alerting• Log Aggregation• Dashboard
Tooling
OpenStack Operations
Monitoring
Collaboration
Cloud Management
Knowledge Base
Security
Customer Support
• Chat• File Sharing• Project Kanban• Shift Management
Tooling
OpenStack Operations
Monitoring
Collaboration
Cloud Management
Knowledge Base
Security
Customer Support
• CMDB• Asset Management• Change Management• Incident Management
Tooling
OpenStack Operations
Monitoring
Collaboration
Cloud Management
Knowledge Base
Security
Customer Support
• Internal Wiki/Runbooks• Product Documents for
Customers
Tooling
OpenStack Operations
Monitoring
Collaboration
Cloud Management
Knowledge Base
Security
Customer Support
• Access Management• Security Compliance
Management• Health Checking• Patching Reporting• Vulnerability Scanning
Tooling
OpenStack Operations
Monitoring
Collaboration
Cloud Management
Knowledge Base
Security
Customer Support
• Ticketing System• Customer Chat• Customer Satisfaction• Cloud Level Maintenance
Communication• Site Level Maintenance
Communication
Teaming
Service Level Agreement
Service Availability
Shift Model
Teaming• 24x7 Availability• Spread the pain• Eliminate interruptions as
possibleat-work
on-call primary
on-call secondary
Time
at-work
on-call primary
on-call secondary
at-work
on-call primary
on-call secondary
at-work at-workat-work
Operators on shift
SME On-call 1
Triage
at-work at-workat-work at-work
primary
secondary
SME On-call 2
primary
secondary
SME On-call 3
primary
secondary
Tooling Integration• A lot of screens to watch• A lot of systems to work on• A lot of interruptions• Use your tools to “kill” them
Tooling IntegrationAs a good start: Kill ”context switch” – work on a single platform
Tooling IntegrationAs a good start: Kill ”context switch” – work on a single platform
Tooling IntegrationWhat’s next: Kill ”all interruptions” – workflow automation across platforms
Cliché – Where BOOOOOM Happens• Implementations & Operations: Change management• The Practices of Upgrade• The Story of HA• The Myth of Live Migration
Change management• “Infrastructure as Code”• Incoming change requests
• Customer initiated requirements• Internal enhancements roll out• Compliance
• Change planning for Consistency• Priorities• Dependencies
OpenStack Upgrade• Prerequisites: deployment automation
• Consistency – cloud configurations in CMDB• Idempotency – code to run OpenStack upgrade
• Upgrade process design• Upgrade orchestration• Repeatable success &
minimum disruption
Reference: Upgrading OpenStack: A Best Practices Guide
Let’s talk about High Availability….• Architecture decisions for HA
• Eliminate SPOF; Non-disruptive upgrade; Load Balancing; …• Inherent availability = MTTF / (MTTF + MTTR)
• HA’s “dark side” for cloud operations• Recovery with HA resetting• Complexity’s impact on recovery time
• Mitigation plan• Built-in monitoring for HA mechanism• Recovery automation
Live Migration? • Does ”nova live-migrate” work?• Manage customer expectations• Abuse prevention
• Limited appropriate scenarios• Automation with caution• Integration with pre & post-
verification routine
Reference: Live Migration is a Perk, not a Panacea@kiwik http://kiwik.github.io/openstack/2015/05/23/Nova-Live-Migration-Workflow/
11:25 Kickoff with Todd MooreIBM Vice President, Open Technology
11:30 OpenStack for BeginnersShamail Tahir • Tyler Britten
12:15 The Open Cloud: A Platform of Possibilities Jesse Proudman • Azmir Mohamed
2:15 Don’t Just Take Our Word for It: Use Cases from Materna & AT&TArmin von Dolenga (Materna) • Jacob Caspi (AT&T)
3:05 Part 1 - Designing Effective MicroservicesManuel Silveyra
3:55 Part 2 - Deploying Infrastructure FoundationsShaun Murikami • Andrew Bodine
5:05 Part 3 - Delivering Application MicroservicesDaniel Krook
5:55 Part 4 – Directing Deployments with DevOpsMegan Kostick • Michael Brewer • Manuel Silveyra
Microservices on the Open Cloud
Enterprise Perspectives
4:30 Join Brad Topol and the Interop
Challenge Vendors for refreshments
The Open Cloud: Delivering Solutions with Choice October 26th CCIB Room 116
Thank You