devopsdays austin: helping horses become unicorns, chef's operations maturity model

Chef’s Operations Maturity Model: Helping Horses Become UnicornsMatt Ray DevopsDays Austin May 5, 2014

Introductions• Matt Ray

• Director Partner Integration at Chef

• matt@getchef.com

• mattray GitHub|IRC|Twitter

“There’s nothing horses hate more than hearing stories about unicorns.”John Arbuckle Chief Architect at GE Capital "Hunting the DevOps Whale in Large Enterprises" ChefConf 2014

http://pichost.me/1468004/

DevOps Unicorns• Etsy

• Facebook

• Netflix

https://keepinghouseandhorse.files.wordpress.com/2013/10/photoshop3.jpeg

But… Enterprise• Our applications are too complex

• Politics get in the way

• We’ve always done it this way

It’s Not Magic• Not everyone requires Continuous Delivery

• They require:

• Higher reliability

• Greater visibility

• More resilience

• Faster response

https://img0.etsystatic.com/000/0/5209298/il_fullxfull.282855902.jpg

How Do We Get There?

The Map is not the Territory• Comparative study of Operational

Maturity Models

• On one end: ad-hoc, slow to respond, “traditional” approach

• At the other: very fast, fully automated, and disaster indifferent

• Figure out what is most important to your Organization

https://www.chimacumtack.com/images/measurehorse.jpg

Fitting the Model• Varying degrees of adoption

• Operational trends often correlated and relational, but not definitive

• Roadmap for improving time to deployment and lower time to recovery

• Understand the challenges, set real expectations for progress

http://www.web3dservice.com/3d_models/images/unicorn_3d_model_03.jpg

Roadmap Considerations• Hardware Management

• OS Management

• Infrastructure Management

• Software Deployments

• Incident Management

• Disaster Recovery

http://cultofunicorn.com/wp-content/uploads/2013/05/Unicorn_horse.jpg

Hardware Management

Every Server is Sacred!• HA Support expected across the entire stack

• Dependence on vendor/on-site SE for replacement/maintenance

• “This is the best hardware money can buy!”

• Architecture Review and Request Forms for all changes

• “Tier 1” data centers

• Every project special snowflake

1 SysAdmin to 25-250 systems?

Automate Common Tasks

Maybe not ALL servers are sacred…• Start using some farms of standardized machines

• Fewer support contracts, less dependence on vendor/on-site support

• Architecture Reviews for new services with some implementation standardization

• HA support across most of the stack

• Probably still using “Tier 1” data centers with excess redundancy

1 Systems Engineer to 250-500 systems

Configuration Management

Most of these servers aren’t sacred?• Limited support on ALL systems

• On-site support used sparingly, lower-skill onsite staff for “normal” failures

• Architecture Reviews only manage exceptions. Automated requests may be exposed via emerging APIs

• Wide adoption of virtualization: server instances are commoditized

• Hardware becoming standardized and easy to replace

• Smaller, more efficient data centers.

• Limited redundancy with hot/hot/hot N+1/N HA strategies

Application Management

1 Systems Engineer to 500-1000 Systems

None of the servers are sacred• Infrastructure as a Service

• Hardware (if any) is fully commoditized

• Hardware is completely standardized, special cases are regarded as a risk to business

• Redundant Array of Inexpensive Data centers

1 Site Reliability Engineer to 1000+ Systems

Continuous Delivery

1 Site Reliability Engineer to 1000+ Systems

Continuous Delivery

Operating System Management

Operating Systems Management• Many OS flavors and versions. Manual, irregular patching

• Limited flavors and versions, planned upgrades. “Patch Tuesday!”

• Standard versions using JEOS with regular upgrades. Automated patching.

• Internally maintained versions, constant upgrades

http://www.smallwebs.com/Swords/images/UK1796HC2d/SCOTLANDFOREVER2.jpg

Incident Management

Incident Threshold: Recovery Time• Which teams have regular on call responsibilities?

• What is expected of someone on call?

• How are people notified & engaged on an incident?

Incident Threshold: Recovery Time• "Something is wrong!" 12+ hours

• "Something is wrong with the…!" 1-12 hours

• "Something went wrong with your deployment!” <60 minutes

• "The core infrastructure fabric is down!” seconds - 10 minutes

Postmortems

http://photography.nationalgeographic.com/photography/photo-of-the-day/

Postmortems• Postmortem Focus

• Root Cause Orientation

• Root Cause Mitigation/Resolution

• Root Cause Elimination Rate

http://img3.wikia.nocookie.net/__cb20111008164412/mlpfanart/images/thumb/b/b2/Twilight_Sparkle_Angry_by_Ivan-Chan.png/597px-Twilight_Sparkle_Angry_by_Ivan-Chan.png

Postmortems: Ad Hoc• "Human Error”: blame finding & punishment

• "Triggering Event”: blaming specific operator error or specific hardware failures

• Cycle between protecting heroes and then firing them

• <10% - Mostly break fix detection

Postmortems: Formal• Focus on "Triggering Event" or "Human Error", but blaming process and/or infrastructure

• "Let's implement more process and overhead”

• 10% within 3 months - mostly simple fixes

• Tracking but little progress against goals vs. other priorities, frequent recurrence

Postmortems: Officially "Blame Free"• Primary focus on on underlying technical root causes, systemic fixes

• Improved tooling, programatic checks, operator tools for special cases. Some focus on building resiliency

• 20% - Easily fixable issues eliminated within 3 months, programs to eliminate larger issues over time

Postmortems: “5 Whys”• Including business and cultural issues

• Primary focus on insights and opportunities from lessons learned

• Increased resiliency and appropriate operator tools, focus on self-healing fixes

• Recurrence becomes infrequent and is a big deal

Navigating the Change• Many more mile markers

• Roadmap to improve your

• Mean Time To Production

• Mean Time to Recovery

Becoming a Unicorn is Possible• Approach the challenges with realistic expectations for your organization

• Always room for improvement

• Culture trumps everything

http://webecoist.momtastic.com/wp-content/uploads/2010/09/unicorns_3x.jpg

Where Can I Download It?bit.ly/Chef-OMM

Thanks!Matt Ray matt@getchef.com @mattray !Thanks to George Miranda, Paul Edelhertz & Jesse Robbins

devopsdays austin: helping horses become unicorns, chef's operations maturity model

recovery time

versions

servers

vendor

wrong

deployment

www

Software

chef's corner

chef's table salad creations

through events building the tech ecosystemcape town devops...

devops measurement - devopsdays dc

devopsdays rotating booth slides

delivering unicorns

devopsdays singapore habitat ignite

chef's meat business plan

chef's course training

devopsdays downunder-vfinal

seeking unicorns

sdl unicorns or thoroughbreds: application security in...

devopsdays madison - opening remarks

our chef's winter suggestions

chef's banquet food storage

seeking unicorns

war games - devopsdays berlin

mystical unicorns

devopsdays barcelona

velocity and devopsdays 2013 takeaways