miniature guide to operational features - edindevops - skeltonthatcher

The Miniature Guide to Operational Features

Edinburgh DevOps Meetup – 15th September 2015

Rob Thatcher & Matthew Skelton

“Operational Features”

how to develop and test

prioritisation techniques

availability is the best feature

Operational Features

“the properties of a system which make it work well in

Production”

Not PIMP MY RIDE

MORE

Greasy Mechanic

Terminology

what happened to NFRs?(non-functional requirements)

Non-Functional Functional

language impact

non-starternon compos mentis

non-compete

nonsense !

holistic product view

How did we get to this?

admission: IT folk have been guilty of making operational

features quite scary & mysterious

long lists of requirementscrazy test plans

poor explanation of needsfailure to engage stakeholders

gold-plating

de-mystify operational features

better approach

pragmatic and effective

rapid, safe, valuable

“the properties of a system which make it work well in

Production”

Why value Operational Features?

downtime:

$$$reputation

($$)

non-linear increase in complexity and problems

Internet of Things

we can no longer deal manually with the scale/volume

of potential problems

agility and response to incidents

remote car hacking:

security as an operational feature

HA + DR + Backup + Metrics + Diagnostics + …

think:"when it fails, how will we recover?“

it will fail

How do we develop and test Operational Features?

defined features

testable and measurable

ahead lie the ‘ilities’...

1. What2. How to test

Operational Hooks

Deployment Pipeline

Configurability

re-read config (SIGHUP)

text files in version control

inject settings – no ‘black boxes’

toggle features via config

“Postcode lookup unavailable”

better UX

Deployability

immutable artefacts

concurrent releases (SxS)

symlinks

rapid

scriptable

simple failure modes

Maintainability

holding page as MVP!

live system component diagrams

modularity

ability to upgrade

version numbering (SemVer?)

Testability

every component has a /health endpoint

stubbed/mocked/faked endpoints

test things individually

Recoverability

asynchronous service start

expect services to be erroring

logs are not wiped (rotated: okay)

avoid flooding logs

no nasty zombies after failures

MTTR more important than MTBF** for most kinds of F

Performance

run key 'hotspot' areas early

use a deployment pipeline

‘critical path’

early pipeline tests act as a barometer for later

performance problems

derive transit time metrics

Monitorability

stream of metrics

transaction tracing

BasketItemAdded

grep BasketItem

logging for insights

Resilience

Saboteur for network failure testing

deployment pipeline

assume missing or failing

Chaos Monkey

don’t crash on HTTP 503

Scalability

concurrent workers

queues and bottlenecks

throttling is your friend

Security and ‘securability’

securability by practice

SSL certs & HEARTBLEED

Gauntlt

deployment pipeline

# nmap-simple.attack

Feature: simple nmap attack to check for open ports

Background:

Given "nmap" is installed

And the following profile:

| name | value |

| hostname | example.com |

Scenario: Check standard web ports

When I launch an "nmap" attack with:

"""

nmap -F <hostname>

"""

Then the output should match /80.tcp\s+open/

Then the output should not match:

"""

25\/tcp\s+open

"""

Availability

“available but unusable"

synthetic transactions

special HTTP header: trigger additional metrics/reporting

How the organisation affects Operational Features

Budgets

bonuses:

story points delivered

tickets closed

Capex vs Opextax breaks

avoiding the Capex/Opex evil

Developers seen as more valuable than Ops people

3x hiring bonus for Devs (!)

improved awareness in product teams

share ownership and decision making

features

end-user

operationalend-user

single product backlog

Product Owner on call for incidents

tricky!

high degree of maturity

honesty about the product

Product Owner and Tech Lead are both on the hook for

outages

AVOID

Product Owner for ‘user features’ and Tech Lead for

‘operational features’

How to evaluate Operational Features vs User Features

treat Ops team folk as another user persona

alternatives to User Stories?

NOT:

"as a logging subsystem, I want..."

Metrics

Live: downtime, A/B for operational aspects (speed)

Pre-live: time spent re-deploying

Metrics for better conversations

metric-ify your delivery and test infrastructure

99.99% uptime, but 20 redeployments every time

Heuristics for operational features

30% of total product budget

30% of dev team time

holistic product view

MVP: ‘service unavailable’ page

test early for operational features

using a deployment pipeline

single product backlog:

(user) features +

(operational) features

availability is the best feature

Books!

operabilitybook.comoperationalfeatures.com

thank you

http://skeltonthatcher.com/[email protected]

@SkeltonThatcher

+44 (0)20 8242 4103

miniature guide to operational features - edindevops - skeltonthatcher

Software