miniature guide to operational features - edindevops - skeltonthatcher
TRANSCRIPT
The Miniature Guide to Operational Features
Edinburgh DevOps Meetup – 15th September 2015
Rob Thatcher & Matthew Skelton
“Operational Features”
how to develop and test
prioritisation techniques
availability is the best feature
Operational Features
“the properties of a system which make it work well in
Production”
Not PIMP MY RIDE
MORE
Greasy Mechanic
Not PIMP MY RIDE
MORE
Greasy Mechanic
Terminology
what happened to NFRs?(non-functional requirements)
Non-Functional Functional
language impact
non-starternon compos mentis
non-compete
nonsense !
holistic product view
How did we get to this?
admission: IT folk have been guilty of making operational
features quite scary & mysterious
long lists of requirementscrazy test plans
poor explanation of needsfailure to engage stakeholders
gold-plating
de-mystify operational features
better approach
pragmatic and effective
rapid, safe, valuable
“the properties of a system which make it work well in
Production”
Why value Operational Features?
downtime:
$$$reputation
($$)
non-linear increase in complexity and problems
Internet of Things
we can no longer deal manually with the scale/volume
of potential problems
agility and response to incidents
remote car hacking:
security as an operational feature
HA + DR + Backup + Metrics + Diagnostics + …
think:"when it fails, how will we recover?“
it will fail
How do we develop and test Operational Features?
defined features
testable and measurable
ahead lie the ‘ilities’...
1. What2. How to test
Operational Hooks
Deployment Pipeline
Configurability
re-read config (SIGHUP)
text files in version control
inject settings – no ‘black boxes’
toggle features via config
“Postcode lookup unavailable”
better UX
Deployability
immutable artefacts
concurrent releases (SxS)
symlinks
rapid
scriptable
simple failure modes
Maintainability
holding page as MVP!
live system component diagrams
modularity
ability to upgrade
version numbering (SemVer?)
Testability
every component has a /health endpoint
stubbed/mocked/faked endpoints
test things individually
Recoverability
asynchronous service start
expect services to be erroring
logs are not wiped (rotated: okay)
avoid flooding logs
no nasty zombies after failures
MTTR more important than MTBF** for most kinds of F
Performance
run key 'hotspot' areas early
use a deployment pipeline
‘critical path’
early pipeline tests act as a barometer for later
performance problems
derive transit time metrics
Monitorability
stream of metrics
transaction tracing
BasketItemAdded
grep BasketItem
logging for insights
Resilience
Saboteur for network failure testing
deployment pipeline
assume missing or failing
Chaos Monkey
don’t crash on HTTP 503
Scalability
concurrent workers
queues and bottlenecks
throttling is your friend
Security and ‘securability’
securability by practice
SSL certs & HEARTBLEED
Gauntlt
deployment pipeline
# nmap-simple.attack
Feature: simple nmap attack to check for open ports
Background:
Given "nmap" is installed
And the following profile:
| name | value |
| hostname | example.com |
Scenario: Check standard web ports
When I launch an "nmap" attack with:
"""
nmap -F <hostname>
"""
Then the output should match /80.tcp\s+open/
Then the output should not match:
"""
25\/tcp\s+open
"""
Availability
“available but unusable"
synthetic transactions
special HTTP header: trigger additional metrics/reporting
How the organisation affects Operational Features
Budgets
bonuses:
story points delivered
tickets closed
Capex vs Opextax breaks
avoiding the Capex/Opex evil
Developers seen as more valuable than Ops people
3x hiring bonus for Devs (!)
improved awareness in product teams
share ownership and decision making
features
end-user
operationalend-user
single product backlog
Product Owner on call for incidents
tricky!
high degree of maturity
honesty about the product
Product Owner and Tech Lead are both on the hook for
outages
AVOID
Product Owner for ‘user features’ and Tech Lead for
‘operational features’
How to evaluate Operational Features vs User Features
treat Ops team folk as another user persona
alternatives to User Stories?
NOT:
"as a logging subsystem, I want..."
Metrics
Live: downtime, A/B for operational aspects (speed)
Pre-live: time spent re-deploying
Metrics for better conversations
metric-ify your delivery and test infrastructure
99.99% uptime, but 20 redeployments every time
Heuristics for operational features
30% of total product budget
30% of dev team time
holistic product view
MVP: ‘service unavailable’ page
test early for operational features
using a deployment pipeline
single product backlog:
(user) features +
(operational) features
availability is the best feature
Books!
operabilitybook.comoperationalfeatures.com