bol.com - what would you do if you could do it all over without getting fired
DESCRIPTION
At bol.com we have been working very hard last 2 years on redoing the way we run our web operations. The fifth of April this year we successfully switched all customers to our new datacenters. During this presentation we would like to share what happened under the hood. How we have been inspired by the community. How we dealt with bad choices and wrongly made assumptions. Furthermore how to get the best out of ‘freshly’ made decisions and limitations encountered. Luckily there’s also a bunch of success we can share and some principles we took onboard. Finally we would like to give you a preview of what’s next. This talk will explore both technical and team topics.TRANSCRIPT
About us
Jos Houtman: -‐ professional bit byter -‐ 10 years on the job -‐ [email protected]
Guido Bakker: -‐ stubborn guy from Hoorn who next
Cme will be on stage! -‐ 15 years on the job and the technical
orchestrator -‐ @guido_bakker
Niels van de Wall: -‐ responsible for IT operaCons -‐ 15 years on the job -‐ [email protected]
>6.700.000 products >12 categories >4.000.000 customers
>8.000.000 products >4.000.000 customers
>150 engineers >50 applica8ons
>30 scrum teams
3
What happened under the hood last 2 years
start
pla:orm and automa8on
ways of working
team
……
Build team
4
With passionate experienced professionals that have great feel how to deal with risky situations Whom love to automate and structurally improve stuff based on measurements Take the lead and own it
and…..
have the right attitude!
5
The BAD: • Takes Cme to find the right
people…. we’ve succeeded but next Cme…
• New team, new ways of working, new plaWorm…… takes iniCally more Cme and energy to fine-‐tune
The UGLY: • pressure cooking… joiners
needed to go through aggressive ramp-‐up period!
The GOOD: • Got the right people just in
Cme without concessions • To-‐be colleagues were
observed how they behave and deal with ma\ers on the command line.
• Building and running done by the same team
• Ownership and focus • AutomaCon mindset • Fun!
6
What happened under the hood last 2 years
start
ways of working
team pla:orm and automa8on
……
7
Platform and automation
Principles: • single version of truth • no manual actions • If it isn't high available it’s bad • set boundaries, be conditional • measure and monitor everything • manage all environments the same • only peer reviewed changes
8
Asset management with API
Goal: holds the truth of our infrastructure and is used during the whole lifetime of an asset. • provisioning: os, hostname, network, etc. • configuration: role • operation: state determines monitoring
visibilty
9
The GOOD: • administraCon is up-‐to-‐date
and enforced • Ce key components together
with api’s/scripts/whatever • changes are cheap • Strict naming scheme allows
for easier automaCon.
The BAD: • It’s good start but needs
more to it! • Majority of infrastructure
informaCon ended up in hiera.
The UGLY: • Need for place to store
infrastructure informaCon
• no CLI • Truth needs to be
available
10
What happened under the hood last 2 years
start
ways of working
team pla:orm and automa8on
……
11
Configuration management
Source: h\p://www.craigdunn.org/2012/05/239/
12
Config – hiera data
• Hiera is suboptimal as a data source for complex information used by different modules / functionality
• Solution: custom functions to retrieve only
subsections of a hiera hash
13
Config – deployments
• Complete state is maintained, puppet installs releases.
• Rundeck does orchestration of puppet
runs, database deploys, restarts
• More tomorrow by Steven Meunier
14
Config – monitoring
• Exported resources to configure nagios checks.
• checks defined on abstraction levels: role, os, etc.
• then exported in the various classes of the profile layer
15
The GOOD: • Define on abstracCon levels
not individual systems • Monitoring, logging and
metrics integral part of our profiles
• No separate deployment needed ader installaCon
• 2 hours from scratch to fully working environment
• Destroyed and rebuild enCre environments
The BAD: • Puppet(db) slow due to
amount of resources • Prone to dependency hell
The UGLY: • Double administraCon
necessary in hiera • Exported resources is the
wrong choice for most problems
16
What happened under the hood last 2 years
start
……
team pla:orm and automa8on
ways of working
17
Ways of working – next steps
collaboraCon & shared
responsibility
ConCnuous delivery