bol.com - what would you do if you could do it all over without getting fired

18
About us Jos Houtman: professional bit byter 10 years on the job [email protected] Guido Bakker: stubborn guy from Hoorn who next Cme will be on stage! 15 years on the job and the technical orchestrator @ guido_bakker Niels van de Wall: responsible for IT operaCons 15 years on the job [email protected]

Upload: jos-houtman

Post on 18-Nov-2014

102 views

Category:

Technology


0 download

DESCRIPTION

At bol.com we have been working very hard last 2 years on redoing the way we run our web operations. The fifth of April this year we successfully switched all customers to our new datacenters. During this presentation we would like to share what happened under the hood. How we have been inspired by the community. How we dealt with bad choices and wrongly made assumptions. Furthermore how to get the best out of ‘freshly’ made decisions and limitations encountered. Luckily there’s also a bunch of success we can share and some principles we took onboard. Finally we would like to give you a preview of what’s next. This talk will explore both technical and team topics.

TRANSCRIPT

Page 1: bol.com - What would you do if you could do it all over without getting fired

About us

Jos  Houtman:  -­‐  professional  bit  byter  -­‐  10  years  on  the  job  -­‐  [email protected]    

Guido  Bakker:  -­‐  stubborn  guy  from  Hoorn  who  next  

Cme  will  be  on  stage!  -­‐  15  years  on  the  job  and  the  technical  

orchestrator  -­‐  @guido_bakker    

Niels  van  de  Wall:  -­‐  responsible  for  IT  operaCons  -­‐  15  years  on  the  job  -­‐  [email protected]      

Page 2: bol.com - What would you do if you could do it all over without getting fired

>6.700.000  products  >12  categories  >4.000.000  customers  

>8.000.000  products  >4.000.000  customers  

>150  engineers  >50  applica8ons  

>30  scrum  teams  

Page 3: bol.com - What would you do if you could do it all over without getting fired

3

What happened under the hood last 2 years

start  

pla:orm  and  automa8on  

ways  of  working  

team  

……  

Page 4: bol.com - What would you do if you could do it all over without getting fired

Build team

4

With passionate experienced professionals that have great feel how to deal with risky situations Whom love to automate and structurally improve stuff based on measurements Take the lead and own it

and…..

have the right attitude!

Page 5: bol.com - What would you do if you could do it all over without getting fired

5

The  BAD:    •  Takes  Cme  to  find  the  right  

people….  we’ve  succeeded  but  next  Cme…  

•  New  team,  new  ways  of  working,  new  plaWorm……  takes  iniCally  more  Cme  and  energy  to  fine-­‐tune  

The  UGLY:    •  pressure  cooking…  joiners  

needed  to  go  through  aggressive  ramp-­‐up  period!  

       

The  GOOD:    •  Got  the  right  people  just  in  

Cme  without  concessions  •  To-­‐be  colleagues  were  

observed  how  they  behave  and  deal  with  ma\ers  on  the  command  line.    

•  Building  and  running  done  by  the  same  team  

•  Ownership  and  focus  •  AutomaCon  mindset  •  Fun!  

       

Page 6: bol.com - What would you do if you could do it all over without getting fired

6

What happened under the hood last 2 years

start  

ways  of  working  

team  pla:orm  and  automa8on  

……  

Page 7: bol.com - What would you do if you could do it all over without getting fired

7

Platform and automation

Principles: •  single version of truth •  no manual actions •  If it isn't high available it’s bad •  set boundaries, be conditional •  measure and monitor everything •  manage all environments the same •  only peer reviewed changes

Page 8: bol.com - What would you do if you could do it all over without getting fired

8

Asset management with API

Goal: holds the truth of our infrastructure and is used during the whole lifetime of an asset. •  provisioning: os, hostname, network, etc. •  configuration: role •  operation: state determines monitoring

visibilty

Page 9: bol.com - What would you do if you could do it all over without getting fired

9

The  GOOD:    •  administraCon  is  up-­‐to-­‐date  

and  enforced  •  Ce  key  components  together  

with  api’s/scripts/whatever  •  changes  are  cheap  •  Strict  naming  scheme  allows  

for  easier  automaCon.          

The  BAD:    •  It’s  good  start  but  needs  

more  to  it!  •  Majority  of  infrastructure  

informaCon  ended  up  in  hiera.  

The  UGLY:    •  Need  for  place  to  store  

infrastructure  informaCon  

•  no  CLI  •  Truth  needs  to  be  

available        

Page 10: bol.com - What would you do if you could do it all over without getting fired

10

What happened under the hood last 2 years

start  

ways  of  working  

team  pla:orm  and  automa8on  

……  

Page 11: bol.com - What would you do if you could do it all over without getting fired

11

Configuration management

Source:  h\p://www.craigdunn.org/2012/05/239/  

Page 12: bol.com - What would you do if you could do it all over without getting fired

12

Config – hiera data

•  Hiera is suboptimal as a data source for complex information used by different modules / functionality

•  Solution: custom functions to retrieve only

subsections of a hiera hash

Page 13: bol.com - What would you do if you could do it all over without getting fired

13

Config – deployments

•  Complete state is maintained, puppet installs releases.

•  Rundeck does orchestration of puppet

runs, database deploys, restarts

•  More tomorrow by Steven Meunier

Page 14: bol.com - What would you do if you could do it all over without getting fired

14

Config – monitoring

•  Exported resources to configure nagios checks.

•  checks defined on abstraction levels: role, os, etc.

•  then exported in the various classes of the profile layer

Page 15: bol.com - What would you do if you could do it all over without getting fired

15

The  GOOD:    •  Define  on  abstracCon  levels  

not  individual  systems  •  Monitoring,  logging  and  

metrics  integral  part  of  our  profiles    

•  No  separate  deployment  needed  ader  installaCon  

•  2  hours  from  scratch  to  fully  working  environment  

•  Destroyed  and  rebuild  enCre  environments  

     

The  BAD:    •  Puppet(db)  slow  due  to  

amount  of  resources    •  Prone  to  dependency  hell  

The  UGLY:    •  Double  administraCon  

necessary  in  hiera  •  Exported  resources  is  the  

wrong  choice  for  most  problems  

Page 16: bol.com - What would you do if you could do it all over without getting fired

16

What happened under the hood last 2 years

start  

……  

team  pla:orm  and  automa8on  

ways  of  working  

Page 17: bol.com - What would you do if you could do it all over without getting fired

17

Ways of working – next steps

collaboraCon  &  shared  

responsibility  

ConCnuous  delivery  

Page 18: bol.com - What would you do if you could do it all over without getting fired