avoidingdown+meusinglinux( ( high(availability( rust - avoiding downtime using...• raid or raid...

45
Avoiding Down+me Using Linux High Availability Jeremy Rust [email protected] @linbit @nerdhacker

Upload: others

Post on 11-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Avoiding  Down+me  Using  Linux    High  Availability  

Jeremy Rust

[email protected] @linbit

@nerdhacker

Page 2: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Introduction & Agenda

•  Downtime is not cheap •  What is High Availability = not a back up! •  Raid or Raid over the network (DRBD) •  SANs and clustered applications •  The Linux cluster stack •  Cluster management with Pacemaker •  Disaster Recovery / Linking sites •  DRBD and the Cloud

Page 3: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

DRBD  HA  and  DR  

Page 4: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Down+me  =  $$$    

•  Lost  revenue  •  Lost  reputa+on  •  Almost  every  business  these  days  has  a  

cri+cal  database  or  file  system  that  they  could  not  do  without.    

•  HP  es+mates  $31,705  per  hour  3.8  hours  a  year  totaling  $481,900/  year  

•  40%  internet  traffic  stops  when  Google  goes  down  

Survey:  Cri+cal  “IT-­‐Systems  in  the  medium-­‐sized  business“,  2013  Techconsult,  on  behalf  of  HP  Germany  Basis:  300  medium-­‐sized  companies  from  Germany  

Page 5: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Down+me  =  $$$    

“YOU  LOST  THE  DATABASE?!?!”    

•  “Ummm,  can  you  ping  ____?”  •  “I  can’t  seem  to  reach  our  inventory  system.”  •  “Can  you  try  pulling  up  this  record?”  

Page 6: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Devo+on  to  Duty  -­‐  xkcd  

Page 7: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Why  Monitor?    

•  Hardware  dies  •  DDOS  afacks  •  Set  it  and  forget  it  mentality  •  Internet  connec+on  •  Security  programs  

Page 8: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Hos+ng  /  XaaS  

•  Reliability  •  Security  •  Mul+-­‐tenant  architecture  •  Scalability  •  Up+me  

Page 9: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

The  Pillars  of  IT  Security  

Integrity  

Page 10: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Types  of  Clustering  Solu+ons    

•  Hardware  redundancy    •  SAN  solu+ons  •  NAS  boxes  •  External  hard  drives  or  JBODS  •  So#ware  Solu+ons          

Page 11: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Recovery  Time/Point  Objec+ves  

Page 12: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

What  is  Raid?  Is  it  enough?  

Page 13: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

RAID  

Microsok  Library  hfp://msdn.microsok.com/en-­‐us/library/aa226166(v=sql.70).aspx#sql7perkune_moreinfo  

Page 14: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

What  Could  Go  Wrong  

•  Your  shiny  new  hardware  will  fail  •  Single  points  of  failures  are  dangerous  •  Dropped  alerts  •  Internet  outage  •  Power  outage  

Page 15: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

•  Easy  to  implement  -­‐  high  cost  per  TB  •  Large  SLAs  -­‐  quality  of  technicians  •  Management  via  GUI        •  Scalable  -­‐  with  the  right  packages  •  SAN  maintenance  -­‐  learning  curve  •  Off  site  replica+on  is  expensive  

SAN/NAS  

Page 16: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Single  Point  of  Failure  

SAN  

NFS  

MySQL  

VM’s  

Page 17: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Piualls  

•  High  ini+al  and  ongoing  costs  •  Vender  lock  in  is  required  •  Ongoing  worry  of  voiding  the  warrantee  •  Maintenance  is  tricky  and  ongoing  •  It  is  a  black  box,  typically  Solaris  based  •  Cannot  add  or  remove  features  •  It  is  s+ll  a  single  point  of  failure    

Page 18: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Sokware  Only  Solu+ons  

Things  to  look  for:  •  Synchronous  or  Asynchronous  replica+on  •  Stability  /  maturity  •  Time  to  recovery  •  Chance  of  data  loss  •  Onsite  /  offsite  •  Is  it  real  +me  (live)  or  snap  shots  

Page 19: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Asynchronous  Architecture  

Page 20: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Synchronous  Architecture  

Secondary  

Page 21: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Layer  Cake  of  Replica+on  

•  Virtualiza+on  

•  Applica+on  

•  File  system  

•  Object  store  

•  Block  layer  

hfp://images.pinkcakebox.com/cake696.jpg  

Page 22: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Cluster  Cake  Fail  

Page 23: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Common  Issues  /  Piualls  

•  File  locking  •  Network  conges+on  •  Data  consistency  /  data  corrup+on  •  High  overhead  and/or  addi+onal  CPU  cycles  •  Asynchronous  or  even  back  up  based  •  Require  ongoing  licensing  and  royal+es  

Page 24: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

DRBD  

•  Completely  hardware  and  applica+on  agnos+c  •  German  engineering  •  In  development  since  2001  •  Created  by  LINBIT  founder  and  CEO    

Phillip  Reisner  •  DRBD  built  into  the  na+ve  Linux  kernel  as  of  

2.6.33  •  Ships  in  all  major  Linux  distribu+ons  •  Does  not  void  RHEL  or  Oracle  support  

Page 25: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

DRBD  Users  

Page 26: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

A  DRBD  Cluster  Stack  

LAN

Server

RAID

High Speed RAID-Controller

High Speed NIC

Replication Network

shared nothing

Storage

Page 27: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Fully  Redundant  System  

Storage  1   Storage  2  

Ac+ve   Passive  

MySQL.com  

Page 28: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Fully  Redundant  System  

Storage  1   Storage  2  

Ac+ve  Passive  

Passive   Ac+ve  

MySQL.com    Ausweb.com  

Page 29: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Heartbeat/Corosync:  The  Comm  Layer  

•  These  are  the  communica+on  tools  of  the  cluster  

•  “Are  you  dead?”  •  “Are  you  alive?”  •  Heartbeat  is  seasoned  and  stable  

(reliability  =  HA)  •  Corosync  is  newer  and  under    

development  

Page 30: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Pacemaker

The  Linux  Cluster  Resource  Manager  •  The  powerful  and  bossy  cluster  manager  •  Manages  all  aspects  of  system  •  Decides  who  is  alive  and  primary  •  Well  known    •  Widely  deployed  •  Does  not  require  applica+ons  have  specific  

plugins  

Page 31: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Pacemaker : Sleep All Night

•  It  lets  you  sleep  though  the  night  even  if  there’s  a  failure.    

•  Highly  Configurable  •  Used  with  a  number  of  clustering  

tools  /  File  Systems  •  Very  powerful  if  done  well    

 Disastrous  if  done  wrong      

Page 32: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Linux  HA  Stack  

Page 33: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Disaster  Recovery  /  Offsite  Replica+on  

•  True  Disaster  Recovery  happens  live  •  Interval  based  snapshots  no  longer  meet  

todays  SLA  requirements  •  DRBD  does  real-­‐+me  replica+on  on-­‐site  and  

off-­‐site  •  DRBD  Proxy  tool  mi+gates  throughput  

constraints  and  latency-­‐  highly  configurable      

Page 34: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Real-­‐+me  Disaster  Recovery  

Page 35: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Scaling  DRBD  

•  DRBD  Proxy  is  typically  done  in  3  node  configura+ons.      

•  Extremely  configurable  •  Proxy  mi+gates  bandwidth  constraints  and  

latency  •  Can  replicate  across  4  machines  even  across  

distances    

Page 36: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

3  node  HA  /  DR  

Location A Live Site

Location B DR Site

Proxy Proxy Proxy

Page 37: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

4  node  DR  +  Ac+ve-­‐Ac+ve  HA  

Location A Live Site

Location B DR Site

Proxy Proxy Proxy Proxy

Page 38: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Dedicated  Proxy-­‐Many  Resources  

Location A Live Site

Location B DR Site

Proxy Proxy

Dedicated Server Dedicated Server

Page 39: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

How  to  apply  this  in  your  cloud  

hfp://www.gamesparks.com/wp-­‐content/uploads/2013/07/the-­‐cloud.jpg  

DRBD  works  in  the  cloud  and  AWS  VPC  

DRBD  can  be  used  as  backing  storage  for  ISCSI  

On  na+ve  bare  hardware  or  as  part  of  your  hardware  or  sokware  appliance  

Page 40: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

HA  with  Nagios!  

•  Filesystem  (which  has  many  symlinks  in  it)  •  MySQL  •  PostgreSQL  •  Crond  •  Ndo2db  •  The  Nagios  applica+on  itself  •  A  Virtual  IP  

Page 41: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

Q+A Jeremy Rust

[email protected] @NerdHacker

877-DRBD247

www.linkedin.com/in/RustJeremy

DRBD.org Linbit.com

Linux-HA.org

Page 42: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

DRBD  9  the  future  

Page 43: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

DRBD  8  Branch  build  structure  

Page 44: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

DRBD  9  Branch  build  structure  

Page 45: AvoidingDown+meUsingLinux( ( High(Availability( Rust - Avoiding Downtime Using...• Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster

2  Full  redundant  systems  

Storage  1   Storage  2  

Ac+ve  Passive  

Passive   Ac+ve  

MySQL.com    Ausweb.com