disaster recovery

36
Beolink.org Fabrizio Manfredi Furuholmen Create a backup site with opensource

Upload: manfred-furuholmen

Post on 06-Dec-2014

815 views

Category:

Business


2 download

DESCRIPTION

Disaster recovery and business continuity planning are processes which help organizations prepare for disruptive events. The talk explains the basic concepts of business continuity, giving a brief overview on the business continuity plan and more detail informations (technical) on how to setup a Disaster Recovery site . We show two different approaches for creating a disaster recovery (DR) site, one the based on operating system layer and one based on the right design of the applications . The common elements on the two approaches are network design, data replication, monitoring system and system/configuration management. All these elements can be implemented with open source software, we explain advantages and disadvantages, performances and layouts on each solutions.

TRANSCRIPT

Page 1: Disaster recovery

Beolink.org Fabrizio Manfredi Furuholmen

Create a backup site with opensource

Page 2: Disaster recovery

Beolink.org

8/23/09 2

Agenda

  Introduction (Part I)   Disaster Recovery Plan   Business continuity Plan

  Disaster Recovery Components (Part II)   Starting point   Software list   Design Pattern

  Case Studies (Part III)

Page 3: Disaster recovery

Beolink.org Disaster Statistics & Potential Implications   90% of businesses that lose data from a disaster are forced to close

down within 2 years since the disaster.

  80% of businesses without a well structured recovery plan are forced to close down within 12 months since the flood or fire.

  43% of companies experiencing disasters never recover.

  50% of companies experiencing a computer outage will be forced to close down within 5 years

  Companies experiencing a computer outage lasting longer than 10 days will never recover its full financial capacity.

  Less than 50% of all organizations in the UK have a business continuity plan or disaster recovery plan.

  One out of 500 data centers experience a severe disaster every year.

8/23/09 3

Page 4: Disaster recovery

Beolink.org

8/23/09 4

What is a Disaster Recovery Plan ?

Disaster recovery plan (DRP) Disaster recovery plan

consists of the precautions taken so that the effects of a disaster will be minimized and the organization will be able to either maintain or quickly resume mission critical functions

Page 5: Disaster recovery

Beolink.org Classification of Risk

 Natural disasters Such as tornadoes, floods, blizzards, earthquakes and

fire  Accidents  Sabotage  Power and energy disruptions  Service sector failure Such as communications, transportation …  Environmental disasters Such as pollution and hazardous materials spills  Cyber attacks and hacker activity

8/23/09 5

Page 6: Disaster recovery

Beolink.org BIA

  How many transactions can be lost without significantly impacting revenue or productivity?

  Does a majority of the businesses depend upon one or more mission critical applications?

  How much revenue per hour would be lost when these critical applications remain unavailable?

  Are there periods of time, for example the end of fiscal quarters or holiday seasons, when an outage would cause a greater disruption?

  How will productivity be affected if critical applications become unavailable?

  In the event of an unexpected outage, how will your partners, vendors and customers be affected ?

  Historically, what has been the total cost of lost productivity and revenue during downtime?

8/23/09 6

Page 7: Disaster recovery

Beolink.org Business Continuity Plan

Disaster recovery planning (DRP) is a subset of a larger process known as Business Continuity Plan.

Business Continuity Plan (BCP) A business continuity plan

enables critical services or products to be continually delivered to clients.

8/23/09 7

Page 8: Disaster recovery

Beolink.org Technology problem ?

Exercise, Maintain & Audit Exercising BCM Controls BS 25999 compliance

Developing BCM Culture Assessing Designing and Delivery Measuring Results

Developing & Implementing BCM Response Crisis Management Business Continuity Plans Business Unit Plan

BCM Strategies Organization BCM Strategy Process Level BCM Strategy Resource Recovery BCM Strategy

Understand your business Organizational Strategy Business Impacy Analysis Risk Assessment & Control

8/23/09 8

Page 9: Disaster recovery

Beolink.org Disaster recovery starting point

 Recovery point objective (RPO) Maximum data you can loose

 Recovery time objective (RTO) The amount of time you need to restart from

the RPO

8/23/09 9

Page 10: Disaster recovery

Beolink.org Basic Steps

8/23/09 10

 System Duplication  Hardware  Network  Operating System  Application

 Data Replication  Filesystem  Database  Application Data

 Switch Procedure   Activation

 Rollback  How to switch back  How to merge data

Page 11: Disaster recovery

Beolink.org Disaster Recovery Components

Elements  Routing  Application  Data  Infrastructures

8/23/09 11

Working mode  Off-line  Near on-line  Online

Page 12: Disaster recovery

Beolink.org Standby

8/23/09 12

Type OS Database Application

Off Line OFF OFF OFF

Near Online

ON OFF/ON OFF

Online ON ON ON

Several minutes

Zero Keep your DR site unreachable until switch

Low Cost High Cost

Page 13: Disaster recovery

Beolink.org Bottlenecks

 Latency (speed of light)  Asynchronous operation  MultiStream/Parallel

8/23/09 13

 Bandwidth  Caching  Compression (AVG 60%)  Diffs transmission

Page 14: Disaster recovery

Beolink.org Build a Disaster recovery site

What do you need ?

8/23/09 14

Page 15: Disaster recovery

Beolink.org Infrastructure Software

8/23/09 15

Componets Software

Virtualization Vmware, XEN, KVM, Cloud computing, jeos

Firewall Pfsense, iptable

System Configuration

Cfengine, puppet

Snapshoot ZFS,LVM

Montoring Nagios, Zenoss, Zabbix

Page 16: Disaster recovery

Beolink.org Data Layer

Off-Line • Amanda, Bacula, tar :

incremenetal, os dump

Near on-line • rsync: compression,

incremental

On Line • GlousterFS, openAFS, DB

Replication

8/23/09 16

Days

Hours

Mins

Secs

Tape Backup

Periodic Replication

Asynchronous Replication

Synchronous Replication

Page 17: Disaster recovery

Beolink.org Data Layer: Amanda

Amanda.conf: inparallel 4 # maximum dumpers that will run in parallel (max 63) # this maximum can be increased at compile-time, # modifying MAX_DUMPERS in server-src/driverio.h

netusage 600 Kbps # maximum net bandwidth for Amanda, in KB per se

define dumptype comp-high { global comment "very important partitions on fast machines" compress client best kencrypt no priority high }

define interface lo0 { comment "a local disk" use 1000 kbps }

8/23/09 17

Compression   Client Fast/Best   Server Fast/Best

Multi Stream:   Number of parrells dump

Diskbase   Holding disk   changer file base

Bandwith   Global   Per interface

Encryption   SSH   KRB5 DiskList:

Dbserver /dumps comp-user-tar …

Page 18: Disaster recovery

Beolink.org Data Layer: rsync

8/23/09 18

Diffs Only actual changed pieces of files are transferred, rather than the whole file.

Compression The tiny pieces of diffs are then compressed on the fly, further saving you file transfer time and reducing the load on the network.

Encryption The stream from rsync is passed through the ssh protocol to encrypt your session.

Latancy Pipelining of file transfers to minimize latency costs

Simple Usage

rsync --verbose --progress --stats --compress --rsh=/usr/local/bin/ssh --recursive --times --perms --links --delete --exclude "*bak" --exclude "*~” /www/* webserver:path_name

Page 19: Disaster recovery

Beolink.org Data Layer: GlusterFS

Features   Automatic Replication   Aggregation   Scalable Striping   Distributed Locking   Performance Modules   Pluggable I/O Scheduler   Pluggable Transport   Pluggable Auth

Administration   NFS-like Backend   Self-healing   Staking Vol/modules

And much more ..

8/23/09 19

Page 20: Disaster recovery

Beolink.org Data Layer: GlusterFS

volume posix type storage/posix option directory /data/export end-volume

volume locks type features/locks subvolumes posix end-volume

Volume brick type performance/io-threads option thread-count 8 subvolumes locks end-volume

volume server type protocol/server option transport-type tcp option auth.addr.brick.allow * subvolumes brick end-volume

8/23/09 20

Replication   Client side   Server side

Raid   Raid 1   Stripe   Raid 10

Performance   I/O modules   Cache modules

volume remote1 type protocol/client option transport-type tcp option remote-host storage1.example.com option remote-subvolume brick end-volume

volume remote2 … end-volume

volume replicate type cluster/replicate subvolumes remote1 remote2

end-volume

volume writebehind type performance/write-behind option window-size 1MB subvolumes replicate end-volume

volume cache type performance/io-cache option cache-size 512MB subvolumes writebehind end-volume

Page 21: Disaster recovery

Beolink.org Data Layer: Database

Builtin  Mysql Master – Slave  Mysql Multi-Master

8/23/09 21

External  Tungsten Replicator  Slony-I (postgresql)

Page 22: Disaster recovery

Beolink.org Application Layer

 Data replication  Cluster JDBC  Group messaging (spread,

jgroup, ..)  Custom protocol

 Distributed Application Design  Data must be Unique  Primary key unique for all

sites (hash, mixing key ..)

8/23/09 22

Page 23: Disaster recovery

Beolink.org Application Layer: spread

Spread functions as a unified message bus for

distributed applications, and provides highly tuned application-level multicast, group communication, and point to point support.

  Reliable and scalable messaging and group communication   Easy to use, deploy and maintain.   Highly scalable from one local area network to complex wide

area networks.   Supports thousands of groups with different sets of

members.   Enables message reliability in the presence of machine

failures, process crashes and recoveries, and network partitions and merges.

  Provides a range of reliability, ordering and stability guarantees for messages.

  .Completely distributed algorithms with no central point of failure.

8/23/09 23

Page 24: Disaster recovery

Beolink.org Application Layer: Example

Session

  Distributed file system

  Replicated Database

  AppServer Cluster

8/23/09 24

Page 25: Disaster recovery

Beolink.org Routing Layer

 DNS: Bind  keep low value of TTL  dynamic update

 MPLS  Change MPLS configuration, keep

the same ip configuration

 Routing Table : quagga  OSPF/BGP announce new routing

for your side

8/23/09 25

Several minutes

Seconds

Page 26: Disaster recovery

Beolink.org Routing Layer: Quagga

 Quagga  CISCO IOS syntax  All routing protocol  Device base interface

! Work only for !  Autonomous System  Large network

8/23/09 26

Quagga.conf:

interface eth0 ip address XXX.XXX.XXX.XXX ! … Ospf.conf

router ospf ospf router-id XXX.XXX.XXX.XX log-adjacency-changes compatible rfc1583 auto-cost reference-bandwidth 10000 network XXX.XXX.XXX.XXX/XX area …

Page 27: Disaster recovery

Beolink.org Routing Layer: Heart

Monitoring System (outside)  Probe/check external  Remote Invocation scripts  Manual/Automatic

8/23/09 27

Linux HA (inside)  Cluster definition with Quorum

Server  Local Scripts / Fencing script  Manual/Automatic

Page 28: Disaster recovery

Beolink.org Important Things

Business critical

Application Which order to

do things How the service

is linked to others

When to run them (1)

When to run them (2)

Keep close disaster

recovery site

Where have you put

employees ?

8/23/09 28

Page 29: Disaster recovery

Beolink.org Case Studies

8/23/09 29

Case Studies

Page 30: Disaster recovery

Beolink.org Case study

Company type: Heavy industry Location: Single production site Web Farm/IT: Inside the production site APPS: Order management, production control RPO/RTO: 2h DR site: distance 400 km Cost: a Lot

Disaster Event: Flooding of the production site

GOOD Solution ?

8/23/09 30

Page 31: Disaster recovery

Beolink.org Case study

What happened ? Perfect plan, everything OK

But… Production restart after 6 months …

WRONG SOLUTION   Disaster not related to Business   Misunderstanding between BCP and DRP

IT Manager was Fired

8/23/09 31

Page 32: Disaster recovery

Beolink.org Base Solution: small-middle companies

8/23/09 32

 Switch by hand   Configuration   Restore   Starting point

 RTO/RPO Hours   Linear to storage

dimension

 Synchronization   Backup

 No more licenses needed

Page 33: Disaster recovery

Beolink.org Medium/High Solution: Enterprise

8/23/09 33

 Automatic Switch   IP migration (one IP

per service)   Change Firewall role

 RTO/RPO seconds

 Synchronization   Multi-master FS   Database Replication   Configuration

Management

 Double licenses needed

Page 34: Disaster recovery

Beolink.org Conclusion

8/23/09 34

“Everything right now” is

hard Only Business

Application Right RTO/

RPO Keep it Simple

Bandwidth Latency Are you sure the data is on

the filesystem ?

Manual rollback

Periodic Testing

Donʼt forget your time

Donʼt forget the power of

rsync How Far Away is Far Enough?

Page 35: Disaster recovery

Beolink.org I look forward to meeting you…

XVI European AFS meeting 2009 Rome: September 28-30

Who should attend:   Everyone interested in deploying a globally accessible file

system   Everyone interested in learning more about real world usage

of Kerberos authentication in single realm and federated single sign-on environments

  Everyone who wants to share their knowledge and experience with other members of the AFS and Kerberos communities

  Everyone who wants to find out the latest developments affecting AFS and Kerberos

More Info: www.openafs.it

8/23/09 35

Page 36: Disaster recovery

Beolink.org

Thank you [email protected] www.beolink.org