disast~1

7/31/2019 DISAST~1

http://slidepdf.com/reader/full/disast1 1/18

System Administration Made Easy 2 –1

&KDSWHU 'LVDVWHU5HFRYHU\

&RQWHQWV

Overview ..................................................................................................................2 –2

Why Plan for a Disaster?........................................................................................2–3

Planning for a Disaster...........................................................................................2–4

Test your Disaster Recovery Procedure ............................................................2–15

Other Considerations ...........................................................................................2–16

Minimizing the Chances for a Disaster...............................................................2–17

7/31/2019 DISAST~1


Chapter 2: Disaster Recovery

Overview

Release 4.6 A/B 2 –2

2YHUYLHZ

The pu rpose of this chap ter is to help you un derstand wh at we feel is the most critical job of

a system ad ministrator—disaster recovery.

We included this chap ter at the beginning of our gu idebook for two reasons:

< To emph asize the imp ortance of the subject

Disaster recovery need s to be p lanned as soon as p ossible, because it takes time to

develop, test, and refine.

< To emp hasize the importan ce of being p repared for a poten tial disaster

Murph y’s Law says:

“Disaster w ill strike wh en you are not p repared for it.”

The faster you begin plann ing, the more p repared you w ill be wh en a disaster does happ en.

This chap ter is not a d isaster recovery “how to.” It is only designed to get you thinking

and working on d isaster recovery.

:KDW,VD'LVDVWHU"

The goal of disaster recovery is to restore the system so that th e compan y can continue

doing bu siness. A disaster is anything th at results in the corru ption or loss of the R/ 3

System.

Examples include:< Database corru ption.

For example wh en test data is accidentally loaded into the prod uction system.

This hap pens m ore often than people realize.

< A serious hard w are failure.

< A complete loss of the R/ 3 System and infrastructure.

For example, the destruction of the building du e to natural disaster.

The ultimate respon sibility of a system adm inistrator is to successfully restore R/ 3 after a

disaster.

The ultima te consequen ce of not restoring the system is that you r company goes out of business.

The adm inistrator’s goal is to prevent th e system from ever reaching the situation wh ere the

ultimate resp onsibility is called up on.

Disaster recovery planning is a major project. Depend ing on your situation an d th e size and

complexity of your compan y, disaster recovery planning could take m ore than a year to

7/31/2019 DISAST~1



Why Plan for a Disaster?

System Administration Made Easy2 –3

prep are, test, and refine. The plan could fill man y volum es. This chapter helps you start

thinking about and planning for disaster recovery.

:K\3ODQIRUD'LVDVWHU"

< A system adm inistrator should expect and p lan for the worst, and then h ope for the best

< Dur ing a disaster recovery, nothing shou ld be d one for the first time.

Unp leasant surp rises could be fatal to the recovery p rocess.

Here are som e of the reasons to develop a d isaster recovery plan:

< Will business operations stop if R/ 3 fails?

< How mu ch lost revenue an d cost will be incurred for each hour that the system is down ?

< Which critical business functions cann ot be comp leted?

< How will customers be supp orted?

< How long can the system be dow n before the comp any goes out of business?

< Who is coordinating an d m anaging the d isaster recovery?

< What will the users do wh ile R/ 3 is down ?

< How long w ill the system be d own?

< How long will it take before the R/ 3 System is available for use?

If you p lan prop erly, you w ill be und er less stress, because you know that the system can be

recovered and how long this recovery will take.

If the recovery down time is un acceptable, man agemen t should invest in:

< Equipment, facilities, and personn el

< High availability (HA) op tions

HA options can be expensive. There are d ifferent degrees of HA, so customers need to

determ ine which option is right for them.

HA is an ad vanced top ic beyond the scope of this guidebook. If you ar e interested in th is

topic, contact an H A vend or.

7/31/2019 DISAST~1



Planning for a Disaster


3ODQQLQJIRUD'LVDVWHU

This chap ter is not a d isaster recovery “how to.” It is only designed to get you thinking

and working on d isaster recovery.

&UHDWLQJD3ODQ

Creating a disaster recovery plan is a m ajor p roject because:

< It can take over a year and considerable time to develop, test, and docum ent.

< The docum entation may be extensive (literally thou sand s of pages long).

If you d o not kn ow h ow to p lan for a d isaster recovery, get the assistance of an expert. A

bad plan (that w ill fail) is w orse than no p lan, because it p rovides a false sense of security.

:KDW$UHWKH%XVLQHVV5HTXLUHPHQWVIRU'LVDVWHU5HFRYHU\"

Who will provide the requirements?

< Senior man agemen t needs to p rovide global (or strategic) requirements and guid elines.

< The business units’ needs dr ive the specific detailed requirements.

These un its should u nd erstand th at as the requiremen t for the recovery time decreases,

the cost for d isaster recovery increases. The units shou ld bu dget for it, or if the fund s

come from an ad ministrative or IT budget, the units should su pp ort it.

What are the requirements? Each requ irement shou ld answ er the following questions:

< Who is the requestor?

< What is the requirement?

< Are other dep artments or customers affected by this requirement?

< Why is the requirement necessary?

When R/ 3 is offline, wh at does (or does not) hap pen?

What is the cost (or lost revenu e) of an hour or a d ay of R/ 3 down time?

The justification sh ou ld be a concrete objective valu e (such as $20,000 an hour ).

Define the cost (per h our, per d ay, etc.) of having the R/ 3 System d own .

7/31/2019 DISAST~1



Planning for a Disaste


([DPSOH

What: No more than one hou r of transaction data m ay be lost.

Why: The cost is 1,000 tran sactions per h our of lost tran sactions that are entered

in R/ 3 and cannot be recreated from mem ory.

This inability to r ecreate lost transactions may result in lost sales and up set

customers. If the lost orders are th ose that the customer qu ickly needs, this

situation can be critical.

([DPSOH

What: The system cann ot be offline for more than three hou rs.

Why: The cost (an average of $25,000 per hour) is the inab ility to book sales.

([DPSOH

What: In the event of d isaster, such as the loss of the bu ilding containing the R/ 3

da ta center, the comp any can only tolerate a two-day dow ntime.

Why: At that point, perm anent custom er loss begins.

Other: There must be an a lternate method of continuing bu siness.

:KHQ6KRXOGD'LVDVWHU5HFRYHU\3URFHGXUH%HJLQ"

Ask you rself the following qu estions:

< What criteria constitute a d isaster?

< Have these criteria been met?

< Who need s to be consulted?

The person m ust be aw are of the effect of the disaster on th e comp any’s business and the

critical nature of the recovery.

([SHFWHG'RZQWLPHRU5HFRYHU\7LPH

([SHFWHG'RZQWLPH

Expected d own time is only pa rt of the bu siness cost of disaster recovery. For defined

scenarios, this cost is the expected m inimum time before R/ 3 can be p rod uctive again.

Downtime may m ean that no orders can be processed and no prod ucts shipped.

Management m ust app rove this cost, so it is imp ortant that they un derstand th at dow ntime

are potential business costs.

To help business continue, it is imp ortant to find out if there are alternate processes that can

be used w hile the R/ 3 System is being recovered.

7/31/2019 DISAST~1





The following costs are involved w ith dow ntimes:

< The length of time that R/ 3 is down .

The longer the system is dow n, the longer the catch-up period w hen it is brought back

up . The tran sactions from th e alternate processes that w ere in place during th e disaster

have to be ap plied to the system to make it curr ent. This situation is more critical in a

high-volum e environment.

< A dow ned system is more expensive du ring the business day w hen bu siness activity

would stop than at the end of the business day w hen everyone has gone home.

< When custom ers cann ot be serviced or sup ported , they may be lost to a comp etitor.

The dura tion of acceptable dow ntime dep end s on the comp any and the natu re of its

business.

5HFRYHU\7LPH

Unless you test your recovery p rocedure, the recovery time is only an estimate, or worse, a

guess. Different disaster scenarios have different recovery times, wh ich a re based on w hat

needs to be don e to become operational again.

The time to recover mu st be matched to the bu siness requirements. If this time is greater

than th e business requirements, the mismatch need s to be comm un icated to the app ropriate

man agers or executives.

Resolving this mismatch involves:

< Investing in equ ipmen t, processes, and facilities to red uce the recovery time.

< Changing th e business requirements to accept the longer recovery time an d accepting

the consequen ces.

An extreme (but possible) examp le: A comp any cann ot afford the cost and lost revenu e for

the month it w ould take on e person to recover the system. During that time, the comp etitionwould take away customers, payment w ould be d ue to vendors, and bills would not be

collected. In this situation, senior m anagem ent n eeds to allocate resources to red uce the

recovery time to an acceptable level.

5HFRYHU\*URXSDQG6WDIILQJ5ROHV

There are four key roles in a r ecovery grou p. The num ber of employees performing these

roles will vary dep end ing on you r company size. In a smaller comp any, for example, the

recovery manag er and the comm un ication liaison could be th e same p erson. Titles and tasks

will probably differ based on your comp any’s need s.

We d efined the following k ey roles:

< Recovery manager

Manages th e entire technical recovery. All recovery activities and issues shou ld be

coordinated through this person.

< Comm un ication liaison

Hand les user phone calls and keeps top managem ent up dated with the recovery status.

One p erson han dling all phone calls allows the grou p d oing the technical recovery to

proceed without interruptions.

7/31/2019 DISAST~1





< Techn ical recovery team

Does the actual technical recovery. As the recovery p rogresses, the original plan m ay

have to be m odified. This role must m anage the chan ges and coordinate the technical

recovery.

< Review and certification m anager

Coordinates an d plans the p ost-recovery testing an d certification w ith users.

To reduce interru ption of the recovery staff, we recomm end you maintain a status board .

The status board shou ld list key points in the recovery plan and an estimate of when the

system w ill be recovered and available to use.

< If the d isaster is a major geograp hical event (like an earth quake), your local staff will be

more concerned with th eir families—not the comp any.

<

Depend ing on the d isaster, key personnel could be injured or killed.You should expect and plan for these situations. Plan for staff from other geograp hic sites

to be flown in and par ticipate as disaster recovery team m embers.

A final staffing role is to p lan for at least one staff mem ber to be “u navailable.” Without this

person, the rest of the dep artment m ust be able to perform a successful recovery. This issue

may become vital dur ing an actual disaster.

7\SHVRI'LVDVWHU5HFRYHU\

Disaster recovery scenarios can be group ed into tw o types:

< Onsite

< Offsite

2QVLWH

Onsite recovery is disaster recovery don e at you r site. The infrastructure u sually remains

intact. The best case scenario is a recovery d one on the original hard war e. The w orst case

scenario is a recovery don e on a backup system.

2IIVLWH

Offsite recovery is disaster recovery done at a d isaster recovery site. In th is scenar io, all

hard ware and infrastructure are lost as a result of facility d estruction such a s a fire, a flood,

or an earthqu ake. The new servers mu st be configured from scratch.A m ajor consideration is that on ce the original facility has been rebu ilt and tested, a second

restore m ust take place back to the customer’s original facility. While this second restore can

be planned and scheduled at a convenient time to disrup t as few u sers as possible. The

timing is just as critical as the d isaster. While the system is being recovered , it is down.

7/31/2019 DISAST~1





'LVDVWHU6FHQDULRV

There are an infinite num ber of disaster scenarios that could occur. It would take an infinite

amount of time to plan for them , and you will never account for all of them. To make th is

task man ageable, you shou ld p lan for at least three and no m ore than five scenarios. In the

event of a d isaster, you w ould ad apt th e closest scenario(s) to the actual disaster.

The disaster scenarios are made u p of:

< Description of the disaster event

< High level plan of ma jor tasks to be perform ed

< Estimated time to have the system av ailable to the users

To create you r final scenar io:

1. Use the Three Common Disaster Scenarios section below as a starting point.

2. Prepare three to five scenarios that cover a wide range of disasters that wou ld app ly to

you.

3. Create a high-level plan (are mad e up of major tasks) for each scenario.

4. Test the planned scenario, by creating different test disasters and d etermining if (and

how ) your scenario(s) would adap t to an actual disaster.

5. If the test scenario(s) cannot be adap ted, mod ify or develop more scenarios

6. Repeat the process.

7KUHH&RPPRQ'LVDVWHU6FHQDULRV

The following three examples ran ge from a best-to-worst scenario ord er:

The dow ntimes in the examp les below are only samp les. Your d own times will be different.You must replace the samp le dow ntimes with the dow ntimes app licable to your

environment.

$&RUUXSW'DWDEDVH

< A corrupt d atabase could resu lt from:

Accidentally loading test d ata into the produ ction system.

A bad transport into pr odu ction, which results in the failure of the prod uction

system.

< Such a d isaster requires the recovery of the R/ 3 database and related op erating system

files.< The “sample” dow ntime is eight hours.

$+DUGZDUH)DLOXUH

< The following types of items may fail:

A system p rocessor

A d rive controller

7/31/2019 DISAST~1





Multiple-drives in a d rive array, so that the d rive array fails

< Such a d isaster scenario requires:

Replacing failed hard ware

Rebuilding the server (operating system and all programs)

Recovering the R/ 3 database and related files< The “sample” dow ntime is seven days and comprises:

Five days to procure replacement hard ware

Two d ays to rebuild the N T server (one person); 16 hours of actual work time

$&RPSOHWH/RVVRU'HVWUXFWLRQRIWKH6HUYHU)DFLOLW\

< The follow ing items can be lost:

Servers

All sup porting infrastructure

All docum entation and ma terials in the building

The building

< A comp lete loss of the facility can resu lt from the following typ es of d isasters:

Fire

Earthquake

Flood

Hurricane

Tornado

Man-made d isasters, such as the World Trade Center bombing

< Such a disaster requires:

Replacing th e facilities

Replacing the infrastructure Replacing lost hard ware

Rebuilding the server and R/ 3 environment (hardware, operating system, database,

etc.)

Recovering the R/ 3 database and related files

< The “sample” dow ntime lasts eight days and comp rises:

At least five days to procure hard ware.

In a regional disaster, this pu rchase could take longer if your sup pliers were also

affected by th e d isaster.

Use national vendors w ith several regional distribution centers and , as a backup ,have an out-of-area alternate su pp lier.

Two d ays to rebuild the N T server (one person); 16 hours actual w ork time

As the hardw are is procured and the server is being rebu ilt, an alternate facility is

obtained and an emergency (minimal) network is constructed

One day to integrate into the emergen cy network

7/31/2019 DISAST~1





< Comp lete loss or d estruction requ ires a recovery back to a new facility.

5HFRYHU\6FULSW

:KDW

A recovery script is a docum ent that p rovides step-by-step instructions about:

< The process required to recover R/ 3

< Who w ill comp lete each step

< The expected time for long steps

< Depend encies between steps

:K\

A script is necessary because it helps you :

< Develop an d u se a proven series of steps to restore R/ 3

< Prevent m issing stepsMissing a critical step m ay requ ire restarting the recovery p rocess from the beginning,

wh ich d elays the recovery.

If the p rimary recovery person is un available, a recovery script helps the backup person

complete the recovery.

&UHDWLQJD5HFRYHU\6FULSW

Creating a recovery script requires:

< A checklist for each step

< A d ocumen t w ith screenshots to clarify the instru ctions, if needed

< Flowchart s, if the flow of steps or activities is critical or confusing

5HFRYHU\3URFHVV

To redu ce recovery time, d efine a process by:

< Comp leting as m any tasks as possible in parallel

< Add ing timetables for each step

0DMRU6WHSV

1. During a potential disaster, anticipate a recovery by:

< Collecting facts

< Recalling the latest offsite tap es

< Recalling the crash kit (see page 2–11 for more inform ation).

< Calling all required personn el

These personnel includ e the interna l SAP team , affected key

user s, infrastructu re supp ort, IT, facilities, on-call consu ltants, etc.

7/31/2019 DISAST~1





< Prepar ing functional organizations (sales, finance, and shipping) for alternate

procedu res for key business transactions and processes.

2. Minimize the effect of the disaster by:

< Stopp ing all add itional tran sactions into the system

Waiting too long could w orsen the pr oblem< Collecting tran saction records th at hav e to be manu ally reentered

3. Begin the plan ning process by:

< Analyzing the problem

< Fitting the disaster to your p redefined scenario plans

< Modifying the plans as needed

4. Define when to initiate a disaster recovery procedure.

< What are th e criteria to declare a d isaster, and h ave they been m et?

< Who w ill make the final decision to declare a d isaster?

5. Declare the d isaster.

6. Perform the system recovery.

7. Test and sign off on the recovered system.

Key users, who will use a criteria checklist to determine that th e system h as been

satisfactorily recovered should perform the testing.

8. Catch up w ith transactions that may have been han dled by alternate processes du ring

the d isaster.

Once completed, this step should require an add itional sign-off.

9. Notify the users that the system is read y for normal operations.

10. Cond uct a postmortem d ebriefing session.

Use the results from this session to improve your disaster recovery planning.

&UDVK.LW

:KDW

A crash kit contains everything need ed to:

< Rebuild the R/ 3 servers

< Reinstall R/ 3

< Recover the R/ 3 database and related files

:K\

During a d isaster, everything that is needed to recover the R/ 3 environm ent is contained in

one (or a few) containers. If you h ave to evacuate th e site, you w ill not have the time to ru n

aroun d, gathering the items at the last minu te, hoping tha t you get everything you need.

In a m ajor disaster you m ay not even h ave that opp ortunity.

7/31/2019 DISAST~1





:KHQ

When a chan ge is mad e to a comp onent (hard ware or software) on the server, replace the

outdated items in the crash kit with up dated items that h ave been tested.

A periodic review of the crash kit should be p erformed to d etermine if items need to be

add ed or chan ged. A service contract is a perfect examp le of an item th at requires this type

of review.

:KHUHWR3XWWKH&UDVK.LW

The crash kit should be ph ysically separa ted from the servers. If it is located in the server

room, and the server room is destroyed, this kit is lost.

Some crash kit storage areas includ e:

< Comm ercial offsite data storage

< Other company sites

< Another secure section of the building

+RZ

The following is an inventory list of some of the major items to pu t into the crash kit. You

will need to add or d elete items for your specific environmen t. This inventory list is

organized into the following categories:

< Documentation

< Software

'RFXPHQWDWLRQ

An inventory of the crash kit should be taken by the p erson w ho seals the kit. If the seal is

broken, items m ay have been rem oved or changed , making the kit useless in a recovery.

The inventory list below m ust be signed an d d ated by th e person checking the crash kit. The

following docum entation mu st be included in the crash kit:

< Disaster recovery script

< Installation instructions for the:

Opera ting system Database

R/ 3 System

< Special installation instru ctions for:

Drivers that hav e to be manu ally installed

Program s that must be installed in a sp ecific mann er

7/31/2019 DISAST~1





< Copies of:

SAP license for all instan ces

Service agreemen ts (with ph one nu mbers) for all servers

Ensure that m aintenance agreements are still valid and check if the agreemen ts expired.These shou ld be par t of a regular sched ule task.

< Instructions to recall tapes from offsite data storage

< List of personnel au thorized to recall tapes from offsite data storage

This list mu st correspond to the list maintained by the d ata storage comp any.

< A p arts list

If the server is d estroyed, this list shou ld be in sufficient d etail to pu rchase or lease

replacement h ard ware. Over time, if original pa rts are no longer available, an alternate

par ts list will have to be prep ared . At this point, you m ight consider up grad ing the

equipment.

< File system layout

< Hardware layout

You need to know w hich:

Card s go in wh ich slots

Cables go where (connector-by-connector)

Labeling cables and connectors greatly r edu ces confusion

< Phone nu mbers for:

Key users

Information services person nel

Facilities personnel

Other infrastructure personnel

Consultants (SAP, network, etc.)

SAP hotline

Offsite data storage

Security dep artment or personnel

Service agreement contacts

Hardw are vendors

6RIWZDUH

< Operating system:

Installation kit

Drivers for hardw are, such as a N etwork Interface Card (NIC) or a SCSI

controller, which are n ot includ ed in the installation kit

Service packs, upd ates, and patches

7/31/2019 DISAST~1





< Database:

Installation kit

Service packs, upd ates, and patches

Recovery scripts, to au tomate th e d atabase recovery

< For R/ 3: Installation kit

Curren tly installed kern el

System p rofile files

tpparam file

saprouttab file

saplogon.ini

< Other R/ 3 integrated programs (for example, a tax package)

< Other software for the R/ 3 installation:

Utilities

Backup

UPS control program

Hardware monitor

FTP client

Remote control program

System m onitor

%XVLQHVV&RQWLQXDWLRQ'XULQJ5HFRYHU\

Business continua tion du ring a recovery is an alternate p rocess to continu e doing bu siness

wh ile recovering from a d isaster. It includes:

< Cash collection

< Ord er processing

< Product shipping

< Bill paying

< Payroll processing

< Alternate locations to continue d oing business

:K\

Without an alternate process, your comp any w ould be un able to do business.

Some of the problems you w ould encoun ter include:

< Ord ers cannot be entered

< Product cannot be shipped

< Money can not be collected

7/31/2019 DISAST~1



Test your Disaster Recovery Procedure


+RZ

There are many alternate p rocesses, includ ing:

< Manual pap er-based

< Stand alone PC-based produ cts

2IIVLWH'LVDVWHU5HFRYHU\6LWHV

< Other comp any sites

< Comm ercial d isaster recovery sites

< Share or ren t space from oth er compan ies

,QWHJUDWLRQZLWK\RXU&RPSDQ\·V*HQHUDO'LVDVWHU3ODQQLQJ

Because there are man y dep enden cies, the R/ 3 disaster recovery process mu st be integrated

with you r company’s general disaster planning. This process includes telephone, network,

prod uct d eliveries, mail, etc.

:KHQWKH56\VWHP5HWXUQV

How will the transactions that were hand led with the alternate process be entered into R/ 3

wh en it is operational?

7HVW\RXU'LVDVWHU5HFRYHU\3URFHGXUH

Unless you test your recovery p rocess, you d o not know if you can actually recover

your system.

A test is a simu lated d isaster recovery wh ich verifies that you can recover the system an d

exercise every task ou tlined in the d isaster recovery plan.

< Test to find out if:

Your d isaster recovery procedu re work s

Something changed, was not d ocum ented, or up dated

There are step s that need clarification for oth ers

The information that is clear to the person docum enting the procedu re may be

un clear to the person reading the procedu re.

Older hardw are is no longer available

Here, alternate planning is needed. You m ay have to u pgrade your h ardw are to be

compatible with currently available equipmen t.

Since man y factors affect recovery time, actual r ecovery times can only be d etermined by

testing. Once you h ave actual times (not guesses or estima tes), your d isaster plann ing

7/31/2019 DISAST~1



Other Considerations


becomes more credible. If the p rocedure is practiced often, w hen a d isaster occurs, everyone

will know w hat to d o. This way, the chaos of a disaster w ill be reduced.

+RZ

1. Execute your disaster recovery plan on a backup system or at an offsite location.

2. Generate a rand om disaster scenario.

3. Execute your disaster plan to see if it han dles the scenario.

:KHQ

A full disaster recovery should be practiced at least once a year.

:KHUH

< The disaster recovery test should be don e at the sam e site that you expect to recover.

If you h ave multip le recovery sites, perform a test recovery at each site. The

equipm ent, facilities, and configura tion may be d ifferent at each site. Docum ent

all specific items th at n eed to be completed for each site. You do not w antto d iscover that you cannot recover at a site after a d isaster occurs.

< A backup onsite server

< Another comp any site

< At another compan y where you have a mu tual sup port agreement

< A compan y that p rovides disaster recovery site and services

:KR6KRXOG3DUWLFLSDWH

< Primary an d backup personn el who w ill do the job du ring a real disaster recovery

A provision should be made tha t some of the key personn el are to be unav ailable dur ing

a disaster recovery. A test procedu re might involve rand omly picking a nam e anddeclare that person un available to par ticipate. This procedu re d up licates a real situation

in w hich a key p erson is seriously injured or killed.

< Personnel at other sites

Integrate these people into the test, since they may be needed to perform th e recovery

du ring an actual d isaster. These people will fill in for unavailable personnel.

2WKHU&RQVLGHUDWLRQV

2WKHU8SVWUHDPRU'RZQVWUHDP$SSOLFDWLRQV

For the comp any to fun ction, other up (or down ) stream app lications also need to be

recovered with R/ 3. Some of these app lications may be tightly associated with R/ 3. The

app lications should be accounted for and protected in the compan y-wide disaster recovery

planning.

7/31/2019 DISAST~1



Minimizing the Chances for a Disaste


App lications located on only one person’s desktop compu ter mu st be backed u p to a safe

location.

%DFNXS6LWHV

Hav ing a contract with a disaster recovery site does not gu arantee tha t the site will be

available. In a regional disaster, such as an earthqu ake or flood, many other comp anies will

be competing for the sam e commercial disaster sites. In th is situation, you may not h ave a

site to recover to, if others have booked it before you.

The emergency backup site may n ot have equipm ent of the same p erformance level as your

production system. Reduced performance and transaction throughput must be considered.

Examples:< A red uced batch sched ule of only critical jobs

< Only essential business tasks w ill be don e w hile on the r ecovery system

0LQLPL]LQJWKH&KDQFHVIRUD'LVDVWHU

There are many w ays to minimize chan ces for a d isaster. Some of these ideas seem obvious,

but it is these ideas th at are often forgotten.

0LQLPL]H+XPDQ(UURU

Many d isasters are caused by h um an error, such as a mistake or a tired op erator. Do not

attemp t dan gerous tasks when you are tired. If you have to do a d angerou s task, get a

second opinion before you start.

< Dangerou s tasks should be scripted and checkpoints includ ed to verify the steps.

Such tasks includ e:

Deleting the test d atabase

Check that the d elete command specifies the Test , not the

Production , database.

Moving a file

Verify that th e target file (to be overwritten) is the old, not the new , file.

Formatting a new drive

Verify that the d rive to be formatted is the new dr ive, not an existing drive with data

on it.

7/31/2019 DISAST~1



Minimizing the Chances for a Disaster

Release 4 6 A/B

0LQLPL]H6LQJOH3RLQWVRI)DLOXUH

A single-point failur e is wh en th e failure of one compon ent causes the entire system to fail.

To minimize single-point failure:

<

Identify cond itions w here a single-point failure can occur< Anticipate w hat w ill happ en if this comp onent or process fails

< Eliminate as many of these single points of failur e as practical.

Practical is defined as the level of work involved or cost compared to the level of risk

and failure.

Types of single po ints of failure includ e:

< The backup R/ 3 server is located in the same data center as the produ ction R/ 3 server.

If the d ata center is destroyed, the backup server is also destroyed.

< All the R/ 3 servers are on a single electrical circuit.

If the circuit breaker opens, everything on that circuit loses pow er, and all the serverswill crash.

&DVFDGH)DLOXUHV

A cascade failure is w hen one failure triggers add itional failures, w hich increases the

complexity of a p roblem. The recovery involves the coordinated fixing of m any problems.

([DPSOH $&DVFDGH)DLOXUH

1. A pow er failure in the air cond itioning system causes an environmental (air

conditioning) failure in the server room.

2. Without cooling, the temp erature in the server room r ises above the equipm ent’s

acceptable operating temperatu re.

3. The overheating causes a hard ware failure in the server.

4. The hard ware failure causes a da tabase corruption.

In add ition, overheating can d amage many th ings, such as:

Network equipm ent

Phone system

Other servers

The recovery becomes complex because:

< Fixing one p roblem m ay un cover other problems or dam aged equipm ent.

< Certain items cannot be tested or fixed u ntil other equipm ent is operational.

In this case, a system that m onitors the air cond itioning system or th e temp erature in the

server room could alert the app ropriate emp loyees before the temp erature in the server

room becomes too hot.

disast~1

Documents