disast~1
TRANSCRIPT
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 1/18
System Administration Made Easy 2 –1
&KDSWHU 'LVDVWHU5HFRYHU\
&RQWHQWV
Overview ..................................................................................................................2 –2
Why Plan for a Disaster?........................................................................................2–3
Planning for a Disaster...........................................................................................2–4
Test your Disaster Recovery Procedure ............................................................2–15
Other Considerations ...........................................................................................2–16
Minimizing the Chances for a Disaster...............................................................2–17
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 2/18
Chapter 2: Disaster Recovery
Overview
Release 4.6 A/B 2 –2
2YHUYLHZ
The pu rpose of this chap ter is to help you un derstand wh at we feel is the most critical job of
a system ad ministrator—disaster recovery.
We included this chap ter at the beginning of our gu idebook for two reasons:
< To emph asize the imp ortance of the subject
Disaster recovery need s to be p lanned as soon as p ossible, because it takes time to
develop, test, and refine.
< To emp hasize the importan ce of being p repared for a poten tial disaster
Murph y’s Law says:
“Disaster w ill strike wh en you are not p repared for it.”
The faster you begin plann ing, the more p repared you w ill be wh en a disaster does happ en.
This chap ter is not a d isaster recovery “how to.” It is only designed to get you thinking
and working on d isaster recovery.
:KDW,VD'LVDVWHU"
The goal of disaster recovery is to restore the system so that th e compan y can continue
doing bu siness. A disaster is anything th at results in the corru ption or loss of the R/ 3
System.
Examples include:< Database corru ption.
For example wh en test data is accidentally loaded into the prod uction system.
This hap pens m ore often than people realize.
< A serious hard w are failure.
< A complete loss of the R/ 3 System and infrastructure.
For example, the destruction of the building du e to natural disaster.
The ultimate respon sibility of a system adm inistrator is to successfully restore R/ 3 after a
disaster.
The ultima te consequen ce of not restoring the system is that you r company goes out of business.
The adm inistrator’s goal is to prevent th e system from ever reaching the situation wh ere the
ultimate resp onsibility is called up on.
Disaster recovery planning is a major project. Depend ing on your situation an d th e size and
complexity of your compan y, disaster recovery planning could take m ore than a year to
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 3/18
Chapter 2: Disaster Recovery
Why Plan for a Disaster?
System Administration Made Easy2 –3
prep are, test, and refine. The plan could fill man y volum es. This chapter helps you start
thinking about and planning for disaster recovery.
:K\3ODQIRUD'LVDVWHU"
< A system adm inistrator should expect and p lan for the worst, and then h ope for the best
< Dur ing a disaster recovery, nothing shou ld be d one for the first time.
Unp leasant surp rises could be fatal to the recovery p rocess.
Here are som e of the reasons to develop a d isaster recovery plan:
< Will business operations stop if R/ 3 fails?
< How mu ch lost revenue an d cost will be incurred for each hour that the system is down ?
< Which critical business functions cann ot be comp leted?
< How will customers be supp orted?
< How long can the system be dow n before the comp any goes out of business?
< Who is coordinating an d m anaging the d isaster recovery?
< What will the users do wh ile R/ 3 is down ?
< How long w ill the system be d own?
< How long will it take before the R/ 3 System is available for use?
If you p lan prop erly, you w ill be und er less stress, because you know that the system can be
recovered and how long this recovery will take.
If the recovery down time is un acceptable, man agemen t should invest in:
< Equipment, facilities, and personn el
< High availability (HA) op tions
HA options can be expensive. There are d ifferent degrees of HA, so customers need to
determ ine which option is right for them.
HA is an ad vanced top ic beyond the scope of this guidebook. If you ar e interested in th is
topic, contact an H A vend or.
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 4/18
Chapter 2: Disaster Recovery
Planning for a Disaster
Release 4.6 A/B 2 –4
3ODQQLQJIRUD'LVDVWHU
This chap ter is not a d isaster recovery “how to.” It is only designed to get you thinking
and working on d isaster recovery.
&UHDWLQJD3ODQ
Creating a disaster recovery plan is a m ajor p roject because:
< It can take over a year and considerable time to develop, test, and docum ent.
< The docum entation may be extensive (literally thou sand s of pages long).
If you d o not kn ow h ow to p lan for a d isaster recovery, get the assistance of an expert. A
bad plan (that w ill fail) is w orse than no p lan, because it p rovides a false sense of security.
:KDW$UHWKH%XVLQHVV5HTXLUHPHQWVIRU'LVDVWHU5HFRYHU\"
Who will provide the requirements?
< Senior man agemen t needs to p rovide global (or strategic) requirements and guid elines.
< The business units’ needs dr ive the specific detailed requirements.
These un its should u nd erstand th at as the requiremen t for the recovery time decreases,
the cost for d isaster recovery increases. The units shou ld bu dget for it, or if the fund s
come from an ad ministrative or IT budget, the units should su pp ort it.
What are the requirements? Each requ irement shou ld answ er the following questions:
< Who is the requestor?
< What is the requirement?
< Are other dep artments or customers affected by this requirement?
< Why is the requirement necessary?
When R/ 3 is offline, wh at does (or does not) hap pen?
What is the cost (or lost revenu e) of an hour or a d ay of R/ 3 down time?
The justification sh ou ld be a concrete objective valu e (such as $20,000 an hour ).
Define the cost (per h our, per d ay, etc.) of having the R/ 3 System d own .
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 5/18
Chapter 2: Disaster Recovery
Planning for a Disaste
System Administration Made Easy2 –5
([DPSOH
What: No more than one hou r of transaction data m ay be lost.
Why: The cost is 1,000 tran sactions per h our of lost tran sactions that are entered
in R/ 3 and cannot be recreated from mem ory.
This inability to r ecreate lost transactions may result in lost sales and up set
customers. If the lost orders are th ose that the customer qu ickly needs, this
situation can be critical.
([DPSOH
What: The system cann ot be offline for more than three hou rs.
Why: The cost (an average of $25,000 per hour) is the inab ility to book sales.
([DPSOH
What: In the event of d isaster, such as the loss of the bu ilding containing the R/ 3
da ta center, the comp any can only tolerate a two-day dow ntime.
Why: At that point, perm anent custom er loss begins.
Other: There must be an a lternate method of continuing bu siness.
:KHQ6KRXOGD'LVDVWHU5HFRYHU\3URFHGXUH%HJLQ"
Ask you rself the following qu estions:
< What criteria constitute a d isaster?
< Have these criteria been met?
< Who need s to be consulted?
The person m ust be aw are of the effect of the disaster on th e comp any’s business and the
critical nature of the recovery.
([SHFWHG'RZQWLPHRU5HFRYHU\7LPH
([SHFWHG'RZQWLPH
Expected d own time is only pa rt of the bu siness cost of disaster recovery. For defined
scenarios, this cost is the expected m inimum time before R/ 3 can be p rod uctive again.
Downtime may m ean that no orders can be processed and no prod ucts shipped.
Management m ust app rove this cost, so it is imp ortant that they un derstand th at dow ntime
are potential business costs.
To help business continue, it is imp ortant to find out if there are alternate processes that can
be used w hile the R/ 3 System is being recovered.
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 6/18
Chapter 2: Disaster Recovery
Planning for a Disaster
Release 4.6 A/B 2 –6
The following costs are involved w ith dow ntimes:
< The length of time that R/ 3 is down .
The longer the system is dow n, the longer the catch-up period w hen it is brought back
up . The tran sactions from th e alternate processes that w ere in place during th e disaster
have to be ap plied to the system to make it curr ent. This situation is more critical in a
high-volum e environment.
< A dow ned system is more expensive du ring the business day w hen bu siness activity
would stop than at the end of the business day w hen everyone has gone home.
< When custom ers cann ot be serviced or sup ported , they may be lost to a comp etitor.
The dura tion of acceptable dow ntime dep end s on the comp any and the natu re of its
business.
5HFRYHU\7LPH
Unless you test your recovery p rocedure, the recovery time is only an estimate, or worse, a
guess. Different disaster scenarios have different recovery times, wh ich a re based on w hat
needs to be don e to become operational again.
The time to recover mu st be matched to the bu siness requirements. If this time is greater
than th e business requirements, the mismatch need s to be comm un icated to the app ropriate
man agers or executives.
Resolving this mismatch involves:
< Investing in equ ipmen t, processes, and facilities to red uce the recovery time.
< Changing th e business requirements to accept the longer recovery time an d accepting
the consequen ces.
An extreme (but possible) examp le: A comp any cann ot afford the cost and lost revenu e for
the month it w ould take on e person to recover the system. During that time, the comp etitionwould take away customers, payment w ould be d ue to vendors, and bills would not be
collected. In this situation, senior m anagem ent n eeds to allocate resources to red uce the
recovery time to an acceptable level.
5HFRYHU\*URXSDQG6WDIILQJ5ROHV
There are four key roles in a r ecovery grou p. The num ber of employees performing these
roles will vary dep end ing on you r company size. In a smaller comp any, for example, the
recovery manag er and the comm un ication liaison could be th e same p erson. Titles and tasks
will probably differ based on your comp any’s need s.
We d efined the following k ey roles:
< Recovery manager
Manages th e entire technical recovery. All recovery activities and issues shou ld be
coordinated through this person.
< Comm un ication liaison
Hand les user phone calls and keeps top managem ent up dated with the recovery status.
One p erson han dling all phone calls allows the grou p d oing the technical recovery to
proceed without interruptions.
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 7/18
Chapter 2: Disaster Recovery
Planning for a Disaste
System Administration Made Easy2 –7
< Techn ical recovery team
Does the actual technical recovery. As the recovery p rogresses, the original plan m ay
have to be m odified. This role must m anage the chan ges and coordinate the technical
recovery.
< Review and certification m anager
Coordinates an d plans the p ost-recovery testing an d certification w ith users.
To reduce interru ption of the recovery staff, we recomm end you maintain a status board .
The status board shou ld list key points in the recovery plan and an estimate of when the
system w ill be recovered and available to use.
< If the d isaster is a major geograp hical event (like an earth quake), your local staff will be
more concerned with th eir families—not the comp any.
<
Depend ing on the d isaster, key personnel could be injured or killed.You should expect and plan for these situations. Plan for staff from other geograp hic sites
to be flown in and par ticipate as disaster recovery team m embers.
A final staffing role is to p lan for at least one staff mem ber to be “u navailable.” Without this
person, the rest of the dep artment m ust be able to perform a successful recovery. This issue
may become vital dur ing an actual disaster.
7\SHVRI'LVDVWHU5HFRYHU\
Disaster recovery scenarios can be group ed into tw o types:
< Onsite
< Offsite
2QVLWH
Onsite recovery is disaster recovery don e at you r site. The infrastructure u sually remains
intact. The best case scenario is a recovery d one on the original hard war e. The w orst case
scenario is a recovery don e on a backup system.
2IIVLWH
Offsite recovery is disaster recovery done at a d isaster recovery site. In th is scenar io, all
hard ware and infrastructure are lost as a result of facility d estruction such a s a fire, a flood,
or an earthqu ake. The new servers mu st be configured from scratch.A m ajor consideration is that on ce the original facility has been rebu ilt and tested, a second
restore m ust take place back to the customer’s original facility. While this second restore can
be planned and scheduled at a convenient time to disrup t as few u sers as possible. The
timing is just as critical as the d isaster. While the system is being recovered , it is down.
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 8/18
Chapter 2: Disaster Recovery
Planning for a Disaster
Release 4.6 A/B 2 –8
'LVDVWHU6FHQDULRV
There are an infinite num ber of disaster scenarios that could occur. It would take an infinite
amount of time to plan for them , and you will never account for all of them. To make th is
task man ageable, you shou ld p lan for at least three and no m ore than five scenarios. In the
event of a d isaster, you w ould ad apt th e closest scenario(s) to the actual disaster.
The disaster scenarios are made u p of:
< Description of the disaster event
< High level plan of ma jor tasks to be perform ed
< Estimated time to have the system av ailable to the users
To create you r final scenar io:
1. Use the Three Common Disaster Scenarios section below as a starting point.
2. Prepare three to five scenarios that cover a wide range of disasters that wou ld app ly to
you.
3. Create a high-level plan (are mad e up of major tasks) for each scenario.
4. Test the planned scenario, by creating different test disasters and d etermining if (and
how ) your scenario(s) would adap t to an actual disaster.
5. If the test scenario(s) cannot be adap ted, mod ify or develop more scenarios
6. Repeat the process.
7KUHH&RPPRQ'LVDVWHU6FHQDULRV
The following three examples ran ge from a best-to-worst scenario ord er:
The dow ntimes in the examp les below are only samp les. Your d own times will be different.You must replace the samp le dow ntimes with the dow ntimes app licable to your
environment.
$&RUUXSW'DWDEDVH
< A corrupt d atabase could resu lt from:
Accidentally loading test d ata into the produ ction system.
A bad transport into pr odu ction, which results in the failure of the prod uction
system.
< Such a d isaster requires the recovery of the R/ 3 database and related op erating system
files.< The “sample” dow ntime is eight hours.
$+DUGZDUH)DLOXUH
< The following types of items may fail:
A system p rocessor
A d rive controller
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 9/18
Chapter 2: Disaster Recovery
Planning for a Disaste
System Administration Made Easy2 –9
Multiple-drives in a d rive array, so that the d rive array fails
< Such a d isaster scenario requires:
Replacing failed hard ware
Rebuilding the server (operating system and all programs)
Recovering the R/ 3 database and related files< The “sample” dow ntime is seven days and comprises:
Five days to procure replacement hard ware
Two d ays to rebuild the N T server (one person); 16 hours of actual work time
$&RPSOHWH/RVVRU'HVWUXFWLRQRIWKH6HUYHU)DFLOLW\
< The follow ing items can be lost:
Servers
All sup porting infrastructure
All docum entation and ma terials in the building
The building
< A comp lete loss of the facility can resu lt from the following typ es of d isasters:
Fire
Earthquake
Flood
Hurricane
Tornado
Man-made d isasters, such as the World Trade Center bombing
< Such a disaster requires:
Replacing th e facilities
Replacing the infrastructure Replacing lost hard ware
Rebuilding the server and R/ 3 environment (hardware, operating system, database,
etc.)
Recovering the R/ 3 database and related files
< The “sample” dow ntime lasts eight days and comp rises:
At least five days to procure hard ware.
In a regional disaster, this pu rchase could take longer if your sup pliers were also
affected by th e d isaster.
Use national vendors w ith several regional distribution centers and , as a backup ,have an out-of-area alternate su pp lier.
Two d ays to rebuild the N T server (one person); 16 hours actual w ork time
As the hardw are is procured and the server is being rebu ilt, an alternate facility is
obtained and an emergency (minimal) network is constructed
One day to integrate into the emergen cy network
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 10/18
Chapter 2: Disaster Recovery
Planning for a Disaster
Release 4.6 A/B 2 –10
< Comp lete loss or d estruction requ ires a recovery back to a new facility.
5HFRYHU\6FULSW
:KDW
A recovery script is a docum ent that p rovides step-by-step instructions about:
< The process required to recover R/ 3
< Who w ill comp lete each step
< The expected time for long steps
< Depend encies between steps
:K\
A script is necessary because it helps you :
< Develop an d u se a proven series of steps to restore R/ 3
< Prevent m issing stepsMissing a critical step m ay requ ire restarting the recovery p rocess from the beginning,
wh ich d elays the recovery.
If the p rimary recovery person is un available, a recovery script helps the backup person
complete the recovery.
&UHDWLQJD5HFRYHU\6FULSW
Creating a recovery script requires:
< A checklist for each step
< A d ocumen t w ith screenshots to clarify the instru ctions, if needed
< Flowchart s, if the flow of steps or activities is critical or confusing
5HFRYHU\3URFHVV
To redu ce recovery time, d efine a process by:
< Comp leting as m any tasks as possible in parallel
< Add ing timetables for each step
0DMRU6WHSV
1. During a potential disaster, anticipate a recovery by:
< Collecting facts
< Recalling the latest offsite tap es
< Recalling the crash kit (see page 2–11 for more inform ation).
< Calling all required personn el
These personnel includ e the interna l SAP team , affected key
user s, infrastructu re supp ort, IT, facilities, on-call consu ltants, etc.
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 11/18
Chapter 2: Disaster Recovery
Planning for a Disaste
System Administration Made Easy2 –11
< Prepar ing functional organizations (sales, finance, and shipping) for alternate
procedu res for key business transactions and processes.
2. Minimize the effect of the disaster by:
< Stopp ing all add itional tran sactions into the system
Waiting too long could w orsen the pr oblem< Collecting tran saction records th at hav e to be manu ally reentered
3. Begin the plan ning process by:
< Analyzing the problem
< Fitting the disaster to your p redefined scenario plans
< Modifying the plans as needed
4. Define when to initiate a disaster recovery procedure.
< What are th e criteria to declare a d isaster, and h ave they been m et?
< Who w ill make the final decision to declare a d isaster?
5. Declare the d isaster.
6. Perform the system recovery.
7. Test and sign off on the recovered system.
Key users, who will use a criteria checklist to determine that th e system h as been
satisfactorily recovered should perform the testing.
8. Catch up w ith transactions that may have been han dled by alternate processes du ring
the d isaster.
Once completed, this step should require an add itional sign-off.
9. Notify the users that the system is read y for normal operations.
10. Cond uct a postmortem d ebriefing session.
Use the results from this session to improve your disaster recovery planning.
&UDVK.LW
:KDW
A crash kit contains everything need ed to:
< Rebuild the R/ 3 servers
< Reinstall R/ 3
< Recover the R/ 3 database and related files
:K\
During a d isaster, everything that is needed to recover the R/ 3 environm ent is contained in
one (or a few) containers. If you h ave to evacuate th e site, you w ill not have the time to ru n
aroun d, gathering the items at the last minu te, hoping tha t you get everything you need.
In a m ajor disaster you m ay not even h ave that opp ortunity.
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 12/18
Chapter 2: Disaster Recovery
Planning for a Disaster
Release 4.6 A/B 2 –12
:KHQ
When a chan ge is mad e to a comp onent (hard ware or software) on the server, replace the
outdated items in the crash kit with up dated items that h ave been tested.
A periodic review of the crash kit should be p erformed to d etermine if items need to be
add ed or chan ged. A service contract is a perfect examp le of an item th at requires this type
of review.
:KHUHWR3XWWKH&UDVK.LW
The crash kit should be ph ysically separa ted from the servers. If it is located in the server
room, and the server room is destroyed, this kit is lost.
Some crash kit storage areas includ e:
< Comm ercial offsite data storage
< Other company sites
< Another secure section of the building
+RZ
The following is an inventory list of some of the major items to pu t into the crash kit. You
will need to add or d elete items for your specific environmen t. This inventory list is
organized into the following categories:
< Documentation
< Software
'RFXPHQWDWLRQ
An inventory of the crash kit should be taken by the p erson w ho seals the kit. If the seal is
broken, items m ay have been rem oved or changed , making the kit useless in a recovery.
The inventory list below m ust be signed an d d ated by th e person checking the crash kit. The
following docum entation mu st be included in the crash kit:
< Disaster recovery script
< Installation instructions for the:
Opera ting system Database
R/ 3 System
< Special installation instru ctions for:
Drivers that hav e to be manu ally installed
Program s that must be installed in a sp ecific mann er
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 13/18
Chapter 2: Disaster Recovery
Planning for a Disaste
System Administration Made Easy2 –13
< Copies of:
SAP license for all instan ces
Service agreemen ts (with ph one nu mbers) for all servers
Ensure that m aintenance agreements are still valid and check if the agreemen ts expired.These shou ld be par t of a regular sched ule task.
< Instructions to recall tapes from offsite data storage
< List of personnel au thorized to recall tapes from offsite data storage
This list mu st correspond to the list maintained by the d ata storage comp any.
< A p arts list
If the server is d estroyed, this list shou ld be in sufficient d etail to pu rchase or lease
replacement h ard ware. Over time, if original pa rts are no longer available, an alternate
par ts list will have to be prep ared . At this point, you m ight consider up grad ing the
equipment.
< File system layout
< Hardware layout
You need to know w hich:
Card s go in wh ich slots
Cables go where (connector-by-connector)
Labeling cables and connectors greatly r edu ces confusion
< Phone nu mbers for:
Key users
Information services person nel
Facilities personnel
Other infrastructure personnel
Consultants (SAP, network, etc.)
SAP hotline
Offsite data storage
Security dep artment or personnel
Service agreement contacts
Hardw are vendors
6RIWZDUH
< Operating system:
Installation kit
Drivers for hardw are, such as a N etwork Interface Card (NIC) or a SCSI
controller, which are n ot includ ed in the installation kit
Service packs, upd ates, and patches
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 14/18
Chapter 2: Disaster Recovery
Planning for a Disaster
Release 4.6 A/B 2 –14
< Database:
Installation kit
Service packs, upd ates, and patches
Recovery scripts, to au tomate th e d atabase recovery
< For R/ 3: Installation kit
Curren tly installed kern el
System p rofile files
tpparam file
saprouttab file
saplogon.ini
< Other R/ 3 integrated programs (for example, a tax package)
< Other software for the R/ 3 installation:
Utilities
Backup
UPS control program
Hardware monitor
FTP client
Remote control program
System m onitor
%XVLQHVV&RQWLQXDWLRQ'XULQJ5HFRYHU\
Business continua tion du ring a recovery is an alternate p rocess to continu e doing bu siness
wh ile recovering from a d isaster. It includes:
< Cash collection
< Ord er processing
< Product shipping
< Bill paying
< Payroll processing
< Alternate locations to continue d oing business
:K\
Without an alternate process, your comp any w ould be un able to do business.
Some of the problems you w ould encoun ter include:
< Ord ers cannot be entered
< Product cannot be shipped
< Money can not be collected
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 15/18
Chapter 2: Disaster Recovery
Test your Disaster Recovery Procedure
System Administration Made Easy2 –15
+RZ
There are many alternate p rocesses, includ ing:
< Manual pap er-based
< Stand alone PC-based produ cts
2IIVLWH'LVDVWHU5HFRYHU\6LWHV
< Other comp any sites
< Comm ercial d isaster recovery sites
< Share or ren t space from oth er compan ies
,QWHJUDWLRQZLWK\RXU&RPSDQ\·V*HQHUDO'LVDVWHU3ODQQLQJ
Because there are man y dep enden cies, the R/ 3 disaster recovery process mu st be integrated
with you r company’s general disaster planning. This process includes telephone, network,
prod uct d eliveries, mail, etc.
:KHQWKH56\VWHP5HWXUQV
How will the transactions that were hand led with the alternate process be entered into R/ 3
wh en it is operational?
7HVW\RXU'LVDVWHU5HFRYHU\3URFHGXUH
Unless you test your recovery p rocess, you d o not know if you can actually recover
your system.
A test is a simu lated d isaster recovery wh ich verifies that you can recover the system an d
exercise every task ou tlined in the d isaster recovery plan.
< Test to find out if:
Your d isaster recovery procedu re work s
Something changed, was not d ocum ented, or up dated
There are step s that need clarification for oth ers
The information that is clear to the person docum enting the procedu re may be
un clear to the person reading the procedu re.
Older hardw are is no longer available
Here, alternate planning is needed. You m ay have to u pgrade your h ardw are to be
compatible with currently available equipmen t.
Since man y factors affect recovery time, actual r ecovery times can only be d etermined by
testing. Once you h ave actual times (not guesses or estima tes), your d isaster plann ing
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 16/18
Chapter 2: Disaster Recovery
Other Considerations
Release 4.6 A/B 2 –16
becomes more credible. If the p rocedure is practiced often, w hen a d isaster occurs, everyone
will know w hat to d o. This way, the chaos of a disaster w ill be reduced.
+RZ
1. Execute your disaster recovery plan on a backup system or at an offsite location.
2. Generate a rand om disaster scenario.
3. Execute your disaster plan to see if it han dles the scenario.
:KHQ
A full disaster recovery should be practiced at least once a year.
:KHUH
< The disaster recovery test should be don e at the sam e site that you expect to recover.
If you h ave multip le recovery sites, perform a test recovery at each site. The
equipm ent, facilities, and configura tion may be d ifferent at each site. Docum ent
all specific items th at n eed to be completed for each site. You do not w antto d iscover that you cannot recover at a site after a d isaster occurs.
< A backup onsite server
< Another comp any site
< At another compan y where you have a mu tual sup port agreement
< A compan y that p rovides disaster recovery site and services
:KR6KRXOG3DUWLFLSDWH
< Primary an d backup personn el who w ill do the job du ring a real disaster recovery
A provision should be made tha t some of the key personn el are to be unav ailable dur ing
a disaster recovery. A test procedu re might involve rand omly picking a nam e anddeclare that person un available to par ticipate. This procedu re d up licates a real situation
in w hich a key p erson is seriously injured or killed.
< Personnel at other sites
Integrate these people into the test, since they may be needed to perform th e recovery
du ring an actual d isaster. These people will fill in for unavailable personnel.
2WKHU&RQVLGHUDWLRQV
2WKHU8SVWUHDPRU'RZQVWUHDP$SSOLFDWLRQV
For the comp any to fun ction, other up (or down ) stream app lications also need to be
recovered with R/ 3. Some of these app lications may be tightly associated with R/ 3. The
app lications should be accounted for and protected in the compan y-wide disaster recovery
planning.
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 17/18
Chapter 2: Disaster Recovery
Minimizing the Chances for a Disaste
System Administration Made Easy2 –17
App lications located on only one person’s desktop compu ter mu st be backed u p to a safe
location.
%DFNXS6LWHV
Hav ing a contract with a disaster recovery site does not gu arantee tha t the site will be
available. In a regional disaster, such as an earthqu ake or flood, many other comp anies will
be competing for the sam e commercial disaster sites. In th is situation, you may not h ave a
site to recover to, if others have booked it before you.
The emergency backup site may n ot have equipm ent of the same p erformance level as your
production system. Reduced performance and transaction throughput must be considered.
Examples:< A red uced batch sched ule of only critical jobs
< Only essential business tasks w ill be don e w hile on the r ecovery system
0LQLPL]LQJWKH&KDQFHVIRUD'LVDVWHU
There are many w ays to minimize chan ces for a d isaster. Some of these ideas seem obvious,
but it is these ideas th at are often forgotten.
0LQLPL]H+XPDQ(UURU
Many d isasters are caused by h um an error, such as a mistake or a tired op erator. Do not
attemp t dan gerous tasks when you are tired. If you have to do a d angerou s task, get a
second opinion before you start.
< Dangerou s tasks should be scripted and checkpoints includ ed to verify the steps.
Such tasks includ e:
Deleting the test d atabase
Check that the d elete command specifies the Test , not the
Production , database.
Moving a file
Verify that th e target file (to be overwritten) is the old, not the new , file.
Formatting a new drive
Verify that the d rive to be formatted is the new dr ive, not an existing drive with data
on it.
7/31/2019 DISAST~1
http://slidepdf.com/reader/full/disast1 18/18
Chapter 2: Disaster Recovery
Minimizing the Chances for a Disaster
Release 4 6 A/B
0LQLPL]H6LQJOH3RLQWVRI)DLOXUH
A single-point failur e is wh en th e failure of one compon ent causes the entire system to fail.
To minimize single-point failure:
<
Identify cond itions w here a single-point failure can occur< Anticipate w hat w ill happ en if this comp onent or process fails
< Eliminate as many of these single points of failur e as practical.
Practical is defined as the level of work involved or cost compared to the level of risk
and failure.
Types of single po ints of failure includ e:
< The backup R/ 3 server is located in the same data center as the produ ction R/ 3 server.
If the d ata center is destroyed, the backup server is also destroyed.
< All the R/ 3 servers are on a single electrical circuit.
If the circuit breaker opens, everything on that circuit loses pow er, and all the serverswill crash.
&DVFDGH)DLOXUHV
A cascade failure is w hen one failure triggers add itional failures, w hich increases the
complexity of a p roblem. The recovery involves the coordinated fixing of m any problems.
([DPSOH $&DVFDGH)DLOXUH
1. A pow er failure in the air cond itioning system causes an environmental (air
conditioning) failure in the server room.
2. Without cooling, the temp erature in the server room r ises above the equipm ent’s
acceptable operating temperatu re.
3. The overheating causes a hard ware failure in the server.
4. The hard ware failure causes a da tabase corruption.
In add ition, overheating can d amage many th ings, such as:
Network equipm ent
Phone system
Other servers
The recovery becomes complex because:
< Fixing one p roblem m ay un cover other problems or dam aged equipm ent.
< Certain items cannot be tested or fixed u ntil other equipm ent is operational.
In this case, a system that m onitors the air cond itioning system or th e temp erature in the
server room could alert the app ropriate emp loyees before the temp erature in the server
room becomes too hot.