business continuity planning in an organisation · the evolution of business continuity too often,...
TRANSCRIPT
Business Continuity PlanningIn an Organisation
Smartha Guha Thakurta
EMC Proven Profesional Knowledge Sharing 2009
Smartha Guha ThakurtaEMC
EMC Proven Professional Knowledge Sharing 2009 2
TABLE OF CONTENTS
CHAPTER 1: INTRODUCTION............................................................................. 4
CHAPTER 2: BUSINESS CONTINUITY & DATA PROTECTION OPTIONS ...... 9
CHAPTER 3: BUSINESS CONTINUITY PLANNING OBJECTIVES ................. 15
CHAPTER 4: DEFINING DISASTER TYPES ..................................................... 19
CHAPTER 5: BEST PRACTICES AND TRENDS IN BC PLANNING................. 21
CHAPTER 6: CASE STUDY ............................................................................... 29
CHAPTER 7: CHANGE MANAGEMENT & DECISION MAKING...................... 31
CHAPTER 8: RECOMMENDATIONS................................................................. 33
CHAPTER 9: TCO AND ROI ANALYSIS ........................................................... 34
CHAPTER 10: CONCLUSIONS.......................................................................... 40
APPENDIX A: SAMPLE RISK ANALYSIS ......................................................... 41
APPENDIX B: REFERENCES AND BIBLIOGRAPHY....................................... 42
BIOGRAPHY........................................................................................................ 43
Disclaimer: The views, processes or methodologies published in this compilation are those of the authors. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies
EMC Proven Professional Knowledge Sharing 2009 3
List of Figures
Fig. 1 Cost of downtime by Industry segments
Fig. 2 Market growth in BC/ DR worldwide estimates
Fig. 3 Obstacles to availability
Fig. 4 Perspectives of various stakeholders in Business Continuance
Fig. 5 Components of a Business continuity model
Fig. 6 The Foundation Pillars of Business Continuity Planning
List of Tables
Table 1: Components of BC and corresponding plans
Table 2: Classification of Business Process Service Levels
Disclaimer: The views, processes, or methodologies published in this compilation are those of the authors. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies.
EMC Proven Professional Knowledge Sharing 2009 4
Chapter 1: Introduction Preface On the morning of 26th December 2004, people were enjoying the beaches in Indonesia,
Malaysia, Sri Lanka and India. No one was aware that a tsunami was going to destroy everything
– people, industries and resources. Disaster struck.
On 11th September 2001, people working at the World Trade Centre never thought that their world
would end. Many of the organizations with offices in those buildings could not recover.
These and many more disasters, like the power outage in Europe and North America (Black
Friday), have significantly changed people’s lives. Technology managers, CEOs, and CIOs were
particularly affected not only personally but by the business challenges that these disasters
provoked.
Today’s marketplace is largely driven by a single principle; we want it, and we want it now. When
we turn the spigot, we expect water. When we pick up the phone, we expect a dial tone. When we
turn on the television, we expect programming. When services are normal, they are expected;
when the unexpected happens, we continue to expect services.
Successful organizations realize that these basic but unrelenting expectations are translated into
business demands. When we launch an application, we expect full and immediate functionality.
When we click the “buy” button, we expect a successful and secure transaction. When we
connect, we expect full and expeditious access to the information we require.
The Evolution of Business Continuity Too often, IT organizations (ITOs) view business continuity planning from a technical rather than a
business perspective that is aligned to user requirements. Availability has evolved into a business
issue with many nuances, including data protection and disaster recovery.
Traditional Approaches Traditional approaches to disaster recovery reacted to outage events by returning systems to the
status quo. Purely an IT function, traditional disaster recovery cobbled together a series of
hardware redundancies, outsourcing arrangements, tape backups, and often impractical ideas on
geographical dispersion.
EMC Proven Professional Knowledge Sharing 2009 5
Many ITOs questioned the affordability, scalability, and reliability of disaster recovery plans.
During an outage, traditional disaster recovery involved vulnerabilities such as unreliable manual
processes, untested third-party interventions, exposure of corporate data and applications to
external and untrained parties, and a host of other failure opportunities. Most disturbing, traditional
disaster recovery was exorbitantly expensive in that most outsourcing arrangements offered a
‘one size fits all’ approach, protecting all systems and applications with a single service level.
Business requirements demand a shift from reactive disaster recovery to a proactive, high
availability state. Coupled with disaster preparedness, traditional high availability sought to
provide near fault tolerance by using multiple locations and recovery architectures such as high-
speed networking, high-performance servers, and automation. Again, this approach failed to
provide a realistic balance between availability and cost effectiveness. No significant focus was
placed on mission-critical activities, user perspectives, or loss impacts. High availability is fine, but
business requires an appropriate level of availability. Traditional high availability assumes an
unlimited budget. The cost of downtime associated with each industry is stunning.
The Paradigm Shift Business continuity planning has emerged as the new availability benchmark by adopting the
benefits and shedding the shortcomings of traditional approaches. This has blurred the distinction
between business as usual and always available. Proactive, testable, and pragmatic business
continuity requires a holistic approach to address availability from user and business
perspectives, rather than from infrastructure or recovery perspectives. Taking availability to a
higher level, business continuity planning does not focus on a specific deployment or project.
Instead, it sets availability management as a corporate mentality and measures all deployments.
EMC Proven Professional Knowledge Sharing 2009 6
The Need for Business Continuity Many organizations have a flawed perception of business continuity. Users don’t care about
availability, disaster recovery, and associated processes and policies. Users care about
seamless, uninterrupted, and rapid access. When business continuity is effective, availability
occurs under the covers, ultimately providing the best service levels to all stakeholders.
The need for business continuity in the Asia-Pacific market can be visualized by looking at IDC
data that forecasts 100% YoY growth in the domain.
Fig. 2 Market growth in BC/ DR worldwide estimates
Planned or Unplanned- It's All Downtime Research shows that the majority of users’ resources for availability initiatives (time, money, and
effort) are dedicated to addressing unplanned downtime. Furthermore, many organizational
stakeholders view downtime in the context of a tragedy or disaster.
This approach is flawed for several reasons. First, planned activities account for as much as 90%
of downtime.
Fig. 3 Obstacles to availability
EMC Proven Professional Knowledge Sharing 2009 7
Second, planned downtime is necessary due to mandatory IT functions such as maintenance,
configurations, backups, etc. Downtime must be addressed with planning and control.
The Perspectives
Fig. 4 Perspectives of various stakeholders in Business Continuance
The Perspective of the Internal User Internal users (employees, contractors, other stakeholders) are customers who contribute to the
bottom line through productivity and efficiency. Providing continuous availability to internal users
has become increasingly difficult due to many rapid marketplace evolutions, including mobile,
wireless, and distributed computing. Given the benefits of global connectedness, ITOs realize that
complex computing must be adequately supported and available.
The Perspective of the External User - The Supply Chain While affording many benefits, the practice of connecting supply chains across enterprises
creates vulnerabilities via dependency. Research shows that organizations are not fully protecting
themselves against supply chain failure through enforcement of delivery or performance
requirements. In an outage, this breakdown in accountability will affect the bottom line by
disabling the supply chain that often feeds mission-critical activities. Ultimately, organizations
should look to business continuity planning to avoid jeopardizing trust or breaching contracts with
partners in the supply chain.
EMC Proven Professional Knowledge Sharing 2009 8
Most Importantly, the Perspective of Customers Everyone remembers the customer service tenets of the brick-and-mortar world. Successful
organizations realize that external users expect unfettered access and availability without glitches,
abnormal pauses, reconnections, or error messages. Fickle and unforgiving, customers will
abandon a user session mid-transaction, never to return. In a world of instantaneous information,
a competitor or substitute is but one click away.
Business continuity decision-makers must place themselves in the shoes of the customer to
assess experience, tolerance, flexibility, forgiveness levels, and switching costs. Business
functions in some industries enjoy more forgiveness than others. If the switching costs are low,
the consumer encounters no barriers to exit and little to no pain in changing providers. Even with
generous forgiveness and high switching costs, a consumer will not remain loyal to an
inconsistent provider.
The Perspective of Law and Governance Companies are under increasing pressure to gain control of core business processes to comply
with current and emerging legislation. Three United States regulatory agencies- the Federal
Reserve System, Department of the Treasury, and Securities and Exchange Commission - are
driving the new surge toward corporate governance by issuing business continuity and disaster
recovery guidelines. Organizations should begin planning for future legislative requirements to
minimize corporate exposure and drive business continuity principles.
For instance, the Sarbanes-Oxley Act requires control over material business processes and
related information. Under the threat of personal indictment, executives of publicly traded
companies must report and validate that their financial information is complete and the underlying
controls secure. Compliance is further complicated since much of the new legislation is industry
specific. Guidelines are in their infancy and often lack specific instruction on how to achieve
compliance. The holistic approach, proactively managing business processes and related
information with technology through effective business continuity planning, is the only viable way
to achieve compliance with Sarbanes-Oxley.
EMC Proven Professional Knowledge Sharing 2009 9
Chapter 2: Business Continuity & Data Protection Options Once an autonomous and largely technical consideration, the concept and implementation of data
backup systems has emerged from a ‘nice to have’ IT add-on to a strategic investment in ensuring
availability. Many tradeoffs force the ITO to prioritize based on cost, business impact, scalability/
capability, data accuracy/ coherency, and elapsed time to copy and restoration. Although many
different data protection variations and configurations exist, data protection is a critical component
to ensure seamless and expeditious availability.
Tape Backup Most organizational downtime is attributed to planned activities. The goal of effective backup in
the context of business continuity planning should tend toward reducing the impact of mandatory
backup processes on business operations while minimizing associated costs. As always, the
needs of users remain an underlying consideration.
Offline backups often create unnecessary risk and uncertainty, and their duration depends highly
on the volume of data and I/O throughput capacity variables. While guaranteeing that data is in a
static and defined state, an offline approach engenders significant obstacles such as
unavailability, bandwidth, capacity, and scalability issues.
Disk Backup Although tape backup is not completely obsolete, disk backup is emerging as a low-cost, space-
saving alternative or complement that ultimately provides speedier backup and recovery. In
addition, disk backup allows for an effective allocation of resources because excessive workloads
can be offloaded to tape systems. This is particularly beneficial in archiving situations. However,
disk backup still fails to cost effectively address the storage objective of long-term retention
because of the associated costs of cooling, power, and handling.
Furthermore, tape media better serves long-term off-site retention, due to the transferring of disk
drives to safe locations, etc.
EMC Proven Professional Knowledge Sharing 2009 10
Replication In conjunction with backup, replication strategies complement traditional approaches by providing
alternative levels of data protection and integrity while minimizing user disruptions. Replication
creates a point-in-time copy of data to be used as the backup source. Thus, replication addresses
the shortcomings of both offline and online backup by reducing downtime intervals while
minimizing application disruption and drag during the backup and synchronization processes.
However, replication involves many different approaches based on data protection needs.
Appropriate use of replication techniques is tantamount to a sound business continuity plan. If the
primary data is corrupted or unavailable, replication processes enable the organization to call up a
viable, current, and accurate secondary data copy to support mission-critical business
applications.
Asynchronous Versus Synchronous Replication occurs in two different modes, synchronous and asynchronous. In real-time,
synchronous replication writes and confirms data to both the primary and secondary storage
before the application continues. This approach entails a two-phase commit; no transaction may
occur until the secondary source copies and acknowledges the valid copy with the primary
source.
Because of this committed approach, synchronous replication results in near-zero loss of data,
and rapid recovery times in case of failure. This means that systems and operations could be
transferred to a different location with little user disruption. As the premium and highly costly
solution, synchronous replication may not meet some geographical dispersion requirements due
to Fibre Channel limitations. Therefore, synchronous replication is often considered mandatory for
mission-critical applications or applications for which delay is unacceptable.
Providing this protection substantially increases cost due to high-bandwidth network connectivity,
robust communication infrastructure and storage performance requirements.
Conversely, asynchronous replication writes and confirms data to the primary source only, or to a
temporary staging area. The write and confirm process to the secondary storage occurs when the
application resumes normal operation, at some pre-programmed interval or whenever an update
takes place. A common fallacy of asynchronous replication is that the copy is not as current as the
production original.
EMC Proven Professional Knowledge Sharing 2009 11
Hardware versus Software Replication can be hardware- or software- based. Hardware replication is generally performed by
specialized controllers, freeing server bandwidth for other functions. Available from most
hardware storage vendors, hardware-based replication solutions require specific disk arrays
accompanied by replication software.
While hardware-based replication typically represents a synchronous approach, many vendors
are responding to consumer demand for more robust functionality through automation, data
transfer modes, and other add-ons. Unfortunately, the effective use of these add-ons may require
additional advanced programming and configuration. Some of these costs are offset since the
initial configuration is typically straightforward.
Conversely, software-based replication works at the application or DBMS level, alleviating some
of the challenges of hardware-based replication. Software-based replication solutions are not
bound to any particular hardware vendor or brand, providing more flexibility and versatility. Thus,
a software-based approach enables data replication from any type of storage system to any other,
over any type of IP network connection. Application-level replication is transaction-aware,
assisting in advanced disaster recovery initiatives. However, software-based replication leverages
server cycles that can affect performance.
Database-Level Replication Database-level replication is the most cost effective option, addressing both planned and
unplanned downtime as well as disaster recovery functionality. Because replication takes place at
the database level, organizations can switch users to a second node in near real time avoiding a
complete database restart. Furthermore, since the replicated copy is a different data set it can be
remote.
Block/Storage-Level Replication Block/storage level replication is best for applications that can tolerate up to 30 minutes of
downtime. They replicate over storage blocks when changed on the primary site, unaware of the
application or database that runs above them. Thus, the target site is a complete copy of the
source, replicating all changes including applications and agents. Block-level replication usually
occurs in synchronous mode, resulting in a high level of data consistency. Asynchronous
replication can be enabled when high transaction volumes are present or distances are great.
EMC Proven Professional Knowledge Sharing 2009 12
Because replication is done at the block level, no applications can be running on the target site.
This necessitates a cold restart that introduces manual processes and time delays. Furthermore,
because the replication occurs at the block level, errors on the primary site, such as database
corruptions, could be repeated at the recovery site.
Mirroring In conjunction with backup, mirroring is a real-time approach to data protection. Mirroring
technology continuously copies data and mirrors it to a secondary server unless specifically
instructed to stop. Unlike a traditional point-in-time backup, a mirrored copy keeps no record of
the original source, simply replacing the original with updated data.
Mirroring can either be synchronous or asynchronous. Synchronous mirroring results in multiple
exact duplicates, but suffers from latency issues that limit its effectiveness over geographical
distances. Not error free, asynchronous mirroring deployments are effective over distances
reducing network costs and providing low latency, but greater time delays. Mirroring technology
proves most effective when responding to a local component failure.
Furthermore, aging mirrors produce diminished returns for use as a fallback point. Due to cost
considerations, especially in storage capacity, mirroring implementations (particularly
synchronous) are suggested for mission-critical applications and data categories only.
Business Continuity Planning Given a choice, stakeholders will opt for 100% availability. However, decision makers realize this
is not possible. Therefore, organizations must approach an availability management plan with
prudence through careful planning, testing, and preparation. A sound, enterprise wide business
continuity plan is an enormous undertaking.
Roles and Responsibilities As with any enterprise wide deployment, organizational acceptance of the need, plan, and
approach is mandatory from the boardroom to the mailroom. To avoid competing agendas, the
first step is to establish roles and responsibilities. This will instill a sense of participation and
encourage collaboration. The IT organization is a significant player in the availability management
process. IT retains responsibility for technical planning, implementation, and operational details of
the business continuity plan.
EMC Proven Professional Knowledge Sharing 2009 13
In conjunction, lines of business (LOBs) are responsible for:
• communicating mission-critical functionality
• understanding and executing business continuity procedures and policies
• planning and executing contingency plans
Lastly, corporate leadership must define and implement policies regarding a business continuity
plan by actively participating and dedicating resources.
Definition of Losses We can predict tangible losses with some degree of accuracy using consistent or known metrics.
Although this approach may appear straightforward, organizations are naive to think that bottom-
line impacts due to outages are simple to predict and quantify. Predicting and quantifying
intangible losses proves much more difficult since they require a clear understanding of business
functions, user expectations, market position, and brand impact. For instance, corporate credibility
and consumer trust are not debited or credited on a balance sheet or income statement. However,
the associated losses affect the bottom-line causing considerable and justifiable concern.
In addition, several loss categories behave differently in the short and long terms. Unavailability of
a particular service may be acceptable in the short term, but cause significant pain long term.
Loss Potential For each system and application, business leaders must define appropriate and acceptable
recovery time objectives (RTOs) and recovery point objectives (RPOs) to understand lost
business opportunity and the resulting financial impacts.
RTO identifies the point in time when application (and associated business process) technology
components are functional to the extent that transactions, business functions, etc. can be
resumed. RTO does not mean 100% recovered; it usually indicates a degraded processing mode
(e.g., less capacity, lower performance).
RPO defines the point in time during the recovery cycle at which consistent data recovery can
begin. It can incorporate the time needed to make the disaster declaration (i.e., activate the plan),
stage equipment and personnel resources, transport backup media, and install software
infrastructure (e.g., operating systems, middleware), applications, network switchovers, etc.
Many organizations consider advanced recovery options (e.g., standby operating system,
electronic vaulting of data to the recovery site) to reduce RPO windows (sometimes to zero i.e.,
consistent data recovery begins immediately).
EMC Proven Professional Knowledge Sharing 2009 14
Categorization Framework The categorization framework seeks to set base-level availability minimums that are aligned with
most major business requirements. Best practices dictate a weighted and tiered approach based
on the metrics described above: RPO, RTO, loss of core business functions, and both tangible
and intangible financial impacts.
Pinnacle systems and applications are classified as platinum, meaning mission-critical
functionality and high loss potential. Platinum systems demand high availability and a recovery
time as close to zero as possible, minimizing risk, vulnerability, and repercussion. Systems and
applications in a gold classification withstand outages better, with recovery times up to an hour.
However, the potential losses from gold-level availability failure have far less impact than those
associated with platinum systems, and so on.
Business Impact Analysis With unlimited funds, availability management would prioritize all systems and applications at the
platinum level. Although possible in an ideal world, the categorization framework assumes a finite
pool of resources. Because of this, business impact analysis assists in allocating the appropriate
level of scarce funds to the systems that impact the business most during an outage. A business
impact analysis enables organizations to identify and assign costs to key business processes.
Process and Testing Processes that support the business continuity plan are established and documented as the
business impact analysis tests and solidifies priorities. Policies must be created regarding process
improvement, sustainability, organizational resiliency, and conflict resolution. Gartner research
shows that organizations are poorly positioned and woefully unprepared to execute on their
business continuity plans. Only 25% of organizations include training and education about the
business continuity plan, leaving key personnel to act independently during a crisis. Fewer than
50% of organizations plan for transportation logistics and telecommunications/network outages.
Rehearsals are the best way to prepare for uncertainty and test the business continuity plan.
Announced or unannounced rehearsals should confirm time objectives, staff preparedness and
awareness, duplication or over commitment of resources, and the responsiveness and
effectiveness of external parties.
EMC Proven Professional Knowledge Sharing 2009 15
Chapter 3: Business Continuity Planning Objectives Today, a Business Continuity Management (BCM) strategy ensures that an organization can
survive during and after any disaster causing data loss. It’s also a key tool to build a
comprehensive emergency management system to sustain critical business processes.
BCM/ BCP is synonymous with disaster recovery. But it encompasses more than just DR. It ties in
all the essential components needed to deal with disaster, and ensures incessant provisioning of
critical business operations and services even during a total system collapse.
Business continuity planning is an integrated, enterprise-wide process that should include business impact analysis, resumption planning, business recovery, contingency planning, crisis communication systems, disaster recovery, information security, risk management, and software management.
Business continuity is moving from a reactive to a proactive investment. Most think that they
already have Business Continuity Plans in place, they may be called e.g. crisis plans, emergency
evacuation plans, disaster recovery plans, communication plans, recovery plans or just Plan B.
HOW BUSINESS CONTINUITY WORKS
AVAILABILITY RELIABILITY RECOVERABILITY
Enterprise High Availability Service Level Mgmt. Business Continuity Planning
Achieve and maintain the chosen availability level of the enterprise’s
IT infrastructure.
Effectively manage and control the IT infrastructure to improve the overall operational reliability.
Provide an effective plan to minimise downtime of key
processes in the event of a major disruption.
Technology Processes People
Proactive and preventive Response and recovery
ISSUE
SOLUTION
OBJECTIVE
EMPHASIS
FOCUS
BUSINESS CONTINUITY MODEL
Fig. 5 Components of a Business Continuity Model
EMC Proven Professional Knowledge Sharing 2009 16
First, the Business Continuity Institute (BCI) and the Disaster Recovery Institute International
(DRII) have agreed to ten (10) standards for Business Continuity Management. The ten
certification standards for business continuity practitioners are:
1. Project Initiation and management Establish the need for a business continuity plan, including management support and the elements of organizing and managing the project to completion
2. Risk identification, analysis and control Determine the risks that can adversely affect an organization, analyze the results and determine the controls needed to prevent or minimize risks
3. Business impact analysis (BIA) Quantify the risks identified in item 2. Establish critical functions, their recovery priorities, and interdependencies so that recovery time objectives can be set
4. Developing continuity strategies Determine and guide the selection of alternative recovery operating strategies to maintain the organization’s critical functions
5. Emergency response and operations Develop and implement procedures to respond to and stabilize an incident or event
6. Developing and implementing Business Continuity Plans Design, develop, deliver and implement the continuity plans
7. Awareness and training programmes Prepare a program to create awareness and enhance the skills required to develop, implement, maintain and execute the continuity plan
8. Maintaining and exercising continuity plans Coordinate, evaluate, test and exercise the continuity plan; document results. Develop processes to maintain the continuity capabilities and the plan document, in accordance with the strategic direction of the organization
9. Public relations and media communication Develop, coordinate, evaluate, implement, and exercise plans to handle the media during a crisis. Communicate with employees and their families, key customers, critical suppliers and other suppliers, owners/stockholders, and corporate management during the crisis to ensure that all stakeholders are informed
10. Co-ordination with public authorities Establish applicable procedures and policies for co-coordinating continuity and restoration activities with other emergency management agencies as required by statutes or regulations Note: Details of item 10 vary from country to country, and from industry to industry.
EMC Proven Professional Knowledge Sharing 2009 17
The first five (5) steps are critical before BC Plans are produced.
How many BC Plans should an organization have? There should be at least one for each location
and then separate plans within each location for mission critical business units or divisions at that
location.
A Management Recovery Team Plan binds the BC plans together. The Management Recovery
Team includes senior representatives of all the mission critical functions and specialists that have
company wide responsibilities including human resources, legal, finance, public relations,
telecommunications, systems, and strategic planning. Keep in mind that the business units that
these key people come from will also have their own separate Business Continuity Plans.
Disaster recovery planning is not limited to IT. It is a business issue.
When developing the Plan, address the following points:
• Senior management must understand the level of effort needed to research, define,
construct, and test the Plan. There needs to be support and commitment from the top!
• Management must support the planning effort and ensure its success both on a short-term
and an ongoing basis. This means allocating resources to manage tasks such as
documentation and testing on an ongoing basis.
• Select a project team to ensure an adequate balance between IT and business community
members. This will ensure that the resulting Plan will cover the requirements of both the
IT and business communities.
• Define and agree upon the recovery requirements of the business and IT communities.
Furthermore, they should be posted and accessible to everyone in the organization (such
as the company intranet). Visibility ensures that people realize the importance of the
effort, and their role in its success.
• Design solutions to fit the requirements of the business and the IT communities, including
risk mitigation.
• The final Plan, incorporating those solutions, must be easy to understand, put into
practice, and maintain.
EMC Proven Professional Knowledge Sharing 2009 18
• Develop a process needs to keep the plan current, representing the business and
computing environments at all times.
Disaster recovery planning is a highly complex and time-consuming activity that requires a firm
commitment from management to allocate the hours and funds necessary to achieve success. In
addition, implementing solutions designed to mitigate risk often requires major expenditures.
EMC Proven Professional Knowledge Sharing 2009 19
Chapter 4: Defining Disaster Types
The word ‘disaster’ is derived from the Latin word for “evil star”- a metaphor for comet, once
thought to be a harbinger of impending doom. Today, IT is a vital part of the Value Chain. It has
evolved from a support function to the heart-line of the entire business. Disaster, in this context,
means the unplanned interruption of normal business processes resulting from the interruption of
the IT infrastructure components used to support them.
Disasters happen. Diligent planning and preparation helps us to control our disaster response.
Defining Disasters The first step in planning to recover from a disaster is to define what types of disaster may occur.
There are three categories of disaster:
Category I, the least serious category, may include events as electrical failure, rolling blackouts,
or an accident that severs a major power line. These are the most common types of disaster;
they may be serious if an organization is not prepared.
Category II, localized man-made or natural disasters of a more serious nature, such as a flooded
or fire damaged computer room, require more extensive planning. Since downtime caused by
these disasters can last for days or weeks, they can devastate an unprepared company.
Mitigation of risk for these may include contracting with an outside agency for a mobile computer
center, or maintaining a hot site in another location.
Category III, widespread natural disasters such as earthquakes or floods, require the most
planning and can be the most difficult to recover from. Although these events do not happen
often, they do happen. They can drive a company out of business without adequate planning.
Mitigating Risk Disasters in the first category are relatively easy to mitigate, and many businesses will have
recognized the need for this during their initial business planning processes. Cost is often a factor
in determining what, if anything, is done to prepare for disasters. The impact of data loss and the
inability to continue business for an extended period of time may be enough to put any company
out of business! While a cost / benefit analysis should be performed, it is important to view risk
mitigation as an insurance policy.
EMC Proven Professional Knowledge Sharing 2009 20
Define what Types of Disasters need to be planned for. Disasters in categories II and III are more difficult to mitigate and require extensive preparation
and planning. Every organization that relies on its IT community in the day-to-day operations of
business should assess its risk of disaster.
As part of risk mitigation, many companies face the question of whether it is better to maintain a
hot site of their own or to contract with an outside agency for recovery services at the agency’s
own recovery site(s), or mobile. Cost, rather than ability to recover, is often the key consideration.
A company's location is first determination. If the company is in an area that is not subject to
widespread natural disasters, such as are described in category III, a mobile solution may be
acceptable. If a company resides in an area that is subject to such disasters, it would be better to
have the remote recovery site in an area that is not subject to those forces of nature. Having a
hot site that resides in another geographic location, and possibly a more rural setting, also makes
sense when mitigating the risk of terrorism.
Size is another determining factor. A company with relatively modest computing requirements
may find, depending on location, that either a mobile or fixed recovery site supplied by an outside
agency is adequate. A company that has large scale computing needs may find that, depending
on the size and number of systems being restored, a mobile recovery solution is not practical.
Availability is another factor. Many agencies that sell recovery services sell the same services to
a large number of clients. This is perfectly reasonable business practice since supplying a
dedicated site and system to each of the companies contracting for these services would be
financially impractical, both for the agency and the client. In the event a company has a fire in its
computer room and requires recovery services until their own facility can be rebuilt, this may work
well. However, in the event of a widespread regional disaster, a company may find itself sharing
computing systems with other clients, or worse, waiting until another client is finished with the
facilities. This aspect of recovery in the event of an actual disaster must be understood,
acknowledged, and managed using contractual agreements that provide for priority access to one
or more recovery sites.
EMC Proven Professional Knowledge Sharing 2009 21
Chapter 5: Best Practices and Trends in BC Planning
Strategic Imperative: Real-time enterprises cannot afford to accept the risks associated with
business continuity (BC) vulnerabilities — the consequences could be fatal.
Real-Time Enterprise and BCP — A Collision Course Business Is Moving Faster than Ever Before with focus on:
• Real-time enterprise business process integration
• Significant reliance on partners in the value chain
• Faster flow and immediate responses expected
• You are only as strong as the weakest link
Historically, business continuity (BC) was focused on protecting the enterprise against unlikely but
large events — fire, flood, and natural disaster. With the real-time enterprise (RTE), however,
even the smallest of interruptions — minutes or hours outage of a critical business system,
interruption in service from a critical supplier or outside service provider, or the potential business
impact caused by the economy and its effects on critical customers/suppliers — can have serious
business consequences.
It is estimated that less than 25% of large enterprises have comprehensive business continuity
planning programs, and just 50% have comprehensive disaster recovery programs. Those that do
not are on a collision course with destruction. Those that have done BCP planning are confident
in their ability to adapt and survive, whatever the incident/situation facing them.
EMC Proven Professional Knowledge Sharing 2009 22
BC Components
Disaster Recovery
Business Recovery
Business Resumption
Contingency Planning
Objective Mission critical applications
Mission critical business processing (workspace)
Business Process workarounds
External events
Focus Site or component outage (external)
Site outage (external)
Application outage (internal)
External behavior forcing change to internal
Deliverable Disaster Recovery Plan
Business Recovery Plan
Alternate Processing Plan
Business Contingency Plan
Sample Event(s)
Fire at the datacenter; critical server failure
Electrical outage in the building
Credit authorization system down
Main supplier cannot ship due to its own problem
Sample solution
Recovery site in a different location
Recovery site in a different power grid
Manual procedure
25% backup of vital products; backup suppliers
CRISIS MANAGEMENT
Table 1: Components of BC and Corresponding Plans
The shift from disaster recovery planning to BCP recognizes that IT services are just one
essential component of a business process. Planning and mitigation of all critical resources — IT,
people, facilities, specialized equipment - are required to effectively recover from a disaster. BCP
is a top-level concern and is vital to maintain financial confidence and the reputation of the
business.
BCP includes five components:
• disaster recovery
• business recovery
• business resumption
• contingency planning
• crisis management
EMC Proven Professional Knowledge Sharing 2009 23
The crisis management component addresses managing the event, protecting employees, and
maintaining confidence in the business despite the business interruption. This presentation
focuses on best practices in BCP, emphasizing the RTE impact. RTE does not change the five
components of BCP but rather places more importance on the enterprise’s contingency and crisis
management plans because of the public nature of outages and the increasing reliance on
external services providers (ESPs) for processing. It also reduces recovery point and time
objectives toward real time — 24x7 continuous availability.
How has BC evolved, and what is the impact of real-time enterprise? BCP has evolved significantly during the past 20 years. In the early 1990s, BCP was IT
disaster recovery, providing protection from natural disasters and critical component failure by
enabling recovery in another data center in about 72 hours. In the mid-1990s, enterprises added
business process protection, and recovery plans were developed (e.g., those for customer call
centers). In the late 1990s, as enterprises re-engineered their business processes and assessed
business processes from a year 2000 remediation perspective; it became apparent that
traditional recovery plans with 72-hour recovery periods were not good enough. Thus, enterprises
significantly increased spending to achieve recovery times of between 4 and 24 hours. The
evolution toward e-commerce and RTE resulted in yet another discontinuity affecting BCP.
For many RTEs, a 4- to 24-hour site outage would cause irreparable damage to the enterprise.
Consequently, many enterprises are incorporating BCP into their business process, application
and technology architecture designs — and building in continuous 24x7 availability. Furthermore,
the risks are greater with RTE, so the BC plan must address new scenarios — and BC processes
must integrate with a greater number of enterprise processes.
One of the most important lessons learned from recent disasters is that people issues need to
take center stage in planning — safety, communication and resiliency in workspace and process
issues. As a result, crisis management plans and call trees are being created or updated as are
contingency plans regarding availability of outside service providers and partners. New scenarios
are being developed to address new vulnerabilities.
Business processes (and their integration with external constituents) are rapidly transforming with
enterprise investment in RTE applications and infrastructures. The new risks and the integration
of continuous availability into the business process affect the business in new ways. The
EMC Proven Professional Knowledge Sharing 2009 24
boundaries between “business as usual” and an emergency event that was so easily erected prior
to RTE are no longer possible, and there is no distinction between these two operating
environments. The cost of operating the RTE application environment increases because the
decision for BC is pushed up into the design phase of the project. The risk of RTE application
service downtime reduces the risks that can be accepted by business management; therefore,
they must be addressed with recovery solutions. The new risks need to be reviewed by an
integrated business/IT team. An outage is public knowledge; therefore, the reaction to it must be
immediate and well-managed. Outages take on many faces: the application might hum, but the
operating processes around the application environment might be the cause.
What best practices should enterprises pursue in striving for BC program excellence?
PROCESS
CHANGE MANAGEMENT EDUCATION TESTING REVIEW
TESTING
Group Plans & Procedures Risk Reduction Implement standby facilities
Create Planning Organisation
Recovery Strategy
Risk Analysis
Business Impact Analysis
Policy Organisation Resources Scope
BUSINESS CONTINUITY PLANNING INITIATION
Fig. 6 The Foundation Pillars of Business Continuity Planning
Senior management sponsorship and participation are the foundations of BC excellence. Build BC
into the enterprise culture by weaving BC processes into the life cycle of every project and change
management process. In the requirements phase, the business impact analysis (BIA) identifies
what the enterprise has at risk and which business processes are most critical, thereby prioritizing
risk management and recovery investments. The direct/indirect impact of business
EMC Proven Professional Knowledge Sharing 2009 25
interruptions is assessed over time, resulting in requirements for recovery time and point
objectives. Risk analysis identifies the enterprise’s vulnerability to risks so that they can be
mitigated in the project design phase. Recovery strategies and processes are developed In the
architecture and design phase. When cost of recovery is outside the project budget, enterprises
often must revert to business requirements to re-justify investments or change requirements.
During construction, detailed plans and procedures are created by those responsible for the daily
operation of the processes. The recovery process must be tested prior to implementation to
ensure that requirements can be met. Establish a process to keep the plan current by initiating a
review of every change to business processes or systems.
Action Item: Enterprises need to formalize business continuity processes, starting with the creation of a BC organization responsible for setting policy, governance and reporting.
To determine appropriate availability investments, enterprises need to understand the
consequences of downtime to justify investments for operational availability and BC. A first step
in developing a BC plan is to perform a Business Impact Analysis (BIA). Identify and prioritize
critical business processes and evaluate costs of downtime.
Key goals of the BIA:
1) agree on the cost of business downtime over varying time periods,
2) identify business process availability and recovery time objectives, and
3) identify business process recovery point objectives. The BIA results feed into the recovery
strategy and process.
Action Item: Integrate BCP into the enterprise project life cycle to ensure that recovery needs are identified in the initial phases of new projects, or when business processes and systems change. Tactical Guideline: Even a failed disaster recovery test is useful. BC plans require frequent testing
to ensure the support of critical business requirements. Every plan must be tested to ensure
credible recovery preparedness. Testing familiarizes all BC team members with the experience of
a sudden, unexpected business processing interruption and exposes potential problems and
unforeseen situations. Continuously testing and modifying plans is the key to recovery
preparedness, maximizing the chances of surviving a disaster.
EMC Proven Professional Knowledge Sharing 2009 26
Action Item: Testing RTE recovery plans requires an integrated effort by all parties involved with the transaction. The participation of all owners is critical to the success of the recovery process. When it is not possible to conduct a live test of a BC plan or a component plan, conduct tabletop testing to ensure that external dependencies are addressed.
Tactical Guideline: Annual assessment of recovery capabilities in light of changing requirements
is necessary to ensure the BC plan meets changing business requirements.
Critical Success Factor: Evaluate Capabilities vs. Goals, and Act. Business requirements change
over time; reassess recovery capabilities frequently to ensure they meet business requirements.
This reassessment may be a formal process (such as a mini-BIA). Often, a failed disaster
recovery (DR) test will propel an enterprise into conducting a more detailed analysis.
Action Item: Know your enterprise’s recovery requirements and capabilities; frequently synchronize them with changing business requirements. What to Focus on When BC Funds Are Limited
• Crisis management plan — ensure the safety of employees, continuity of decision making,
and view from the outside world (includes employee call-tree and facilities diagrams)
• Asset list and key supplier contact information
• Secure, offsite backup tape storage
• Prioritize spending on most critical business processes — perform a BIA to determine
priorities
• Work-at-home programs for workspace recovery
• Contingency planning — mitigate the risks of external events
The most important activity is ensuring that, regardless of the level of spending, it is spent in the
right place — to protect the most-critical business processes. Performing a BIA will aid in
identifying business process and resource criticality, priority and dependencies so that spending
can be prioritized accordingly.
Classifying Business Process Service Levels in Project Life Cycle illustration follows on next page.
EMC Proven Professional Knowledge Sharing 2009 27
Classifying Business Process Service Levels in Project Life Cycle CLASS BUSINESS PROCESS
SERVICES SERVICE LEVELS
Class 1 (RTE) • Customer/ Partner facing
• Functions critical to Revenue Production
• 24 x 7 scheduled • 99.9% availability • RTO = 2 hrs., RPO = 0
hrs. Class 2 • Less – Critical
Revenue- Producing Functions
• Supply Chain
• 24 x 6-3/4 scheduled • 99.5% availability • RTO = 8-24 hrs., RPO = 4
hrs. Class 3 • Enterprise Back-Office
Functions • 18 x 7 scheduled • 99% availability • RTO = 3 days, RPO = 1
day Class 4 • Departmental
Functions • 24 x 6-1/2 scheduled • 98% availability • RTO = 5 days, RPO = 1
day
Table 2: Classification of Business Process Service Levels
Define business requirements for application service availability and DR during the business
requirements phase. Ignoring requirements often results in a solution that requires significant re-
architecture to improve service. Service-level definitions should include scheduled uptime, percent
availability in scheduled uptime, and recovery time and point objectives. Availability is day-to- day
availability of the service. Recovery means the time to recover from a significant event (a rolling
hardware failure or natural disaster) affecting the business process. In this example, Class 1
application services are those with an RTE.
Action Item: Develop a service-level classification system with associated development, infrastructure and operations architecture requirements. A repeatable process is a process that works.
Technologies to Reduce RTO/RPO Traditional BC plans provide 24- to 72-hour application/business process recoverability. Many
enterprises need shorter recovery times for critical applications. High-availability techniques are
escalating (especially for RTE applications), enabling enterprises to achieve RTOs and RPOs in
minutes rather than days. With RTE, hot standby (an idle standby application environment that
waits for a disaster affecting the primary physical site) often isn’t good enough, and many design
applications architectures cross several active physical sites. Thus, if one data center has an
outage, the others continue processing requests.
EMC Proven Professional Knowledge Sharing 2009 28
Action Item: Although rapid RTE recovery is expensive, the alternative (recovery in three or more days) could jeopardize an enterprise’s survival. A BIA will help assess the recovery ROI.
How are technologies and service providers evolving to meet BC’s needs? Multi-site architectures are used for application services with Class 1 or 2 (short RTO/RPO).
Often, a new RTE application service starts with single-site architecture and migrates to multiple
sites as its risks grow. Multiple sites complicate applications architecture design (load balancing,
database partitioning, database replication and site synchronization must be designed into the
architecture).
For non-transaction processing applications, multiple sites run concurrently to connect users to
the closest or least-used site. To reduce complexity, most transaction processing (TP)
applications replicate databases (or disks) to an alternative site, but the alternative databases are
idle unless a disaster occurs. A switch to the alternative site can be accomplished in 15 to 30
minutes. Some enterprises prefer to partition databases and split the TP load between sites, and
consolidate data later for decision support and reporting. This reduces the impact of a site outage,
affecting only a portion of the user base. Others prefer more-complex architectures with
bidirectional replication between sites to maintain a single database image. All application
services require end-to-end data backup and offsite storage as a component of the DR strategy.
EMC Proven Professional Knowledge Sharing 2009 29
Chapter 6: Case study
Methodology Since our expertise is primarily in the core domain of Storage, we shared BCP best practices with
the customer. At the same time, we helped them understand the value that to be gained by
engaging our expert consultancy services.
Tailor each disaster recovery-planning project to the individual organization. For this organization,
we had a specific charter designed for the various critical processes and the teams involved. The
project phases can be broadly classified as follows:
Disaster Recovery Planning for ABC Project Phases Phase I Project Initiation The objectives of this phase are to gain an understanding of the organization’s existing and
planned future IT environment, define the scope of the project, develop the project schedule, and
identify project risks.
We established a Steering Committee during this phase. As with any project, the Steering
Committee provided guidance to the project team. The Committee included key personnel from
the business and IT communities.
Phase II Assessment of Disaster Risk This should include, but not be limited to, an assessment of geographical location, building
composition, computing environment/physical plant security, installed security devices (including
automated fire extinguishers and automated shut-down devices), computing environment/physical
plant access control systems and software, personnel practices, operating practices, and backup
practices. This is a good time to perform an IT Assessment, Practices and Procedures Audit, and
Single Points of Failure Analysis.
Phase III Business Impact Analysis We conducted an analysis of all key business units supported by the IT team to identify which
systems and functions were critical to the continuation of business, and to determine the length of
time those units could survive without the critical systems.
EMC Proven Professional Knowledge Sharing 2009 30
Phase IV Definition of Requirements This was the most difficult and time-consuming part of the project. All requirements of, and
relating to, the Plan were defined and detailed. These included the recovery requirements of the
business and IT communities, the requirements generated by the business impact analysis, and
the requirements generated by the assessment of disaster risk and the mitigation of disaster risk.
Phase V Project Planning It is important here to distinguish between the Project Plan and the Disaster Recovery Plan. The
Project Plan defines the project that is being executed. One of its objectives is to develop the
Disaster Recovery Plan.
EMC Proven Professional Knowledge Sharing 2009 31
Chapter 7: Change Management & Decision Making
Conducting BC Plan exercises & “scenario planning” I recommend the term “exercise” rather than “test.” This is because the word test suggests either
success or failure; that is not what BCM is all about. Indeed “failure” is not an option for most
businesses today.
The organization should expect to find flaws in BC Plans during each exercise but that does not
equate to failure. When reporting results, report outcomes and recommended actions.
Example of a BCP “Exercising Policy” Exercising Business Continuity Plans should occur annually or as scheduled by the BCP
Committee. It is important to consider the opinions and recommendations of all Business Unit
Recovery Team members.
The purpose of the exercise is to:
1. Confirm the validity, accuracy and workability of the Business Continuity Plans
2. Ensure that all required resources, including personnel, are available in time
3. Validate that Recovery Team personnel are trained and understand the Business Continuity
Plans
The BCP Committee or Management Recovery Team is responsible for BC Plan exercises.
Results of individual Business Unit exercises must be reported to the BCP Committee in a
prescribed format and should define the action taken by the business unit to change BC Plans
identified by the exercise.
Suggested Draft policy statement - Exercising of all Business Continuity Plans “Exercising Business Continuity Plans is an integral and critical part of Business Continuity
Management and Business Continuity Plans.”
This is a typical statement of exercising objectives for a BCP Committee: “Provide exercise platforms to ensure business unit confidence in the recovery process for technology & systems, people & planning, business systems, and continuity of operations.”
EMC Proven Professional Knowledge Sharing 2009 32
Scenario Planning or WHAT IF? You can be either surprised by the future or be prepared for it. With markets and technology
becoming less predictable, scenario planning will help exercise your organizations’ business
continuity strategy and BC Plans. Consultants, academics and corporate planners agree that
rapid change in markets and technology make reliance on traditional planning increasingly
dangerous.
Documentation and Exercising Output It is imperative that all parties review their BCP documentation to ensure it is current prior to
initiating the exercise. After the exercise is complete, produce a report summarizing results, any
actions items (with expected completion dates), and a table showing expected recovery times for
each system. Distribute this report to all stakeholders, scope document signoffs, and exercise
participants.
Reviewing the exercise plan, timing, cost, objectives This is an ongoing task for the BC Committee or Management Recovery Team.
Consideration should be given to:
•Change management
•New business units
•Results of previous exercises
•New scenarios identified from risk analysis and identification
•Critical staff views on regularity of exercises
•Incident register recordings
•Cost identification recorded for analysis
Audit the exercise regularly to evaluate different elements of BCP All regions in SE Asia business units carry out BCP exercises annually based on a recent Gartner
audit. A few performed additional, smaller scale tests on a monthly or quarterly basis.
A business case would be a standard part of the deliverable. It would include how much it will
cost and how much it would save in operations after implementation.
EMC Proven Professional Knowledge Sharing 2009 33
Chapter 8: Recommendations Exercising BC Plans is the most important continuing function of BC management. It requires
regular and consistent planning, documenting, communicating and analyzing incidents that
require invoking unexpected BC Plans. Therefore, the element of surprise should be a feature of
all BC Plan exercises. If you rely on internal resources, it may become too routine.
Recommendations
1. Consolidate server-attached storage to a Networked Storage Solution
2. Establish a Hot/Warm DR Site with remote replication
3. Improve the Backup/Restore Process
4. Increase Availability
5. Implement Storage Resource Management (SRM) Tools
This allows you to plan your IT purchases to optimally meet scalability requirements.
EMC Proven Professional Knowledge Sharing 2009 34
Chapter 9: TCO and ROI Analysis Disaster Recovery Strategies that have evolved as part of the exercise
Now that all the Disaster Recovery requirements have been documented, they should be rolled
into viable Disaster Recovery strategies.
These Disaster Recovery strategies include the recovery of critical infrastructure and services but
not the recovery of Desktop platforms or their supporting network infrastructure. The proposed
network infrastructure will only support host and backup platforms at the proposed alternate site.
This doesn’t mean that the networking requirements for desktop or notebook access isn’t most
critical; it’s just that the distributed nature of desktops by DRP of each individual Business Unit
would make it hard to consolidate the deployment of required switches at any single location.
Assumptions:
1. Based on currently available information and established scope, pricing is expected to be
within a 20% range of the indicative pricing (plus or minus). During the subsequent Stage 2
engagement a fully costed proposal, technical rollout plan, and timeframe for rollout would be
delivered.
2. This pricing doesn’t consider the ABC resources required during the scoping, solutioning, and
delivery stages of the DRP solution.
3. The pricing presented (in the CAPEX portion) is based on the purchase of the infrastructure.
Other financial alternatives (including leasing and asset management options) are available
and will be covered in the subsequent Stage 2 engagement.
The preferred strategy follows on the next page.
EMC Proven Professional Knowledge Sharing 2009 35
Strategy 1: Rapid Recovery Solution with Advanced Functionality - The Preferred Strategy
FOCUS: Rapid Recovery
Target for recovery of service:
2hrs – 8 hrs
Primary Site Requirements:
• Storage Arrays and SAN deployed to consolidate information.
• Real Time Bi-Directional Replication Infrastructure (DWDM / Replicating Software)
• Possibly consolidate all production systems at a single site or distribute between Primary & Secondary sites
Between Sites Requirements:
• Telecommunications Links (Dark Fiber)
• WAN links (existing)
Secondary Site Requirements: Hot DR site with
• Secondary site power & environmental facilities • Storage Arrays and SAN deployed • Real Time Bi-Directional Replication Infrastructure
(DWDM / Replicating Software) • Dedicated Network infrastructure for fail-over. Details of
Network failover worked • Dedicated Host &Backup Infrastructure
Advanced Functionality: (Incremental Value over Strategy 2)
Automation and Integration: • Discovery, configuration, operation of multiple disk
replicas • Automation of source data restoration to production
servers or alternate hosts • Automation of provisioning and related reporting • Automation of policy-based activities and system
administration tasks • Integrated Performance Monitoring Scalability &
Performance: • Solution can scale to include Enterprise Business
Application across BUs • Ensure minimal performance degradation as these
platforms integrated into the deployed solution DRP Process Integration Consulting: • Professional Services engagements during the
implementation process to ensure features implemented during DRP implementation are integrated into BCP plans of other BUs.
EMC Proven Professional Knowledge Sharing 2009 36
Consideration Pro Con
Management: 1. Better control of future DRP requirements
2. Least Business Impact 3. More options for
replication and distribution of data
1. Some resource redeploy-ment and training required
2. Needs commitment from other Business Units
Ease of Recoverability: 1. Most available and scalable of options
2. Frequent Testing of DR procedure possible
Price: 1. Strong TCO + Strongest ROI
1. Higher initial + mainten-ance cost
Solution: 2 site synchronous storage based replication using EMC Enterprise Storage
Indicative Pricing:
INFRASTRUCTURE COMPONENTS: PRICING:
CAPEX (INCLUSIVE OF 1 YR. MAINTENANCE) Storage Arrays and SAN deployed at Primary & Secondary sites + SRM Software + Replicating Software
$ x million
Dedicated Host Infrastructure at Secondary site $ x million Dedicated Backup Infrastructure at Secondary site $ x million Dedicated Network infrastructure for fail-over at Secondary site $ x million Sub Total: $ 3x
REVEX (ONE-TIME IMPLEMENTATION EXPENSE)
One-time Professional Services to deploy and integrate infrastructure $ x million
OPEX (ON-GOING OPERATING EXPENSE) Secondary site facilities (per/annum)
12 x $ x/ 1.78 = $ x millionReal Time Replication Infrastructure (DWDM / telecommuni-cations Links) as a service 2 x $ x/ 1.78 = $ x million Maintenance from second year
$ x million Tapes bunkered at new site
No additional cost
Sub Total: (Per Year) $ 1x
EMC Proven Professional Knowledge Sharing 2009 37
Synchronous versus Asynchronous Replication (Intermediate Recovery) Comparative Pricing:
Based on earlier pricing presented to ABC, we established that the incremental CAPEX (related to
networking equipment required to support Asynchronous replication) presented for Asynchronous
negated some of the OPEX benefits offered by the lower Operating Costs of Telecom
requirements for Asynchronous replication. We decided that the incremental benefits of a
predominantly synchronous solution outweighed the marginal additional costs.
Pricing Summary for ABC Requirements: Based on Strategy 1 and the data provided to us, the following pricing changes should be
considered when planning for ABC’s DRP requirements:
INFRASTRUCTURE COMPONENTS: PRICING:
CAPEX (INCLUSIVE OF 1 YR. MAINTENANCE)
Storage Arrays and SAN deployed at Primary & Secondary sites + SRM Software + Replicating Software
$ x million
Dedicated Host Infrastructure at Secondary site ABC Hosts $ x million
Dedicated Backup Infrastructure at Secondary site $ x million
Dedicated Network infrastructure for fail-over at Secondary site $ x million
Sub Total: $ 3.3 x
REVEX (ONE-TIME IMPLEMENTATION EXPENSE)
One-time Professional Services to deploy and integrate infrastructure $ x million
OPEX (ON-GOING OPERATING EXPENSE)
Secondary site facilities (per/annum) 12 x $ x/ 1.78 = $ x million
Real Time Replication Infrastructure (DWDM / Telecommunications Links) as a service 2 x $ x/ 1.78 = $ x million
Maintenance from second year $ x million
Tapes bunkered at new site No additional cost
Sub Total: (Per Year) $ 1 x
EMC Proven Professional Knowledge Sharing 2009 38
Strategy 2: Standard Rapid Recovery Solution
FOCUS: Rapid Recovery
Target for recovery of service:
2hrs – 8 hrs
Primary Site Requirements:
• Storage Arrays and SAN deployed to consolidate information.
• Real Time Bi-Directional Replication Infrastructure (DWDM / Replicating Software)
• Possibly consolidate all production systems at a single site or distribute between Primary & Secondary sites
Between Sites Requirements:
• Telecommunications Links (Dark Fiber) • WAN links (existing)
Secondary Site Requirements: Hot DR site with
• Secondary site power & environmental facilities • Storage Arrays and SAN deployed • Real Time Bi-Directional Replication Infrastructure
(DWDM / Replicating Software) • Dedicated Network infrastructure for fail-over. Details of
Network failover worked • Dedicated Host &Backup Infrastructure
Consideration Pro Con Management: 1. Better control of future
DRP requirements 2. Some Business Impact 3. Not as option rich for
replication and distribution of data
1. Less resource redeploy-ment and training required
2. Needs commitment from other Business Units
Ease of Recoverability: 1. Still high availability 2. Frequent Testing of DR
procedure possible
1. Not the most scalable options
Price: 1. Stronger TCO + Strong ROI
2. Lower initial + mainten-ance cost
1. Incremental functionality has to be negotiated on a case by case basis with incremental cost
EMC Proven Professional Knowledge Sharing 2009 39
Solution: EMC Enterprise Storage based Synchronous Replication
Indicative Pricing:
INFRASTRUCTURE COMPONENTS: PRICING:
CAPEX (INCLUSIVE OF 1 YR. MAINTENANCE)
Storage Arrays and SAN deployed at Primary & Secondary sites + SRM Software + Replicating Software
$ x million
Dedicated Host Infrastructure at Secondary site
$ x million
Dedicated Backup Infrastructure at Secondary site $ x million Dedicated Network infrastructure for fail-over at Secondary site $ x million Sub Total: $ 2.5 x
REVEX (ONE-TIME IMPLEMENTATION EXPENSE) One-time Professional Services to deploy and integrate infrastructure $ x million
OPEX (ON-GOING OPERATING EXPENSE) Secondary site facilities (per/annum)
12 x $ x/ 1.78 = $ x millionReal Time Replication Infrastructure (DWDM / Telecommunications Links) as a service 2 x $ x/ 1.78 = $ x million Maintenance from second year
$ x million Tapes bunkered at new site
No additional cost Sub Total: (Per Year)
$ 0.75x
EMC Proven Professional Knowledge Sharing 2009 40
Chapter 10: Conclusions
People have begun to understand ‘Business Continuity Planning is, in essence, a pragmatic undertaking.’ BCP is well within the grasp of common sense individuals possessing sound
analytical and communication skills.
Jargon aside, the “new focus" of BCP is to recover critical business operations as expeditiously as
possible following an unplanned interruption. Now is the time to develop Business Continuity
capability. New technologies will continuously evolve and BCP will keep on changing dimensions,
but for organizations to sustain their competitive edge, this is the way to go. It has to be driven by
top management and every individual in the organization MUST understand his or her role and
value. Change management is the most CRITICAL SUCCESS FACTOR for the Business to
continue after a disaster. If you have a plan in place but no change guidelines, it is better to have
no plan.
The art of war suggests that the victor does not survive due to luck or chance, but due to a clearly outlined strategy that is executed at all levels within the ranks, each individual with his set of objectives set at achieving the collective team’s goal.
EMC Proven Professional Knowledge Sharing 2009 41
Appendix A: Sample Risk Analysis
Likelihood (Rate 1-5) X
Impact (Rate 1-5) =
Risk Category
S.No. Threat or Trigger 1= Very Low 2= Low 3= Medium 4= High 5= Very High
1= Negligible 2= Some 3= Moderate 4= Significant 5= Severe
Relative Weight
(W) A => W<8 B=> 8<W<12C=> 12<W
1 Earthquake 2 x 4 = 8 B 2 Power Failure 4 x 3 = 12 C 3 Fire 2 x 2 = 4 A 4 Hurricane 1 x 4 = 4 A 5 Tsunami 3 x 3 = 9 B 6 Flood 1 x 5 = 5 A 7 Bombing 4 x 4 = 16 C 8 NBC* Attack at Site 2 x 5 = 10 C
9 NBC* Attack within 50 kms 4 x 4 = 16 C
10 Cyber Attack 5 x 3 = 15 C 11 Kidnapping 2 x 3 = 6 A 12 Sabotage 3 x 2 = 6 A 13 Hazardous Accident 3 x 3 = 9 B 14 Product Recall 2 x 2 = 4 A 15 Public Health 2 x 2 = 4 A 16 Work Stoppage 3 x 2 = 6 A
*NBC = Nuclear, Biological, Chemical
EMC Proven Professional Knowledge Sharing 2009 42
Appendix B: References and Bibliography
Books & Journals a) Regis J. Bates, Disaster Recovery Planning for Networks, Telecommunications and Data
Communications, McGraw-Hill, 2nd Edition, 2002. b) A. V. Vedpuriswar and Rajesh Kumar Singh, Enterprise Risk Management Concepts and
Cases Vol. 1 - by, ICFAI University, © ICFAI University 2002
c) John Laye, Avoiding Disaster – How to keep your business going when Catastrophe strikes, FBCI, John Wiley and Sons © 2002
d) The US Congress, The 9/11 Commission Report – The official report by the National
Commission on Terrorist attacks upon United States.
e) Regis J. Bates, Voice & Data Communication Handbook, Fourth Edition, McGraw-Hill, 2001
f) Jon William Toigo, Disaster Recovery Planning: Strategies for Protecting Critical
Information Assets, 3rd Edition, Prentice Hall
g) Michael Wallace, Lawrence Webber, The Disaster Recovery Handbook: A Step-by-Step Plan to Ensure Business Continuity and Protect Vital Operations, Facilities, and Assets, July 2004, AMACOM
Websites a) http://www.globalcontinuity.com
b) http://www.drplanning.org
c) http://www.emc.com
d) http://www.disaster-resource.com
e) http://drie.org
f) http://www.continuitycentral.com
g) http://www.drpplanning.com
Magazines and Newspapers a) The Economic Times articles
b) The Times of India articles
EMC Proven Professional Knowledge Sharing 2009 43
Biography
Smartha Guha Thakurta is a Technology Consultant and Product Marketing Manager who holds the following credentials:
• EMC Technology Architect NAS Expert • EMC Technology Architect CAS Specialist • EMC Technology Architect Storage and Infrastructure Specialist • Symmetrix Speed • NAS Speed
He has also achieved the following certifications:
• Brocade: Brocade Certified SAN Designer • Network Appliance: Network Appliance Storage Associate • Hitachi Data Systems: Hitachi 9500 Presales Certified Professional • Sun Solaris • Sun Certified Cluster 3.x Installer • Sun Certified Field Engineer • Sun Certified Network Administrator for Solaris 8 • Sun Certified System Administrator for Solaris 8, Level 2 & 1 • Veritas: Veritas Certified Presales Professional • Microsoft: Microsoft Certified Professional