disaster recovery and data replication architectures

8/12/2019 Disaster Recovery and Data Replication Architectures

1/22

Gartner IT Security Summit 2005 Donna Scott

68 June 2005

Marriott Wardman Park Hotel

Washington, District of Columbia

Disaster Recovery and Data Replication

Architectures

These materials can be reproduced only with Gartner's written approval. Such approvals must be requested via e-mail [email protected].


2/22

Disaster Recovery and Data Replication Archi tectures

Page 1

Donna Scott

C4, SEC11, 6/05, AE

2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any form without prior written permission is

forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to

the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the

information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to

achieve its intended results. The opinions expressed herein are subject to change without notice.

Strategic Planning Assumptions: Through 2007, fewer than 20 percent of enterprises wi ll

operate at Stage 3 the highest level of disaster recovery management process maturity (0.8

probability). By year-end 2007, large enterprises wi th well-defined disaster recovery processes

and regularly tested plans wil l rise from the current level of approximately 60 percent to 80

percent (0.8 probability).

Disaster Recovery Management Is Maturing

No data recovery

or shelfware plan

Data recovery as an ITproject

Platform-based Plan occasionally tested Ad hoc project status

reporting

Data recovery as a process, component ofbusiness continuity management

Link DR to business process requirements Defined organization Plan regularly tested Formalized reporting

Business integration Partner integration Process integration Continuous improvement culture Frequent, diverse testing Formalized reporting to BCM,

executives and board

Stage 0

Stage 1

Stage 2

Stage 3

Disaster recovery management (DRM) has evolved over 20 years, from its roots in platform-based IT recovery(such as mainframe) to integration in business continuity plans. Enterprises tend to evolve through at least fourDRM stages, moving to the next stage when the benefits outweigh the risks of inaction.

In the first phase, disaster recovery (DR) plans are nonexistent or they exist only as shelfware. They are nottested or maintained and would not enable or direct recovery actions.

Enterprises typically move next to define a DR plan on a project basis. Typically, there is a realization inside ITthat some disaster risk mitigation must be implemented or there is outside business pressure to protect a specificbusiness process (such as the call center). The project is focused on a plan with occasional testing; however, its

typically not integrated into other IT/business processes and is not maintained.In the next phase, enterprises focus on building a DRM organization and processes, ensuring a life cycleapproach to maintaining a plan, and regularly test the plan (once or twice per year). Business process ownersactively determine IT recoverability requirements, as well as participate in tests.

In the final phase, the focus is on process integration that is, with change management to ensure that DRplans are kept up-to-date and with incident/problem management to leverage IT support processes. DRM is alsoconsidered early in the stages of a new project. Emphasis is also on end-to-end planning, including partnerintegration and continuous improvement/best practices.


3/22


4/22


Page 3

Donna Scott

C4, SEC11, 6/05, AE






Investing to Reduce Unplanned Downtime

40%

Operations

Errors

40%

Application

Failure

Investment Strategy Redundancy Service Contracts Availability Monitoring BCM/DRM and Testing

People and Process Hiring and Training IT Process Maturity Reduce Complexity Automation Change and Problem Mgt.

People and Process App. Architecture/Design Mgmt. Instrumentation Change Management Problem Management Configuration Management Performance/Capacity Mgt.

20%Environmental

Factors,HW, OS, Power,

Disasters

Based on extensive feedback from clients, we estimate that, on average, 40 percent of unplanned mission-critical

application downtime is caused by application failures (including bugs, performance issues or changes to

applications that cause problems); 40 percent by operations errors (including incorrectly or not performing an

operations task); and about 20 percent by hardware, operating systems, environmental factors (for example,

heating, cooling and power failures), first-day security vulnerabilities, and natural or manmade disasters. To

address the 80 percent of unplanned downtime caused by people failures (vs. technology failures or disasters),

enterprises should invest in improving their change and problem management processes (to reduce the downtime

caused by application failures); automation tools, such as job scheduling and event management (to reduce the

downtime caused by operator errors); and improving availability through enterprise architecture (includingreducing complexity) and management instrumentation. The balance should be addressed by eliminating single

points of failure through redundancy, implementing BC/DR plans and reducing time-to-repair through

technology support/maintenance agreements.

Action Item: Dont let redundancy give you a false sense of security since 80 percent of downtime is caused by

people and process issues.

Client Issue: How will enterprises justify investments in technologies, people and business

processes needed to deliver continuous application availability and protection from site

disasters?


5/22


Page 4

Donna Scott

C4, SEC11, 6/05, AE






Justification Vehicle:The Business Impact Assessment

Revenue

Know your downtimecosts per hour,day, two days ...

Number ofemployees affectedx hours out xburdened hourly rate

Damaged Reputation Customers

Suppliers

Financial markets

Banks

Business partners

...

Financial Performance

Revenue recognition

Cash flow

Lost discounts (A/P)

Payment guarantees

Credit rating

Stock price

Other ExpensesTemporary employees, equipment rental, overtime costs,extra shipping costs, travel expenses, legal obligations ...

Direct loss

Compensatory payments

Lost future revenue

Billing losses

Investment losses

ProductivityRTO RPO

Client Issue: How will enterprises justify investments in technologies, people and businessprocesses needed to deliver continuous application availability and protection from sitedisasters?

Enterprises need to understand the consequences of downtime to justify investments for operational availabilityand business continuity (BC)/DR. A first step in developing a BC/DR plan is performing a business impactanalysis (BIA), where critical business processes are identified and prioritized and costs of downtime areevaluated over time. The BIA is performed by a project team consisting of business unit, security and ITpersonnel. Key goals of the BIA are to: 1) agree on the cost of business downtime over varying time periods, 2)identify business process availability and recovery time objectives, and 3) identify business process recovery

point objectives. The BIA results feed into the recovery strategy and process. Enterprises that have neverinstituted a BIA into their application life cycle processes typically initiate a BIA project and use their findingsto ensure that current recovery strategies meet business process requirements. With real-time enterprise (RTE)applications, it is critical that BC/DR is built into the life cycle for new applications and business processenhancement projects so that availability and recovery requirements are built into the architecture and design.

Action Item: Integrate business continuity management (BCM) and DRM into the enterprise project life cycle toensure that recovery needs are identified in projects initial phases or in changes in business processes andsystems.

Strategic Planning Assumption: By year-end 2007, 65 percent of large enterprises will integrate

disaster recovery requirements into the new project li fe cycle, up from fewer than 25 percent in

2004 (0.8 probabil ity).


6/22


Page 5

Donna Scott

C4, SEC11, 6/05, AE






Criticality Ratings/Classification Systems

RTO = five+ days; RPO = one day

Quick ship contracts typical

Sourcing at time of disaster whereRTOs are lengthy

Departmental functionsClass 4

RTO = three days; RPO = one day

Shared recovery environment

May include quick-ship programs

Enterprise back-officefunctions

Class 3

RTO = 8-24 hrs.; RPO = four hours

Dedicated or shared recoveryenvironment

Less-critical revenue-

producing functions Supply chain

Class 2

RTO = 0-4 hrs.; RPO = 0-4 hrs.

Dedicated recovery environment

Architecture may includeautomated failover

Customer-/Partner-Facing

Functions critical torevenue production whereloss is significant

Class 1(RTE)

DR Service Levels and StrategyBusiness Process ServicesClass

Client Issue: How will enterprises justify investments in technologies, people and business

processes needed to deliver continuous application availability and protection from site

disasters?

Business needs for application service availability/DR should be defined during the business requirements phaseIgnoring this early often results in a solution that doesn't meet needs and ultimately requires significant re-architecture to improve service. We recommend a classification scheme of supported service levels andassociated costs. These drive tasks and spending in development and application architecture, systemsarchitecture and operations. Business managers then develop a business case for a particular service

classification. From a DR perspective, this case is developed in the BIA and recovery strategy phases. Service-level definitions should include scheduled uptime, percentage availability in scheduled uptime, and recoverytime and point objectives. In this example, Class 1 application services have an RTE strategy and are those thatthe enterprise would suffer irreparable harm from if they were unavailable. Not all applications in a criticalbusiness process would be grouped in Class 1; rather, only those deemed most-critical or with the mostdowntime effect. The DR architecture for Class 1 and even Class 2 would result in an implementation across twophysical sites to meet availability/recovery needs.

Action Item: Develop a service-level classification system with associated development, infrastructure andoperations architecture requirements. A repeatable process is a process that works.


7/22


Page 6

Donna Scott

C4, SEC11, 6/05, AE






Client Issue: What technologies will be crit ical for data replication architectures and what are

their trade-offs?

Tactical Guideline: There is no one right disaster recovery data center strategy. Many

companies implement all four methods, depending on their application and recovery

requirements.

Disaster Recovery Strategies

Common Strategies

Production/Load Sharing Production/Outsourcing /DR

Production/Development and Test DR Production/Standby DR

Client Questions:

How many data centers should I have? One or many?

Where should they be located? Is close better than far?

Should I reduce the cost of DR by using idle assets for other purposes?

Trade-off Costs, Risks, Complexity

Gartner frequently gets questions from clients about data center strategies such as, how many data centers,

how are they used, and what is a strategy for DR? Although there are no right answers to these questions (the

right answer for your organization depends on your business and IT strategy), there are common themes across

large enterprises. Although data center consolidation has often been used to reduce cost through economies of

scale, consolidation across the oceans is fairly rare. This is due to network latency causing unacceptable

response time levels for worldwide applications. However, those that do operate a single application instance

worldwide achieve greater visibility across business units (such as the supply chain) and reduced overall costs of

operation and application integration.

As far as the number of data centers since the Sept. 11 tragedy, there has been a slight increase in the overallnumber of data centers for many organizations to achieve protection from disasters. From a DR perspective, the

trend is toward sub-24-hour recovery time objective (RTO), and often sub-four-hours, resulting in dedicated

recovery environments either internally or at outsourcers. Often to reduce total cost of ownership, the

recovery environment is shared with development/test or production load sharing, or through shared contracts

with DR service providers. Furthermore, capacity-on-demand programs are popular for internal recovered

mainframe environments.


8/22


Page 7

Donna Scott

C4, SEC11, 6/05, AE






Geographic Load Balancer

GeographicLoad Balancer

Site LoadBalancer

SiteLoad

Balancer

WebServer

Clusters

ApplicationServer

Clusters

DatabaseServer

Clusters Disk

PIT Image,

Tape B/U

WebServer

Clusters

ApplicationServer

Clusters

DatabaseServer

Clusters

TransactionReplication

DBReplication

RemoteCopy

Secondary Site

LAN and

PC Tape

Backup

Production Site

Data Recovery Architectures:Redundant Everything


their trade-offs?

For application services with short RTO/recovery point objective (RPO) requirements, multi-site architecturesare used. Often, a new IT service or application is initially deployed with a single-site architecture and migratesto multiple sites as its criticality grows. Multiple sites complicate architecture design (for example, loadbalancing, database partitioning, database replication and site synchronization must be designed into thearchitecture). For non-transaction processing applications, multiple sites often run concurrently, connecting usersto the closest or least-used site. To reduce complexity, most transaction processing (TP) applications replicate

data to an alternative site, but the alternative databases are idle unless a disaster occurs. A switch to thealternative site can typically be accomplished in 15 minutes to 60 minutes. Some enterprises prefer to partitiondatabases and split the TP load between sites and consolidate data later for decision support and reporting. Thisreduces the impact of a site outage, affecting only a portion of the user base. Others prefer more complexarchitectures with bi-directional replication to maintain a single database image. All application services requireend-to-end data backup and offsite storage as a component of the DR strategy. Often, the DR architecture willimplement point-in-time replicas to enable synchronized backup and recovery. Application services with greaterthan 24-hour RTO typically recover via tape in the alternative site.

Tactical Guideline: Zero transaction loss requires transaction mirroring and all the costs

associated with it.


9/22


Page 8

Donna Scott

C4, SEC11, 6/05, AE






Point In Time Copies: The Data CorruptionSolution

Controller based

EMC TimeFinder/Snap

IBM Flashcopy

HDS ShadowImage

StorageTek SnapShot

Software

Oracle 10g Flashback

BMC SQL Backtrack

Veritas Storage Foundation


their trade-offs?

PIT copy solutions are a prerequisite to building Real Time Infrastructures (RTIs). Their penetration rate in

enterprise sites is at least three to five times greater than the penetration rate of remote copy solutions. There are

two key reasons for this disparity. First, PIT copies protect against a frequent source of downtime: data

corruption. Second, they shrink planned downtime for backups from hours to seconds or minutes and they

simplify other operational issues like check-pointing production workloads and application testing.

Software based PIT copy technologies limit storage vendor lock-ins and have the potential of leveraging their

closeness to the protected applications into greater efficiency and tighter application integration. However, with

these advantages, there is the downside of potentially more complex software architectures (with many tools

potentially implemented) and the need for additional testing. Storage or controller based solutions give up some

intimacy with the applications to deliver a more agnostic platform and application solution, but at the cost of

greater storage vendor lock-ins.

In most situations choosing between software and controller based solutions will be driven by prior investments,

internal skills, application scale and complexity, and price.

Strategic Planning Assumption: By 2008, more than 75 percent of all enterprises using SAN

storage wil l have deployed a PIT solution to meet service level objectives (0.7 probabili ty).


10/22


Page 9

Donna Scott

C4, SEC11, 6/05, AE






Appl ication /

Transaction

Level

Architected for nodowntime transparentdata access

Full or partial recoveryscenarios

Loosely coupledapplication componentsdesigned for integrity

Supports heterogeneousdisk

DBAs understand andhave confidence in thesolution

Must be designed upfront

Significant re-architecturewhen not designed upfront

Requires applications/DBgroups to be responsible for

recovery Does not offer prepackaged

solutions that provideintegrity and consistencyacross applications

Pros Cons

Product Examples: IBM WebSphere MQ or other message-oriented

middleware; Teradata Dual Active Warehouse or other built-in application-

level replication; also fault-tolerant middleware or virtualization

Data Replication Alternatives

Client Issue: What technologies will be crit ical for data replication architectures and what aretheir trade-offs?

The best method of achieving continuous 24x7 availability is building recovery right into the application orenterprise architecture. This way, enterprises architect transparent access to data, even when some componentsare down. Users are never or very rarely impacted by downtime, even with a site failure. Typically, thearchitecture consists of asynchronous message queuing middleware but may be implemented with faulttolerant infrastructure middleware that replicates the transaction to redundant applications in anotherlocation. Applications and database architects and database administrators (DBAs) have confidence in this typeof solution because it is based on transactions not bits, or blocks or files that lack application/transactioncontext. However, this type of architecture may take significant effort in the application design stages and mostenterprises do not task their application development organization with recovery responsibilities. Furthermore,this method does not provide a prepackaged solution to ensure against conflict resolution and consistency andintegrity across applications during recovery. Rather, application developers and architects must assess methodsto roll back or forward to a consistent point in time and may code consistency transactions into applications toenable this to happen. Due to these drawbacks, most enterprises use the infrastructure to enable recovery for themajority of their needs and reserve this method for the most critical subset of applications.

Decision Framework: Use application/transaction-level replication where 24x7 continuous

application availability (no downtime) is required and for new application projects.

Strategic Planning Assumption: Through 2007, application and transaction-level replication

wil l be used by less than 10 percent of large enterprises (0.8 probability).


11/22


Page 10

Donna Scott

C4, SEC11, 6/05, AE






Often included with DBMS

Some enable read/write use of

second copy and conflict resolution

Hardware-independent; supports

heterogeneous disk

No special application design

required Allows flexibility in recovery to a

specific point in time other than last

backup/point of failure

DBAs understand and have

confidence in the solution

Generally low bandwidth

requirement; lower network costs

Can be used for short RPO, with

longer RTO, reducing software

license costs

Additional Data Replication Alternatives

DBMS

Log-Based

Replication

DBMS-specific solution

More operational complexity

than storage-controller-based

Replication

Solution automation/

integration varies

Does not replicateconfiguration data in files

Requires active duplication

of server resources

Requires DB logs and/or

journaling could impact

production performance

No assurance of cross-

application data integrity

Complex failback due to

synchronization issues

Pros Cons


their trade-offs?

Database log-based replication is a popular method for ensuring recoverability. Logs are read, with changestypically shipped asynchronously (some solutions offer synchronous replication, but its rarely used), and can beapplied continuously, on a delay or to a backup upon disaster (this decision depends on the RTO and RPO). Aswith transaction-level replication, DBAs and application architects understand and have confidence in thesolution. Furthermore, many solutions allow read/write access of the second copy, and, therefore, it is possible tocreate failover transparency in the solution (if replication is closely synchronized). However, care must be taken

to avoid conflict resolution. To minimize conflict resolution, most enterprises apply transactions at a primarylocation and only switch to the secondary when the application cannot access the primary database. A majordownside of this solution is that replication is needed for every database, thus, labor/management costs increase.Furthermore, configuration data stored in file systems (rather than the database) is not replicated andsynchronization must be designed separately, typically through change control procedures. Moreover, cross-application recovery integrity must be built into the solution for example, by writing synchronizationtransactions and rolling back or forward to achieve consistency. Despite the drawbacks, thousands of enterprisesuse database log-based replication to achieve short RTOs or RPOs.

Decision Framework: Consider replication at the database management system level to

provide short RPO and RTO for mission-cri tical applications, keeping in mind that data

integrity across applications must be designed into the applications and transactions.


12/22


Page 11

Donna Scott

C4, SEC11, 6/05, AE






DBMS Log-Based/Journaling/Shadowing

Product Strength Weakness

Oracle Data Guard Automation, functionincluded

Failoverresynchronization

DB2 UDB HADRfor v.8.2

Log apply automation;included in ESE

Failover automation;Cannot read target

SQL Server Log

Shipping

Function included Automation, failover

ResynchronizationQuest SharePlexfor Oracle

Bidirectional Cost

GoldenGate DataSynchronization

Bidirectional, multi-DBMS support

Gaining groundoutside NonStop

Lakeview, Vision,DataMirror

AS/400 Gaining groundoutside AS/400

ENET RRDF z/OS z/OS only

HP NonStop RDF NonStop NonStop only


their trade-offs?

Oracle Data Guard is a popular method for DR, as it is included with the Oracle license and has integratedautomation built-in for failover and failback and failback resynchronization in 10g, which enables changesmade at the secondary site to be integrated back into the primary site database management system (DBMS),resulting in no lost transactions. Data Guard offers two methods for replication: shipping of archive logs (aftercommitment), which could mean 15 to 30 or more minutes RPO, or shipping of redo log changes, which couldbe implemented in synchronous mode for zero data loss or in the more commonly implemented asynchronousmode. DB2 log shipping is included and offers asynchronous replication but does not have built-in automation.In DB2 UDB v.8.2, IBM added HADR (included with Enterprise Server Edition only), which automatesshipping, receiving and applying logs (not failover). HADR does not enable users to read the target DBMS; forthis, you must implement the more-complex DB2 replication. SQL Server also lacks failover and failbackautomation. Quest SharePlex, a popular tool for Oracle replication, provides close synchronization (a fewseconds to a minute) and bidirectional support. GoldenGate offers similar technology for multiple DBMSplatforms. HP NonStop has strong replication functionality for its DBMS. Suppliers of AS/400 replicationtechnology are Lakeview Technology, Vision Solutions and DataMirror. On the mainframe, ENET RRDF isoften deployed in environments with short RPO and longer RTO, so that the changed data is maintained in analternative site but not applied until disaster (or in tests).

Decision Framework: Most relational DBMS products include log-based replication, but the

degree of synchron ization (speed) and automation varies considerably. Some third-party tools

offer mul ti-DBMS replication support, as well as integrated automation and synchronization.


13/22


Page 12

Donna Scott

C4, SEC11, 6/05, AE






Storage

Controller

Unit-based

Infrastructure-based

solution requires less

effort from application

groups

Platform and data type

independence (including

mainframe) Single solution for all

applications/DBMSs

Operational simplicity

Most solutions assure, but

do not guarantee, data

integrity across servers

and logical disk volumes

Minimal host resource

consumption

Data copies not available for

read/write access

Short recovery time, but user work

is interrupted

Storage hardware and software

dependent

Less storage vendor pricingleverage

Failover is typically not packaged

and requires scripting

High connectivity costs

Monitoring/control must be built

Lack of customer and vendor

understanding of procedures and

technology to assure data integrity;

taken for granted that it works

Homogeneous SAN required

Pros Cons

Data Replication Alternatives Pros/Cons


their trade-offs?

Storage controller unit-based solutions are popular for enterprises seeking to build recoverability into the

infrastructure and use the same solution for all applications and databases. Software on the disk array captures

block-level writes and transmits them to a disk array in another location. Because many servers (including

mainframes) can be attached to a single disk array, there are fewer replication sessions to manage, thus greatly

reducing the complexity. Although solutions generally ensure write-order integrity for each array, some are able

to provide a method for data integrity of the copy across arrays. These solutions started out synchronous and

therefore have been highly utilized in close proximity (under 50 miles). Synchronous solutions are extremelypopular in financial services industries where RPOs are set to no-loss-of-work. However, asynchronous solution

are slowly gaining ground. The major drawback to storage controller-based solutions is that recovery cannot be

transparent to the applications because control of the target copy is maintained by the primary site, which is only

an issue for applications requiring 24x7 availability. Many enterprises use storage controller-based solutions and

use transaction-level or DBMS-level replication for those few applications requiring more stringent availability.

Another drawback is lock-in to storage hardware and software.

Decision Framework: Consider storage control ler unit-based replication to achieve short

RPO/RTO, where enterprises desire to move to a single solut ion to address many

applications/data sources in the enterprise.


14/22


Page 13

Donna Scott

C4, SEC11, 6/05, AE






Storage Controller Unit-Based Synchronous


EMC SRDF Strong market leader,Cross platform MFand distributed, m:nconsistency groups

Price, but gettingmore competitive

IBM ESS Metro

Mirror; formerlyPPRC

Price competitive,

Consistency groupsacross mainframeand distributed

Late entry to market

in early 2004

Hitachi TrueCopy Consistency groups;4:1 MF consistency;small but loyal base;Sun reseller

Consistency groups1:1 for distributedoperating systemenvironments

HP ContinuousAccess XP

Small buy loyal base;MC/SG integration

Nearly exclusive toHP-UX

Client Issue: What technologies will be critical for data replication architectures and what aretheir trade-offs?

EMC SRDF was first to market with an array-based replication solution in the mid-1990s, and, as a result, is theclear market leader. In addition, EMC offers consistency groups across multiple arrays so that when problemsoccur, all replication is halted to provide greater assurance that the secondary site has data integrity. Furthermoreunlike the alternative solutions, it is the only solution that supports distributed servers and mainframes on the samarray.

In the late 1990s, when SRDF had little competition, users often complained about pricing; however, with

additional market entrants, SRDF is being priced more competitively. IBMs Enterprise Storage Server (ESS)Metro Mirror (formerly Peer to Peer Remote Copy PPRC) is price-competitive. It supports consistencgroups across mainframe and distributed environments, and now supports Geographically DistributedParallel Sysplex (GDPS). Hitachis TrueCopy supports consistency groups via time-stamping for mainframe andistributed operating system platforms. However, consistency groups are more functional for mainframe platformwhen they can support four arrays vs. one in the distributed environment. HP licenses Hitachis TrueCopy andadds value to it for its solution. It can support multiple server operating system platforms, but most customers useit almost exclusively for HP-UX systems.

Decision Framework: Consider synchronous, storage controller unit replication where the two

facilit ies are less than 50 to 60 miles apart, so that network latency does not affect the

performance of applications.


15/22


Page 14

Donna Scott

C4, SEC11, 6/05, AE






Storage Controller Unit-BasedAsynchronous


Hitachi TrueCopyAsync

Market leader;Sun reseller

Mostly mainframeinstalled base

Hitachi UniversalReplicator

Journal-based;Pull tech; Sun reseller

1:1 consistencygroups; new to market

EMC SRDF/A m:n consistencygroups (new)

Fairly new to market;few production installs

EMC/Hitachi/IBM XRCcontroller and host-based replication

Supported by multiplestorage vendors

Mainframe only

NetApp SnapMirror Proven market leader Historically amidrange player

IBM ESS GlobalMirror

HACMP integration;8:1 consistency groupsMF and distributed

New to market; doesnot support GDPS(planned YE04)

HP ContinuousAccess XP Extension

Integrated withMC/Serviceguard

Nearly exclusive toHP-UX


their trade-offs?

Compared to synchronous storage controller unit replication, asynchronous storage controller unit replication is

relatively new, making its debut with Hitachi TrueCopy in the late-1990s. Although Hitachi is the market leader

in asynchronous storage controller unit-based replication, its installed base and market share pales in comparison

with synchronous replication. However, for many enterprises that have recovery sites more than 50 to 60 miles

apart, asynchronous replication alternatives have reduced the complexity of their recovery environment because

they could migrate to asynchronous from synchronous multihop architectures. A synchronous multihop

architecture is one where, due to the greater distance between facilities, a local copy is taken, then split andreplicated in an asynchronous mode to the secondary site. In this architecture, four to six copies of the data are

required vs. the two copies required otherwise. EMCs SRDF has many multihop installations and many clients

are testing its new SRDF/A to assess whether they can migrate from their multihop architectures to a single hop

with SRDF/A. In April 2004, IBM announced its first asynchronous solution for the ESS, branding it Global

Mirror, rather than PPRC. Hitachi released its new Universal Replicator in September 2004, enabling more

replication flexibility and the promise of future heterogeneity.

Decision Framework: Asynchronous, storage controller unit replication should be considered

where the two facilities are beyond synchronous replication distances (more than 50 to 60

miles apart).


16/22


Page 15

Donna Scott

C4, SEC11, 6/05, AE






Data Replication Alternatives Pros/Cons

File-based Storage hardware independent

One solution for allapplications/data on a server

Failover/failback may beintegrated and automated

Read access for second copy,

supporting horizontal scaling Low cost for software

File system dependent

More operational complexitythan storage controller-basedreplication

Application synchronizationmust be designed in the

application

Volume

manager-

based

Storage hardwareindependent

One solution for allapplications/data on a server

Failover/failback may beintegrated and automated

Volume manager dependent

Data copies not available forread/write access

Application synchronizationmust be designed in theapplication

More operational complexitythan storage controller-basedreplication

Pros Cons


their trade-offs?

File-based replication is a single-server solution that captures file writes and transmits them to an alternative

location. The major benefits are: 1) it does not require storage area network (SAN) storage, and 2) the files can

be used for read-access at the alternative location. File-based replication is most popular in the Windows

environment where SAN storage is not as prevalent, especially for critical applications, such as Exchange. A

drawback to this type of solution is that it is server-based; therefore, management complexity rises as compared

with storage controller unit replication.

Volume manager-based replication is similar to storage controller unit-based replication in that it replicates at

the block level and the target copy cannot be accessed for read/write. It requires a replication session for each

server and, therefore, has high management complexity. However, no SAN storage is required and it supports all

types of disk storage solutions. Both of these solutions are used for one-off applications/servers where

recoverability is critical. Furthermore, both solutions tend to offer integrated and automated failover/failback

functionality.

Decision Framework: Consider file-level replication to provide short RPO and RTO for

Windows-based applications. Consider volume manager-based replication for applications

requiring short RTO/RPO, where a heterogeneous disk is implemented.


17/22


Page 16

Donna Scott

C4, SEC11, 6/05, AE






File and Volume/Host-Based Replication


NSI DoubleTake Market leader,integrated automation

Windows-only

Legato RepliStor EMC, integratedautomation

Lack of focus

XOsoft WANSync No planned downtimerequired, integratedautomation

New to market

Veritas VolumeReplicator

Market leader,Integrated withcommonly used volumemanager, multiplatform,VCS integration

Price; requiresVxVM

IBM GeoRemote Mirror

Integrated with AIXvolume manager

AIX only

Softek Replicator Multiplatform Low marketpenetration


their trade-offs?

In file-based replication, NSI DoubleTake was the market leader in 2003 with an estimated $19.4 million in newlicense revenue. NSI primarily sells through indirect channels (such as Dell and HP) to midmarket and enterpriseclients. Many use DoubleTake for Exchange and file/print. In the mid-to-late1990s, Legato had significant file-based replication market share for its RepliStor product (then called Octopus), but it narrowed its focus (and thumarket share) and is broadening its focus since EMCs acquisition of Legato. RepliStor provides EMC with asolution for enterprises that do not have or want heterogeneous disk. A newcomer on the market, XOsoft

differentiates itself in scheduled uptime no planned downtime is necessary to implement replication.Therefore, one common use is disk migrations. Veritas is the leader for volume manager-based replication andhas the same look and feel as its popular volume manager product, VxVM. It is also integrated into VCS, wherethe DR option provides long-distance replication with failover. Veritas improves manageability of multiple,heterogeneous replication sessions and geographic clusters with CommandCentral Availability, previously calledGlobal Cluster Manager. Softek also offers a multiplatform volume manager-based solution, but it has lowmarket penetration. Formerly called TDMF Open, it has been rebranded Replicator. IBM offers a volumemanager-based solution for AIX called Geo Remote Mirror.

Decision Framework: Consider file-based replication for criti cal Windows-based applications

and volume-based replication for cr itical applications where a heterogeneous disk is

deployed.


18/22


Page 17

Donna Scott

C4, SEC11, 6/05, AE






Other Recovery Technologies

Emerging network-based replication

Topio Data Protection Suite, Kashya KBX4000, FalconStor IPStor Mirroring,DataCore SAN Symphony Remote Mirroring, StoreAge multiMirror, IBM SANVolume Controller

Point in time or snapshots to quickly recover from data corruption

EMC TimeFinder/Snap, IBM Flashcopy, HDS ShadowImage, Oracle 10gFlashback, BMC SQL Backtrack, Imceda SQL Lightspeed, StorageTek

SnapShot, Veritas Storage Foundation Wide-area clusters for automated recovery

HP Continental Cluster, IBM Geographically Dispersed Parallel Sysplex, VeritasCluster Server Global Cluster Option

Stretching local clusters across a campus to increase return on investment

HP MC/ServiceGuard, IBM HACMP, Microsoft Clustering, Oracle RAC,SunCluster, Veritas Cluster Server

Capacity on demand/emergency backup for in-house recovery. Becoming mainstreamon S/390 and zSeries mainframes

Speed server recovery with Server Provisioning and Configuration Management Tools

Client Issue: What technologies will be crit ical for data replication architectures and what aretheir trade-offs?

There are many other recovery technologies that may be used in disaster recovery architectures. A relatively newset of network-based replication products (sometimes called virtualization controllers) moves the software fromthe storage array controller into a separate array controller sitting in the storage fabric. This group of suppliershopes it can change the game and be successful at chipping away at storage controller unit-based replicationmarket share. They offer similar benefits in addition to heterogeneous disk support. Clustering local, campusand wide-area offers automation for failover and failback, speeding recovery time and reducing manualerrors. Stretch clustering, where a local cluster is stretched across buildings or campuses using the samearchitecture as local clustering, is becoming more popular as a way to take already purchased redundancy toachieve some degree of disaster recovery (with a single point of failure for the data and networks). Serversconfigured with capacity on demand enables pre-loaded but idle CPUs and memory to be turned on at therecovery site for disaster recovery testing and in the event of disasters. This reduces the overall cost of dedicatedhardware for disaster recovery. And, finally, many enterprises are implementing standard server images (orscripted installation routines) and using these templates (on disk) to restore servers and applications. This issignificantly faster than restoring the server from tape and can restore many servers in parallel, significantlyreducing manual effort.

Strategic Imperative: Managing the diversity of the infrastructure will reduce complexity and

improve recoverability and abili ty to automate the process.


19/22


Page 18

Donna Scott

C4, SEC11, 6/05, AE






Best Practices in Disaster Recovery

Consider DR requirements in new project design phase and annuallythereafter

Testing, Testing, Testing

end-to-end test where possible

partial where not

tabletop tests can be advantageous to assess capabilities to

address scenarios as well as procedures fast follow-up/response to test findings

Incident/Problem/Crisis Process

where IT incident could result in invocation of DR plan, leverageproblem management process which should already be in place

damage assessment: must assess costs of failing over toalternate location vs. time to recover in primary location

Use automation to reduce errors in failover/failback

Use same automation for planned downtime (which results in

fre uent testin

The most important parts of disaster recovery management are: 1) considering DR requirements during new

project design phase to match an appropriate solution to business requirements rather than retrofitting it at a

higher cost and 2) testing it is only as a result of testing that an enterprise can be confident about its plan as

well as improve the plan through refining procedures and process. As much as possible, tests should be end-to-

end in nature and include business process owners as well as external partners (for example, that integrate with

enterprise systems). When an end-to-end test is not possible, partial tests should be done, with tabletop

walkthroughs to talk through the other components of the tests. Through frequent testing, participants become

comfortable with solving many kinds of problems in a way, they become more agile so that whatever the

disaster, people are likely able to react in a positive way to recover the enterprise without lulling into a chaoticstate (which would threaten recoverability). Moreover, for IT disasters, enterprises should leverage their incident

and problem management processes and pull in the DR team during the assessment process.

Another best practice is using automation as much as possible, not only to avoid human error during times of

crisis, but to enable other employees who may be implementing the plan to proceed recovery, even if members

of the primary recovery team are unavailable. By using the automation during planned downtime periods, testing

becomes part of standard production operations.

Client Issue: What architectures and best practices will enable enterprises to achieve 24x7

availabili ty and disaster protection as required by the business?


20/22


Page 19

Donna Scott

C4, SEC11, 6/05, AE






Primary Production Site Secondary Production Site

Async replication: RPO = 0 to 15 seconds

Standby DBMS to mitigate risk of data

corruption

Month-end DBMS for reporting

Production

DBMSStandby

DBMS

Month-End

DBMS

Local Failover

Server; HP-UX,

MC/Serviceguard

Quest SharePlex

captures DBMS

changes from

Oracle redo logs

Disaster

Recovery

DBMS

DR Test

DBMSs

SQL is applied

continuously to

remote DBMSs.

In the event of

disaster, replication

is reversed.

Failover

Failback

RTO less than one hour

Test disaster process once/quarter

Architecture minimizes planned

downtime for migrations/upgrades

Test

DBMSs

Case Study DBMS Log-BasedReplication Provides RTO Under One Hour

Client Issue: What architectures and best practices will enable enterprises to achieve 24x7

availabili ty and disaster protection as required by the business?

A financial services company processes transaction data with a packaged application based on Oracle RDBMS.Database access comes from internal (such as loan officers) and external customers (such as automated tellermachines), with some 300,000 transactions per day. To ensure data availability/recovery, the company deployedQuest SharePlex to replicate its 500GB Oracle DBMS: 1) locally to mitigate data corruption risks in theproduction database and provide a reporting database and 2) remotely (500 miles) as part of its DR plan.SharePlex captures the changes to the DBMS (from the Oracle redo logs) and transmits them to local and remote

hosts. Changes are then applied continuously (by converting them to SQL and applying them to the targetDBMSs). SharePlex keeps primary and target DBMSs synchronized, and the company maintains a maximum of15 seconds RPO. In a site disaster, the target is activated as the primary, any unposted changes would be postedand the active database would be updated in the application middleware. The failover process, once initiated,takes approximately one hour. Once the remote site is processing transactions, the replication process is reversedback to the primary data center. Although the remote site is missing some transactions (


21/22


Page 20

Donna Scott

C4, SEC11, 6/05, AE






Recommendations

Make your disaster recovery management processes mature so they are integrated wi th

business and IT processes and meet changing business requirements.

Infuse a continuous improvement culture.

Plan for disaster recovery and availabili ty requirements during the design phase of new

projects and annually re-assess for product ion systems.

Test, test and test more.

Use automation to reduce complexity and errors associated with failover/failback.

Select the replication methods that match business requirements for RPO and RTO. If a

single infrastructure-based solut ion is desirable, consider storage controller-based

replication. If 24x7 continuous availability is required, consider application, transaction or

database-level replication.


22/22

This is the end of this presentation. Click anywhere to continue.

These materials can be reproduced only with Gartners written approval. Such approvals must be requested viae-mail [email protected].

disaster recovery and data replication architectures

Documents