alain azagury [email protected] - university of minnesota

IBM Labs in Haifa © 2005 IBM CorporationThird Intelligent Storage Consortium University of Minnesota

Supporting continuous availability

Alain [email protected]

IBM Labs in Haifa

© 2005 IBM Corporation2 Third Intelligent Storage Consortium University of Minnesota

Agenda

Background and motivationKey building blocksCharacteristics of continuous remote copyDeep dive

Synchronous remote copyMetro Mirror (PPRC)

Asynchronous remote copyExtended Remote Copy (XRC)Global Mirror

The latest “hot” technology in the market…Summary

IBM Labs in Haifa


Scope of this Talk(or Disclaimer)

Focus on (Block) Storage ControllersAs opposed to host-based, switch-based, file-system solutions

Focus on IBM solutionsThe topic is not widely described in the literature

Focus on core technologies in storage controllersAn end-to-end view on business continuity would include:

Application and database specific automation of core technologies for business continuity Integration of core technologies into server environments and automation to enable business continuityCore technologies to do backup and restore, disaster recovery and continuous operations Hardware Infrastructure

IBM Labs in Haifa


Trends

By 2008, 45% of Global 2000 users will utilize two data centers to deliver continuous availability; of these, 25% will support real-time recovery. By 2006, more than 60% of G2000 data centers will utilize capacity on demand to satisfy less critical recovery services. Through 2008, more than 50% of G2000 users will utilize a single "hardened" data center augmented by third-party services to deliver traditional, cost-effective disaster recovery services (48- to 72-hour recovery).

META Trend 3/8/04

IBM Labs in Haifa


Some lessons from September 11

Recovery requires less of a dependency on people and a greater dependency upon automationThe "rolling disaster" scenario was validatedDisasters may cause multiple companies to recover and that puts stress on the commercial business recovery servicesRecovery of data from distributed systems and desktops ranged from grade "A" to grade "F"Tape, as well as disk, is a crucial part of the recovery capabilityRethinking of distance between data centersRethinking of synchronous versus asynchronousD/R Plan after Successful Recovery from Disaster

IBM Labs in Haifa


Recovery Metrics

Time to Recover : How quickly is application recovered after a disaster? 15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days

Tier 4 - Data Base Log Replicationed & Host Log Apply at Remote.

Tier 3 - Electronic Tape Vaulting

Tier 1 - PTAM*

Tier 2 - PTAM & Hot SitePoint-in-Time Backup to Tape

Active Secondary Site

Recovery Point Objectives Amount of Lost Data

Cos

t - T

CO

Ser

vers

/Net

wor

k B

andw

idth

/Sto

rage

*PTAM = Pickup Truck Access Method

Tier 7 - RPO=Near, RTO <1Hr. Server/Workload/Network/Data Automatic Site Switch

Tier 6 - RPO=Near Zero, RTO= Manual - Disk or Tape Data Mirroring

Tier 5 - RPO > 15 min. RTO= Manual; PiT or SW Data Replication.

RPO= 24+ HoursRTO= Days

RPO= 4+ HrsRTO= 4+ Hrs

RPO= 0, secondsRTO= <1hr -> 4Hrs

IBM Labs in Haifa


Agenda





IBM Labs in Haifa


Key Building Blocks

Point-in-Time CopyThe ability to create a consistent snapshot of large volumes of data, potentially spanning across multiple controllers

IBM’s FlashCopy, EMC’s TimeFinder, HDS’s ShadowImageContinuous replication

SynchronousIBM Metro Mirror, EMC SRDF

Asynchronous, no consistency guaranteesIBM Global Copy, SRDF/Adaptive

Asynchronous, consistency guaranteesIBM Global Mirror, EMC SRDF/A, HDS TrueCopy

zSeries onlyGlobal Mirror for zSeries (XRC)

IBM Labs in Haifa


Point-in-Time Copy

Three major techniquesSplit mirrorChanged blockConcurrent

Current expectationsConsistent across thousands of volumes and multiple controllersProduction I/O cannot be disrupted for more than100’s of millisecondsTarget copy needs to survive failures

IBM Labs in Haifa


Agenda





IBM Labs in Haifa


Continuous Remote Copy – Characteristics

Consistency – does the remote copy reflect a consistent view of the data as seen at the source site at some point in time?

No guaranteeWith enough time without modifications, a consistent state will be reached

Power failureDependent writes consistency

Application consistencyAssumes application-specific knowledge or application control Application performs regular checkpointsAllows faster restarts, at the expense of currency

IBM Labs in Haifa


Continuous Remote Copy – Characteristics (cont.)

Currency – how out of date are the data at the remote site?Synchronous

No data lossHowever – what about rolling disasters such as virus contamination or other data corruption issues?

AsynchronousMinimal impact of application performanceHot spot data transfer reductionBetter bandwidth utilization

IBM Labs in Haifa



Latency impact – impact on application response time? Impact felt on application writesFor synchronous solutions, a function of the distance between sites

At least 5 µsec per kilometer (10 µsec roundtrip), based on speed of light in glassCompare to < 1 msec for local writes

For asynchronous, a function of the overhead of bookkeeping Can be as simple as setting a bit in a bitmap, or more complex such as queuing a message to transfer the data

IBM Labs in Haifa



Bandwidth requirements – what network bandwidth is required for the solution?

For synchronous solutions, it is determined by the peak write load requirementsFor asynchronous solutions, it is determined by the average write load, and the tolerance to lag in currencyAdditional considerations

Transfer of modified bytes onlyLevel of granularity for modified data bookkeepingHotspot elimination

IBM Labs in Haifa


Agenda





IBM Labs in Haifa


Synchronous Replication – ESS Metro Mirror (PPRC)

Ensures that the data written will be applied to the secondary before the application host is notifiedConsists of two major phases

Full track asynchronous transfer modeDuring initial establishment or during resynchronization

Changed sector synchronous transfer modeSends only modified sectors to the secondary volume

Support distances of up to 300kms

IBM Labs in Haifa


Synchronous Replication – ESS Metro Mirror (PPRC)

IBM Labs in Haifa


Synchronous Replication – ESS Metro Mirror (PPRC)Characteristics

No data lossHas a direct impact on write processing time

Processing time in primary ESS to send modified blocksProcessing time at secondary site (fast write)Network delay time

ConsistencySingle volume consistency guaranteed due to synchronous nature of transferConsistency groups allow consistency across volumes (and controllers) in the event of volume suspension

When a volume pair becomes “suspended”, changed tracks are recorded in a bitmapHowever, need to prevent other volume pairs to continue receiving updatesPPRC provides a message to the host processors, and commands to freeze all secondary volumes upon detection of the first failing volume

IBM Labs in Haifa


Synchronous Replication – ESS Metro Mirror (PPRC)Performance

Local writes (NO PPRC) vs. synchronous copy (PPRC@75kms) vs. asynchronous copy with no consistency guarantees (PPRC XD)

Performance measurements from the test lab of the IBM Storage Systems Group on March 22, 2002

IBM Labs in Haifa


Asynchronous Replication – Extended Remote Copy (XRC)

Supports a single zSeries or a zSeries Parallel SysplexThe controller puts information about the write operations in a “side file”

zSeries I/O operations include a timestamp provided by the hostThe zSeries Parallel Sysplex has a cluster-wide timer facility

It also places a pointer to the modified dataA data mover external process issues commands to the primary control unit to read the host’s modifications

The timestamps allow the data mover process to ensure causal consistency among the writes

Notice: if a write is received for data that are referenced from the queue before these data are transferred to the secondary site, the control unit cannot allow the data to be overwritten

IBM Labs in Haifa


Asynchronous Replication – Extended Remote Copy (XRC)

Asynchronous, continuous remote copy solution for zSeries dataSupported by the disk subsystemDriven by software running on a zSeries host

IBM Labs in Haifa


Asynchronous Replication – ESS Global Mirror

Asynchronous solutions for zSeries, iSeries and open systemsConsistency always maintained at mirrored siteMirror lags in currency by as little as 5 secondsA tertiary copy is required to preserve consistencyData loss limited to data in queue or in transit

Consistent Asynchronous Mirroring

1.

•Write

4.

•Write acknowledged by secondary

3.

•Write to secondary

2.

•Write acknowledgment (channel end / device end)

IBM Labs in Haifa


Asynchronous Replication – ESS Global Mirror

SAN

FlashCopy

FlashCopy

FlashCopy

SAN

Copy consistency managed by Master Control Server Uses tertiary copy to ensure consistencyApplies a point-in-time copy every 5 seconds

CA B

IBM Labs in Haifa


Prepare FlashCopy

Asynchronous Replication – ESS Global MirrorConsistency Group Formation

Coordinate ESSs

Let consistent data drain to remote Record new writes at local site

. . .

CG Interval Time*

Commit FlashCopy

*Consistency Group Interval may be set from 0 seconds(consistency continuously formed) up to 18 hours

IBM Labs in Haifa


Asynchronous Replication – ESS Global MirrorCharacteristics

Minimal impact on write processing timeConsistency group formation will inhibit writes to complete until all controllers have acknowledged

Minimal data loss Can create a consistency group every 5 seconds

ConsistencyMaintains power-failure consistency across heterogeneous sets of volumes

zSeries, iSeries, open volumesVolumes “C” are the consistent set

Except for the case where the FlashCopy from “B” to “C” succeeds partially, in which case volumes “B” are consistent

IBM Labs in Haifa


Agenda





IBM Labs in Haifa


Continuous Data Protection

What is it?New paradigm in data protection Storage mechanism that keeps a time-ordered history of application writesGranularity of history varies between products, from every write to every few minutes State of storage can be quickly reverted to any previous points in time

Major issuesSpace efficiencyManagement and adaptation with current applications

IBM Labs in Haifa


Summary

Continuous availability is moving from a Fortune 500 requirement to the masses

Small and Medium Businesses are requiring continuous availabilityNew regulations impose additional requirements

Additional copies (synchronous and asynchronous)Longer distance

Advanced controllers offer sophisticated infrastructure to support continuous availability

Point-in-time copy, and synchronous and asynchronous remote copy

Requirements are becoming more stringentBetter support for rolling disastersEnhanced resiliency to failures

alain azagury [email protected] - university of minnesota

Documents