microsoft exchange server 2007 high availability and disaster recovery deep dive

High Availability and Disaster Recovery Deep Dive

2

Agenda

Solutions for Disaster RecoveryMailbox Server High AvailabilityCCR and SCR: Better TogetherWhy CCR? Why not SCC?Continuous Replication Demystified

3

Solutions for Disaster Recovery

Solutions for Disaster RecoveryDeleted Item Retention – default 14 daysDeleted Mailbox Retention – default 30 daysMailbox Service and Data Recovery

Server RecoverySetup /m:RecoverServerSetup /recoverCMS

Database portabilityDial tone portabilityContinuous replicationBackup and Restore

Legacy streaming ESE backupsVolume Shadow Copy Service (VSS) backupsRecovery Storage Groups, alternate restores

Edge Transport Server Cloned Configuration

Solutions for Disaster RecoveryAugment built-in solutions with other processes

Configuration ManagementServer build standardizationServer build documentation

Change managementRelease managementProactive monitoringDetailed recovery plansRegular integrity checksRegular practice drills

6

Server RecoverySetup /m:recoverServer

All roles except EdgeFresh install and ImportEdgeConfig for Edge

All custom settings on Client Access server must be recreatedRestrictions: Can’t use this for…

repairing a failed setupmigrating between different operating systemsrecovering or un-clustering a clustered mailbox server

Setup /recoverCMSFor CCR and SCC onlyRestrictions: Can’t use this for…

changing from CCR to SCC or vice versamigrating between different operating systemsclustering a standalone Mailbox serversplitting or merging clustered Exchange environments

Does not trigger Transport DumpsterWindows 2003 clustering has dependency on PDC Emulator

7

Data Recovery

Switch to a replicated copy (Activation)Passive copy (LCR/CCR)Target copy (SCR)

Restore from backupSame serverDatabase portability on alternate server

Database portability from Windows 2003 to Windows 2008 has initial performance impact

Dial tone and data merge using RSG

8

Mailbox Server High Availability


Built-in features for various levels of availabilityLocal Continuous Replication (LCR) – data availabilitySingle Copy Cluster (SCC) – service availabilityCluster Continuous Replication (CCR) – data and service availabilityStandby Continuous Replication (SCR) – disaster recovery and site resilience

10


Local Continuous Replication (LCR)

11


Single Copy Cluster (SCC)

12


Cluster Continuous Replication (CCR)

13

Standby Continuous Replication

CCR

Standalone

SCC

Standalone MailboxServer (w/o LCR)

Standby Cluster with Passive Mailbox Role

SCR Sources SCR Targets

14

CCR and SCR: Better Together

CCR and SCR: Better Together

CCR provides high-availability for Mailbox data and services within the datacenterSCR replicates data remotely to provide site resilience for the Mailbox data

Datacenter A Datacenter B

16

CCR across 2 SitesDatacenter A Datacenter B

17

CCR local / SCR to remote Site

Datacenter A Datacenter B

18

CCR/SCR vs SCC/Sync – 2 sitesDatacenter A Datacenter B

DB

Lo

gs

DB

Lo

gs

Q

Lo

gs

DB

Lo

gs

DB

Exchange Disaster Recovery or 3rd Party Failover

PhysicalCorruption Physical

Corruption

VS

S

Clo

ne

VS

S

Clo

ne

Undetected Physical Corruption

1 month later, Undetected Physical Corruption

On full Storage or Site Failure in Primary Site,corruption is detected, must Recover from Backup

Log corruption detected immediately on replication at both targets

Physical Corruption

Lo

gs D

B

Setup /recovercms, play logs forward

On Site Failure in Primary Site,if corruption not detected and corrected from a test failover, must Recover from Backup

CCR

SCC

19

Why CCR?Why Not SCC?

20

CCR SCCSingle Point of Failure

None when stretched across sites or combined with SCR for site resiliency

Data, Storage and Site single points of failurePotential for massive data loss on single failure:• Storage device failures can lose collocated backups• Hardware replication can propagate physical errors• Storage failure requires activation of remote copy if

one exists• Requires two VSS clones plus a remote copy of data

to achieve RPO equal to CCR

Simplicity Simple setup• No special storage

configuration Built-in Site Resilience Same technology and

redundancy model for intra- and inter-site protection

Shared storage Storage configuration before and after forming

cluster Complex storage stack Complex deployment to get RTO/RPO of 1 CCR

cluster

Why CCR? Why not SCC?

21

CCR SCCBackups Backups off passive copy

eliminates/reduces backup window

Backups must be off active

TCO Reduced TCO• Cheaper hardware• No special storage

expertise required• In-the-box solution• Integrated management• Single operations team• Reduced backup cost

Higher TCO• Additional products needed to achieve

equivalent combined RTO/RPO• Separate management tools for HA

operations may be required• Higher-end servers and storage required• Storage expertise needed

Large Mailboxes

• Great RTO/RPO, Simplicity, No Maintenance Window, Reduced TCO → improved support for larger mailboxes

Higher TCO, long recovery times constrain mailbox size


22

Failure CCRStretched CCR or CCR + SCR

SCCSCC + SCR/3rd party replication + 2 VSS clones

to approach combined RTO/RPO of 1 CCR cluster

RTO

Server ~ 2 minutes ~ 2 minutes

Data or LUN ~ 2 minutes 15 min – 1 hour Full Storage ~ 2 minutes ~ 15 min with synchronous replication

Days with VSS clones only

Site ~ 2 minutes for Stretched CCR 30-60 minutes for CCR + SCR

~ 15 min with synchronous replication

Days with VSS clones only

RPO

Server 0 for mail*appointment, contact, task, draft

0 – uses same copy of data

Physical Corrupt

DB 0 Hours to days if sync repl; point in time if VSSLogs 0 (must reseed passive) N/A if log not needed; same as DB if needed

DB LUN dies 0 0 with synchronous replication

Point-in-time with VSS clones

LOG LUN dies 0 for mail*appointment, contact, task, draft

0 with synchronous replication Point-in-time with VSS clones

Full Storage 0 for mail*appointment, contact, task, draft

0 with synchronous replication

Hours to days with VSS clones only

Site Same as Server for Stretched CCR 1 Log**

0 with synchronous replication

Hours to days with VSS clone

* Assumes following best practice guidance for Transport Dumpster **Assumes replication’s keeping up


23


SCC: no mechanism to detect database corruption on the copy replicated by 3rd Party solutionsSCC: no mechanism to detect log corruption on the copy replicated by 3rd Party solutionsWith hardware-based replication, deeper stack can lead to corruption caused by:

HBA driver/firmwareMulti-path driver Server hardware FC Switch firmwareStorage controller firmware/OSTarget storage controller firmware/OS

Corruptions caused by the applicationLogical corruption replicated by all replication solutionsSCR with lag replay can mitigate if detected early

Logical Corruption

Physical Corruption

24

Continuous Replication Demystified

25

Log Copier

LogReplayer

Basic Replication Pipeline

SourceDB

Store

Log Inspector

Source LogDirectory

InspectorDirectory

ReplicaLogDirectory

TargetDB

26

Continuous Replication Basics

When current log file is closed, it is copied to the replication target by the Replication serviceReplication service

at source: creates read-only shares for log directoryat target: reads from the shares and pulls a copy of the log filecontains a ReplicaInstance for each storage group

Configuration discovered from Active Directory (every 30 sec for LCR/CCR, every 3 min for SCR)

27

Continuous Replication Basics

Communication is done via logs, registry, cluster database and RPC

Logs: replicate database changes and backup statusRegistry: used in LCR and SCR. Also in CCR for checkpointing the current log generation value for loss calculationCluster database: cluster res "Exchange Information Store Instance (CMSName)" /priv | findstr /i replayRPCs: Target Replication service RPCs into Store for log truncation coordination

28

Lost Log Resilience (LLR)

Designed to minimize need to reseed after lossy failoverDatabase changes written to log file prior to database, and the database can be updated as soon as change is loggedLLR modifies this behavior by delaying updates to the database until 1 or more log generations are createdUtilizes a new log stream marker called the waypoint

Minimum Log Required to prevent database divergenceNo modifications after the waypointhave been written to the database

Log Stream Markers

Committed: Log generation 20Checkpoint: Log generation 2Waypoint: Log generation 10What this means:

Only logs 2-10 are neededLogs 11-20 can be discarded

Initiating FILE DUMP mode... Database: priv1.edb ... State: Dirty Shutdown Log Required: 2-10 (0x2-0xA) Log Committed: 0-20 (0x0-0x14) ...

17

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

checkpoint

waypoint

NodeB

18

19

20

2121

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

NodeAHealthy CCR

NodeA fails and a failover to NodeB occurs

Validate database can mount logs lost <

AutoDatabaseMountDial

Logs are generated on NodeB (beyond gen21)

NodeA recovers and performs a

divergence check

NodeA performs incremental reseed and copies logs

Healthy CCR

18

19

20

21

1717

31

When Do I Need A Full Reseed?

RarelyLost log past current Waypoint

Admin accepted large amount of loss by running Restore-StorageGroupCopyAutomatic mount while LLR was “not honored”Automatic lossy mount with “stale” loss window calculation

Log corruption prior to log replayESE cannot skip over logs

Database files modified outside of Store or Replication service

E.g., Offline defrag, eseutil /r

32

Transport Dumpster Hub Transport servers retain messages that have been delivered to destination mailbox until size or time limit is reachedTransport Dumpster is per storage group per Hub Transport server for servers in same Active Directory site as the storage groupTransport Dumpster statistics:

Get-StorageGroupCopyStatus -DumpsterStatistics Output:

DumpsterServersNotAvailable:{HUB1}DumpsterStatistics:

{HUB2(2/25/2009 10:20:37 PM; 2 ; 1032KB)}

33

CCR CMS

MBX2

MBX1

HUB1SG Dumpster Contents

SG1

SG2

HUB2SG Dumpster Contents

SG1

SG2

SG1 SG2

SG1 SG2

Passive

SG Dumpster Contents

SG1 Msg1

SG2 Msg1


SG1 Msg2

SG2


SG1 Msg1

SG2 Msg1,Msg3


SG1 Msg2,Msg4

SG2 Msg4

SG Resubmit Required

SG1

SG2


SG1 HUB1,HUB2

SG2 HUB1,HUB2

Redeliver SG1,SG2(returns Retry)

Redeliver SG1,SG2(returns timeout)


SG1 HUB1

SG2 HUB1

Active

Redeliver SG1,SG2(returns Success)

Redeliver SG1,SG2(returns retry)Redeliver SG1,SG2(returns success)

Transport Dumpster

34

Transport Dumpster How much data loss can transport dumpster mitigate?

18 MB dumpster per storage group on 8 Hub Transport servers = 144 MB / storage group[20 MB / 10 hour] x [100 users / SG] = 200 MB message traffic in one hourPutting the above two together gives

60 min X 144 / 200 43.2 minutes worth of datain 43.2 minutes 144+ logs created per SG

Customize transport dumpster size/time limitSet-TransportConfig –MaxDumpsterSizePerStorageGroup 30MB

–MaxDumpsterTime 07.00:00:00

No time window guaranteesIf there are no message size limits, a single large message (e.g., 15 MB) will purge all other messages for destination storage group(s) on a given Hub Transport server

35

Transport Dumpster

When CCR detects a lossy failover:Expands loss window by 12 hours back and 4 hours forward Finds all Hub Transport servers in the local Active Directory siteRequests transport dumpster redelivery from all detected servers

New servers not added to redelivery list

Inaccessible servers: CCR retries same request every 30 seconds until configured MaxDumpsterTime If multiple lossy failovers take place, new loss is window added to previous one

Restore-StorageGroupCopy on LCR is one time request, no retriesRedelivery not triggered as part of Setup /recoverCMSNo other ways to redeliver messages from transport dumpster

Redundant Networks

Use for log shipping and seeding in CCR

Enable-ContinuousReplicationHostName

SeedingUpdate-StorageGroupCopy -DataHostNames:Host1,Host2

Get-ClusteredMailboxServerStatus OperationalReplicationHostNames:FailedReplicationHostNames:InUseReplicationHostNames:

Watch out for misconfigured host file

37

Circular Logging

One configuration setting with two consumersStore service: requires database to be dismounted and re-mounted to take effectReplication service: picks up new setting dynamically

In CCR, it’s no big deal to switch between on/off/onIn some settings, logs are deleted prematurely

Example: turn off circular logging, then enable LCR without dismount/mount of database

ESE is still doing log truncation with circular logging logicLogs will get truncated before making it to the LCR copy

To be safe follow this recipe: Suspend, dismount, change setting, mount, resume

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS,

IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

microsoft exchange server 2007 high availability and disaster recovery deep dive

Technology

ccr clusterserver

site site

ccr local scr

ccr sccbackups backups

vss clonesstretched

better togetherwhy ccr

ccr redundancy model

ccr sccsingle point