microsoft exchange server 2007 high availability and disaster recovery deep dive
TRANSCRIPT
High Availability and Disaster Recovery Deep Dive
2
Agenda
Solutions for Disaster RecoveryMailbox Server High AvailabilityCCR and SCR: Better TogetherWhy CCR? Why not SCC?Continuous Replication Demystified
3
Solutions for Disaster Recovery
Solutions for Disaster RecoveryDeleted Item Retention – default 14 daysDeleted Mailbox Retention – default 30 daysMailbox Service and Data Recovery
Server RecoverySetup /m:RecoverServerSetup /recoverCMS
Database portabilityDial tone portabilityContinuous replicationBackup and Restore
Legacy streaming ESE backupsVolume Shadow Copy Service (VSS) backupsRecovery Storage Groups, alternate restores
Edge Transport Server Cloned Configuration
Solutions for Disaster RecoveryAugment built-in solutions with other processes
Configuration ManagementServer build standardizationServer build documentation
Change managementRelease managementProactive monitoringDetailed recovery plansRegular integrity checksRegular practice drills
6
Server RecoverySetup /m:recoverServer
All roles except EdgeFresh install and ImportEdgeConfig for Edge
All custom settings on Client Access server must be recreatedRestrictions: Can’t use this for…
repairing a failed setupmigrating between different operating systemsrecovering or un-clustering a clustered mailbox server
Setup /recoverCMSFor CCR and SCC onlyRestrictions: Can’t use this for…
changing from CCR to SCC or vice versamigrating between different operating systemsclustering a standalone Mailbox serversplitting or merging clustered Exchange environments
Does not trigger Transport DumpsterWindows 2003 clustering has dependency on PDC Emulator
7
Data Recovery
Switch to a replicated copy (Activation)Passive copy (LCR/CCR)Target copy (SCR)
Restore from backupSame serverDatabase portability on alternate server
Database portability from Windows 2003 to Windows 2008 has initial performance impact
Dial tone and data merge using RSG
8
Mailbox Server High Availability
Mailbox Server High Availability
Built-in features for various levels of availabilityLocal Continuous Replication (LCR) – data availabilitySingle Copy Cluster (SCC) – service availabilityCluster Continuous Replication (CCR) – data and service availabilityStandby Continuous Replication (SCR) – disaster recovery and site resilience
10
Mailbox Server High Availability
Local Continuous Replication (LCR)
11
Mailbox Server High Availability
Single Copy Cluster (SCC)
12
Mailbox Server High Availability
Cluster Continuous Replication (CCR)
13
Standby Continuous Replication
CCR
Standalone
SCC
Standalone MailboxServer (w/o LCR)
Standby Cluster with Passive Mailbox Role
SCR Sources SCR Targets
14
CCR and SCR: Better Together
CCR and SCR: Better Together
CCR provides high-availability for Mailbox data and services within the datacenterSCR replicates data remotely to provide site resilience for the Mailbox data
Datacenter A Datacenter B
16
CCR across 2 SitesDatacenter A Datacenter B
17
CCR local / SCR to remote Site
Datacenter A Datacenter B
18
CCR/SCR vs SCC/Sync – 2 sitesDatacenter A Datacenter B
DB
Lo
gs
DB
Lo
gs
Q
Lo
gs
DB
Lo
gs
DB
Exchange Disaster Recovery or 3rd Party Failover
PhysicalCorruption Physical
Corruption
VS
S
Clo
ne
VS
S
Clo
ne
Undetected Physical Corruption
1 month later, Undetected Physical Corruption
On full Storage or Site Failure in Primary Site,corruption is detected, must Recover from Backup
Log corruption detected immediately on replication at both targets
Physical Corruption
Lo
gs D
B
Setup /recovercms, play logs forward
On Site Failure in Primary Site,if corruption not detected and corrected from a test failover, must Recover from Backup
CCR
SCC
19
Why CCR?Why Not SCC?
20
CCR SCCSingle Point of Failure
None when stretched across sites or combined with SCR for site resiliency
Data, Storage and Site single points of failurePotential for massive data loss on single failure:• Storage device failures can lose collocated backups• Hardware replication can propagate physical errors• Storage failure requires activation of remote copy if
one exists• Requires two VSS clones plus a remote copy of data
to achieve RPO equal to CCR
Simplicity Simple setup• No special storage
configuration Built-in Site Resilience Same technology and
redundancy model for intra- and inter-site protection
Shared storage Storage configuration before and after forming
cluster Complex storage stack Complex deployment to get RTO/RPO of 1 CCR
cluster
Why CCR? Why not SCC?
21
CCR SCCBackups Backups off passive copy
eliminates/reduces backup window
Backups must be off active
TCO Reduced TCO• Cheaper hardware• No special storage
expertise required• In-the-box solution• Integrated management• Single operations team• Reduced backup cost
Higher TCO• Additional products needed to achieve
equivalent combined RTO/RPO• Separate management tools for HA
operations may be required• Higher-end servers and storage required• Storage expertise needed
Large Mailboxes
• Great RTO/RPO, Simplicity, No Maintenance Window, Reduced TCO → improved support for larger mailboxes
Higher TCO, long recovery times constrain mailbox size
Why CCR? Why not SCC?
22
Failure CCRStretched CCR or CCR + SCR
SCCSCC + SCR/3rd party replication + 2 VSS clones
to approach combined RTO/RPO of 1 CCR cluster
RTO
Server ~ 2 minutes ~ 2 minutes
Data or LUN ~ 2 minutes 15 min – 1 hour Full Storage ~ 2 minutes ~ 15 min with synchronous replication
Days with VSS clones only
Site ~ 2 minutes for Stretched CCR 30-60 minutes for CCR + SCR
~ 15 min with synchronous replication
Days with VSS clones only
RPO
Server 0 for mail*appointment, contact, task, draft
0 – uses same copy of data
Physical Corrupt
DB 0 Hours to days if sync repl; point in time if VSSLogs 0 (must reseed passive) N/A if log not needed; same as DB if needed
DB LUN dies 0 0 with synchronous replication
Point-in-time with VSS clones
LOG LUN dies 0 for mail*appointment, contact, task, draft
0 with synchronous replication Point-in-time with VSS clones
Full Storage 0 for mail*appointment, contact, task, draft
0 with synchronous replication
Hours to days with VSS clones only
Site Same as Server for Stretched CCR 1 Log**
0 with synchronous replication
Hours to days with VSS clone
* Assumes following best practice guidance for Transport Dumpster **Assumes replication’s keeping up
Why CCR? Why not SCC?
23
Why CCR? Why not SCC?
SCC: no mechanism to detect database corruption on the copy replicated by 3rd Party solutionsSCC: no mechanism to detect log corruption on the copy replicated by 3rd Party solutionsWith hardware-based replication, deeper stack can lead to corruption caused by:
HBA driver/firmwareMulti-path driver Server hardware FC Switch firmwareStorage controller firmware/OSTarget storage controller firmware/OS
Corruptions caused by the applicationLogical corruption replicated by all replication solutionsSCR with lag replay can mitigate if detected early
Logical Corruption
Physical Corruption
24
Continuous Replication Demystified
25
Log Copier
LogReplayer
Basic Replication Pipeline
SourceDB
Store
Log Inspector
Source LogDirectory
InspectorDirectory
ReplicaLogDirectory
TargetDB
26
Continuous Replication Basics
When current log file is closed, it is copied to the replication target by the Replication serviceReplication service
at source: creates read-only shares for log directoryat target: reads from the shares and pulls a copy of the log filecontains a ReplicaInstance for each storage group
Configuration discovered from Active Directory (every 30 sec for LCR/CCR, every 3 min for SCR)
27
Continuous Replication Basics
Communication is done via logs, registry, cluster database and RPC
Logs: replicate database changes and backup statusRegistry: used in LCR and SCR. Also in CCR for checkpointing the current log generation value for loss calculationCluster database: cluster res "Exchange Information Store Instance (CMSName)" /priv | findstr /i replayRPCs: Target Replication service RPCs into Store for log truncation coordination
28
Lost Log Resilience (LLR)
Designed to minimize need to reseed after lossy failoverDatabase changes written to log file prior to database, and the database can be updated as soon as change is loggedLLR modifies this behavior by delaying updates to the database until 1 or more log generations are createdUtilizes a new log stream marker called the waypoint
Minimum Log Required to prevent database divergenceNo modifications after the waypointhave been written to the database
Log Stream Markers
Committed: Log generation 20Checkpoint: Log generation 2Waypoint: Log generation 10What this means:
Only logs 2-10 are neededLogs 11-20 can be discarded
Initiating FILE DUMP mode... Database: priv1.edb ... State: Dirty Shutdown Log Required: 2-10 (0x2-0xA) Log Committed: 0-20 (0x0-0x14) ...
17
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
checkpoint
waypoint
NodeB
18
19
20
2121
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
NodeAHealthy CCR
NodeA fails and a failover to NodeB occurs
Validate database can mount logs lost <
AutoDatabaseMountDial
Logs are generated on NodeB (beyond gen21)
NodeA recovers and performs a
divergence check
NodeA performs incremental reseed and copies logs
Healthy CCR
18
19
20
21
1717
31
When Do I Need A Full Reseed?
RarelyLost log past current Waypoint
Admin accepted large amount of loss by running Restore-StorageGroupCopyAutomatic mount while LLR was “not honored”Automatic lossy mount with “stale” loss window calculation
Log corruption prior to log replayESE cannot skip over logs
Database files modified outside of Store or Replication service
E.g., Offline defrag, eseutil /r
32
Transport Dumpster Hub Transport servers retain messages that have been delivered to destination mailbox until size or time limit is reachedTransport Dumpster is per storage group per Hub Transport server for servers in same Active Directory site as the storage groupTransport Dumpster statistics:
Get-StorageGroupCopyStatus -DumpsterStatistics Output:
DumpsterServersNotAvailable:{HUB1}DumpsterStatistics:
{HUB2(2/25/2009 10:20:37 PM; 2 ; 1032KB)}
33
CCR CMS
MBX2
MBX1
HUB1SG Dumpster Contents
SG1
SG2
HUB2SG Dumpster Contents
SG1
SG2
SG1 SG2
SG1 SG2
Passive
SG Dumpster Contents
SG1 Msg1
SG2 Msg1
SG Dumpster Contents
SG1 Msg2
SG2
SG Dumpster Contents
SG1 Msg1
SG2 Msg1,Msg3
SG Dumpster Contents
SG1 Msg2,Msg4
SG2 Msg4
SG Resubmit Required
SG1
SG2
SG Resubmit Required
SG1 HUB1,HUB2
SG2 HUB1,HUB2
Redeliver SG1,SG2(returns Retry)
Redeliver SG1,SG2(returns timeout)
SG Resubmit Required
SG1 HUB1
SG2 HUB1
Active
Redeliver SG1,SG2(returns Success)
Redeliver SG1,SG2(returns retry)Redeliver SG1,SG2(returns success)
Transport Dumpster
34
Transport Dumpster How much data loss can transport dumpster mitigate?
18 MB dumpster per storage group on 8 Hub Transport servers = 144 MB / storage group[20 MB / 10 hour] x [100 users / SG] = 200 MB message traffic in one hourPutting the above two together gives
60 min X 144 / 200 43.2 minutes worth of datain 43.2 minutes 144+ logs created per SG
Customize transport dumpster size/time limitSet-TransportConfig –MaxDumpsterSizePerStorageGroup 30MB
–MaxDumpsterTime 07.00:00:00
No time window guaranteesIf there are no message size limits, a single large message (e.g., 15 MB) will purge all other messages for destination storage group(s) on a given Hub Transport server
35
Transport Dumpster
When CCR detects a lossy failover:Expands loss window by 12 hours back and 4 hours forward Finds all Hub Transport servers in the local Active Directory siteRequests transport dumpster redelivery from all detected servers
New servers not added to redelivery list
Inaccessible servers: CCR retries same request every 30 seconds until configured MaxDumpsterTime If multiple lossy failovers take place, new loss is window added to previous one
Restore-StorageGroupCopy on LCR is one time request, no retriesRedelivery not triggered as part of Setup /recoverCMSNo other ways to redeliver messages from transport dumpster
Redundant Networks
Use for log shipping and seeding in CCR
Enable-ContinuousReplicationHostName
SeedingUpdate-StorageGroupCopy -DataHostNames:Host1,Host2
Get-ClusteredMailboxServerStatus OperationalReplicationHostNames:FailedReplicationHostNames:InUseReplicationHostNames:
Watch out for misconfigured host file
37
Circular Logging
One configuration setting with two consumersStore service: requires database to be dismounted and re-mounted to take effectReplication service: picks up new setting dynamically
In CCR, it’s no big deal to switch between on/off/onIn some settings, logs are deleted prematurely
Example: turn off circular logging, then enable LCR without dismount/mount of database
ESE is still doing log truncation with circular logging logicLogs will get truncated before making it to the LCR copy
To be safe follow this recipe: Suspend, dismount, change setting, mount, resume
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS,
IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.