exchange server 2013 high availability - site resilience
DESCRIPTION
More info on http://techdays.be.TRANSCRIPT
![Page 1: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/1.jpg)
Exchange Server 2013High Availability | Site ResilienceScott SchnollPrincipal Technical WriterMicrosoft Corporation
![Page 2: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/2.jpg)
Agenda
Storage
High Availability
Site Resilience
![Page 3: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/3.jpg)
Storage
![Page 4: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/4.jpg)
Storage Challenges
Capacity is increasing, but IOPS are notDatabase sizes must be manageableReseeds must be fast and reliablePassive copy IOPS are inefficientLagged copies have asymmetric storage requirementsLow agility from low disk space recovery
![Page 5: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/5.jpg)
Storage Innovations
Multiple Databases Per VolumeAutomatic ReseedAutomatic Recovery from Storage FailuresLagged Copy Enhancements
![Page 6: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/6.jpg)
Multiple databases per volume
![Page 7: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/7.jpg)
Multiple Databases Per Volume
DB1 DB4DB3DB2
DB4
DB3
DB2
DB1
DB4
DB3
DB2
DB1
DB4
DB3
DB2
DB1
Passive
ActiveLagge
d
4-member DAG4 databases4 copies of each database4 databases per volume
Symmetrical design with balanced activation preference
Number of copies per database = number of databases per volume
![Page 8: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/8.jpg)
Multiple Databases Per Volume
DB1 DB1DB1DB1
Single database copy/disk:Reseed 2TB Database = ~23 hrsReseed 8TB Database = ~93 hrs
DB1DB1
Passive
Active
20 MB/s
![Page 9: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/9.jpg)
Multiple Databases Per Volume
DB1 DB4DB3DB2
Single database copy/disk:Reseed 2TB Database = ~23 hrsReseed 8TB Database = ~93 hrs
DB4
DB3
DB2
DB1
DB4
DB3
DB2
DB1
DB4
DB3
DB2
DB1
Passive
ActiveLagge
d
4 database copies/disk:Reseed 2TB Disk = ~9.7 hrsReseed 8TB Disk = ~39 hrs
DB1
12 MB/s
20 MB/s
20 MB/s
12 MB/s
DB1
DB4
DB3
DB2
![Page 10: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/10.jpg)
Multiple Databases Per Volume
Requirements
Single logical disk/partition per physical disk
Best Practices
Same neighbors on all servers
Balance activation preferences
Database copies per volume = copies per database
![Page 11: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/11.jpg)
Autoreseed
![Page 12: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/12.jpg)
Seeding Challenges
Disk failure on active copy = database failoverFailed disk and database corruption issues need to be addressed quicklyFast recovery to restore redundancy is needed
![Page 13: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/13.jpg)
Seeding Innovations
Automatic Reseed (Autoreseed) - use spares to automatically restore database redundancy after a disk failure
![Page 14: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/14.jpg)
Autoreseed
In-Use Storage
Spares
X
![Page 15: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/15.jpg)
Autoreseed Workflow
Periodically scan for
failed and suspended
copies
Check prerequisite
s: single copy, spare availability
Allocate and
remap a spare
Start the seed
Verify that the
new copy is
healthy
Admin replaces
failed disk
![Page 16: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/16.jpg)
Autoreseed Workflow
1. Detect a copy in an F&S state for 15 min in a row2. Try to resume copy 3 times (with 5 min sleeps in between)3. Try assigning a spare volume 5 times (with 1 hour sleeps in
between)4. Try InPlaceSeed with SafeDeleteExistingFiles 5 times (with 1
hour sleeps in between)5. Once all retries are exhausted, workflow stops6. If 3 days have elapsed and copy is still F&S, workflow state is
reset and starts from Step 1
![Page 17: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/17.jpg)
Autoreseed Workflow
PrerequisitesCopy is not ReseedBlocked or ResumeBlockedLogs and EDB files are on same volumeDatabase and Log folder structure matches required naming conventionNo active copies on failed volumeAll copies are F&S on the failed volumeNo more than 8 F&S copies on the server (if so, we might be in a controller failure situation)
For InPlaceReseedIf EDB files exists, wait for 2 days before in-place reseeding (based on LastWriteTime of edb file)Only up to 10 concurrent seeds are allowed
![Page 18: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/18.jpg)
Autoreseed
Configure storage subsystem with spare
disks
Create DAG, add servers with configured
storage
Create directory and mount points
Configure 3 AutoDAG properties
Create mailbox databases and
database copies
\
ExchDbs
ExchVols
Vol1 Vol3MDB1 MDB2
MDB1
Vol2
MDB2
MDB1.DB MDB1.log
MDB1.DB MDB1.log
AutoDagDatabasesRootFolderPath
AutoDagVolumesRootFolderPath
AutoDagDatabaseCopiesPerVolume = 1
![Page 19: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/19.jpg)
Autoreseed
RequirementsSingle logical disk/partition per physical diskSpecific database and log folder structure must be used
RecommendationsSame neighbors on all serversDatabases per volume should equal the number of copies per databaseBalance activation preferences
Configuration instructions at http://aka.ms/autoreseed
![Page 20: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/20.jpg)
Autoreseed
Numerous fixes in CU1Autoreseed not detecting spare disks correctly or using detected disks
GetCopyStatus has a new field 'ExchangeVolumeMountPoint' which shows the mount point of the database volume under C:\ExchangeVolumes
Better tracking around mount path and ExchangeVolume path
Increased autoreseed copy limits (previously 4, now 8)
![Page 21: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/21.jpg)
Automatic Recovery From Storage Failures
![Page 22: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/22.jpg)
Storage Challenges
Storage controllers are essentially mini-PCs and they can crash/hang
Other operator-recoverable conditions can occurLoss of vital system elementsHung or highly latent IO
![Page 23: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/23.jpg)
Storage Innovations
Exchange Server 2013 includes functionality to automatically recovery from a variety of new storage-related failures
Innovations added in Exchange 2010 also carried forward
Even more behaviors added in CU1
![Page 24: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/24.jpg)
Automatic Recovery from Storage Failures
Exchange Server 2010
ESE Database Hung IO (240s)
Failure Item Channel Heartbeat (30s)
SystemDisk Heartbeat (120s)
Exchange Server 2013
System Bad State (302s)
Long I/O times (41s)
MSExchangeRepl.exe memory threshold (4GB)
System Bus Reset (Event 129)
Replication service endpoints not responding
![Page 25: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/25.jpg)
Lagged Copy Challenges
![Page 26: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/26.jpg)
Lagged Copy Challenges
Activation is difficultRequire manual careCannot be page patched
![Page 27: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/27.jpg)
Lagged Copy Innovations
Automatic play down of log files in critical situationsIntegration with Safety Net
![Page 28: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/28.jpg)
Lagged Copy Innovations
Automatic log play downLow disk space (enable in registry)Page patching (enabled by default)Less than 3 other healthy copies (enable in AD; configure in registry)
Simpler activation with Safety NetNo need for log surgery or hunting for the point of corruption
![Page 29: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/29.jpg)
High Availability
![Page 30: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/30.jpg)
High Availability Challenges
High availability focuses on database healthBest copy selection insufficient for new architectureManagement challenges around maintenance and DAG network configuration
![Page 31: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/31.jpg)
High Availability Innovations
Managed AvailabilityBest Copy and Server SelectionDAG Network Autoconfig
![Page 32: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/32.jpg)
Managed Availability
![Page 33: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/33.jpg)
Managed Availability
Key tenet for Exchange 2013All access to a mailbox is provided by the protocol stack on the Mailbox server that hosts the active copy of the user’s mailboxIf a protocol is down on a Mailbox server, all active databases lose access via that protocol
Managed Availability was introduced to detect these kinds of failures and automatically correct themFor most protocols, quick recovery is achieved via a restart actionIf the restart action fails, a failover can be triggered
![Page 34: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/34.jpg)
Managed Availability
An internal framework used by component teams
Sequencing mechanism to control when recovery actions are taken versus alerting and escalation
Includes a mechanism for taking servers in/out of service (maintenance mode)
Enhancement to best copy selection algorithm
![Page 35: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/35.jpg)
Managed Availability
MA failovers come in two formsServer: Protocol failure can trigger server failoverDatabase: Store-detected database failure can trigger database failover
MA includes Single Copy AlertAlert is per-server to reduce flowStill triggered across all machines with copiesMonitoring triggered through a notificationLogs 4138 (red) and 4139 (green) events based on 4113 (red) and 4114 (green) events
![Page 36: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/36.jpg)
Best Copy and Server Selection
![Page 37: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/37.jpg)
Best Copy Selection Challenges
Process for finding the “best” copy of a specific database to activate
Exchange 2010 uses several criteriaCopy queue lengthReplay queue lengthDatabase copy status – including activation blockedContent index status
Not good enough for Exchange Server 2013, because protocol health is not considered
![Page 38: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/38.jpg)
Best Copy and Server Selection
Still an Active Manager algorithm performed at *over time based on extracted health of the systemReplication health still determined by same criteria and phases
Criteria now includes health of the entire protocol stackConsiders a prioritized protocol health set in the selectionFour priorities – critical, high, medium, low (all health sets have a priority)Failover responders trigger added checks to select a “protocol not worse” target
![Page 39: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/39.jpg)
Best Copy and Server Selection
All HealthyChecks for a server hosting a copy that has all health sets in a healthy state
Up to Normal HealthyChecks for a server hosting a copy that has all health sets Medium and above in a healthy state
All Better than SourceChecks for a server hosting a copy that has health sets in a state that is better than the current server hosting the affected copy
Same as SourceChecks for a server hosting a copy of the affected database that has health sets in a state that is the same as the current server hosting the affected copy
![Page 40: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/40.jpg)
DAG Network Autoconfig
![Page 41: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/41.jpg)
DAG Network Autoconfig
Automatic or manual DAG network configDefault is AutomaticRequires specific configuration settings on MAPI and Replication network interfaces
Manual edits and EAC controls blocked when automatic networking is enabledSet DAG to manual network setup to edit or change DAG networks
DAG networks automatically collapsed in multi-subnet environment
![Page 42: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/42.jpg)
![Page 43: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/43.jpg)
Site Resilience
![Page 44: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/44.jpg)
Site Resilience Challenges
Operationally complexMailbox and Client Access recovery connectedNamespace is a SPOF
![Page 45: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/45.jpg)
Site Resilience Innovations
Operationally simplifiedMailbox and Client Access recovery independentNamespace provides redundancy
![Page 46: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/46.jpg)
Site Resilience
Key CharacteristicsDNS resolves to multiple IP addressesAlmost all protocol access in Exchange 2013 is HTTPHTTP clients have built-in IP failover capabilitiesClients skip past IPs that produce hard TCP failuresAdmins can switchover by removing VIP from DNSNamespace no longer a SPOFNo dealing with DNS latency
![Page 47: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/47.jpg)
Site Resilience
Previously loss of CAS, CAS array, VIP, LB, some portion of the DAG required admin to perform a datacenter switchover
In Exchange Server 2013, recovery happens automaticallyThe admin focuses on fixing the issue, instead of restoring service
![Page 48: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/48.jpg)
Site Resilience
Previously, CAS and Mailbox server recovery were tied together in site recoveries
In Exchange Server 2013, recovery is independent, and may come automatically in the form of failoverThis is dependent on the customer’s business requirements and configuration
![Page 49: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/49.jpg)
Site Resilience
With the namespace simplification, consolidation of server roles, separation of CAS array and DAG recovery, de-coupling of CAS and Mailbox by AD site, and load balancing changes…
if available, three locations can simplify mailbox recovery in response to datacenter-level events
![Page 50: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/50.jpg)
Site Resilience
You must have at least three locationsTwo locations with Exchange; one with witness server
Exchange sites must be well-connected
Witness server site must be isolated from network failures affecting Exchange sites
![Page 51: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/51.jpg)
alternate datacenter: Portlandprimary datacenter: Redmond
Site Resilience
cas3 cas4cas1 cas2
VIP: 192.168.1.50 VIP: 10.0.1.50
mail.contoso.com: 192.168.1.50, 10.0.1.50
![Page 52: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/52.jpg)
alternate datacenter: Portlandprimary datacenter: Redmond
Site Resilience
cas3 cas4cas1 cas2
VIP: 192.168.1.50X VIP: 10.0.1.50
mail.contoso.com: 192.168.1.50, 10.0.1.50
Removing failing IP from DNS puts you in control of in service time of VIPWith multiple VIP endpoints sharing the same namespace, if one VIP fails, clients automatically failover to alternate VIP(s)
mail.contoso.com: 10.0.1.50
![Page 53: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/53.jpg)
third datacenter: Paris
alternate datacenter: Portland
primary datacenter: Redmond
Site Resilience
dag1mbx1 mbx2 mbx3 mbx4
Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log file, automatic failover should occur
witness
X
![Page 54: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/54.jpg)
alternate datacenter: Portlandprimary datacenter: Redmond
Site Resilience
dag1
witness
mbx1 mbx2 mbx3 mbx4XXX
![Page 55: Exchange Server 2013 High Availability - Site Resilience](https://reader033.vdocuments.us/reader033/viewer/2022061218/54b7c9ec4a7959466a8b45b3/html5/thumbnails/55.jpg)
alternate datacenter: Portlandprimary datacenter: Redmond
dag1
Site Resilience
witness
mbx1 mbx2 mbx3 mbx4
alternate witness
1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond
2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc
3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland
X