implementing a holistic bc/dr strategy with vmware · key components of srm replication vcenter...
TRANSCRIPT
© 2014 VMware Inc. All rights reserved.
Implementing a Holistic BC/DR Strategy with VMware
James O‘Mahony Technical Support Engineer
• Klaus Kremser • Manager Systems Engineering
What’s on the agenda?
• Defining the problem
• Definitions
• VMware technologies that provide BC and DR
– vSphere HA and App HA
– vSphere FT
– vSphere Data Protection / Advanced
– vCenter Availability
– vCenter Infrastructure Navigator (VIN)
– vCloud Hybrid Service Disaster Recovery
– vSphere Replication
– vCenter Site Recovery Manager (SRM)
• Find out more
Disaster Recovery vs. Business Continuity
Example: Tuesday, August 23, 2011 at 1:51 PM EDT - Magnitude 5.8 earthquake near Mineral, Virginia
Disaster recovery required?
No
Interruption to business?
YES!
Fault Tolerance vs. High Availability
• Fault tolerance
– Ability to recover from component loss
– Example: Hard drive failure
• High availability
Uptime percentage in one year Downtime in one year
99 3.65 days
99.9 8.76 hours
99.99 52 minutes
99.999 “five nines” 5 minutes
X
RTO, RPO, and MTD
• Recovery Time Objective (RTO)
– How long it should take to recover
• Recovery Point Objective (RPO)
– Amount of data loss that can be incurred
• Maximum Tolerable Downtime (MTD)
– Downtime that can occur before significant loss is incurred
– Examples: Financial, reputation
vSphere App HA
vSphere HA Cluster
vFabric Hyperic
Virtual Appliance
vSphere App HA
Virtual Appliance
Hyperic Agents
Running in VMs
vCenter
Server
vSphere vSphere vSphere vSphere
New
vSphere HA – Keep In Mind…
• RTO – measured in minutes (not seconds)
• Requires shared storage
• Best practices
– Use admission control – percentage policy
– Test post-failure performance with host maintenance mode
– Isolation response – leave powered on
– Network and storage redundancy
vSphere Fault Tolerance (FT)
• Zero recovery time, data loss
– Host hardware failure only
– Does not protect against OS and application failure
• Works fine with HA, App HA
• Why not FT?
– Resource requirements – does workload really need it?
– VM has multiple CPUs
– No VM snapshots – backups require agent
Data Protection (Backup and Restore)
• Agents? No Agents? – Both!
– No agents for majority of workloads – keep it simple
– Agents for certain apps
• vSphere Data Protection (VDP) Advanced
– Backup and recovery for VMware, from VMware
– Based on proven, mature EMC Avamar™
– Agent-less VM backup and restore
– Agents for granular tier-1 application protection
VDP Advanced – Keep In Mind…
• Engineered for SMB environments
• Uses VADP – VM snapshots, CBT
• Utilizes Windows VSS in VMware Tools
• Works fine with HA, not with FT
• RDM – virtual yes, physical no
• Is it DR?
– Maybe – depends on RTO, RPO
– Needs replication offsite, right?
VDP Advanced – Keep In Mind…
• Best Practices
– Prepopulate DNS, always use FQDN
– Manage VM snapshots
– Avoid deploying to slow storage
– Do not power-off, always shut down gracefully
– Do not schedule backups during maintenance window
vCenter Availability
• Run vCenter Server application in a VM
• Run vCenter Server database in a VM
• Run both in same VM?
• Protect with vSphere HA
– vCenter and DB VM restart priority set to High
– Enable guest OS and App monitoring
• App HA can protect SQL Server database
vCenter Availability
• Back up vCenter Server VM and database
– Image-level backup for vCenter Server VM
– App-level backup using agent for database backup
• Why not FT for vCenter Server?
– vCenter Server requires minimum of 2 vCPUs
– FT does not protect against application failure
vCloud Hybrid Service - Disaster Recovery
VMware vSphere
VMware
vCenter Server
vSphere
Replication
Site A (Primary)
Servers
vCHS, Site B (Recovery)
US East Region
US West Region
1Dependent on available bandwidth
Simple and secure asynchronous replication and failover for vSphere
• Warm standby capacity on vCHS
• Self-service protection, failover and failback workflows per VM
• 15 min1 – 24 hr. recovery point objective (RPO)
• Initial data seeding by shipping a disk
• Includes:
• 2x 7-day DR tests per year
• 30 days of recovered VM run time
22
Disaster Recovery – New Core Class of Service
Minimum size:
10GHz vCPU
20GB vRAM
Starts at
1 TB
10 Mbps allocated
2 Public IPs
2 Tests*
Term Lengths:
1m, 12m, 24m, 36m subscriptions
Dedicated Cloud
Instance Virtual Private
Cloud Instance
vCloud Hybrid Service Standard Servicer Tiers
New Instance
Type as DR
Service Tier
DR-VDC Instance
vSphere Provides The Best Foundation For Disaster Recovery in the Cloud
Encapsulation: Simple Application Protection
• Entire system – including application, OS, and data – is stored as virtual machine files
• Entire system can be protected with data protection tools
Hardware-Independence: Flexible Infrastructure
• Eliminate the need for SAN or array-based replication
• Enable consistent recovery throughout data center lifecycle changes
Hybrid Aware: Seamless Integration with vCHS
• Reduced costs by leveraging the cloud for DR
• Scale your protection capacity to meet variable demand
24
Fully integrated with vSphere Web Client
Consistent management and operational best practices…
• Single interface and common management
• Designed to integrate with vCloud Hybrid Service
• Doesn’t require “console hopping”
25
Disaster Recovery System Requirements
Primary Data Center
• VMware vSphere 5.1 or above
– vSphere Essentials Plus
– vSphere Standard
– vSphere Enterprise
– vSphere Enterprise Plus
• VMware vCenter 5.1 or above
– Includes vSphere Web Client
• vSphere Replication Appliance 5.6
– 1:1 mapping with vCenter*
• Public internet connectivity
vCloud Hybrid Service
• DR subscription
26
(DR Virtual Data Center instance)
Disaster Recovery and Site Recovery Manager Disaster Recovery as a complementary DR solution to traditional SRM deployments
Seeking DR
Solution?
SRM in scope?
Pass
vCloud Hybrid
Service - DR
Internal/DIY Hosted Solution
On Premise
Co-existence
Yes
No
No
Yes Co-existence
Yes Yes
(Default)
(Partner service contract)
True Multi-Tenancy & Multi-Site Storage agnostic support
Support for different vSphere versions
Shared cloud infrastructure
Simplified management
• UI embedded in vSphere (v5.1+)
• Protect VMs with a couple of clicks
• Failover and testing through API
• Installable in current environment
Administration via vCHS console and API*
RaaS Alternative
vCHS US-East vCHS US-West vCHS EUR-UK
VMware vSphere
customers
27
vSphere Replication – Standalone
• Native tool built into the platform
• Per-VM hypervisor replication, managed in VC
Selectable RPO from 15 min up
to 24 hours
Selectable destination
datastore (Disk-type agnostic)
Replication Across Sites
vCenter Server
ESXi
NFC
VRA
ESXi
NFC
VRA
ESXi
NFC
VRA
Storage Storage
(VMDK1)
vCenter Server
ESXi
NFC
VRA
ESXi
NFC
VRA
ESXi
NFC
VRA
VR
Appliance VR
Appliance
Storage Storage
VMDK1
vCenter Server vCenter Server
Four Steps for Full Recovery
Right-click, select “Recover”
Select a target folder
Select a target resource
Click Finish
Will validate your choices as you go
New Feature – Retain Historical Replicas
vSphere
VR Agent
After recovery, use the snapshot manager to revert
to earlier points
Retention of
multiple points
in time allows
reversion to
earlier known
good states
MPIT Presented as VM Snapshots after Failover
Use the snapshot manager to revert to earlier points, an interface
all administrators have been comfortable with for many years.
vSphere Replication – Interoperability
Fault tolerance –
Doesn’t work with VR
• FT conflicts at the
vSCSI disk filter level.
VDP
• Mostly no problem!
HA, vMotion, DRS
Storage vMotion
and Storage DRS
• Now supported!
vSphere Replication – Best Practices
• RPO
– Only what is necessary!
– Just because you can…
• RTO
– Don’t set one! No testing, no automation, manual process.
• VSS – Only if necessary!
• What about bandwidth?
– Very hard to determine. Do a local loopback first.
• RDMs?
– Don’t use them. If you must, use virtual compatible.
SRM
• A Disaster Recovery engine
• A tool that uses externally replicated data (VR or array based) to speed the RTO of a BCP
• A product that allows for DR to be tested, automated, planned, repeatable and customizable
What is it?
• A replication engine
• A disaster avoidance stretched cluster
What is it not?
Key Components of SRM
Replication
vCenter Server
SRM Server
One vCenter Server
(Windows or VCVA) per
site, same versions
One SRM Server per
site, same versions
vSphere hosts,
recommend same
versions per site (pre
vSphere 5.x only if using
array replication)
vSphere Essentials Plus and higher editions supported
vCenter Server
SRM Replication Options
• SRM can utilize BOTH array based
AND vSphere Replication
• SRM will “see” existing standalone
vSphere
Replication protected VMs
• SRM can install vSphere
Replication from scratch
if needed
Hub LUN 2
Web
Multi-tier App
DB
App
vSphere Replication
Storage-based Replication
LUN 1
Web
DB
App
Multi-tier App
Recovery Workflows
• User defined recovery plan
• Minimize errors
Failover Automation
• Isolated test environment
• Increase confidence in DR process
Non-disruptive Failover Testing
• Zero data loss
• Operational migration
Planned Migration
• Re-protect VM’s, migrate back
Failback Automation
SRM Interoperability
• Works with VR –and- ABR
• Backups, VADP or other
are fine
• HA is no problem at all
• vMotion and DRS are fine
• Storage vMotion and
Storage DRS – Sort of…
– Replication Dependent
• FT is “yellow”
– Array replicated only and the
FT status is not recovered
• Web vs vSphere Client
SRM – A Few Best Practices
Not exhaustive
Plenty of support material available on blogs, vmware.com, tech sites
Big ones: Storage Layout
Test Network Configuration
Test often!
Size vCenter correctly
Biggest one:
Do a Business Impact Analysis
RPO, RTO, Cost of downtime, interdependencies, criticality of applications, priorities, units of failover, overlooked externalities, executive buy-in, …..
Protection Groups (PGs)
• More PGs = more granular testing/failover
– DR testing is easier – fewer resource requirements
– Fail-over only what is needed
– More configuration/complexity
• Less protection groups = less complex
– Fewer LUNs, PGs, recovery plans
– Less flexibility
• Find a good balance between flexibility and simplicity
Fewer LUNs/PGs
Less complexity
Less flexibility
More LUNs/PGs
More complexity
More flexibility
Right combination
of complexity and
flexibility Varies by customer
Majority of outages
are partial (not entire
data center) – design
accordingly
Test Network
– Use VLAN or isolated network for test environment
• Default “Auto” setting does not allow VM communication between hosts
– Different vSwitch can be specified in SRM for test versus run
• Specified in Recovery Plan
Typical failover
CONFIDENTIAL 48
Storage Storage
Primary
Site
Secondary
Site
VirtualCenter Site Recovery
Manager VirtualCenter Site Recovery
Manager
Array Replication
/ vSphere Replication
Array Based Replication with SRM
CONFIDENTIAL 49
Replication Software
VMFS VMFS
Replication Software
VMFS VMFS
“Protected” Site “Recovery” Site
SRA SRA
SRM Plug-In SRM Plug-In
Storage Storage
SRM Server SRM Server
vSphere Client vSphere Client
vCenter Server vCenter Server
ESX ESX ESX ESX ESX
Replication
SRA Commands
“Configure arrays" done during the SRM Array Configuration
• 1. discoverArrays
• 2. discoverDevices
Test failover (Test the DR solution at a point of Time using LUN snapshots)
• 3. testFailoverStart
• 4. testFailoverStop
Failover (Planned Migration or Disaster recovery)
• 5. failover
CONFIDENTIAL 50
SRA Commands Continued…
Failback (SRM 5.x onwards)
• 6. reverseReplication
• 7. queryReplicationSetting
Synchronization calls
• 8. syncOnce
• 9. querySyncStatus
CONFIDENTIAL 51
vSphere Replication with SRM
CONFIDENTIAL 52
“Protected” Site “Recovery” Site
VR Server
SRM Plug-In SRM Plug-In
Storage
SRM Server SRM Server
Storage
vSphere Client
ESX ESX ESX
HBR HBR HBR
ESX ESX
VMFS VMFS
Storage
VMFS VMFS
vCenter Server vCenter Server
VR Server
vSphere Client
vSphere Replication failover workflow
53
Test Failover
VMDKs
Servers
Virtual Machines
VirtualCenter Site
Recovery Manager
Replicated VMDKs
Servers
vSphere Replication Appliance
VirtualCenter Site
Recovery Manager
vSphere Replication
Protected Site Recovery Site
Virtual Machines
vSphere Replication failover workflow
54
Full Failover
VMDKs
Servers
Virtual Machines
VirtualCenter Site
Recovery Manager
Replicated VMDKs
Servers
vSphere Replication Appliance
VirtualCenter Site
Recovery Manager
vSphere Replication
Protected Site Recovery Site
Virtual Machines
Synchronize
vSphere Replication failover workflow
55
Re-protect
VMDKs
Servers
Virtual Machines
VirtualCenter Site
Recovery Manager
Replicated VMDKs
Servers
vSphere Replication Appliance
VirtualCenter Site
Recovery Manager
vSphere Replication
Protected Site Recovery Site
Virtual Machines
Protected Site Recovery Site
vSphere Replication Appliance
Pro’s and Con’s of the replication technologies
56
Pros Cons
ABR
• Mature
• Can be synchronous as well as
asynchronous
• Datastore Consistency Groups
• Supports SDRS however all LUNs
involved in the SDRS cluster must be
within the same consistency group
• Coarse Granularity (per LUN)
• Requires compatible HW at Both sites
• Dedicated Physical resources
• Managed Outside of vCenter
• Licensed asset
ABR = Array Based Replication
Pro’s and Con’s of the replication technologies
57
VR
• Fine Granularity (per VM)
• Any to any storage
• Integrated into vSphere
• Uses existing network infrastructure to
replicate the data
• Supports SDRS
• MPIT feature added to allow for failover
to an older point in time
• Is available as a standalone appliance
outside of SRM
• Lack Of Multi-VM CGs
• Does Not Support Low RPOs ( < 15mins)
• VSS is not compatible if another solution
like a backup also uses VSS quiescing
• Dependant on the network bandwidth
between sites
Pros Cons
VR = vSphere Replication
vSphere Replication with SRM
58
“Protected” Site “Recovery” Site
VR Server
SRM Plug-In SRM Plug-In
Storage
SRM Server SRM Server
Storage
vSphere Client
ESX ESX ESX
HBR HBR
VMFS VMFS
Storage
VMFS VMFS
vCenter Server vCenter Server
vSphere Client
HBR
ESX ESX
VR Server
SRM Advanced Settings
59
- Every environment is different so one setting does not fit all
- These settings are “per site” so without consistency, failover and failback will behave differently
SRM Logs
SRM Log Files location
C:\ProgramData\VMware\VMware vCenter Site Recovery Manager\Logs
Or generate from within SRM (and can also gather the VR logs too) by right clicking the selected site and selecting “Export System Logs”
61
Takeaways
62
- SRM may need to be customized based on your environment
- If an environmental change is made, ensure that test failovers are run to ensure the change has not caused an unforeseen issue
- The test failover workflow exists to test without a production outage and its
purpose is to highlight any issue which may exist that could cause a full failover to fail. Accordingly, it is important to ensure test failovers are a part of scheduled maintenance
- In the event VMware need to be engaged to troubleshoot an issue, ensure that the SRM logs are generated at both sites, and also include the logs for a subset of the source and DR ESXi servers including the vCenter logs at both sites (failover report is also helpful)
- In the event vSphere Replication is also in use, it is important to provide the logs from the appliances also (to match with the ESXi logs from the server hosting the production VM)
Find Out More
• Take an online hands on lab
• Ask for a demo
• Install 60-day evaluation