availability and disaster recovery foe sap hana

5/24/2018 Availability and Disaster Recovery Foe SAP HANA

1/4

High Availability and Disaster Recovery for IBM Systems Solution forSAP HANA

1. Dist ingu ish between HA and DR

It is all about continuous business processes. According to the business requirements planned

and unplanned downtimes need to be covered. To do this for the unplanned downtimes, SAP

talks about datacenter readiness of SAP HANA. This covers the disciplines of hardware orsoftware failures, network malfunctions, security threads, natural or man-made disasters, failure

of compliance and operations, etc.

For this document we define High Availability (HA) covering a single node hardware failure,

e.g. one node breaks in a running scale out config for whatever reason (e.g. CPU, memory,

storage, network, ...). Certainly IBM eX5 hardware provides a degree of redundancy whichcovers failures like that, but breaks can still happen.

Disaster Recovery solutions (aka Disaster Tolerance (DT)) cover multiple nodes fail at the same

time or a whole data center goes down with a fire, flood, or other catastrophe, and a secondary

site needs to take over the SAP HANA system.

If the customer is running a side by side SAP HANA scenario ( e.g. CO-PA or sales planning or

smart metering) the data will still be available in the source / backend SAP Business Suitesystem. Only the fast planning or analytical tasks will run significantly slower, just as before the

existence of HANA.

More important is the situation if SAP HANA is the prime database under e.g. BW. Then the

"productive" data sits in SAP HANA as the DB and according to the business service level

agreements, prevention for a failure is more crucial.

In either way, the requirements of recovery point objective (RPO) and recovery time objective(RTO) needs to be discussed with the customer. These requirements can be very different to

every customer scenario.

RPO = is the maximum tolerable period in which datamight be lost from an IT service due to amajor incident.RTO = is the duration of time and a service level within which a business process must be

restored after a disaster or disruption.

Infolink:Availability and Disaster Recovery for IBM Systems Solution for SAP HANA


2/4

2. High Avai labi l i ty

With the IBM Workload Optimized Solution for SAP HANA this is covered in one system, one

location via GPFS functionality. The pre-requisite at this time are two scale out nodes plus a

dedicated quorum node or three or more scale out nodes. Three nodes are always required,

because one needs to run as a quorum node to define, who is the master in the cluster. This caneither be an active system used as worker or standby or a dedicated, less expensive quorum

server. The data is written by GPFS to the primary location and in a striped fashion also

replicated to the other nodes in the cluster.If one of the worker nodes dies or loses connection to the quorum node, then the standby node

recovers the data (data and log files) into its main memory and the cluster keeps working without

downtime - just some delay to the query in flight. The response to the query is being provided assoon as the data that is required will have been recovered. The remainder of the data will be

required thereafter.

With just one standby node, the cluster then will have lost its HA capability until a repaired or

new node will be added to the remaining worker nodes as a new standby node. With more than

one standby node, the GPFS configuration needs to be reconfigured (re-striped). Please see SAPNote 1650046. Upon completion of such reconfiguration and with at least one standby node left

in the cluster, HA capability is maintained.

There are all kinds of scenarios possible: n worker + 1 standby, n worker and n standby and any other

combination. Certified configurations are in the range of 2-16 nodes, larger configurations can get

certified on request at the customer site. By the way, the standby node does not require a SAPHANA license. The license fee is calculated based on the amount of memory that resides with

the worker nodes only.

To clarify the confusion on "hot" and "cold" standby nodes: from an SAP HANA perspective the

standby node is considered cold, but from an infrastructure perspective it is hot, because SLESand GPFS are running.

3. Disaster Recovery

The basic features for disaster recovery are available from SAP today. Details are describedhere.

Possible solutions:Recovery

t ime(RTO)

Recovery Point(RPO)

Availability

Backup / recovery over distance hours>0

(depends on backupcycle)

today

Synchronous replication to two sites (IBMGPFS)

minutes 0 today

System Replication ("warm standby", SAPHANA SPS5)

minutes 0 today

Log shipping seconds planned: 2014

Enhanced DR (asynchronous replication)planned: late2013
https://w3-connections.ibm.com/files/app?lang=en#/file/6c358227-9c9b-40e5-9c25-9dd41a80eb0ehttps://w3-connections.ibm.com/files/app?lang=en#/file/6c358227-9c9b-40e5-9c25-9dd41a80eb0ehttps://w3-connections.ibm.com/files/app?lang=en#/file/6c358227-9c9b-40e5-9c25-9dd41a80eb0ehttps://w3-connections.ibm.com/files/app?lang=en#/file/6c358227-9c9b-40e5-9c25-9dd41a80eb0e


3/4

1Log shipping functionality needs to be provided by SAP.

If multiple nodes fail at the same time or if a second node fails while the reconfiguration of theGPFS has not been recovered upon a single node fail, then the whole cluster will go down as

some primary and secondary (replicated) data is lost. Therefore, this is considered a disaster. If

such disaster happens and a whole site (here primary) goes down, then a scenario where asecondary datacenter would take over is called disaster tolerant.

As of today, the following procedures are feasible and recommended.

Backup/Recovery over distance

Backup your data on the primary site regularly (at least daily) to a defined staging area which

might be an external disk on an NFS share or a directly attached SAN subsystem (e.g. DS8K or

existing storage). Transfer the backup to the remote site regularly (mirror functionality can be

used here).

On that site an identical SAP HANA system (# of nodes, size, hostnames and SID, etc.) needs toexist. This system can run for example a Quality Assurance (QA), Development (DEV), or Test

(TST) system or other second tier system. In case the primary site goes down, the system needsto be cleared from this second tier HANA system (hostname and SID potentially adapted - fresh

install of the SAP HANA software recommended) and the backup can be restored. Upon

configuring the application systems to use the secondary site instead of the primary one,operation can be resumed. SAP HANA recovers from the latest backup in case of a disaster.

GPFS based synchronous replication

Also, a mirror can be configured on GPFS from the primary site to the secondary site. This is

validated and certified since 18.Dec. 2012.The failover has to be initiated manually, application systems have to be adapted and thesecondary cluster will have to be restarted. With such a restart, the system is restored to the latest

savepoint and the available logs are recovered. This scenario implies, that the remote system is

not being used for a different SAP HANA installation, but is available in standby with currentdata to be ready to start with RPO = 0, in case the prime site goes down.The maximum distance between the two sites for this solution is defined by the maximum

latency between the internal switches of the appliance. SAP allows a maximum allowed latency

of 320s (microseconds). Under ideal conditions this translates to a distance of about 64km or 40miles. SAP reserves the right to validate the DR solution at the customer site. If there is

competing traffic or the latency is too high, customer might be asked to optimize the network

accordingly.We also created a solution, where the remote site can run non- productive systems at the sametime. For this, direct attached additional disk space in expansion units need to be added. This is a

unique feature of the GPFS based DR solution.

Disaster recovery solutions based on storage replication for SAP HANA need to be validated by

SAP. Validated solutions are documented inSAP Note 1755396 -

"Released DT solutions for SAP HANA with disk replication"
https://service.sap.com/sap/support/notes/1755396https://service.sap.com/sap/support/notes/1755396https://service.sap.com/sap/support/notes/1755396https://service.sap.com/sap/support/notes/1755396https://service.sap.com/sap/support/notes/1755396https://service.sap.com/sap/support/notes/1755396


4/4

SAP System Replication

SAP can deliver since SPS5 a so called "warm standby" solution, now called "System

Replication". With this, SAP HANA itsself will be able to write synchronously to a remote site.This requires an identical system on the remote site. This system will be idle (not available for -

say- non productive SAP HANA instances. It will have a "warm" that means DB loaded alreadyin memory, which ensures a short (< than 5min RTO) switch over time, if the primary site goesdown. This applies also for single node installations on both sides. The documentation of the

SPS5 solution ishere.SAP delivers with SPS 6 the same functionality but asynchronous. The

documentation can be foundhere,only the parameter "mode=sync" is now set to "mode=async".

The most desired solution certainly is a hot-standby (using log shipping) of the secondary site.

SAP and IBM are working closely together on such a solution. At this point in time, this is not

available yet.

For more info: https://w3-

connections.ibm.com/wikis/home?lang=en#!/wiki/Waef4c0eb0f35_427f_a25e_670e392682b1/page/Business%20Continuity%20for%20SAP%20HANA
https://w3-connections.ibm.com/files/app#/file/8cd47fa0-561b-4e09-b0a1-010dc43c7360https://w3-connections.ibm.com/files/app#/file/8cd47fa0-561b-4e09-b0a1-010dc43c7360https://w3-connections.ibm.com/files/app#/file/8cd47fa0-561b-4e09-b0a1-010dc43c7360http://www.saphana.com/docs/DOC-2010http://www.saphana.com/docs/DOC-2010http://www.saphana.com/docs/DOC-2010http://www.saphana.com/docs/DOC-2010https://w3-connections.ibm.com/files/app#/file/8cd47fa0-561b-4e09-b0a1-010dc43c7360

availability and disaster recovery foe sap hana

Documents