availability and disaster recovery foe sap hana

Upload: bebetto38

Post on 14-Oct-2015

15 views

Category:

Documents


0 download

TRANSCRIPT

  • 5/24/2018 Availability and Disaster Recovery Foe SAP HANA

    1/4

    High Availability and Disaster Recovery for IBM Systems Solution forSAP HANA

    1. Dist ingu ish between HA and DR

    It is all about continuous business processes. According to the business requirements planned

    and unplanned downtimes need to be covered. To do this for the unplanned downtimes, SAP

    talks about datacenter readiness of SAP HANA. This covers the disciplines of hardware orsoftware failures, network malfunctions, security threads, natural or man-made disasters, failure

    of compliance and operations, etc.

    For this document we define High Availability (HA) covering a single node hardware failure,

    e.g. one node breaks in a running scale out config for whatever reason (e.g. CPU, memory,

    storage, network, ...). Certainly IBM eX5 hardware provides a degree of redundancy whichcovers failures like that, but breaks can still happen.

    Disaster Recovery solutions (aka Disaster Tolerance (DT)) cover multiple nodes fail at the same

    time or a whole data center goes down with a fire, flood, or other catastrophe, and a secondary

    site needs to take over the SAP HANA system.

    If the customer is running a side by side SAP HANA scenario ( e.g. CO-PA or sales planning or

    smart metering) the data will still be available in the source / backend SAP Business Suitesystem. Only the fast planning or analytical tasks will run significantly slower, just as before the

    existence of HANA.

    More important is the situation if SAP HANA is the prime database under e.g. BW. Then the

    "productive" data sits in SAP HANA as the DB and according to the business service level

    agreements, prevention for a failure is more crucial.

    In either way, the requirements of recovery point objective (RPO) and recovery time objective(RTO) needs to be discussed with the customer. These requirements can be very different to

    every customer scenario.

    RPO = is the maximum tolerable period in which datamight be lost from an IT service due to amajor incident.RTO = is the duration of time and a service level within which a business process must be

    restored after a disaster or disruption.

    Infolink:Availability and Disaster Recovery for IBM Systems Solution for SAP HANA

  • 5/24/2018 Availability and Disaster Recovery Foe SAP HANA

    2/4

    2. High Avai labi l i ty

    With the IBM Workload Optimized Solution for SAP HANA this is covered in one system, one

    location via GPFS functionality. The pre-requisite at this time are two scale out nodes plus a

    dedicated quorum node or three or more scale out nodes. Three nodes are always required,

    because one needs to run as a quorum node to define, who is the master in the cluster. This caneither be an active system used as worker or standby or a dedicated, less expensive quorum

    server. The data is written by GPFS to the primary location and in a striped fashion also

    replicated to the other nodes in the cluster.If one of the worker nodes dies or loses connection to the quorum node, then the standby node

    recovers the data (data and log files) into its main memory and the cluster keeps working without

    downtime - just some delay to the query in flight. The response to the query is being provided assoon as the data that is required will have been recovered. The remainder of the data will be

    required thereafter.

    With just one standby node, the cluster then will have lost its HA capability until a repaired or

    new node will be added to the remaining worker nodes as a new standby node. With more than

    one standby node, the GPFS configuration needs to be reconfigured (re-striped). Please see SAPNote 1650046. Upon completion of such reconfiguration and with at least one standby node left

    in the cluster, HA capability is maintained.

    There are all kinds of scenarios possible: n worker + 1 standby, n worker and n standby and any other

    combination. Certified configurations are in the range of 2-16 nodes, larger configurations can get

    certified on request at the customer site. By the way, the standby node does not require a SAPHANA license. The license fee is calculated based on the amount of memory that resides with

    the worker nodes only.

    To clarify the confusion on "hot" and "cold" standby nodes: from an SAP HANA perspective the

    standby node is considered cold, but from an infrastructure perspective it is hot, because SLESand GPFS are running.

    3. Disaster Recovery

    The basic features for disaster recovery are available from SAP today. Details are describedhere.

    Possible solutions:Recovery

    t ime(RTO)

    Recovery Point(RPO)

    Availability

    Backup / recovery over distance hours>0

    (depends on backupcycle)

    today

    Synchronous replication to two sites (IBMGPFS)

    minutes 0 today

    System Replication ("warm standby", SAPHANA SPS5)

    minutes 0 today

    Log shipping seconds planned: 2014

    Enhanced DR (asynchronous replication)planned: late2013

    https://w3-connections.ibm.com/files/app?lang=en#/file/6c358227-9c9b-40e5-9c25-9dd41a80eb0ehttps://w3-connections.ibm.com/files/app?lang=en#/file/6c358227-9c9b-40e5-9c25-9dd41a80eb0ehttps://w3-connections.ibm.com/files/app?lang=en#/file/6c358227-9c9b-40e5-9c25-9dd41a80eb0ehttps://w3-connections.ibm.com/files/app?lang=en#/file/6c358227-9c9b-40e5-9c25-9dd41a80eb0e
  • 5/24/2018 Availability and Disaster Recovery Foe SAP HANA

    3/4

    1Log shipping functionality needs to be provided by SAP.

    If multiple nodes fail at the same time or if a second node fails while the reconfiguration of theGPFS has not been recovered upon a single node fail, then the whole cluster will go down as

    some primary and secondary (replicated) data is lost. Therefore, this is considered a disaster. If

    such disaster happens and a whole site (here primary) goes down, then a scenario where asecondary datacenter would take over is called disaster tolerant.

    As of today, the following procedures are feasible and recommended.

    Backup/Recovery over distance

    Backup your data on the primary site regularly (at least daily) to a defined staging area which

    might be an external disk on an NFS share or a directly attached SAN subsystem (e.g. DS8K or

    existing storage). Transfer the backup to the remote site regularly (mirror functionality can be

    used here).

    On that site an identical SAP HANA system (# of nodes, size, hostnames and SID, etc.) needs toexist. This system can run for example a Quality Assurance (QA), Development (DEV), or Test

    (TST) system or other second tier system. In case the primary site goes down, the system needsto be cleared from this second tier HANA system (hostname and SID potentially adapted - fresh

    install of the SAP HANA software recommended) and the backup can be restored. Upon

    configuring the application systems to use the secondary site instead of the primary one,operation can be resumed. SAP HANA recovers from the latest backup in case of a disaster.

    GPFS based synchronous replication

    Also, a mirror can be configured on GPFS from the primary site to the secondary site. This is

    validated and certified since 18.Dec. 2012.The failover has to be initiated manually, application systems have to be adapted and thesecondary cluster will have to be restarted. With such a restart, the system is restored to the latest

    savepoint and the available logs are recovered. This scenario implies, that the remote system is

    not being used for a different SAP HANA installation, but is available in standby with currentdata to be ready to start with RPO = 0, in case the prime site goes down.The maximum distance between the two sites for this solution is defined by the maximum

    latency between the internal switches of the appliance. SAP allows a maximum allowed latency

    of 320s (microseconds). Under ideal conditions this translates to a distance of about 64km or 40miles. SAP reserves the right to validate the DR solution at the customer site. If there is

    competing traffic or the latency is too high, customer might be asked to optimize the network

    accordingly.We also created a solution, where the remote site can run non- productive systems at the sametime. For this, direct attached additional disk space in expansion units need to be added. This is a

    unique feature of the GPFS based DR solution.

    Disaster recovery solutions based on storage replication for SAP HANA need to be validated by

    SAP. Validated solutions are documented inSAP Note 1755396 -

    "Released DT solutions for SAP HANA with disk replication"

    https://service.sap.com/sap/support/notes/1755396https://service.sap.com/sap/support/notes/1755396https://service.sap.com/sap/support/notes/1755396https://service.sap.com/sap/support/notes/1755396https://service.sap.com/sap/support/notes/1755396https://service.sap.com/sap/support/notes/1755396
  • 5/24/2018 Availability and Disaster Recovery Foe SAP HANA

    4/4

    SAP System Replication

    SAP can deliver since SPS5 a so called "warm standby" solution, now called "System

    Replication". With this, SAP HANA itsself will be able to write synchronously to a remote site.This requires an identical system on the remote site. This system will be idle (not available for -

    say- non productive SAP HANA instances. It will have a "warm" that means DB loaded alreadyin memory, which ensures a short (< than 5min RTO) switch over time, if the primary site goesdown. This applies also for single node installations on both sides. The documentation of the

    SPS5 solution ishere.SAP delivers with SPS 6 the same functionality but asynchronous. The

    documentation can be foundhere,only the parameter "mode=sync" is now set to "mode=async".

    The most desired solution certainly is a hot-standby (using log shipping) of the secondary site.

    SAP and IBM are working closely together on such a solution. At this point in time, this is not

    available yet.

    For more info: https://w3-

    connections.ibm.com/wikis/home?lang=en#!/wiki/Waef4c0eb0f35_427f_a25e_670e392682b1/page/Business%20Continuity%20for%20SAP%20HANA

    https://w3-connections.ibm.com/files/app#/file/8cd47fa0-561b-4e09-b0a1-010dc43c7360https://w3-connections.ibm.com/files/app#/file/8cd47fa0-561b-4e09-b0a1-010dc43c7360https://w3-connections.ibm.com/files/app#/file/8cd47fa0-561b-4e09-b0a1-010dc43c7360http://www.saphana.com/docs/DOC-2010http://www.saphana.com/docs/DOC-2010http://www.saphana.com/docs/DOC-2010http://www.saphana.com/docs/DOC-2010https://w3-connections.ibm.com/files/app#/file/8cd47fa0-561b-4e09-b0a1-010dc43c7360