enterprise imaging

Document Status: Approved Livelink ID: 51293216 Version: 19

Platform Definition 1/58 Livelink Node ID: < 51293216>

Printed copies are not controlled and shall be verified on the electronic document management systemTemplate 46865842, v.4 (template revision history at 39703795)

A g f a H e a l t h C a r e P l a n n i n g d o c u m e n t

Business Continuity and Disaster Recovery (BCDR) Strategy White Paper

Enterprise Imaging 8.1.x

LiveLink ID 51293216

Author/Responsible (Owner)

Global PD&V Manager Dirk Somers (*)

Mandatory Reviewers

Development Representative Nikolas Boel,

Verification Representative Peter Libbrecht

Service Design Ives Heymans (*)

Global Client Engagement Lukas Gut (*), Andrew Frick(*)

Optional Reviewers

QARA Representative Jodi Coleman

Global R&D Director Matthieu Ferrant (*)

Documentation Representative Beatriz Briceno Nicolaas Lammens

Program Managers Nadia De Paepe, Robert Menko(*)

Product Managers Peter McCarthy, Andrew Benfield (*), JP Slabbaert

Verification Representative Peter Libbrecht, Romain Berthon

Platform Development Dieter Van Bogaert, James Vanderleeuw

Service Design/Architecture Beena Alexander, Willy Druyts

Global Client Engagement André Emin (*)

Approvers

Global R&D Director Matthieu Ferrant (*)

Program Management Robert Menko

Global Client Engagement Director André Emin (*)

Verification Representative Peter Libbrecht

Global PD&V Manager Dirk Somers (*)

(*) Reduced RACI for 8.1.2/consistency check refresh

Enterprise Imaging

https://livelink.agfa.net/Livelink/livelink.exe/overview/51293216


https://livelink.agfa.net/Livelink/livelink.exe?func=ll&objaction=overview&objid=46865842





CONTENT

1 Introduction .......................................................................................................... 4

1.1 Purpose ............................................................................................................................. 4

1.2 Scope ................................................................................................................................ 4

1.3 References......................................................................................................................... 4

1.4 Definitions, Acronyms, and Abbreviations......................................................................... 5

2 Terminology .......................................................................................................... 7

3 High Availability .................................................................................................. 10

3.1 IT HA concepts & components......................................................................................... 10

3.2 Enterprise Imaging High Availability components........................................................... 14

4 Business Continuity & Disaster Recovery .............................................................. 17

4.1 IT BCDR concepts............................................................................................................ 17

4.2 Enterprise Imaging BCDR concepts and principles .......................................................... 17

5 BCDR Scenarios Overview.................................................................................... 19

5.1 Generic Definitions ......................................................................................................... 19

5.2 Supported EI Scenarios ................................................................................................... 19

5.3 Supported Technologies.................................................................................................. 21

6 Supported BCDR Scenarios - Details ..................................................................... 22

7 Scenario 1 – Basic Backup / Recovery ................................................................... 23

7.1 Solution Summary .......................................................................................................... 23

7.2 Components .................................................................................................................... 26

7.3 Fail Over scenario............................................................................................................ 27

7.4 Fail back scenario ............................................................................................................ 28

7.5 References / High-level Runbook .................................................................................... 28

8 Scenario 2 – Active-Passive................................................................................... 29


8.2 Components .................................................................................................................... 31

8.3 Fail over scenario (high level) ......................................................................................... 34

8.4 Fail back scenario ............................................................................................................ 34

9 Scenario 5a: Active/Standby ............................................................................... 36


9.2 Components .................................................................................................................... 38

9.3 Fail over scenario (high level) ......................................................................................... 42

9.4 High-level fail back scenario............................................................................................ 42

10 APPENDIX A – EI JBoss Cluster support for Layer 3 networks ................................ 44

11 APPENDIX B – EI Auxiliary System ....................................................................... 45







11.1 Definition:....................................................................................................................... 45

11.2 Features and dependencies: ............................................................................................ 45

11.3 EI AUX Relationship to the BCDR scenarios..................................................................... 46

11.4 Starting condition: .......................................................................................................... 46

11.5 EI AUX “Fail-over” scenario ............................................................................................. 47

11.6 EI AUX “Fail-back” scenario............................................................................................. 47

12 APPENDIX C - Oracle HA/DR ............................................................................... 48

12.1 Oracle Recovery .............................................................................................................. 48

12.2 Dependencies on SYNC Hardware storage replication..................................................... 48

12.3 Oracle RAC...................................................................................................................... 49

12.4 Oracle Data Guard........................................................................................................... 49

12.5 Oracle Data Guard on ODA ............................................................................................. 51

12.6 Oracle Fast Start failover (FSFO): ................................................................................... 52

13 APPENDIX D - VMware Site Recovery Manager (SRM).......................................... 54

13.1 BCDR Assumptions and Dependencies ............................................................................ 54

13.2 Dependency diagram and availability calculations .......................................................... 54

13.3 Frequency of incidents and downtime:............................................................................ 55

13.4 What-If scenarios ............................................................................................................ 56

13.5 High Availability ............................................................................................................. 56

13.6 Disaster recovery............................................................................................................. 56

9 Revision History................................................................................................... 58







1 Introduction

1.1 Purpose

The purpose of this White Paper is to document supported scenarios for a Business Continuity and Disaster Recovery (BCDR) strategy and associated scenarios for Enterprise Imaging 8.1.x.

Intended audience for this white paper is: AGFA presales and sales consultants – to understand the different supported scenarios AGFA engineers and project mgmt. - to understand the components for this solution AGFA Professional Services – as definition of the solutions for future implementation

guidance.

1.2 Scope

This white paper relates to all versions of Enterprise Imaging.When specific functions apply to specific versions of EI, it is mentioned as such.Covered versions:

EI 8.0 EI 8.1.x Future versions – to be defined, and will require a document refresh

IMPORTANT NOTE:For Agfa supported standard FITCO BCDR scenarios, please select one of the pre-defined strategies, scenarios and deployments from the FICTO Wiki pages below.

http://wikihealthcare.agfa.net/display/FSADS/Fitco+SA+Deployment+Scenarios

1.3 References

Table 1 - Design documents

Node ID / Link Related Platform Definition Name

FITCO Wiki Standard FITCO Solution scenarios

LLID 49278555 EI 8.1 Storage System groups and Storage behavior requirements document

LLID 51611198 EI 8.1 Storage & BCDR Vision document

LLID 54453272 Enterprise Imaging Design Cookbook (BCDR reference configurations)

LLID 54453272 Bid Support – EI 8.x Platform Design Template configurations

LLID 52339646 Bid Support – EI 8.x Solution Architecture Workbook template (SAW)

LLID 48718647 AGFA IITS vSphere and SRM Design Guidelines





https://livelink.agfa.net/livelink/livelink.exe/overview/54453272











Table 2 - Operational procedures

Node ID / Link Related Platform Definition Name

LLID 63112397 EI 8.1.x - BCDR Runbook Decision Tree TEMPLATE

LLID 68588615 FITCO BCDR scenario 1c.2A Runbook Overview

LLID 68589800 FITCO BCDR scenario 1c.2A Test Plan

8.1.2 KB Enterprise Imaging Disaster Recovery Guide

LLID 64595443 BCDR scenario 5a Runbook Overview

LLID 64595360 BCDR scenario 5a Test Plan

LLID 55955892 Oracle DB on Linux - Toolkit for the Agfa HealthCare EI database server

(includes FSFO, ODG, failover)

LLID 55953388 Oracle DB on WINDOWS - Toolkit for the Agfa HealthCare EI database server

(includes FSFO, ODG, failover)

ODA Wiki ODA Service Plan

http://wikihealthcare.agfa.net/display/IITSSERVICE/ODA+Service+Plan

1.4 Definitions, Acronyms, and Abbreviations

Agfa HealthCare terminology and abbreviations can be found in the Agfa HealthCare glossary in IMS: http://ims.agfa.net/he/en/intranet/ims/overview.jsp?ID=12462977

Abbreviation Full form(EI) ASR Enterprise Imaging Application Storage Replication: a feature in

Enterprise Imaging 8.1 and later allowing writing to multiple storage groups.

(EI) WD The Enterprise Imaging Watch Dog process, monitoring Enterprise Imaging services availability and starting/stopping services when needed. A crucial component in the BCDR capabilities of Enterprise Imaging.

HA High availability - a system and/or service with specific design elements intended to keep the availability above a certain threshold (e.g. 99.9%)

DR Disaster RecoveryBC Business Continuity – the set of plans to ensure that functions can

continue, or be recovered at an operational state; encompasses HA and DREI DB Enterprise Imaging Database ServerEI CS Enterprise Imaging Core Server (historic name: CSP)EI WS Enterprise Imaging Web Server (historic name: CWP)LUN Logical Unit: logical disk on direct attached (DAS) storage infrastructure

or internal server disks. Accesses as Block IO (SCSI)DAS Direct Attached Storage – SCSI protocol – block IO device.NAS Network Attached Storage – TCP-IP protocol (NFS or CIFS) over the

network, usually through a storage gateway.RAC Oracle Real Application Cluster: the cluster HA solution for Oracle.

As turnkey solution, supported on ODAODA Oracle Database Appliance -ODG Oracle Data GuardSRM VMware Site Recovery ManagerActive Application Services actively running and contributing to workload

processing

http://ims.agfa.net/he/en/intranet/ims/overview.jsp?ID=12462977

http://wikihealthcare.agfa.net/display/IITSSERVICE/ODA+Service+Plan





http://knowledgebase-healthcare-in.agfa.net/impax_agility/guides/Enterprise_Imaging_Disaster_Recovery/index.htm










Abbreviation Full formPassive No Application Services or OS actively running; hardware in “idle” mode.Standby Hardware and OS running, limited application services running, not

actively contributing to workload processing.

.







2 Terminology

Throughout this document, a distinction is made in terminology between High-Availability (HA) and Disaster Recovery (DR). The combination of both HA and DR constitutes the Enterprise Imaging Business Continuity (BC) and Disaster Recovery (DR) strategy.The Business Continuity is often a broader set of processes and policies owned by the hospital, encompassing HA and DR.

Enterprise Imaging BCDR strategy Concept diagram

High Availability (HA)

Availability:The percentage of total time that a system and/or service is available for use.

High Availability (HA): a system and/or service with specific design elements intended to keep the availability above a certain threshold (e.g. 99.9%)

HA in this document: Covers for automated failover to another server within the same data center, in case of minor or larger (technical) failures, or for planned maintenance on individual components.

Covered by design for each Enterprise Imaging design, building on Application or IT components such as:

Enterprise Imaging multiple storage groups and affinity groups, including multiple archive locations.

Redundant HW components without single point of failure

o Redundant N+1 server o Redundant network components and NIC’so Redundant storage components and FC HBA’s

VMware High Availability Cluster Oracle DB on VMware HA Cluster … or Oracle RAC on an Oracle Database Appliance (ODA),

Enterprise Imaging Application Availability covered by design: redundant application servers (such as Core Server, Web

Servers), combined with Load balancers. Monitored by the EI Watchdog to trigger Core Server

failover to the 2nd Data Center.

Ref. LLID 54453272(server) component failure

https://livelink.agfa.net/livelink/livelink.exe/overview/54453272








BusinessContinuity (BC)

Continuity of Services:The set of plans, design and potentially also auxiliary systems to ensure that functions can continue, or be recovered at an operational state; encompasses HA and DR.

The combination of …. Design ( -> resilience) Processes ( -> recovery) People ( -> contingency)

… encompassing planning and preparation to ensure that an organization can continue to operate in case of serious incidents or disasters and is able to recover to an operational state within a reasonably short period.

This White Paper will mainly focus on the design and (technical) processes.








Disaster Recovery(DR)

Solution which offers (usually) a manual failover, intended to be used as protection against "disasters" or severe technical outages, usually resulting in evacuation to a 2nd computer room or a remote site after human decision making process (ITIL: Emergency Change Advisory Board – ECAB), and manual intervention from the SysAdmin or Operator.

Covers for major technical (or non-technical) outages requiring a fail-over or evacuation of the complete solution to a second data center.

The most complete DR solution would allow for Partition Tolerance (i.e. two separate independent clusters replicating data).

Extra “option” that needs to be designed for to increase availability and continuity of the EI services.DR options can be provided by the EI Application, and/or on the IT infrastructure level.

The DR design covers for major breakdowns of the infrastructure and can cover for downtimes during planned data center maintenance.In certain scenarios it can cover as well for downtimes during (minor) software upgrades (not necessarily for major software upgrades).

Components that contribute to this solution: Enterprise Imaging Storage Replication (ASR and SS) Enterprise Imaging dual archiving and cloud archiving Oracle Data Guard (with or without FSFO) Hardware storage replication VMware SRM

Users typically experience somewhat longer interrupt in servicethan with HA solutions,.

DR solutions require a runbook for failover procedure, a runbook for failback procedure, and a drill procedure for controlled test evacuations.

Figure 1 Ref. LLID 54453272

Full Data Center failure







3 High Availability

3.1 IT HA concepts & components

An important principle in the Enterprise Imaging Business Continuity (BC) strategy described in this document is the distinction between High Availability and Disaster Recovery: both aspects need to be considered in an overall BCDR strategy.

High Availability within this document is defined as the automated failover within the same data center or server room.

This section describes the HA aspects. BC/DR is covered in chapter 4.

3.1.1 HA components overview

High Availability provides automated fail-over of the solution – mostly within the same Data Center - at times of smaller technical incidents or planned maintenance on individual components.

High-Availability is mainly covered by:

Redundant hardware components:o Designed without Single Point of Failure (SPoF),o N+1 redundant servers which is always included as part of the EI solution

Storage: all data on the EI storage components are protected by RAID-levels – as defined in the different EI Storage Tiers (i.e RAID-10, RAID-5, RAID-6 etc.); for SAN, redundant storage controllers or SAN switches, with redundant data paths from server to storage. For NAS, redundant network infrastructure (redundant NIC’s and NAS heads).

Underlying VMware virtualization infrastructure: all EI components run in a virtualized environment and inherit the HA features of the underlying platform. VMware covers for automated fail over in case of technical platform issues (OS crash, server breakdown, HBA and NIC breakdown, breakdown of the network and storage data paths, etc.).

Database Redundancy: In scenarios where the EI Oracle Database runs on native (bare metal) hardware, high availability can be covered by Oracle RAC. Oracle RAC ensures seamless failover of transactions between the two cluster nodes. When delivered as a turnkey solution by AGFA, RAC is deployed on the Oracle Database Appliance (ODA).

Redundant software componentsThere are typically multiple EI Server Components (Core Servers, Web Servers, ProxyServers, etc.) for redundancy. However, even when there is only one, the components are run by default on the VMware virtualization platform, inheriting the HA capabilities of the virtualization platform and redundant HW components.

The EI server components can be load balanced by hardware or software Load Balancer technology - which is always part of the EI core infrastructure.







The LB will take care of seamless fail-over to the redundant components, so that minimal interruption in services is experienced by end-users.

3.1.2 VMware HA

Reference: https://docs.vmware.com/en/VMware-vSphere/6.5/vsphere-esxi-vcenter-server-65-

availability-guide.pdf

VMware HA is a component of the VMware vSphere™ platform (not for the ESXi free version!).

VMware High Availability (HA) provides high availability for applications running in virtual machines. VMware HA continuously monitors all virtualized servers in a resource pool and detects physical server and operating system failures.

Physical server failure:HA detects physical server failures and initiates the new virtual machine restart on a different physical server in the resource pool without human intervention. Affected virtual machines are automatically restarted on other production servers with spare capacity. To monitor physical servers, an agent on each server maintains a heartbeat with the other servers in the resource pool such that a loss of heartbeat automatically initiates the restart of all affected virtual machines on other servers in the resource pool.

Figure 2 - VMware High Availability: server failure

Operating System failure:In the case of operating system failure, VMware HA restarts the affected virtual machine on the same physical server. To monitor operating system failures, VMware HA monitors heartbeat information provided by the VMware Tools package installed in each virtual machine in the VMware HA cluster. Failures are detected when no heartbeat is received from a given virtual machine within a user-specified time interval.

https://docs.vmware.com/en/VMware-vSphere/6.5/vsphere-esxi-vcenter-server-65-availability-guide.pdf

https://docs.vmware.com/en/VMware-vSphere/6.5/vsphere-esxi-vcenter-server-65-availability-guide.pdf







Figure 3 - VMware HA - OS failure

3.1.3 Oracle HA: Oracle RAC on Oracle Database Appliance (ODA)

The Enterprise Imaging database availability can be further improved by Oracle RAC on the Oracle Database Appliance (ODA).

For details: refer to the Enterprise Imaging on ODA Install/Configuration Guide – LLID 56088611.

ODA – Oracle Database Appliance Summary: ODA X72

Appliance Content: o Two Oracle Linux servers (nodes) and one storage shelf, in one rack-mount

appliance, each with two CPU sockets per node; the amount of cores can be enabled in firmware in a (licensed!) in a capacity on-demand licensing model: CPU cores need to be enabled, which will drive also the cost for the database licensing.

o Including 10Gbps external networking connectivity (for the Enterprise Imaging PRODUCTION or APPLICATION network), and a dedicated redundant InfiniBand interconnect for cluster communication (private cluster interconnect network for ODA – required!!).

o Tiered storage on a combination of SSD and SAS drives, managed by Oracle ASM;o Storage is managed by Oracle Application Storage manager (ASM), i.e raw disk,

and no filesystems! (this is different from the standard EI deployments on x86 !). External NFS storage is supported for online backups and data staging; not for EI DB files!)

o ODA is dedicated infrastructure for the Enterprise Imaging production Database;

the EI application components are deployed on standard x86-hardware, in a VMware framework.

Oracle database:o The EI database is deployed on the bare metal ODA hardware: Oracle

Virtualization is not supported for Enterprise Imaging. The ODA requires an Oracle EE version, supporting Oracle RAC and Oracle RAC One Node

o Note that the database is separately licensed (i.e. not included in the ODA package!)








Oracle RAC:Oracle Real Application Clusters (RAC) is the foundation for data center high availability (HA).

o Reliability: If an instance fails, the remaining instances in the cluster remain open and active. Oracle Clusterware monitors all Oracle processes and immediately restarts any failed component.

o Error Detection – Oracle Clusterware monitors Oracle RAC databases as well as other Oracle processes (Oracle ASM, instances, Listeners, etc.) and provides fast detection of problems. It also automatically recovers from failures, often before users notice that a failure has occurred.

o Recoverability – If an instance fails in an Oracle RAC database, it is recognized by the other instance in the cluster and recovery will start automatically. Fast Application Notification (FAN) and Fast Connection Failover (FCF) make it easy to mask a component failure from the user.

o Continuous Operations – Oracle RAC provides continuous service for both planned and unplanned outages. If a server (or an instance) fails, the database remains open and applications continue to be able to access data, allowing for business critical workloads to finish, mostly without a delay in service delivery.

Oracle RAC One Node:Oracle Real Application Clusters One Node (Oracle RAC One Node) is a single instance of an Oracle RAC database that runs on one node in a cluster. Instead of stopping and starting instances, you can use the Oracle RAC One Node online database relocation feature to relocate an Oracle RAC One Node instance to another server.Administration of Oracle RAC One Node databases on Oracle Database Appliance is different from administering Oracle RAC or single-instance Oracle Databases. For Oracle RAC One Node databases, one node is the primary node, and the other node is a candidate node, which is available to accommodate services if the primary node fails or is shut down for maintenance. The nodes, Oracle Databases, and database services reside in the generic server pool.

Figure 4 - Oracle Database with Oracle RAC Architecture(conceptual picture – Oracle documentation)

References: Oracle Database Appliance X7-2 spec sheet (June 2019):

https://www.oracle.com/technetwork/database/database-appliance/learnmore/odax7-

https://www.oracle.com/technetwork/database/database-appliance/learnmore/odax7-2-ha-ds-3933489.pdf







2-ha-ds-3933489.pdf

Real Application Cluster Administration & Deployment Guide:https://docs.oracle.com/database/121/RACAD/toc.htm

Oracle ODA Administration and reference Guide:https://docs.oracle.com/cd/E68623_01/doc.121/e68637/GUID-F39A374F-3047-4956-AD80-5633C31A491D.htm#CMTAR911

3.2 Enterprise Imaging High Availability components

High Availability (HA) within this document is defined as automated failover within the same Data center or Server Room.

Enterprise Imaging has High Availability features included in the design and the architecture; the most important-ones are listed below – and further explained in detail in next sections.

Multiple Enterprise Imaging Core Server and Web Server components can be added to the solution to provide application failover and redundancy.

Load balancers further complete the application failover and offer transparent load balancing and failover between two or more Core Sever and Web Servers.A Software Load Balancer is included in the solution per default, and can be replaced by a Hardware Load Balancer for larger sites.

o EI 8.1.x Core Server Hardware Load Balancer references – LLID 57887747o EI 8.1.x Web Server Hardware Load Balancer references – LLID 58216568o Detailed EI Health Check info – LLID 57990289o EI HW LB Guide (BigIP) – LLID 60028887

Core Server JBoss Clusters: All EI Core Servers in the solution belong to the same JBoss cluster

Web Server JBoss Clusters: Web servers can be grouped into one or more JBoss clusters; the Web Server JBoss clusters are usually grouped per location ( DC-1, DC-2, Remote Facilities,…)

Enterprise Imaging Application Storage Replication (EI ASR)

3.2.1 JBoss clusters

Enterprise Imaging is built on JBoss clusters. Within the Data Center, all EI Core Servers in the solution belong to 1 JBoss Cluster.The EI Web Servers can be grouped into several JBoss Clusters, depending the location or functionality (Web Servers in different Data Centers, Web Servers in Remote Facilities, etc.)

JBoss Clustering allows running an application on several parallel servers (cluster nodes). The load is distributed across different servers; in the event of a server failure, the application is still accessible via other JBoss cluster nodes.

A cluster is a set of nodes. In a JBoss cluster, a node is a JBoss server instance. To build a cluster, several JBoss instances are grouped together (known as a "partition"). Different JBoss clusters can live on the same network: in order to differentiate them, each cluster must have an individual name. Each JBoss server instance (node) specifies which cluster (i.e., partition) it joins





https://docs.oracle.com/cd/E68623_01/doc.121/e68637/GUID-F39A374F-3047-4956-AD80-5633C31A491D.htm#CMTAR911

https://docs.oracle.com/cd/E68623_01/doc.121/e68637/GUID-F39A374F-3047-4956-AD80-5633C31A491D.htm#CMTAR911

https://docs.oracle.com/database/121/RACAD/toc.htm

https://www.oracle.com/technetwork/database/database-appliance/learnmore/odax7-2-ha-ds-3933489.pdf







(ClusterPartition MBean in the deploy/cluster-service.xml file). All nodes that have the same ClusterPartition MBean configuration join the same cluster.

The EI solution (JBoss nodes) is divided in one Core Server cluster (EI CS cluster) and one or more Web Server cluster; the Web Server cluster(s) can expose XERO/Xtend and Proxy services. , EI WS cluster, EI WS Proxy cluster, … ), each with a different ClusterPartition MBean configurations.







3.2.2 EI Components overview and HA/DR

For details on the different Enterprise Imaging Components, refer to the following Platform Design Documents:

Enterprise Imaging – core components: EI Database Server - LLID 50024659 EI Core Server (CS) – LLID 50024659 EI Web Server (WS) – LLID 50027980

Enterprise Imaging “peripheral” components: EI Business Intelligence – LLID 50046574 EI Teaching Files – LLID 49977848 EI Global Third Party Integrations (GTI) server – LLID 46906102

Note that the BCDR Strategy focusses on the EI Core Components.













4 Business Continuity & Disaster Recovery

Next to the High Availability (HA), Disaster Recovery design is the next aspect to considered as part of an Enterprise Imaging Business Continuity (BC) strategy.

Previous chapter described the High Availability aspects. This section describes the Disaster Recovery (DR) aspects of the Enterprise Imaging solution.

Some of the DR components in the solution are standard IT industry solutions (such as VMware SRM, Oracle Data Guard, etc.). In addition, Enterprise Imaging has built-in DR features that inherently protect for disaster event.

4.1 IT BCDR concepts

The following IT Infrastructure Components can be part of the Enterprise Imaging BCDR design:

Oracle Dataguard: to replicate the EI database. Oracle Fast Start Fail-Over (FSFO), Data Guard Broker and FSFO Observer

( https://www.oracle.com/technetwork/articles/smiley-fsfo-084973.html)

3rd Party Storage Vendor Storage replication (Dell-EMC, HPE 3Par, …)

VMware Site Recovery Manager

4.2 Enterprise Imaging BCDR concepts and principles

4.2.1 EI Application Storage Replication (ASR)

(Supported as of EI 8.1.x)

Application Storage Replication (ASR) is the EI application process that writes acquired images to the different defined storage locations (Storage System Groups or SSG’s): the ASR service will pull the data from the SSG in the local storage location (Affinity) and writes it to another Storage System Group in the same or a different Affinity (Data Center). The EI Application Storage Rules define what data is copied to what destination at what condition (for example: between SSG’s or between one storage tier to another).

As this service writes the data over SMB (or NFS in some cases), file systems are cross mounted across the Data Centers. Latencies for this solution scenario should therefore be low as possible.Writing to the storage is managed by application queues.

4.2.2 EI Watch Dog Process

(Supported as of EI 8.1.x)

As of Enterprise Imaging version 8.1, each EI Core Server (CS) will include the EI CS Watch Dog (WD) process.

Concepts:

The EI CS Watch Dog (WD) Process is part of the EI Core Server (CS) software. When enabled, the EI CS WD process is continuously checking the health of the active open database (primary DB).

https://www.oracle.com/technetwork/articles/smiley-fsfo-084973.html







4.2.3 Affinities

Enterprise Imaging Storage System Groups (SSG’s) are associated to an Affinity, which is the physical location of the storage devices in the Storage System Group. Typical affinities are “Data Center 1”, “Data Center 2”.

4.2.4 EI ASR queuing

ASR is managed by the Enterprise Imaging Information Lifecycle (ILM) rules, more specifically by the Storage and cache deletion rules. It allows to copy and archive studies to particular locations (SSG’s) or delete studies from a cache storage group, based on certain criteria.

The studies are asynchronously “replicated” to the other defined System Storage Group (SSG) bythe application

4.2.5 ASR and ILM deletion

Caches in the SSG’s are managed and cleaned up by the Cache deletion rules in the cache management ILM rules; each SSG is managed separately.As such, the replicated storage solution in both data centers do not needs to be identical or equal in size: both are managed separately by the EI storage and deletion rules.







5 BCDR Scenarios Overview

5.1 Generic Definitions

Following definitions are used when describing Generic BCDR scenarios:

Active – Passive ( A-P):The primary data center is Active, whereas the other data center is a Passive data center and does not contribute to the workload, workflow or servicing the end users.(Example: the passive site in a VMware setup, protected by VMWare SRM)

Note: in the case of (optional) Oracle Data Guard replication, some (application) services might be actively running in the secondary DC (replicated DB server, Core Servers, etc.). (Supported in EI 8.1.x ).

Active – Standby ( A-S ):The second data center is running in Standby mode, running the replicated Oracle Data Guard server (and some minimal EI application services), and is monitored by the EI Watchdog process. EI services on the standby servers are already running, and actively started by the EI watchdog process when the database in the local data center becomes the active open database. EI services are stopped for the DC which hosts the non-open/non-active database server. (Supported in EI ).

Stretched Cluster: Database and load balancer service are stretched across server rooms and only active within one DC at a time; The active components are – sometimes transparently- spread over two data centers or computer rooms, usually when both are on the same (low latency) LAN. (Supported in EI 8.1.x).

Active-Active (Independent Databases)Fully independent and fully synchronized Active/Active clusters. It provides of Partition tolerance characteristic via fully independent backends. NOT SUPPORTED

( refer to the overview table on next page).

5.2 Supported EI Scenarios

The following Enterprise Imaging DR Scenarios are described in this document:

Table 3 - Enterprise Imaging DR Scenario OverviewScenario Name Description Chptr.1 Basic Backup /

RecoveryThe most basic BCDR design based on a single EI environment and backup/recovery based on RMAN, scripting or off-the-shelf backup agents.This is the minimum delivered solution, required for every EI-8.x implementation.

1

2 Stretched Cluster“Active-Passive”

This scenario is based on a “single instance” EI solution’. All data is replicated to a 2nd symmetrical

8







Mainly for customer managed solutions

data center by hardware storage replication (SYNC for DB and VMs, ASYNC for DICOM images on NAS). Requires Layer 2 network connectivity, very low latencies due to SYNC hardware storage replication.

3 Stretched Cluster“Active/Passive”+ ODG

Same as scenario 2, extended with Oracle Data Guard replication (ODG) for better DB protection.

Error! Reference source not found.

5(a) Active / Standby+ ODG+ EI-ASR+ EI WD

Based on one single EI cluster environment, Oracle Data Guard replication, Application StorageReplication (ASR) and an Application Watchdogprocess (as of EI 8.1.x).

10







5.3 Supported Technologies

Table 4 - Enterprise Imaging - DR supporting components

Sce

na

rio

EI

Ve

rsio

n

Da

ta C

en

ter

Ba

cku

p/R

est

ore

VM

HA

# v

Sp

he

re (1

)

VM

SR

M

OracleSYNC

HW Storage Replication ASYNCHW Storage Replication

(Images)

#C

SJB

oss

clu

ste

rs

EI Application replication/HA

features

Comments

OD

G

FS

FO VM

OSDB

Incomingcache

EIASR

(3)

EI CSWatch

dog

1 all 1 1 - - - - - - - 1 - - Basic backup recovery2 all A-P 2 O - - 1 - - Active-Passive – (FITCO stretched cluster)

5a 8.1.x A-S 2 - O - - - - 1 x Active– Standby ASR EI cluster, multiple writes to different storage groups ( as of EI 8.1 )

- = not applicable = required/applicable O = Optional

X = application uses dual write to 2 (or more) different storage locations; this might mean cross mounting from “the other” data center, which is not always favorable (depending on latencies).

(1) vSphere environments: recommended to have one in each DC, although could be 1 in “metro-cluster” type of setups.(3) For HA/DR; the EI caches are linked to more than 1 affinity, assuming the storage is available in more than 1 location

= ASR = Application Storage Replication or writes to multiple volume groups.







6 Supported BCDR Scenarios - Details

PART II - EI 8.1.x BCDR Scenarios

The next chapters describe the different pre-engineered Enterprise Imaging DR Scenarios

and how each of the layers in the managed stack are covered.

Managed Stack

Networking

Application

Data

Database

Middleware

O/S

Virtualization

Servers

Storage







7 Scenario 1 – Basic Backup / Recovery

7.1 Solution Summary

Key Indicators:

Scenario 1: Basic Backup / Recovery

Use Case: "As the CTO or IT administrator, my hospital only requires basic Business Continuity (BC)

and Disaster Recovery (DR) capabilities. We don't have a lot of money so for now we employ manual BC

practices (e.g. Radiologist must go to the Modality to view the images). Some day we might want to investigate

options so we hope the system is easily upgraded to meet our growing needs."

EI Version: Supported in Enterprise Imaging 8.1.x (all versions).

HA within DC1:

Uptime [%] baseline: ~99,8% (indicative)

Restore:

Recovery Time Objective (RTO) EI Core: (best effort)

Recovery Time Objective (RTO) Ancillary: (best effort)

Recovery Point Objective (RPO): 12-24 hrs. (time of last successful backup)

Recovery Point Objective (RPO) Ancillary: 12-24 hrs. (time of last successful backup)

7.1.1 Assumptions

The Agfa standard backup solution is VEEAM – and can be provided as a turnkey solution. Basic Backup /Recovery is the primary and minimal solution to protect against data loss It is assumed to be in place for each Scenario described in all further sections. We assume a local backup Staging Area and a remote Safe Storage Location. Customer assumes responsibility for execution of the correct backup process by using the

prescribed backup/restore tools, and for the proper protection of the data on a safe location.

Customer assumes responsibility for safeguarding the backups.

Below described concept follows the guidelines and principles of the SANS Data Recovery Capability - Critical Security Control: 10 (CSC-10) [ reference here ]

https://www.cisecurity.org/critical-controls/download.cfm?f=CSC-MASTER-VER%206.0%20CIS%20Critical%20Security%20Controls%2010.15.2015

https://www.sans.org/critical-security-controls

https://www.sans.org/critical-security-controls







Figure 5 - Data Recovery Capability - Critical Security Control 10 overview

System and User backups: CSC 10.1

A distinction is made between User Backup – containing application data that changes daily – and System backups – containing System image data that only changes sporadically. Both types have different schedules, frequencies and retention policies.

System backups are typically made after first deployment and implementation and are repeated further on an as needed basis – such as after major configuration changes or upgrades. The system backup covers the operating system and application software.System backups are typically triggered manually after each configuration change.

User backups cover for the application data and at minimum need to run once per day. User backups cover the databases and configuration data that changes regularly.

Further distinction is made between full backups and incremental backups. Full backups will back up all data, whether it changed or not.

Incremental backups will only backup data that has changes since the last full backup.

Staging Area and Safe Storage: CSC 10.4

To delineate roles & responsibilities, EI Backup is considered a two-step approach:

1. EI components back up to a Staging Area: this process is managed and monitored by processes and scripts which are part of the EI solution.

2. The data in the backup staging area is then securely copied (to an off-site location) on Safe Storage. This process is managed and owned by the hospital (backup) processes, and can







be orchestrated by 3rd party backup frameworks. Safe Storage can be versioned according retention policies of the hospital.

At time when recovery is required, the EI processes will assume recovery from the Staging Area which holds the most recent and consistent copy of the data since the time of the last successful backup. In case the Staging Area is also affected by the outage, the last consistent backup must be presented for recovery from Safe Storage. This latter step might be a hospitals responsibility.

At all times, recovery is a manual process: the key objectives such as Recovery Point Objective and Recovery Time Objective capabilities (RPO, RTO) are restricted to the capabilities of the backups solution.

7.1.2 High level description

Backup and recovery of the EI Database is performed by the tools and RMAN scheduled jobs provided in Enterprise Imaging.

The OS VM’s are backed up by the VEEAM backup solution.It is customers responsibility to store (or replicate) the backups from the “Staging Area” to a “Safe Storage” location.

Image data is protected by archiving.Image data can be additionally protected by using the Enterprise Imaging Application Storage Replication (ASR) capability to write the same data to multiple storage volumes.

7.1.3 Block Diagram

Figure 6 - Basic backup / Recovery - block diagram







7.2 Components

This section lists each layer of the solution, and how it is protected in the BCDR solution:

7.2.1 Environment: Single Data Center

DC-1 is the primary Data Center, and is active. A secondary DC-2 or computer room is a recommended best practice: storing the backups

in a different location, where the backup solution is hosted (refer Safe Storage Location –not part of the standard solution).

The backup solution can be rather basic (i.e. a backup solution with associated tape library) or can be extensive (based on a backup solution combination with spinning disks and tape library). The capability of this solution will define eventually the RPO and RTO values.

Note 1: the backup should reside on a different platform or architecture than the production environment (not on the same disk infrastructure as the production environment)

Note 2:Customer is responsible for executing the backups where needed, and for transferring the backups from the Staging Area to a Safe Storage Location.

7.2.2 Network Topology:

No specific network topology requirements Can be Layer 2 or Layer 3 (routing). The network to the backup solution must be able to handle the backup volume in a

reasonable timeframe (backup window).

7.2.3 Oracle Database

The Oracle database is in first place protected by the rollback logs and RMAN.(refer to section “12 - APPENDIX C - Oracle HA/DR “)

Oracle Database RMAN scheduled jobs are installed as part of the Enterprise Imaging DB installer and trigger a backup script at pre-defined intervals – every day at 02:15AM;

L0 full backup is triggered every Sunday The backup set is validated after each L0 backup L1 differential incremental backups triggered every day. Default Retention policy is 6 days Reference: EI – Backup/restore Guide - LLID 41759645

7.2.4 Application servers: (CS, WS, …)

The specific application server configuration is part of the Operating System The settings are backed up as part of the VM backup –rfr. next section.

7.2.5 VMware vSphere environment:

Backup of the initial deployment is required for recovery. In addition, backup of the VM images is required at each major change or update

VM Backup / Restore is supported in a vSphere licensed environment.

https://he-library.agfa.net/he/overview.jsp?ID=41759645







In ESXi free versions, backups can be handled with a workaround via OVF exports; the non-licensed ESXi free environment is probably not best suited to cover for DR scenarios.

In the Basic Backup/Recovery scenario, there is nothing specifically provisioned for failover of the VM’s to another location or server room.

7.2.6 Storage

In this scenario, the different storage layers – identified per storage Tier - are protected as follows:

OS disks (Tier-1):o block IO (i.e. DAS or SAN) or local disks in the servero Holds the VM datastores for the VM OS drives (C:\ etc.)o Covered by the VM backup – refer above

Database storage (Tier-1) o Database on block IO according storage specs.o In first place, protected by Oracle rollback and redo logfiles.o for DR purposes, the database is protected by the daily backup.

Tier-2/2a Caches: CS Incoming Cache and WS Proxy/Accepted cacheo CS Incoming Cache and WS Proxy cache are not specifically protected in DR

Scenario 1 - Basic Backup/Recovery. o The EI application will store the data to the CS Online or Midline storage

locations, or to the CS Hub in case of the CW server.o Images are protected by further archived from CS Incoming Cache to (1 or more)

Nearline archive locations, or to (1 or more) optional Cloud archive(s).

Tier-2/3 CS Image cache on Online and Midline Storage:o Online /Midline storage are protected by the application archiving to the Nearline

archive location.o As of EI 8.1, EI can write to multiple Online/Nearline storage locations (affinity

groups): the 2nd location is the DR for the primary storage location.

CW Webcache (on Storage Tier-2/3):o Webcache contains data for viewing on local storage,o The original source of the data is located on the Core Servers (CS) and can be

reconstructed when such is required.o No extra protection required; optional

Nearline Archive (Tier-4 Storage):o Resides on NAS storage.o EI (8.1 onward) can write to 1 or more Nearline Archive locations (as of EI 8.1),

or optionally to an additional off-site Cloud Archiveo If part of the solution, a second archive location can be the DR solution of the

primary archive location.

Tier-4 storage: Cloud Archiveo Cloud archive – if available as part of the solution – is the DR protection of the on-

site image caches and Nearline archive(s).

7.3 Fail Over scenario

No DR Failover in this scenario.







7.4 Fail back scenario

No fail back in this scenario: restore the database backups.

When Image cache storage is lost: the Online storage references need to be purged from the system When Image data is protected by multiple Online/Midline/Nearline storage locations

(Affinities), the Image data remains available from at least one or more locations in case of a DR incident, or from Nearline / Cloud archive locations.

7.5 References / High-level Runbook

Reference: EI – Backup/restore Guide - LLID 41759645 Bid Support backup components – LLID 55661132 <placeholder for an image cache purge command?>


https://he-library.agfa.net/he/overview.jsp?ID=41759645







8 Scenario 2 – Active-Passive


Key Indicators:

Scenario 2 – Active-Passive, based on storage vendors SYNC HW storage replication;

for integration in customer owned environments that already have a solid BCDR strategy in place

based on storage vendor replication.

EI Version: all EI 8.1.x versions

HA within DC1:

Uptime [%] baseline: ~99,98%

Uptime [%] with ODG/FSFO: ~99,99% (See next scenario)

Failover to DC2:

Recovery Time Objective (RTO) EI Core: 30 min.

(20 min for DB failover + 10 min for 1st CS)1

Recovery Time Objective (RTO) Ancillary: Less than 30 min (1)

Recover Point Objective (RPO) EI Core: 12-24hr (time of last successful backup)

Recover Point Objective (RPO) Ancillary: 12-24hr (time of last successful backup)

Figure 7 - Source: Segmentation Assessment chart v2 – LLID 65920274 – v27 – 2019-06-14

Use Case:

"As the CTO or IT administrator, my hospital requires more than the most basic Business Continuity (BC) and

Disaster Recovery (DR) capabilities. We don't have a lot of money, but we've invested in Storage infrastructure

and have the ability to replicate storage (block level) from the storage vendor (despite understanding the risks

associated with DB being storage replicated, the additional cost of Oracle EE and Oracle Data Guard configuration

are more than we're able to spend). We know the failover from one DC to another isn't immediate, but we can live

with about an hour or two to recover. Some day we might want to investigate options, so we hope the system is

easily upgraded to meet our growing needs."

1 Dictated by the fail-over process and capability of the storage replication solution! Ask vendor for the storage solutions RTO and RPO.








8.1.1 Assumptions

For Agfa provided FITCO 2b scenario based on HPE Peer-Persistence. For Software & Services-only projects: EI integrates into an existing hospital IT solution,

o Hospitals IT staff maintains and controls the hardware platform and BCDR fail-over operations, as part of the hospitals BCDR strategy.

o For customer managed storage solutions, where the IT department has already built out a DR solution based on storage clusters such as EMC-VPLEX, HP Peer Persistence, NetApp MetroCluster configurations etc.

A VMware virtualization infrastructure is in place; this solution is not suitable for Bare Metal deployments.

8.1.2 High-level description

This scenario is depending on virtualization and is not suitable for Bare Metal deployments.

This BCDR scenario is based on Storage Infrastructure replication for sites with low (LAN-like) latency. It allows EI-8.x to hook in into specific customer owned and managed BCDR solutions from the storage vendors.The scenario can be further enhanced with fail-over automation tools such as VMware Site Recovery Manager (SRM).

The EI Active-Passive (single EI instance) scenario can be the right choice for larger Hospital Environments that already have a Dual Data Center BCDR Strategy and operational model in place, based on hardware storage replication solutions and fail-over technologies such as VMware Site Recovery Manager (SRM).

Note that two symmetric storage systems are required to support synchronous hardware storage replication (i.e. same size, same storage product, same technology, etc.).

Note that synchronous hardware storage replication should be further extended with a backup solution for a solid Disaster Recovery solution.

Industry examples of such solution: Combination of EMC VPLEX with EMC RecoverPoint Combination of HP Peer Persistence with HP backup solutions NetApp Metro Cluster combined with SnapMirror solution Etc. …







8.1.3 Block Diagram

Figure 8 - Scenario 2 – Stretched Cluster - hardware storage replication

8.2 Components

The following components are used in this BCDR solution:

8.2.1 Environment: Dual Symmetrical Data Center

Dual symmetrical data center or server room setup, Both data centers have an identical server architecture, storage and network setup, Only the Primary Data Center (DC-1) is active (Active-Passive), but VM’s can transparently

live in DC-2.


Layer 2 connectivity is required between the two sites: IP network addresses must be able to live in both DC’s (i.e no Layer 3 / routing!)

Optional, IP addresses can be set up to change (i.e Layer 3 connectivity) if VMware SRM scripting in combination with the EI startup processes have been configured to cover for the IP address change at time of the failover.

Latency: LAN-like latency between the DC’s, < 1-2 msec as absolute maximum, to support fast Synchronous hardware storage replication for DB volumes: any higher latency will have considerable impact on performance of the Database.

At time of failover, the VM’s in DC-1 are stopped (if not already) and need to be restarted in DC-2: they maintain the same IP addresses (Layer 2 connectivity).

Load balancer VIPs forward connections to the servers in the active data center.








This DR scenario is based on hardware storage replication: the EI database is protected as follows:

The Oracle database volumes require Synchronous hardware storage replication to maintain database consistency.

Note that synchronous hardware storage replication should be further extended with a backup solution to be considered a solid Disaster Recovery solution.

The Oracle database is primarily protected by the rollback logs and RMAN backup( Rfr section “ 12- APPENDIX C - Oracle HA/DR ” )

Optional Oracle Data Guard (ODG) and Fast Start Failover (FSFO)

8.2.4 Application servers: (CS, WS …)

There is not specific dependency on the application: EI is “unaware” of the underlying hardware storage replication.

The IP addresses and hostnames of the EI components (DB, CS, WS, …) servers before and after the failover are in principle identical (replicated VM images, Layer 2 connectivity);

Optional, IP addresses can change if VMware SRM scripting in combination with the EI startup processes have been configured to cover for the IP address change at time of the failover (Layer 3 connectivity).

The VM images of the servers in DC-1 and DC-2 are maintained identical by SYNC hardware storage replication. Note that a VM OS backup is still required!


Separate VM HA Clusters, one for DC-1 (VM-HA-DC1) and one VM-HA-DC2) to limit the VM’s to fail-over in case of a defect within the same DC.

Optional: VMware Site Recovery Manager (SRM) for scripted failover. Separate vSphere environment in in every DC (as per SRM requirement). VMware SRM will script the manually triggered failover process from DC-1 to DC-2 (SRM

“failover” button)

Variation:If both data centers (or computer rooms) are located within the same LAN (L2), both data centers could be considered as one single managed vSphere environment; the VM’s can in this case transparently fail over between the data centers (VMware HA Cluster).

8.2.6 Storage

In this scenario, DR is mainly based on replication of the Storage components. The storage in DC-1 and DC-2 needs to be symmetrical to support SYNC replication: i.e. same technology, type, size etc.


OS disks (Tier-1):o block IO (i.e. DAS or SAN) or local disks in the servero Holds the VM datastores for the VM OS drives (C:\ etc.)







o Synchronous replication from DC-1 to DC-2; this requires symmetrical storage for this Tier in DC-1 and DC-2

o Note that synchronous hardware storage replication should be further extended with a backup solution: a backup of the VM images is still required – either of the initial deployment and/or at each major change (upgrade, change, etc).

Database storage (Tier-1) o Database on block IO according storage specs.o In first place, protected by Oracle rollback and redo logfiles.o for DR purposes, the database is protected by the daily backup.o In addition, Synchronous replication of the DB files from DC-1 to DC-2o Note that synchronous hardware storage replication should be further extended

with a backup solution: a backup of the database (i.e Scenario 1) will always be required!

Tier-2/2a Caches: CS Incoming Cache and WS Proxy/Accepted cacheo These caches are located on Block IO or NASo In case of Block IO, protected by Synchronous replication from DC-1 to DC-2o If NAS, replication will be ASYNC replication (NAS does usually not support SYNC

replication)o Images are protected by further archiving from CS Incoming Cache to (1 or more)

Nearline archive locations, or to (1 or more) optional Cloud archive(s).

Tier-2/3 CS Image cache on Online and Midline Storage:o Online /Midline storage caches are protected by the application archiving to the

Nearline archive location. o As of EI 8.1, EI can write to multiple Online/Nearline storage locations (affinity

groups): the 2nd location is the DR for the primary storage location.o Specifically in this Scenario 2, protected by Synchronous replication from DC-1 to

DC-2 (for Block IO) and …o Asynchronous replication from DC-1 to DC-2 for NAS (NAS does usually not

support SYNC replication)


reconstructed when such is required.o No extra protection required

Nearline Archive (Tier-4 Storage):o Resides on NAS storage.o EI (8.1 onward) can write to 1 or more Nearline Archive locations (as of EI 8.1),

or optionally to an additional off-site Cloud Archiveo If part of the solution, a second archive location can be the DR solution of the

primary archive location.o Specifically in this Scenario 2 A/P based on hardware storage replication, the

archive can be protected by Asynchronous replication from DC-1 to DC-2 for NAS (NAS does not support SYNC replication)

Tier-4 storage: Cloud Archive optionalo The Cloud archive – when available as part of the solution – is the DR protection

of the on-site Nearline archive and image caches.







8.3 Fail over scenario (high level)

8.3.1 Starting condition:

DC-1 is the active data center: holds all connected users and modalities EI database storage in DC-1 is synchronously replicated to a symmetric storage solution in

DC-2 (SYNC hardware storage replication) EI Image caches are asynchronously replicated to a symmetric storage solution in DC-2

(ASYNC hardware storage replication of the Image caches on NAS); note this is required for all image caches: Incoming cache (T2a), midline cache (T2) and Nearline archive (T4) – with exception of the cloud archive.

8.3.2 Scenario

1. Incident in DC-1 (major or minor) 2. All users and modalities/departments disconnect from EI or experience application hang-up.3. Application is down.4. Shut down all EI VM components in DC-1 – if not already.5. Make the storage in DC-2 the primary storage (procedure is storage vendor/model dependent) 6. Start up a binary copy of the EI VM components in DC-2 (provisioned by SYNC hardware

storage replication)7. Application is up again8. Users and modalities/departments can reconnect to the same IP addresses and hostnames9. Transactions that were in transit during the incident might need to be cross checked and re-

executed.10. Revert the hardware storage replication now from DC-2 to DC-1: storage in DC-1 is primary

storage, DC-2 is standby storage.

Steps 3-6 can be automated by VMware SRM, delivered as a professional Service. The failover in this case is triggered by the SRM failover button.

8.4 Fail back scenario


DC-2 is the active data center Hardware storage replication is broken after the incident Note that, as the DC’s are symmetrical, production can continue with this scenario.

8.4.2 Failback scenario

Fail back from DC-2 to DC-1:

1. Reconstruct the defective DC-1 storage by replicating setup from DC-2 storage (in case of corruption and if needed).

2. Wait for DC-2 to DC-1 back-replication to complete:Note that - for longer periods of replication downtime, or for data corruption - the total storage capacity might need to be replicated back from DC-2 to DC-1. This takes a considerable amount of time.







3. Stop VM’s in DC-1: VM’s (and therefore also the application) is down.4. When DC-1 storage is back in sync with DC-2: again revert the hardware storage replication

from (DC-2 to DC-1) to (DC-1 to DC-2)5. Start VM’s in DC-1: application is up again.







9 Scenario 5a: Active/Standby


Key Indicators:

Scenario 5a - Active/Standby with Oracle Data Guard, EI Watch Dog and EI Application

Storage Replication (ASR).

Use Case:

"As the CTO or IT administrator, my hospital is willing to pay a little more for better Business Continuity (BC) and

Disaster Recovery (DR) capabilities. While we have invested in Storage infrastructure and have the ability to

replicate storage (block level) from the storage vendor, we don't want to rely on this for the database or the image

cache. We know the failover from one DC to another isn't immediate, but we can live with less than an hour to

recover (ideally less than 1/2hr or even 15min). Some day we might want to investigate options, so we hope the

system is easily upgraded to meet our growing needs."

EI Version: EI 8.1.x and later

HA within DC1

Uptime [%] baseline: ~99,95%

Uptime [%] with ODA: ~99,99%

Failover to DC2

Recovery Time Objective (RTO) EI Core Less than 30 minutes (2)

(15 min. ODG failover and 10 mins for 1st CS)

Recovery Time Objective (RTO) Ancillary: Less than 30 minutes (failover to DC2 via

SRM)

Recovery Point Objective (RPO): 5 – 15 min 3

Recovery Point Objective (Ancillary): 5 – 15 min 4

Figure 9 - Source: Segmentation Assessment chart v2 – LLID 65920274 – v27 – 2019-06-14

9.1.1 Assumptions

A/S concept works for EI Core Components only. Always combine with Scenario 2 in case there are Ancillary components in your design.

2 Default is manual failover. Scripted FSFO to expedite fast fail-over.3 RPO dictated by the application replication (i.e. time to replicate large CT studies) 4 RPO dictated by the application replication (i.e. time to replicate large CT studies)








This solution is based on a combination of Oracle Data Guard for database replication, EI

Application Storage Replication and storage vendor replication.

DC-1 and DC-2 are running active – and different! -EI VM’s: the EI VM’s are not intended to fail over between DC’s. SRM shall not be used.

For the application and storage protocols (CIFS/NFS) to operate correctly and at acceptable performance, latencies between DC’s should not be higher than 15 msec; expected impact of network latencies for CIFS/NFS (to be confirmed):

o NORMAL: < 15 msec in normal operationso WARNING: > 15 msec during a period no longer than 1 minute.o ERROR: > 20 msec during a period no longer than 2 minutes.5

9.1.2 High Level Description

This BCDR scenario is based on EI Application Storage Replication (ASR) and is intended for sites with low (LAN-like) or medium latency. This scenario writes the data to multiple storage locations (Storage System Groups) in two different data centers (affinities), thereby avoiding the need for complex Hardware Storage replication solutions.Note: Storage level replication still required for the Ancillary components in your design!

The scenario builds further on an Oracle Data Guard solution: For this Scenario 5a – Active -Standby, we can extend ODG with Fast Start Fail-Over (FSFO) to support a manual triggered and scripted failover process.

The Enterprise Imaging Watchdog process will either automatically, or under manual control (as of 8.1.2SP4), fail over the Core Servers application services from DC-1 to DC_2.

The EI Active-Standby scenario is probably the better choice for small and large Hospital Environments that want a reliable and solid application based BCDR Strategy for the AGFA HE Enterprise Imaging solution.

5 Source: NetApp Performance advisor document [ link here ]

https://kb.netapp.com/library/CUSTOMER/solutions/1013259/PerformanceAdvisor_Default_Threshods.pdf







9.1.3 Block Diagram

Figure 10 - block diagram - Scenario 5a - Active -Standby

9.2 Components

The following components are used in this BCDR solution:

9.2.1 Environment: dual DC

Dual Data Center: one active, one “standby” – but running some active EI components DC-1:

o is the primary Data Centero this DC is actively participating in the workload, the workflow, and is actively

servicing the users DC-2:

o DC-2 is the secondary Data Center; o this DC is a standby data center and is not actively contributing to the workload or

servicing the userso there are however a couple of components that are “switched on”, such as the

standby Database Server ( for Oracle Data Guard replication) and the EI CS servers running the EI CS WATCH DOG process.


Supports as well Layer 2 vLAN or Layer 3 vLAN (routing) (refer section 10- JBoss Cluster support for Layer 3 networks)







Higher latencies between the two DC’s are supported (refer assumptions section above) butimpact the throughput and performance (and hence RPO!) on the SMB/CIFS cross mounts.

Traffic is kept (as) local (as possible, supported by the EI CS WATCH DOG componentconcept – refer section Error! Reference source not found.- Error! Reference source not found. .

CROSS STORAGE traffic for CIFS-SMB:this scenario expects the active CS’s to write to the two storage locations: local in DC-1 and remote in DC-2 via CROSS MOUNTING

As both DC-s run active EI services, there will be cross DB traffic (besides ODG replication), although minimal.

9.2.3 Load balancer

LB VIPS are forwarding network traffic to the Application servers in DC-1 and DC-2 As application services in DC-2 are not running, the DC-2 servers will not get any

communication/connections. The CS Health Check pages will provide the status to the Load balancer and indicate that underling DC-2 CS IP Addresses are not available for the EI services (i.e transparent for the end user and connected modalities).

At time of a failure, the EI CS Watch Dog process will stop EI services in DC-1 and enable services in DC-2

The LB health check will detect the “failure” and the LB VIP will forward the connections now to the respective Core Servers (CS) in DC-2 as they come up.

The Web Servers (CW’s) are active in both Data Centers, and protected by CW LB VIP’s. To reduce cross DC traffic, the active LB VIP can point to the WS in the Data center of preference.


The Oracle database is primarily protected by the rollback logs and RMAN.(refer to “ 12 - APPENDIX C - Oracle HA/DR ““)

Oracle Data Guard (ODG) replication from DC-1 to DC-2. The standby database server in DC-2 is active but not open for serving clients. Oracle Fast-Start-Fail-Over (FSFO) as failover automation option is Work in Progress Optional: Oracle RAC on the Oracle Database Appliance (ODA)

o Offers HA platform (failover within the same DC)o Optional: a hybrid failover solution (i.e ODA in DC-1 to standard server in DC-2)

is supported by Oracle under certain specific conditions – refer to section “ 12.3 -Oracle RAC”

9.2.5 EI Core Application environment

One CS JBoss cluster spanning over the two DC’s )(refer section 10- JBoss Cluster support for Layer 3 networks),

The EI CS application components are only active in the primary data center (i.e the DC that hosts the ACTIVE OPEN database – DC-1)

The EI CS servers in DC-2 are running VM’s started under different hostname/IP address, however do not handle any user requests until they get activated by the the EI CS WATCH DOG PROCESS

Therefore, it is not the intention to fail-over the EI Core servers such as CS or WS VM’s from DC-1 to DC-2: each CS server in DC-1 and DC-2 is a separate component , with its own hostname and IP address

The CS Watch Dog Process is continuously checking for the active and open database in the local data center: either the DB server in DC-1 or the DB server in DC-2 is the active database. The JBoss Database Check watchdog process will only check for the local







database instance: so Core Servers in DC-1 will monitor availability of DB server in DC-1; similar for the CS servers in DC-2. This to avoid unintended and automatic switch overs (f.e. in case of patching and restarting the database).

The EI CS WATCH DOG process will start up the EI services on the CS servers in the DC where the database is (or becomes) active

The CS Watch Dog process will shut down the EI services in the DC that does no longer run the active open database. Note that – as of EI 8.1.2 SP4 – automatic shutdown of the Core Server services by the watchdog process can be switched off, to offer better manual control over the shutdown and failover of servers when planned maintenance is required.Core Server nodes can be assigned either to DC-1 or to DC-2 locations.

As the CS servers (and other) run active EI services for image replication, all CS servers (also in DC-2) will connect or cross connect to the active open DB in the primary DC.

The Web Servers (CW’s) are active in both Data Centers, and protected by CW LB VIP’s. To reduce cross DC traffic, the active LB VIP can point to the WS in the Data center of preference.


This scenario requires separate vCenter / vSphere environments in every DC (requiresseparate VM HA clusters per DC!).

Independent vSphere in each DC is recommended: at time of a fail-over, management of the VM environment in DC-2 should probably not depend on the availability of the VM management environment in DC-1.

Only Ancilliary VM’s do fail-over from DC-1 to DC-2 or vice versa, hence two HA clusters or 2 vSphere environments, and no need for 3rs party hardware storage replication. For these components SRM will be required.

9.2.7 Storage

In this scenario, DR is mainly based on EI Application Storage Replication (ASR) and ODG.


OS disks (Tier-1):o block IO (i.e. DAS or SAN) or local disks in the servero Holds the VM datastores for the VM OS drives (C:\ etc.)o There is no intention to run the VM’s of DC-1 into the DC-2 environment (or vice

versa): the VM’s running in DC-1 are different VM’s than I DC-2.o The OS disks of the VM’s (vmdk’s/datastores) do not require (sync or async)

hardware storage replication. o Note that solution should be further extended with a backup solution: a backup of

the VM images is still required – either of the initial deployment and/or at each major change (upgrade, change, etc.).

Database storage (Tier-1) o Database on block IO according storage specs.o In first place, protected by Oracle rollback and redo logfiles.o no hardware storage replication required – Oracle Data Guard is dealing with the

(Oracle application based) database replicationo The database is also protected by the daily backup.

Tier-2/2a Caches: CS Incoming Cache and WS Proxy/Accepted cache







o Affinity: only local Incoming cache within the same DC is mounted to the local servers (in that same DC): no cross mounting of the Incoming Cache:

o the Incoming Cache of the active DC-1 is mounted to the Core Servers in DC-1, and populated by the CS server in that DC,

o the Incoming cache in DC-2 is mounted to the CSs in DC-2, but not actively used (under control of the EI CS WATCH DOG process)

o No need for replication of the Incoming caches from DC-1 to DC-2 or vice versa: images will be stored on the local and remote Tier-2 Image cache.

o For Proxy Incoming/ Accepted cache at remote facilities: this is protected by the application storage and archiving rules: studies are registered immediately in the CS and archive to the hub. Images still reside on the modalities; no further protection required.

Tier-2/3 CS Image cache on Online and Midline Storage:o Online /Midline storage are in first place protected by application archiving to the

Nearline archive locationo In this scenario, EI (8.1) writes to multiple Online/Nearline storage locations

(affinity groups) as defined in the Storage Rulers: the 2nd location is the DR for the primary storage location (EI Application Storage Replication – ASR).

o Local mounting of Storage to CS servers in DC-1 and CROSS mounting of storage in DC-2 to CS servers in DC-1

o CROSS MOUNTING setup! CS servers in DC-1 are writing to storage in DC-1 andstorage in DC-2, controlled by the application queues.

o The size of this storage tier can be asymmetrical between DC-1 and DC-2: i.e. this storage tier in DC-2 can be smaller and different vendor/technology if such is desired (budget)


reconstructed when such is required.o No extra protection required

Nearline Archive (Tier-4 Storage)o EI writes to 1 or more Nearline Archive locations in each DC – as defined by the

storage rules - or optionally to an additional off-site Cloud Archive(Application Storage Replication – ASR)

o The second archive location is the DR solution of the primary archive location. o The size of this storage tier can be asymmetrical between DC-1 and DC-2: i.e. this

storage tier can be smaller in DC-2 if desired (budget) - but – as this is the archive, needs to grow with the solution.

o Local mounting of Storage to CS servers in DC-1 and CROSS mounting of storage in DC-2 to CS servers in DC-1.

o CROSS MOUNTING setup! CS servers in DC-1 are writing to storage in DC-1 andstorage in DC-2, controlled by the application storage rules and queues.

Tier-4 storage: Cloud Archive optionalo The Cloud archive – when available as part of the solution – is the DR protection

of the on-site Nearline archive and image caches. o No extra measures are required for protecting the Cloud archive: his is covered by

the SLA of the Cloud provider. If failover required, servers in DC-2 will connect and point to “the same cloud archive solution”.

o Only supported for MS Azure.







9.3 Fail over scenario (high level)


DC-1 is the active data center: holds all connected users and modalities EI database storage in DC-1 is replicated by Oracle Data Guard to the standby DB server in

DC-2. the DB server in DC-1 is the active and open database (primary DB server); the other

database (in DC-2) is the replicated ODG DB server and is active but closed. Image caches are replicated based on Application Storage Replication (ASR).

The EI Watchdog process on the Core Servers in DC-1 has detected that DB server in DC-1 is the active Open database. EI Services on CS in DC-1 are active.

EI Services on the Core Servers in DC-2 are inactive, only Watchdog processes are running. Other EI services on the CS’s in DC-2 are inactive.

9.3.2 Scenario

1. Incident: Incident in DC-1 (major or minor) 2. All users and modalities/departments experience an application hang-up.3. Application is down.4. The EI VM application components in DC-1 might be up or down (undetermined).5. Refer to the BCDR Runbook Decision Tree ( LLID 63112397 ) to assist in the decision-making

phase.

6. DECISION POINT: decision to fail over to DC-2:Note that failover is typically used for unplanned downtime.Switchover is used during a planned downtime (e.g. when a planned work is done in DC01).

7. Fail-over: Make the replicated DB in DC-2 the active and “open” database; Note that the replicated DB was already active, and is now (manually) set to “open” state for the application (*)

8. The EI CS WATCHDOG service in DC-1 will stop any EI application services (if not already):9. The EI CS WATCH DOG component in DC-2 will start automatically the EI application services

in DC-2 as soon as it detects the local DB in DC-2 is active and open.10. LB will detect Health Checks and VIP(s) will fail over.11. Application is up again12. Users and modalities/departments can reconnect to the same IP addresses and hostnames13. Transactions that were in transit during the incident might need to be cross checked and re-

executed.

(*) Step 5 is strongly recommended to be a manual process, and can be scripted by in Oracle Fast Start Fail-Over (FSFO)! There is no need to fail over EI Core VM’s (WS, CS) from DC-1 to DC-2. However, all other Ancillary components (BI, Rhapsody etc.) will be restarted using Storage level replication and Site Recovery Manager SRM.

9.4 High-level fail back scenario

9.4.1 Starting condition

EI is active and fully functional in DC-2.DC-1 has been repaired after the incident, and deemed ready to go in production








9.4.2 Scenario

Same as fail-over scenario; fail-back can happen in a controlled way to minimize downtime.







10 APPENDIX A – EI JBoss Cluster support for Layer 3 networks

Enterprise Imaging uses JBoss clustering for High Availability and failover purposes.Standard JBoss uses either UDP or TCP broadcasts for Cluster detection; the nodes join the cluster with the same cluster name.

This concept only works well for nodes that reside on the same (v)LAN – not across (v)LAN’s ( i.e on routed networks / Layer 3 networks).

The Enterprise Imaging installer framework covers also for the Layer 3 networks and identifies upfront on which networks the different EI components reside during installation. As such, EI JBoss clustering is also supported across different networks or (v)LAN’s.In this configuration, the EI CS servers will belong to different JBoss Cluster Nodes (JBoss groups), one in DC-1, and another in DC-2 ( BCDR Scenario 5A).

For BCDR Scenario 6A – Stretched Cluster - EI CS servers will belong to the same JBoss cluster to maintain data consistency and stateless operation across the Core Servers in both different data centers.

The EI Web Servers are stateless and can belong to one or more JBoss clusters – different from the CS JBoss cluster.







11 APPENDIX B – EI Auxiliary System

Reference: AUXiliary Environment - Training PPT - LLID 68399202 AUX Statement of Work / White Paper – reference included in above PPT.

11.1 Definition:

An Auxiliary Enterprise Imaging system (AUX) is defined as a separate – smaller scale EI system –which is functioning independently from the production environment, on isolated and smaller scale hardware, with separate installation, configuration, administration, maintenance and change management from the main production system.

For some scenarios, an Enterprise Imaging AUXILIARY SYSTEM (EI AUX) can contribute to the business continuity (BC) of Enterprise Imaging.

The EI AUX system is however not contributing to uptimes, Recovery Point Objectives or Recovery Time Objectives of the primary production environment.

11.2 Features and dependencies:

Table 5 – Auxiliary (AUX) System - features and capabilities table

Features, capabilities and limitations

Standalone The EI AUX system is a separate system. Besides image and message forwarding from the EI production system, no integration with the EI production environment.

Separate installation

The AUX system is a separate EI deployment from the production environment; this means also separate lifecycles – and hence not affected by upgrades, migrations or updates on the production system.Note that the AUX system is not a TEST system.

Forwarded Image data

The Image data is forwarded (routed) from the Production environment to the AUX system Stored on the AUX system for a limited time only

Forwarded HL7 data

HL7 messaging and data is forwarded from the higher level HIS/RIS systems to PRODUCTION system as well as to the AUX system.

Separate configuration

The EI configuration is independent form the production EI environment; there are no dependencies on the EI production settings, This also means that separate setup and maintenance is required. Changes on the production environment need to be applied separately to the AUX system.

Small scale Usually a smaller scale and low budget environment, at the cost of somewhat limited capacity and performance.Sufficiently sized however to service an identified number of users and modalities when going into a business continuity scenario.

Limited Integration Only limited integration capabilities to the external world. As the system is intended to be used in business continuity scenarios, it does not depend and tightly integrate into any external systems: the EI AUX system provides only the basics, required in a business continuity scenario.

Manual operations Operations of the AUX system is a manual action; as such, there are no automated processes for fail-over and fail-back.Going from the EI Production environment to the EI AUX system is a manually orchestrated scenario.

Data reconciliation Data reconciliation (from Production to AUX and vice versa) is a manual activity; the image or reporting data that has been acquired needs to be sent back to the production environment when required.








There are no automated failover or failback processes between Production and AUX system.

Data Consistency Data Consistency controls are manual activities as both PRODUCTION and AUX system are totally independent from each other. Activities on the AUX system are not replicated back to the PRODUCTION system.

11.3 EI AUX Relationship to the BCDR scenarios

As the EI AUX system is intended a standalone system independent from the main EI Production environment, it does not integrate into the higher described BCDR scenarios.

As such, adding an EI AUX system does not contribute to higher uptimes or better RPO/RTO objectives of the primary production system.

The “failover strategy” for use of the AUX system is to fall back to all manual operations on a small scale environment, for very limited functionality, and for a limited amount of users, after a human decision making process of customer/Agfa (no failover automation) to allow a small usercommunity to continue the basic work. The use of the AUX system is only intended for exceptional conditions where major incidents or servicing operations do not allow the standard BCDR scenarios to be used. By no means is the AUX system part of the higher described BCDR scenarios.

Similarly, the fail back strategy is a manual process, where data (DICOM Images and reports) need to be manually sent back from the AUX to the PRODUCTION environment; it is customer’sresponsibility to cross check the data consistency between AUX and PRODUCTION system, decide which is the authoritative data and eventually delete the redundant data.

11.4 Starting condition:

The production EI environment is forwarding the DICOM studies to the EI AUX system As such, the EI AUX system has (typical) 3 months of DICOM Image data on-line (sizing

parameter). Older DICOM data is purged from EI AUX by the Storage ILM Purging rules to clean up

space for new studies. This is a scheduled event. Studies that are protected are not deleted.When older studies are deleted from the AUX server, these studies are purged as well from the AUX database.

The RIS/HIS/EHR is forwarding HL7 messaging to the production system, and in parallel also to the EI AUX system.

HL7 messaging sent to the AUX system is not generating tasks, activities or worklists: all rules engines are stopped in the EI AUX system. As such, HL7 messages will be piling up.

Activities and status updates in the production system are replicated and applied to the AUX system via IOCM messaging from PRODUCTION to EI AUX.

…?

11.4.1 Incident on EI production

Based on the severity of the incident on the EI production environment, it is decided to evacuate to the EI AUX system for identified departments and users to support the diagnostic and clinical processes.

The standard BCDR scenarios cannot be used to deal with the incident.







In agreement between Agfa and Customer, a decision is taken to make use of the EI AUX system.

11.5 EI AUX “Fail-over” scenario

The time of “failover” is registered and kept for the record for later fail-back. Identified users are informed to connect to EI AUX Identified modalities ( or departments) are informed to send studies to the EI AUX system The DICOM feed to PRODUCTION is down or must be stopped The HL7 feed to PRODUCTION is down or must be stopped. EI Services on PRODUCTION are stopped.

11.6 EI AUX “Fail-back” scenario

11.6.1 Starting condition

The production EI environment is operational again and in production. A certain amount of studies have been acquired, reported and diagnosed on the EI AUX

system The patient data residing on the EI AUX system has not been archived yet Data needs to be reconciled into the EI Production environment once available. The AUX system is still in use – for “wet reading”.

11.6.2 Scenario for fail-back

This is a manual operation that requires orchestration from the local Admin. In agreement between Agfa and Customer, a decision is taken to revert from the AUX

system to the normal PRODUCTION environment. Data reconciliation: the studies that have been acquired and processed on the AUX system

are selected, and manually sent back to the PRODUCTION system, verifying for any inconsistencies between both; the study on the AUX system.

All studies to be sent back to the PRODUCTION environment – up to the registered time of the fail-over procedure (see “11.5 - EI AUX “Fail-over” scenario “ - first step).







12 APPENDIX C - Oracle HA/DR

References: EI 8.1 – Database and Application Disaster Recovery guide – KB link here. Oracle databases on Linux: Toolkit for the Agfa HealthCare Enterprise Imaging database

server - (Livelink ID: 55955892) Oracle databases on Windows: Toolkit for the Agfa HealthCare Enterprise Imaging

database server (Livelink ID: 55953388)

12.1 Oracle Recovery

Oracle performs automatic crash recovery at the first database open after a system crash to bring the database back into a consistent state. This recovery is performed automatically by applying the redo logs; no user intervention is required.

Oracle Crash recovery is the main feature which allows Enterprise Imaging Scenario 2 - to be supported (see also next section “12.2 - Dependencies on SYNC Hardware storage replication”).

When crash recovery turns out to be unsuccessful (corrupt database media files), Datafile Media Recovery is used to recover from a lost or damaged current datafile or control file. Whenever a change is made to a datafile, the change is first recorded in the online redo logs. Media recovery selectively applies the changes recorded in the online and archived redo logs to the restored datafile to roll it forward.

Datafile Media recovery is a manual process and proceeds through the application of (online or archived) redo data to the datafiles while the database is off-line.

Block media recovery is a technique for restoring and recovering individual data blocks while all database files remain online and available.The interface to block media recovery is provided by RMAN, and is part of the principal backup and recovery solution for Enterprise Imaging Oracle databases.

12.2 Dependencies on SYNC Hardware storage replication

Once written to disk, DICOM image files do not change. At time of failure, Asynchronous replication on EI DICOM Image caches would only impact consistency of the images in process of being written. In an EI environment, Synchronous replication for DICOM Image caches is therefore not a requirement (refer to RPO and RTO times).

Database files however are changing continuously; database consistency can only be guaranteed with synchronous hardware storage replication, or with a combination of (consistent) point-in-time snapshot solutions and replication.

http://knowledgebase-healthcare.agfa.net/impax_agility/guides/Enterprise_Imaging_DB_App_Server_Disaster_Recovery.pdf







If latency does not allow for synchronous replication without performance impact, other DR techniques - such as Oracle data Guard - are required for consistent database replication.

12.3 Oracle RAC

Refer to section “3.1.3 - Oracle HA: Oracle RAC on Oracle Database Appliance (ODA) “

12.4 Oracle Data Guard

Reference: Oracle 11g Database High Availability best practices :

https://docs.oracle.com/cd/E11882_01/server.112/e10803/config_dg.htm

Oracle 12c (Active) Data Guard technical Paperhttp://www.oracle.com/technetwork/database/availability/active-data-guard-wp-12c-1896127.pdf

The Business Continuity (BC)/Disaster Recovery (DR) solution from Oracle is Oracle Data Guard (ODG). ODG is supported on Oracle 11 database and Oracle 12 database Enterprise Edition (EE) –not on lower editions.

Extracts / figures from the Oracle data Guard technical paper (reference above);

A Data Guard configuration includes a production database referred to as the primary database, and a directly connected standby database. Primary and standby databases connect over TCP/IP using Oracle Net Services. There are no restrictions on where the databases are physically located provided they can communicate with each other.

A standby database is initially created from a backup of the primary database.

Data Guard automatically synchronizes the primary database and the standby database by transmitting primary database redo logs (the information used by every Oracle Database to protect transactions) and applying it to the standby database.

Data Guard transport services handle all aspects of transmitting redo from a primary to a standby databases(s). As the EI application commits transactions at a primary database, redo records are generated and written to a local online log file. Data Guard transport services simultaneously transmit the same redo directly from the primary database log buffer (memory allocated within system global area) to the standby database(s) where it is written to a standby redo log file.

Network efficiency, IO efficiency and performance:

Data Guard transmits only database redo. This is in stark contrast to storage remote-mirroring which must transmit every changed block in order to maintain real-time synchronization. Oracle tests have shown that storage remote-mirroring transmits up to 7 times more network volume, and 27 times more network I/O operations than Data Guard.

http://www.oracle.com/technetwork/database/availability/active-data-guard-wp-12c-1896127.pdf

http://www.oracle.com/technetwork/database/availability/active-data-guard-wp-12c-1896127.pdf

https://docs.oracle.com/cd/E11882_01/server.112/e10803/config_dg.htm







Reduced network consumoption by datagard,c ompared to Storage SYNC mirroring.

Data Guard offers two choices of transport services: synchronous and asynchronous.EI uses asynchronous transport mode.

Synchronous redo transport requires a primary database to wait for confirmation from the standby that redo has been received and written to disk (a standby redo log file) before commit success is signaled to the application. Synchronous transport provides a guarantee of zero data loss if the primary database suddenly fails. Although there is no physical limit to the distance between primary and standby sites, there is a practical limit to the distance that can be supported. As distance increases, the amount of time that the primary must wait to receive standby acknowledgement also increases, directly impacting application response time and throughput.

Asynchronous redo transport avoids any impact to primary database performance by acknowledging commit success to the application as soon as the local log-file write is complete; it never waits for the standby database to acknowledge receipt. This performance benefit comes with the potential for a small amount of data loss because can be no guarantee that at any moment in time all redo for committed transactions has been received by the standby.

Database protection:

Automatic Gap Resolution: In cases where primary and standby databases become disconnected (network failures or standby server failures), and depending upon the protection mode used, the primary database continues to process transactions and accumulate a backlog of redo that cannot be shipped to the standby until a new connection is established (reported as an archive log gap and measured as transport lag). While in this state Data Guard monitors the status of the standby database, detects when connection is re-established, and automatically reconnects and resynchronizes the standby database with the primary.

Redo Apply Services: Redo Apply services run on a physical standby database. Redo Apply reads redo records from a standby redo log file, performs Oracle validation to







ensure that redo is not corrupt, and then applies redo changes to the standby database. Redo apply functions independently of redo transport to insure that the primary database performance and data protection (Recovery Point Objective - RPO) is not affected by apply performance at the standby database. Even in the extreme case where apply services have been stopped, Data Guard transport continues to protect primary data by transmitting redo to the standby where it is archived for later use when apply is restarted.

Continuous Oracle Data Validation: Data Guard uses Oracle Database processes to continuously validate redo before it is applied to the standby database. Redo is completely isolated from I/O corruptions on the primary because it is shipped directly from the primary log buffer.Data Guard also detects silent corruption caused by lost-writes. Data Guard prevents this by performing lost-write validation at the standby database and detects lost-write corruption whether it occurs at the primary or at the standby.

12.5 Oracle Data Guard on ODA

ODG is supported on a mixed platform (i.e. standby database is not identical as the primary system) under the following conditions:(Note: not limited to ODA only!)

The same release of Oracle Database Enterprise Edition must be installed on the primary database and the standby database (same Oracle database release and Patch set!)

Systems are of the same Oracle Platform, as defined below:

Oracle software is certified to run on the server software

Architecture of the primary database server and the standby database server is the same (Platform ID and Platform Name, as per Oracle DB Query below);

SQL> select platform_id, platform_name from v$database;

PLATFORM ID PLATFORM_NAME----------- ----------------------------------------------------- 13 Linux x86 64-bit

Figure 11 - Oracle Platform identification command log

References: Data Guard Support for Heterogeneous Primary and Physical Standbys in Same Data

Guard Configuration (Oracle - Doc ID 413484.1)

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=546257979712398&id=413484.1&_afrWindowMode=0&_adf.ctrl-state=1ihxlx1ma_4







https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=299372658519598&id=413484.1&_adf.ctrl-state=3olw87rb_90 (login required)

Oracle Doc ID Note 413484.1 - discusses mixed-platform support and restrictions for physical standbys

Oracle Doc ID Note 1085687.1 - discusses mixed-platform support and restrictions for logical standbys.

12.6 Oracle Fast Start failover (FSFO):

Oracle Fast Start Fail over (FSFO) in Maximum Availability mode is a failover automation feature, part of the Oracle Data Guard.

For Scenario 5a – Active -Standby, we can extend ODG with FSFO to support a manual triggered and scripted failover process.

In It is strongly advised *not* to use automated failover FSFO in Scenario 5a – Active Standby – as it might introduce unwanted failovers n case of slower DB response. Therefore, FSFO is to be used as a manual triggered and scripted failover process.

General technology description:

FSFO Observer is the low-footprint client tool that monitors the state of the primary and standby databases, and that can (must) run on a different hardware platform,

Observer should not reside in DC1 or DC2 but rather in a 3rd location (typically administrative location) for the best possible monitoring, avoiding dependency on any of the monitored data centers.

FSFO is part of the Oracle Broker framework. The Oracle FSFO Observer will (can be configured to) trigger the automated failover

between two Oracle Data Guard servers when conditions permit so. The Oracle FSFO Observer will trigger an automated fail over to the secondary database,

and reinstate a failed primary database as a standby (i.e reverting the replication) if that feature is enabled (the default). Note that - to avoid unneeded repeated fail-overs - a manually fail-back policy might be recommended.

Note: in FSFO, fail-over can be configured to be automatic or not. As such, FSFO is always recommended to be part of the Enterprise Imaging Oracle data Guard configurations. Set to manual fail over if you want better control over the failover process.

Conditions for FSFO Failover: the observer will initiate failover to the target standby if and only if ALL of the following are true:

Observer is running Observer and the standby both lose contact with the primary

Note: if the observer loses contact with the primary, but the standby does not, the observer can determine that the primary is still up via the standby.

observer is still in contact with the standby durability constraints are met failover threshold timeout has elapsed

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=299372658519598&id=413484.1&_adf.ctrl-state=3olw87rb_90

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=299372658519598&id=413484.1&_adf.ctrl-state=3olw87rb_90







References: Guide to Oracle Data Guard Fast-Start Failover:

http://www.oracle.com/technetwork/articles/smiley-fsfo-084973.html EI 8.1.x Oracle databases on Linux Toolkit - LLID 55955892 – FSFO section


http://www.oracle.com/technetwork/articles/smiley-fsfo-084973.html







13 APPENDIX D - VMware Site Recovery Manager (SRM)

VMware SRM provides automated orchestration and non-disruptive testing of centralized recovery plans to simplify disaster-recovery management for all virtualized applications. VMware SRM automates a manually triggered (!) fail-over process and restarts the services running in DC-1 on a replicated environment in DC-2.

References:

AGFA IITS vSphere and SRM Design Guidelines – LLID 48718647

SRM Standard Services Implementation SoW – LLID 51526968

SRM Design guidelines – LL workspace – LLID 48792817

AGFA HealthCare Enterprise Imaging - VMware SRM 5.x Design, Setup and Installation - LLID 45946440

SRM data sheethttps://www.vmware.com/files/pdf/products/SRM/VMware_vCenter_Site_Recovery_Manager_5.5.pdf

Technical resources:https://www.vmware.com/products/site-recovery-manager/resources.html

SRM Documentation center:http://pubs.vmware.com/srm-55/index.jsp#com.vmware.srm.install_config.doc/GUID-B3A49FFF-E3B9-45E3-AD35-093D896596A0.html

13.1 BCDR Assumptions and Dependencies

Availability numbers in this document are tentative and are based on the below mentioned formulas. In order to make realistic estimations for uptimes, failover time and required recovery time, some assumptions have been made and documented below as well.

13.2 Dependency diagram and availability calculations

Reference: EI 8.1.x BCDR Matrix – LLID 60698109

A dependency diagram describes in a visual way what components constitute a service.

In a “serial” dependency, the service is depending on the well-functioning of all components in the dependency chain. Failure of one component will bring the service down. The uptime formula for the overall availability of the service is the product of the availability numbers of each of its components.

With a frequency of 1 incident per timeframe, the recovery or failover time for that component will determine its estimated uptime. The uptime for the overall service is then calculated as follows:


http://pubs.vmware.com/srm-55/index.jsp#com.vmware.srm.install_config.doc/GUID-B3A49FFF-E3B9-45E3-AD35-093D896596A0.html

http://pubs.vmware.com/srm-55/index.jsp#com.vmware.srm.install_config.doc/GUID-B3A49FFF-E3B9-45E3-AD35-093D896596A0.html

https://www.vmware.com/products/site-recovery-manager/resources.html

https://www.vmware.com/files/pdf/products/SRM/VMware_vCenter_Site_Recovery_Manager_5.5.pdf

https://www.vmware.com/files/pdf/products/SRM/VMware_vCenter_Site_Recovery_Manager_5.5.pdf


https://livelink.agfa.net/Livelink/livelink.exe?func=ll&objId=48792817&objAction=browse









Availability = A1 . A2 . … An %

In a “parallel dependency”, redundant components are contributing to the overall service. Failure of one component does not necessarily permanently interrupt the service. In the same concept as above, the overall availability of the service based on 2 parallel systems is then calculated as follows:

Availability = 1 - ( 1 - A1 ) . ( 1 - A2 ) . … ( 1 - An ) %

Figure 12 – Dependency Diagram & Availability Statistics calculations - LLID 60698109

.

13.3 Frequency of incidents and downtime:

To calculate downtimes, we have to start from an assumption on number of failures per component per timeframe.We start from a reasonable aggressive assumption that each component (hardware or software) will have some sort of smaller technical failure once per month.

The failing component will have a certain impact on the overall service: for some failures, a load balancing solution might trigger an automated failover to a standby component in a matter of minutes; other component failures might result in a more complex process which could take a considerable longer time.

The time required to restart the failing component – or to fail over to a standby component – is the downtime which is used in the availability calculations. The service is up again when the failing component has been restarted or when the provided service has been resumed by a redundant component in the solution.








13.4 What-If scenarios

A breakdown of one (or more) of the Service constituting components (for example: a server hardware breakdown in the ICIS solution) and its effect on the overall Service (for example: the ICIS View service) is assessed in the “What-If” scenarios:

The What-If scenario describes for each of the failing components in the solution, what the incident is, what the impact is, what the resulting actions are, and how much downtime this causes. The downtime is derived from the time required to restart the broken component and the overall service.

The EI 8.1.x BCDR Matrix – LLID 60698109 – contains a tab with description of the what-if scenarios and associated impact on availability.

13.5 High Availability

High Availability is the design and process providing automated fail-over capabilities for (parts of) the solution within the same Data Center, at times of smaller technical incidents.

Under the assumption that each component in the solution will observe 1 (smaller) technical incident per month, this incident will trigger an automated fail-over within the same data center.As a consequence, the typical fail-over and restart times of the failing component – or the components that cover for this failure - will determine how much downtime is experienced as result of the failure.

Typical incidents are: server breakdown, network component failure, disk failures, software hang-up or “blue screen” …etc.

The assumed frequency for these HA incidents is 1 per month, per component.

13.6 Disaster recovery

Disaster Recovery is the process of evacuation to a secondary Data Center at times of a disaster incident or major (technical) outage of one or more of the crucial components that constitute the ICIS service.

The assumed frequency for these DR incidents is 1 per year (overall).

Typical incidents in this category: Major storage breakdown or corruption, data center power failure, major network failures, human error, …etc.

In certain scenarios, the DR process of evacuation to a second data center could be used for major interventions and planned maintenance.








13.6.1 Uptime calculations:

Based on the What-If scenarios and the assumptions on incidents per timeframe, we can derive the overall uptime statistics.

Table below calculates the overall uptime of the EI View service based on the uptimes per individual component that constitutes this service. In this specific configuration, calculated uptime would be XX.xx%.

Refer to above EI 8.1.x BCDR Matrix - LLID 60698109 - for details.

The uptimes per component are based on the time it requires to restart or fail-over the identifiedfailing component. At an incident frequency of 1 incident per component per month, the resulting uptime of the overall service is then calculated as per formulas explained in previous paragraph. In reality, we shall never see this high frequency of incidents.

13.6.2 Uptime of EI functional components

Above described What-If scenarios and uptime calculations mainly cover for the underlying platform components (hardware, servers, network, storage, Oracle database, Operating System, etc.). The EI software and functional components are also represented in the dependency diagram, with an estimated availability number. This number is derived from the restart time of the software component, and is as such included in the total uptime calculation of the solution.

The High Availability features of the platform do not necessarily cover for the correct functioning of the EI Software. As example: the load balancer can monitor and detect the DICOM network port (104 - …) to be up and running, which does not necessarily mean that DICOM image ingest is successfully accepted. EI provides in a more detailed html status page where application functionality can be verified.

To mitigate this aspect, EI is always delivered with AGFA GRIP monitoring agents (Global Remote Incident Prevention System) that will monitor the functional aspects of the solution. The dependency diagrams and associated uptime calculations include a “Service and Operations” component that represent this monitoring aspect and cover for the response time covered by the service SLA.

13.6.3 Bare Metal deployments:

In a limited number of cases, a bare metal deployment might be preferred or even required. Most of the scenarios are also valid for Bare Metal deployments, with exception of the following:

Scenario 2: Active – Passive: this scenario relies on storage replication of the VM images and is obviously not supported for Bare Metal deployments.

Scenario 3: Active – Passive with ODG:with exception of the database, this scenario also relies on storage replication of the VM images and cannot be used in a Bare Metal Deployment.








9 Revision History

Version history is maintained in Livelink.

Table 6 - EI BCDR Strategy White paper - LLID 51293216LL

VersionDate Author Description

V2 2015-08 DS EI Refresh of 2014.1 BCDR WPV3-5 2016-02 DS Intermediate versions, covering first input from various

architecture discussionsv6 2016-03 DS EI 8.1 - version for first reviewv8 2016-06 DS EI 8.1 – version for TfS approvalV9 2016-07 DS EI 8.1 – approved version for TfS

V10 2016-08 DS Corrected corrupt document cross ref’s. - republishedV14 2017-07 DS EI 8.1.1 refresh, added EI AUX.V15 2017-08-07 DS EI 8.1.1 review updates, and version for approvalV16 2019-04 DS Editorial updates + 8.1.2SP4 updates : added option of a

manual BCDR scenario in 5A – as per JIRA IEI-71077 –Improvements to CS Clustering.Covering for first FITCO alignment decisions, for immediate implementation: Deprecated scenario 6A!

V17 2019-04 AE, LGDS

Updated with v16 review comments, for publication:- removed unsupported BCDR Scenarios (4, 6a, 6b…)- cleansing of document structure.- Adjusted to FITCO 2019 standardization workstream.

V19 2019-06 AE, LG, DS

EI 8.1.2SP5/8.1.4, final review with Global Client Engagement team – consistency checking; version for publication

http://jiraprod.agfahealthcare.com/browse/IEI-71077





Details as of PDF Creation DateDocument Metadata

Title: Enterprise Imaging 8.1.x - BCDR Strategy White Paper

Livelink ID: 51293216

Version#: 19

Version Date: 2019-06-18 12:06 PM CET

Status: Approved on 2019-06-20 12:13 PM CET

Owner: Dirk Somers (amduv)

Created By: Dirk Somers (amduv)

Created Date: 2015-08-28 10:30 AM CET

PDF Creation Date: 2019-06-20 12:13 PM CET

This document was approved by:

Signatures:

1. Ives Heymans (amdow) on 2019-06-19 10:20 AM CET2. Matthieu Ferrant (awpzv) on 2019-06-20 11:35 AM CET3. Andre Emin (ageia) on 2019-06-18 12:11 PM CET4. Lukas Gut (amovm) on 2019-06-20 09:43 AM CET5. Robert Menko (axdxk) on 2019-06-18 01:21 PM CET

Detailed Approver History:

• Approval Workflow started on 2019-06-18 12:10 PM CET◦ Approval task originally assigned to and completed by Matthieu Ferrant

(awpzv) on 2019-06-20 11:35 AM CET◦ Approval task originally assigned to and completed by Robert Menko

(axdxk) on 2019-06-18 01:21 PM CET◦ Approval task originally assigned to and completed by Andre Emin (ageia)

on 2019-06-18 12:11 PM CET◦ Approval task originally assigned to and completed by Lukas Gut (amovm)

on 2019-06-20 09:43 AM CET◦ Approval task originally assigned to and completed by Ives Heymans

(amdow) on 2019-06-19 10:20 AM CET

Version & Status History

Page 1 of 3

2019-06-20


Version# Date Created Status

19 2019-06-18 12:06 PM CET Approved - 2019-06-20

18 2019-06-14 02:00 PM CET

17 2019-06-03 04:39 PM CETPublished - 2019-06-17Approved - 2019-06-17Approval Cancelled - 2019-06-14

16 2019-04-16 09:03 AM CET Reviewed - 2019-05-28

15 2017-08-07 04:21 PM CET

Unpublished - 2019-06-11Unpublished - 2019-06-03Unpublished - 2019-06-03Published - 2017-08-16Approved - 2017-08-16

14 2017-07-07 05:01 PM CET Reviewed - 2017-07-27

13 2017-07-06 07:32 PM CET

12 2017-07-06 07:30 PM CET

11 2017-07-06 07:30 PM CET

10 2016-08-10 12:00 PM CETUnpublished - 2017-08-16Published - 2016-08-17Approved - 2016-08-10

Applied Categories and Attributes:

Agfa Healthcare Library

Document Type: Technical Specification

Category > SubCategory > Item:

Enterprise Imaging > Solution > Solution 8.1, 8.1.1, 8.1.2

Content Manager: Dirk Somers (amduv)

Summary:

Language: English

Availability: Intranet

Language Master Document:

Fileshare FTP URL:

ITCo Library

Document Type: Technical Specification

Category > SubCategory > Item:

Enterprise Imaging > Solution > Solution 8.1, 8.1.1, 8.1.2

Content Manager: Dirk Somers (amduv)

Summary:

Page 2 of 3

2019-06-20


Language: English

Availability: Intranet

Language Master Document:

Page 3 of 3

2019-06-20