migrating sglx cluster to rhcs cluster

7/25/2019 Migrating SGLX Cluster to RHCS Cluster

1/53

Migrating an HP Serviceguard for Linux Cluster to Red

Hat Cluster Suite in Red Hat Enterprise Linux 5Advanced Platform

May 2009


2/53

2

Executive Summary ............................................................................................................................ 4

Introduction ....................................................................................................................................... 4Audience ...................................................................................................................................... 5

Red Hat Cluster Suite and fencing ........................................................................................................ 6Quorum rule and fencing ................................................................................................................ 6Failure detection time ..................................................................................................................... 6Use of Qdisk to bolster quorum ..................................................................................................... 7

Qdisk recommendations ............................................................................................................ 10

Networking consideration ................................................................................................................. 11

HP iLO fencing based Familiar Configuration ................................................................................... 12Network configuration .................................................................................................................. 12HP iLO fence timing ...................................................................................................................... 12HP iLO and third party hardware ................................................................................................... 13HP iLO fencing reliability .............................................................................................................. 13Conclusions ................................................................................................................................. 13

SCSI-3 PR fencing based Familiar configuration ............................................................................... 15Introduction to SCSI-3 PR fencing ................................................................................................... 15Software, NMI, and hardware watchdog timer................................................................................ 15SCSI-3 PR with automatic restart..................................................................................................... 16

Persistent Reservation and Multipath environment ............................................................................. 17Conclusions ................................................................................................................................. 19

RHCS resources, services, and failover domain ................................................................................... 20Services and resources ................................................................................................................. 20Failover domain ........................................................................................................................... 20Resource Agents .......................................................................................................................... 21

Migrating SGLX Packages ................................................................................................................ 22Serviceguard Packages ................................................................................................................. 22Packages types ............................................................................................................................ 22Restart, Failover, and Failback ....................................................................................................... 23

Auto start capability ..................................................................................................................... 24Fast fail option ............................................................................................................................. 24

Package dependency ................................................................................................................... 25Client Network monitoring ............................................................................................................ 26

Applications - startup, shutdown, and monitoring ............................................................................. 26Package startup and shutdown/Service hierarchical structure ............................................................ 27Service IP address ........................................................................................................................ 29

Volume group .............................................................................................................................. 29File system................................................................................................................................... 29

Migration Procedure ........................................................................................................................ 31SGLX cluster to be migrated .......................................................................................................... 31Planning phase ............................................................................................................................ 32

Additional Requirements ............................................................................................................ 32Migrating SGLX cluster configuration ......................................................................................... 33Migrating package configuration ............................................................................................... 33Saving the customer-defined area and external scripts variables .................................................... 35Backing up application and configuration data ............................................................................ 37

Migration phase .......................................................................................................................... 37Stop Serviceguard cluster and packages ..................................................................................... 37Installing the RHCS RPM ............................................................................................................ 37Convert the volume groups to CLVM ........................................................................................... 37Configuring RHCS members, SCSI PR fencing, and qdisk .............................................................. 37Configuring the automatic restart mechanism ............................................................................... 38Setting up the failover domains, resource, and service for applications ........................................... 39Starting the cluster, and services ................................................................................................. 40


3/53

3

Terms ............................................................................................................................................. 41

For More Information ....................................................................................................................... 44

Appendix ....................................................................................................................................... 45Sample Watchdog controller service script ...................................................................................... 45Sample Watchdog callback fence verification script ......................................................................... 48Migration planning worksheets ...................................................................................................... 50

Cluster configuration worksheet .................................................................................................. 50SGLX package configuration worksheet ...................................................................................... 51


4/53

4

Executive Summary

This white paper describes a procedure to migrate an HP Serviceguard for Linux (SGLX) cluster to acluster running Red Hat Cluster Suite (RHCS). Differences between the two clusters are describedincluding membership, quorum, fencing to prevent data corruption, and application failover control.

A step-by-step process describes how to use configuration information from an existing SGLX cluster toquickly create an RHCS cluster with similar functionality.

Some features that exist in SGLX without comparable RHCS feature are listed. Many of thosedifferences can be managed by users through custom scripts.

HP has worked with Red Hat on creating this whitepaper.

Introduction

Red Hat Cluster Suite provides a number of configuration options to cater to different availabilityneeds and to support various hardware configurations. During planning, the administrator isresponsible for selecting the right configuration based on business availability needs and the

hardware chosen for the cluster. This white paper will focus on two aspects of the configuration thatare meaningful to SGLX users: fencing, the mechanism used to ensure a node that has been removedfrom the cluster does not cause data corruption, and cluster membership.

Red Hat Cluster Suite provides two major types of fencing mechanisms: power fencing and storagefencing. Within power fencing, the power reset can be achieved via external power fencing devicessuch as network addressable Power Distribution units (PDUs) or via integrated management cardssuch as HP's Integrated Lights Out (iLO). Similarly, storage fencing provides two different methods forrestricting shared storage access: fencing via control of the storage area network (SAN) ports on theswitch connections to the server or via SCSI-3 Persistent Reservation.

Two types of fencing are described in this white paper: power reset fencing based on HPs IntegratedLights Out (iLO) features and Persistent Reservation fencing. Power reset fencing has been used withRHCS for many years. Persistent Reservation (PR) fencing is currently supported in RHCS. The twofamiliar Red Hat Cluster Suite configurations, one based on HP iLO fencing mechanism and theother based on SCSI-3 PR fencing mechanism, are evaluated and described in detail. These twoRHCS configurations will provide a high-availability environment which is familiar, if not same, to thatof SGLX. The one best suited for the environment can be chosen for the migration.

RHCS and SGLX have different membership algorithms. This white paper describes how an RHCSquorum disk (qdisk) can be used to bolster the existing quorum mechanism without introducingasymmetric cluster configurations. Also, an automatic reset mechanism can be used to enhanceRed Hat membership functionality. It does this by allowing failed nodes to automatically rejoin the

cluster in a way that is similar to an SGLX cluster.

The first part of this white paper defines the two familiar Red Hat Cluster Suite configurations to beused for migration from the SGLX cluster. The first familiar RHCS configuration uses the PersistentReservation fencing with the automatic reset mechanism to prevent data corruption and to allowfailed nodes to automatically rejoin the cluster in a way that is similar to an SGLX cluster. The secondfamiliar RHCS configuration uses the HP iLO power fencing method. In both familiarconfigurations, the qdisk configuration is required to bolster the existing quorum to handle variousfailure scenarios.


5/53

5

SGLX users are familiar with the concept of packages and how toolkits can be used to simplify thedevelopment of package control scripts. Red Hat Cluster Suite uses a similar concept by usingresources, services, and failover domains to control application failover. This white paperdescribes how these RHCS features, taken together, are similar to the SGLX package and toolkitstructure.

The second part of this white paper describes the migration process. Using a step-by-step procedure,SGLX cluster configuration information is gathered and an RHCS cluster is created from this

information. This is illustrated with a sample application. Separate white papers will be available ata later date for migrating SGLX clusters configured with specific toolkits (for example Oracle orApache).

Red Hat Cluster Suite delivered with Red Hat Enterprise Linux Advanced Platform 5.2 is used as thebasis for this white paper. The information in this white paper is not expected to change significantlyfor later versions of Red Hat Enterprise Linux 5.

Audience

This document is targeted for users of HP SGLX on RHEL5 who wish to migrate to Red Hat ClusterSuite on RHEL5.2 or later.

It is assumed that the reader has an understanding of HP SGLX and Red Hat Cluster Suite. Details inthe Red Hat documentation are not necessarily repeated in this white paper. For more information oneach solution, see http://www.hp.com/go/sglxand http://www.redhat.com/docs/manuals/csgfs.

In addition to formal documentation from HP and Red Hat, the upstream project for Red Hat ClusterSuite maintains a rich wiki at http://sources.redhat.com/cluster/wiki. However, care should betaken when referring to documentation on the wiki as the upstream features may not be completelyimplemented in a formal release.
http://www.hp.com/go/sglxhttp://www.redhat.com/docs/manuals/csgfs/http://sources.redhat.com/cluster/wikihttp://sources.redhat.com/cluster/wikihttp://www.redhat.com/docs/manuals/csgfs/http://www.hp.com/go/sglx


6/53

6

Red Hat Cluster Suite and fencing

At the core of any high availability cluster is the concept of membership, an accounting of nodes thatare associated with the cluster. Cluster software has algorithms to adjust membership based onvarious failure scenarios. Typically, the cluster software has the concept of quorum that defineswhich set of nodes will continue to define the cluster. To protect data, the nodes that do not havequorum are removed from cluster membership. The non-quorate nodes that are removed must beprevented from accessing shared resources. Fencing is the term used to describe this restricted

access.

Quorum rule and fencing

In Red Hat Cluster Suite, the quorum is based on a simple voting majority of the defined nodes in acluster; to reform successfully, a majority of all possible votes is required. There is no rollingmembership. Even if a node was taken down by an operator, its vote is still one of the possiblevotes. In RHCS, each cluster node is assigned some number of votes, and they contribute to thecluster while they are members. If the cluster has a majority of all possible votes, it has quorum (alsocalled quorate), otherwise it does not.

SGLX bases any membership vote on the last state of the cluster (rolling membership). For example,if a cluster originally had 6 nodes and an administrator takes down one node, then on a failurevoting is based on 5 nodes. The difference in membership schemes is most evident in small clusters,when more than one node fails at a time, or in split site (disaster tolerant) clusters. SGLX can also usea Lock LUN or the Quorum Service to break ties.

Serviceguard uses a self reset method to prevent the failed node from writing to a shared storage.The deadman driver is a key component to reset nodes that come out of a hung state. Red HatCluster Suite supports various fencing mechanisms.

In both SGLX and RHCS, an unequal sized partition (i.e., network failure that creates partitions withdifferent number of members), will result in the partition with the majority number of votes to establish

quorum and form a new cluster. In RHCS, the failed node/s (or the partition that lost quorum) isfenced by the quorate partition i.e., removed from the cluster. While in SGLX, all the nodes of thepartition that lost quorum will self reset.

In SGLX, a quorum server or Lock LUN can be used to break a tie and allow one partition to establishquorum and form the cluster. Likewise in RHCS, a quorum disk (qdisk) is used to break a tie and tohandle asymmetric configuration.

Failure detection time

In an RHCS cluster, the totem token (tokenparameter in / et c/ cl ust er / cl ust er . conf) is thefailure detection time, the time lapsed from a node failure to the cluster software detecting that the

node has failed. In HP Serviceguard version A.11.18 and earlier revisions, the NODE_TIMEOUTis thefailure detection time, while in Serviceguard version A.11.19 it is the member_timeout. Setting ofthe totem token parameter is determined by guidelines provided by Red Hat. For example, when aqdisk (quorum disk) is used, the totem token timeout should be at least twice that of qdisk membershiptimeout. Another guideline is when both cluster communication and client traffic use the same networkthe totem token timeout should be set higher to prevent false cluster reformation.


7/53

7

Use of Qdisk to bolster quorum

Quorum means having a simple majority - just over half the nodes in the cluster is required to establisha quorum using the default one vote per node. A 4-node cluster needs three active nodes to functionor in a 6-node cluster needs four nodes and so on. But in the 4-node cluster case, losing only 2 nodesresults in loss of quorum.

In an RHCS cluster, to lose quorum, exactly half or more nodes need to be down at the same time.Loss of quorum will leave the cluster in a suspended state during which cluster operations such asfencing, GFS recovery, and cluster service startups are disallowed. In such cases, operatorintervention is required to manually reset the surviving nodes and restart the cluster. The probability oflosing half or more nodes in a large cluster is very unlikely when compared to a small cluster. Forexample in a 10 node cluster, the likelihood of losing 5 nodes (to break the quorum) is remote whencompared with losing 2 nodes in a 4 node cluster. In SGLX, failure of more than half the clustermembers will result in remaining nodes to lose quorum which would self reset.

A time tested approach to strengthen quorum is to add a spare machine to a cluster so that its votewould give the required majority. But, in an RHCS cluster, adding spare machines to strengthen thequorum count is an expensive mechanism when it can be easily achieved by using shared storagemechanism referred as the quorum disk (qdisk).

A qdiskis a small 10MB disk partition shared across cluster nodes. The qdisk daemon (qdiskd) oneach node periodically evaluates the health and updates the state and timestamp into its assignedportion of the disk. Each qdiskdthen looks at the state of the other nodes in the cluster as posted intheir area of the qdisk partition. If a node fails (or loses access to the shared storage) its area in theqdisk will not be updated at the required frequency. This will be detected by the other nodes in thecluster. When in a healthy state, the total quorum count is the sum total of votes of each node and thevote assigned to the qdisk. In the event of node failures, the qdisk contributes its vote giving theremaining nodes the required majority to form a cluster. To sustain failure of exactly half the members,the qdiskis assigned a vote of 1 assuming all the members have vote of 1. To allow clusteroperations until the last node, qdisk is assigned a vote of one less than the total number of nodes inthe cluster. This means that any one node plus the quorum partition is sufficient to hold quorumallowing cluster operations to continue. When the cluster heartbeat timeout occurs, with the votecontribution from qdisk, the surviving nodes gain quorum and fences the failed nodes forming acluster. Figure 1 shows how the qdisk, in the event of 2 nodes failing in a 4-node cluster, provides therequired vote for quorum.


8/53

Figure 1: qdisk in majority node failure

3 of 5 is simple majorityfor quorum

CMANdetects node

failure

Node 1 and 2are fenced by

cman

(Interval * TKO)

2-node clusterformed

In a 4-node cluster,node 1 and node 2

fail

Other Nodes detectNode 1 and 2 are

down

Totem timeout

Vote count of 3(Node 1, node 2 and qdisk

CMAN

qdisk

In the following quorumdconfiguration, the qdiskd performs evaluation every 2 seconds andconsiders the node to be down if it does not see updates for a node for 5 cycles.

. . . .

. . . .

Red Hat recommends the CMAN membership timeout (totem token timeout) value to be at least twotimes that of the qdiskdmembership timeout value. Hence, in this example the totem token is set to20 seconds.

NOTE: The above method describes how a qdisk can be used for detecting and fencing a failed node(which is down). The same method can be used to detect and fence nodes that have lost access tostorage. In an RHCS cluster, nodes that have lost access to storage are not automatically fenced. If anode loses access to shared storage, then the services on that node will fail at the time of accessingthe storage, resulting in a failover to another node in the cluster. This node will continue to participatein the cluster even though it cannot host any services. If there is an attempt to move a service back tosuch a failed node before the SAN failure is fixed, the services will fail to start and will be moved toother nodes. The above qdisk configuration used for a failed node (which is down) can be used toprevent cluster participation by a node which has been disconnected from the SAN.

8


9/53

9

Until now we have looked at how qdisk allows cluster members to determine that a failed node(s) not responding to the cluster heartbeat - is because it is down and not due to a network partition. Wewill now look at an enhanced capability of qdisk which allows cluster members to determine theirfitness for cluster participation and volunteer to be removed from the cluster (this is typically used inan equal network partition which is discussed in more detail later).

One or more heuristics can be added to the configurations which are run prior to accessing the qdisk.These are fitness checks used by a node to declare itself as fit or unfit for cluster participation.

Only nodes scoring over the required minimum score will claim they are fit via the quorum disk.Those nodes whose score drops below the minimum score will declare they are unfit for clusterparticipation via the quorum disk. The qdiskdrunning on another node will request the clustermanager to fence the unfit nodes.

The qdiskwith heuristics mechanism is commonly used in equal network partitions in a cluster of 2nodes as well as a cluster of more than 2 nodes. One partition declares itself as unfit for clusterparticipation, while the other remains "fit" and therefore fences the unfit partition. This prevents thefence race which can occur when using per-node power management like HP iLO.

In an evenly split partition in a cluster with more than 2 nodes, neither partition gains quorum leavingthe entire cluster in a suspended state. A qdisk with a heuristics is used to break the tie so that one ofthe partition proceeds to form the cluster.

NOTE: The same qdisk with heuristics used to break a tie in an equal sized partition can detect andremove a failed node.

The other use of qdisk with heuristics is to handle asymmetric configuration. For example in a 4-nodecluster, 3 nodes in a majority partition could decide that they are *all* unfit for the cluster - while the1-node minority partition continues to operate. In an SGLX failure of more than half the clustermembers will result in the shutdown of the remaining nodes. The qdiskwith heuristics can also beused by members to declare as unfit based on external reasons such as not able to access a publicrouter.

The following is a sample heuristics definition used in a 2-node cluster for preventing fence races inthe event of a network partition (failure of the cluster communication network). The cluster is setup touse separate networks, one for the client and another for cluster communication. Clustercommunication is assumed to use a single NIC which is connected using a switch. In this example, theheuristic is a check on the Ethernet link state using et ht ool . Note that this will not work with a cross-over cable since both nodes will find their links are unplugged.

The following is the XML definition for the qdisk with the heuristics definition:

As recommended by Red Hat, the totem token is set to 20 seconds which is at least twice that of theqdiskdmembership timeout value (quorumd interval * quorumd tko)which is 10 secondsin the above XML definition.

The script / usr / l ocal / sbi n/ checkpvt l i nk. shcan be implemented as follows:

#!/bin/sh

#

# Check for link state


10/53

10

#

ethtool $1 | grep -q "Link detected.*yes"

exit $?

NOTE: It is important to understand that the heuristics check is based on the discernible differencebetween partitions so that only one partition gets the required minimum heuristics score while theother fails to get the minimum score. Otherwise, both partitions will declare themselves as fit andwill enter into a fence race which can bring down the cluster. If the heuristics in the above examplewas based on the public network (like pinging an upstream router), both nodes would still seeeverything the tie-breaker will not work resulting in a fence race.

Qdisk recommendations

Red Hat recommends the use of qdisk configuration to bolster quorum to handle failures such as half(or more members) members failing, a tie-breaker in equal split partition, and detecting SAN failure.

Red Hat Cluster Suite allows creation of a two node cluster with an exception to the quorum rule (i.e.,majority of votes are required for quorum), in that, one node is considered enough to establish aquorum. In the case of a network partition, each node, which has quorum, will race to fence the

other. Note that, in per-node power management (i.e., where the power control device is not sharedbetween cluster nodes) used in HP iLO, there is a very small possibility for the nodes to simultaneouslyfence each other, bringing down the entire cluster. Also a persistent network problem will result in Afences B, B fences A fence loop (assuming that the fencing device is still accessible). The cluster setupwith qdisk with well-defined heuristics is used to prevent the fence race and fence loop in theevent of a network partition in a 2-node cluster. When using SCSI-3 PR fencing method, the nodes wilnot simultaneously fence each other. This is due to the atomic nature of the SCSI-3 PR mechanism;once a node is ejected it cannot eject others. However, in the event of a persistent network problemthere still is the possibility of a fence loop which can be prevented using a qdiskconfiguration.

In the event of an evenly split partition in a cluster with more than 2 nodes, neither partition gainsquorum leaving the entire cluster in a suspended state. In such situations operator intervention is

required to manually reset the partitions nodes and restart the cluster. A qdisk with a well-definedheuristics is used to break the tie so that one of the partition proceeds to form the cluster.

To sustain failure of half the members, a qdisk configuration assigned with a vote of 1 is required -assuming all the members have vote of 1. In the event of simultaneous failure of half the nodes, theqdisk vote will give the surviving node the required majority to form a cluster. A qdisk configurationwith heuristics is used to implement the SGLX capability, to sustain evenly split partition or failure ofhalf the members, in RHCS. This is an example of how the SGLX options that do not have anequivalent match (or not readily available) in RHCS can be easily implemented programmaticallyusing scripting.

In an RHCS cluster, nodes that have lost access to storage are not fenced. If a node loses access to

the shared storage, then the services on that node should fail at the time, resulting in a failover toanother node in the cluster. This node will continue to participate in the cluster even though it cannothost any services. If there is an attempt to move a service back to such a failed node before the SANfailure is fixed, the service will fail to start and will be moved on to the other node. A qdiskconfiguration can be used to prevent cluster participation by a node which has been disconnectedfrom the SAN. In SGLX, the disk monitoring service is used to allow package failover when FC link(s)fail.


11/53

11

Networking consideration

In SGLX, the heartbeat requirement is to use multiple heartbeat networks or a single HA network(using Linux bonding). Both heartbeat and client communication either be on the same network or ondifferent networks. The recommended configuration is to use separate HA network for client andheartbeat communication. In a minimum network configuration, the same network can be used forboth client and heartbeat communication.

Red Hat recommends the use of a bonded pair of NIC (using Linux Channel Bonding) for client accessand another bonded pair of NIC for cluster heartbeat. Channel Bonding provides additional HAprotection and reduces the number of cluster reformations. For the greatest network reliability, anetwork bond should not use multiple connections from a single multi-port NIC.

Note: Cluster communication can be setup to use IPv6 addresses but is not supported by Red Hat atthis time.

Another possible network configuration is a "minimum" network configuration which uses a bondedNIC pair for the client network and the same bonded NIC pair is used for heartbeat as well. Beforeadopting the minimum network configuration, the following limitations have to be evaluated:

1. During increased network traffic it is possible for heartbeats to be missed which result inunnecessary cluster reformations. The totem token timeout has to be set to a higher value inorder to prevent "missed heartbeats". This is also a limitation in Serviceguard when bothheartbeat and client communication share the same network.

2. While heartbeat messages are normally encrypted using a shared key, there may be securityreasons for keeping the heartbeat messages off an Internet-facing network.

Continue to use the same network configurations setup for client and heartbeat communication inServiceguard and in Red Hat Cluster Suite.

The last networking consideration to note is that Red Hat Cluster Suite defaults to a site-local multicastnetwork for its heartbeat communication. The multicast address can be changed if needed to meetlocal requirements.


12/53

12

HP iLO fencing based Familiar Configuration

The nodes that do not have quorum are removed from cluster membership and must be preventedfrom accessing shared resources. For HP ProLiant and Integrity systems, HP iLO fencing is one suchmethod used in Red Hat Cluster Suite to restrict access to shared resources. HP iLO provides remotemanagement capabilities for HP ProLiant and Integrity servers with no additional installation. Thesystem administration tasks can be done remotely regardless of the state of the servers operatingsystem. Among other basic iLO features, the servers come enabled with power on and power off

capabilities. Red Hat Cluster Suite supported HP iLO fencing mechanism is based on the iLO poweron/power off capabilities.

In this section, the familiar Red Hat Cluster Suite configuration which is based on the HP iLO fencingis evaluated.

Network configuration

Since HP iLO is primarily setup for managing systems remotely, it becomes necessary to have HP iLOconfigured on a routable network. In an SGLX cluster, the HP iLO network can either be setup on theclient access network or kept isolated on a different routable network. It cannot be on a non-routableprivate network.

In Red Hat Cluster Suite, the HP iLO can be connected to either the cluster communication networkused by cluster manager (CMAN) or to a different network. The HP iLO should not be connected tothe cluster communication network since the nodes will lose access to the fence devices in the event ofa network partition. This leaves the cluster in a suspended state where neither partition will be fencedrequiring operator intervention to manually reset the nodes and start the cluster. By connecting the HPiLO to a different network (different from the cluster communication network), the nodes will still haveaccess to the fence devices in the event of a network partition. Notethat HP iLO is based on per-nodepower management where the device is not shared between cluster nodes. In the event of a networkpartition in a 2-node cluster, both nodes (that have quorum) will race to fence the other andpotentially bring down the entire cluster. A qdiskwith heuristicsis used in a 2-node cluster sothat only one node wins quorum preventing the fence race. This is described in more detail in section,Use of Qdisk to bolster quorum.

The following guidelines need to be followed for HP iLO network configurations in an RHCS cluster:

1) HP iLO can be connected to either the client access network or to a different network. But itshould be a routable network.

2) HP iLO should not be on the network that is used for cluster communication by CMAN.3) The HP iLO of each cluster system must be accessible over the network from every other

cluster system.

HP iLO fence timing

The f ence_i l oscript requires multiple handshakes with iLO via theRemote Insight Board CommandLanguage(RIBCL) interface to fence a node. Since the RIBCL interface requires a new connectionevery time the f ence_i l oscript contacts it, there is a lot of connection buildup and teardown time.So the HP iLO fencing mechanism is slower in the earlier releases of RHEL5 than SCSI-3 PR. A recentupdate to the f ence_i l oscript has improved response time.

With the iLO2 ASIC on x86_64 system, it was found that the original f ence_i l orelease takesapproximately 40-50 seconds to execute and is independent of the number of nodes in the cluster.

When the cluster-infrastructure components (such as DLM and GFS) are notified of a node failure, all


13/53

13

related operations (e.g. GFS recovery, application failover) are suspended until the failed node isfenced successfully. A slow fencing will increase the application failover time.

A slow fence can potentially overlap with the rebooting of a failed node resulting in a second reset.For example, consider the case of node failure which resets due to system panic. If the fencing isslow, then the iLO reset will probably be triggered during the last phase of the boot sequenceresulting in a second reboot. However, if the iLO fencing had been quick, then the node would havebeen reset within a few seconds of the node panic. Aside from this extra reset or reboot there is no

other impact to the cluster.

HP iLO and third party hardware

HP iLO fencing mechanism can be used only on HP x86 and Integrity systems which has the built-iniLO hardware. This fencing mechanism cannot be used for migration of SGLX on third party systemsbut there are other power control fencing methods usable with other vendors servers.

HP iLO fencing reliability

The HP iLO fencing relies on the network for communication to which the iLO device is connected. Ifthat fails, the fencing operation will fail leaving the cluster in a suspended state.

The HP iLO fencing script (f ence_i l o) in Red Hat is implemented using the RIBCL iLO interface. Thefencing logic is implemented as a sequence of status/action commands sent one after the other to theiLO hardware (on the node to be fenced).

Following is an example of the sequence of commands sent by the f ence_i l oscript to the iLOdevice:

1) Get power status of the server2) Power off the server (assume it was on)3) Get power status of the server4) Power on the server (assume it was off)

5) Get power status of the server

Such handshake based implementations are less reliable (even though very infrequent) whencompared with a single atomic based operation used in the SCSI-3 PR fence method.

To handle such infrequent failures of the HP iLO fencing, a backup fence method can be setup forredundancy. If the HP iLO fencing fails the backup method configured would be employed.

Conclusions

Following are the important conclusions that users must be aware of when choosing HP iLO as the

fencing mechanism:

1. HP iLO as fencing mechanism is less costly and easier to manage than most other methods.

2. Even in those extremely rare cases of system hard hangs, the HP iLO fencing willsuccessfully reset a node. The same is not true in SGLX.

3. The HP iLO can be connected to any routable network as long as it is not used by CMAN forcluster communication.


14/53

14

4. A qdisk with heuristics is required to handle failure scenarios such equal network partition,failure of half the cluster member or SAN failure. These were sustained in an SGLXenvironment. For more information, see the section Qdisk recommendations.

5. HP iLO fencing mechanism is slow when compared with other methods such as SCSI-3 PR. Aslow fencing will increase the application failover time.

6. HP iLO hardware is not present in Serviceguard supported third party systems. In thosecases, other hardware such as Dells DRAC and IBMs RSA or external network addressable

PDUs such as those from APC and WTI might be usable.7. To handle infrequent failures of the HP iLO fencing (such as a switch failure), a backup fence

method can be setup for redundancy.


15/53

15

SCSI-3 PR fencing based Familiar configuration

This section defines the familiar Red Hat Cluster Suite configuration for the migration which is basedon the SCSI-3 Persistent Reservation (PR) fencing mechanism.

Though SCSI-3 PR as a fencing method was supported in RHEL5.2 and earlier releases, its use was

severely restricted in a multipath environment. SCSI-3 PR fencing is supported with multipath,specifically DM-MPIO, with the release of RHEL5.3. The section, Persistent Reservation and Multipathenvironment, describes the shortcomings of using SCSI-3 PR in a multipath environment with RHEL5.2and earlier releases.

Note: In a shared storage environment, SCSI-3 PR fencing method is preferred over the HP iLOfencing. The SCSI-3 PR based configuration described in this section, especially when used withRHEL 5.3 and DM-MPIO multipath, resembles a SGLX cluster more than another RHCS configuration.

Introduction to SCSI-3 PR fencing

SCSI-3 PR (Rersistent Reservation) allows multiple participating nodes to access a shared disk whileat the same time blocking access to other nodes. For this purpose, the SCSI-3 PR uses a mechanism ofregistration and reservation. Each system that wants to participate registers a unique key with theshared disk. Blocking access to a system is as simple as removing (pre-empt and abort) the nodesregistration from the device. The blocking operation is atomic: once a node is ejected, it has no keyregistered so that it cannot write to a disk and cannot eject others, avoiding the split-brain condition.

Note: SCSI-2 also has a concept of PR, but is explicitly not supported by Red Hat Cluster Suite.

The RHCS f ence_scsi fence agent employs the above described SCSI-3 PR to provide access tomembers of the cluster and deny access to members that are removed from the cluster. Red HatCluster Suite is able to perform fencing via SCSI-3 Persistent Reservation by simply removing a node's

registration key from all devices. When a node failure occurs, the f ence_scsi agent will removethe failed node's key from all devices, thereby preventing it from being able to write to those devices.

It is important to note that, although f ence_scsi agent prevents failed nodes from being able towrite to those devices, it does not reset the node so that it would be able to re-join the cluster after thereboot. The operator is required to manually reset the node.

Since an SGLX node that is fenced is also reset automatically, it is desirable to get the samefunctionality in a familiar RHCS cluster. This improves usability since it avoids manual interventionwhich is undesirable in clustered environments. Also, since the membership algorithm in RHCS relieson the defined members, having the node back up is helpful. Subsequent sections elaborate on howa watchdog timer can be integrated as a cluster service to reset nodes that are fenced by SCSI-3 PR.

Software, NMI, and hardware watchdog timer

System hangs may be caused by the kernel looping, so other tasks have no opportunity to run. Thesehangs can be classified in two groups: soft lockups and hard lockups. Soft lockups are transitorylockups that delay the execution and scheduling of other tasks. The soft and hard lockup can bedetected and managed using watchdog timers of different types.

The first type is a software based watchdog which can only handle Soft lockups. Hard lockups leavethe system completely unresponsive and can occur when a CPU disables interrupts and gets stuck due


16/53

16

to a locking error. Timer interrupts are not served in a hard lockup so scheduler-based softwarewatchdogs cannot be used for detection. An NMI handler or hardware based watchdog timer can beused.

The second type is the hardware-based Non-Maskable Interrupt (NMI) watchdog supported by theLinux kernel that relies on specific server hardware which is usually on the system motherboard. TheNMI watchdog hardware will trigger a reboot of the system if it does not detect a steady level ofsystem interrupts occurring.

Lastly, the most reliable among the three types of watchdog timers is the traditional hardwarewatchdog timer. These devices will force a system shutdown/reboot if their device driver does notregularly reset them. Due to a lack of uniformity among low-level hardware watchdog components, itis difficult to make generalizations describing how to determine if a particular system contains suchcomponents. Many low-level hardware watchdog components are not self-identifying.

The software based watchdog timers are sufficient to reset a node even in the case of a soft hangsystem hangs. However, in those extremely rare cases of hard hangs, the software based watchdogtimers will not be able to reset the fenced node. It must be remembered that the intention of resettingthe node using the watchdog timer is not to fence a node but to ensure that the node can rejoin thecluster without requiring operator intervention. An external-type fencing mechanism such as SCSI-3 PR

fences a node and ensures data integrity. Therefore, the failure to reset the node only prevents thenode from automatically re-joining cluster and has no impact on data integrity. In such remote cases,operator intervention would be required to reset the node.

SCSI-3 PR with automatic restart

This section describes an automatic reset mechanism that reset nodes that are fenced by the SCSI-3 PRfencing method and thereby allows failed nodes to re-join the cluster automatically without operatorintervention.

The Linux kernel can reset the system using either a software-only watchdog timer or a hardware-based watchdog timer.

The watchdog daemon opens / dev/ wat chdog, and keeps writing to it often enough to keep thesystem from resetting. When the daemon stops writing, the system is reset by the watchdog timer.

After the watchdog demon starts, it puts itself into the background and then tries all the user checksspecified in its configuration file. Between each two tests it writes to the kernel device to prevent areset and then goes to sleep for a predetermined time period before repeating the logic all overagain. If the watchdog daemon fails to write within the configured time, the watchdog timer willcause the node to reset.

Automatic reset is implemented using a software based watchdog timer and a script that checks if thenode is fenced by SCSI-3 PR. The script determines if the system should stay up or be reset. Based on

the return code, the watchdog daemon decides whether to write to the kernel device or not. The scriptis configured as the test-binary parameter in / et c/ wat chdog. conf. The script is invokedperiodically by the watchdog daemon to determine if the node is fenced by SCSI-3 PR. This is doneby checking whether the shared disks contain the nodes registration key or not. The sample script isprovided in the Appendix.

Note: The software based watchdog timers are sufficient to reset a node even in the case of mostsystem hangs. However, in extremely rare cases the software based watchdog timers will be not ableto reset the fenced node. The same is true in SGLX; the deadman driver will reset the system inmany cases but hard hangs are not handled.


17/53

The watchdog daemon must be shutdown when cluster services are manually halted on a clusternode. This is to make sure that nodes manually removed from the cluster are not unnecessarily reset.Every node in the cluster can be configured with a restricted service called as the watchdog timer(WDT) controller service which is responsible for starting, monitoring, and stopping the watchdogdaemon. The WDT controller services would ensure that the watchdog daemon starts when the clusterservices start on the node, and stops the watchdog daemon when the operator stops the clusterservices on the node. The WDT controller service provides a simple, reliable way to manage the

watchdog daemon to distinguish between nodes that are fenced versus those that manually removedfrom the cluster. The sample implementation of WDT controller service script is provided in theAppendix and is illustrated in Figure 2..

This is illustrated in the figure 2.

Figure 2: RHCS cluster SCSI-3 PR with automatic reset

Persistent Reservation and Multipath environment

This section describes the shortcomings of using SCSI-3 PR in a multipath environment with RHEL5.2and earlier releases. These are fully addressed with the release of RHEL5.3.

17


18/53

18

The SCSI Primary Commands -3 (SPC-3)specifications define the behavior of SCSI-3 PR. It specifiesthat for SCSI-3 PR support in a multipath environment, (multiple paths from host to disk) the node hasto register down each HBA/array controller pair. In the RHEL5.2 and earlier releases, the DeviceMapper multipath (DM-MPIO) driver does not forward the PR commands to all the physical paths of aLUN which is required for proper functioning of the SCSI-3 PR.

In RHCS, the scsi _r eser veinit script at startup ('start' option) will proceed to create registrationson all the discovered devices. The scsi _r eserve with the 'stop' option will attempt to remove(unregistering) the node's registration key from all devices that it registered. The scsi _reserveuses

the sg_per si st command for registering or unregistering the keys. In a Multipath environment,these sg_per si st commands are first received by the DM driver which is then passed to theunderlying SCSI driver.

The DM-MPIO driver forwards the sg_per si st registration (or unregistration) commands only onone of the physical paths of a LUN. The required behavior is that the DM-MPIO driver should passthese commands on all the physical paths of a LUN or the sg_per si st command should find thesepaths and register on each of them. In an active/passive array, the registration (or unregistration)commands are sent on only the active path and not sent on any of the standby paths. In anactive/active array, these commands are forwarded to a single path that is dynamically selectedbased the DM-MPIO policy.

This behavior of DM-MPIO has the following consequence in the RHCS if the SCSI-3 PR as fencingmethod is employed in the current form. In a static path selection based configuration, once a pathis selected, it is used for I/O until it fails. The node registration is done on this initially selected path.

When this path fails, I/O retry attempts on the remaining paths (which are not registered) will fail. Ina dynamic path selection based configuration, any randomly selected path is used for every I/O.This path can different from the path that was registered. The node registration is done on onerandomly selected path. This dynamic selection of path for every I/O leads to a failover, failbackping-pong between the registered and non-registered paths. This continues until all the non-registeredpaths are marked as failed leaving only the registered path as the only available path. With thefailure of the registered path, the device will no longer be accessible since either there will be noother paths available or registered ones to allow I/O. When the registered path fails in eitherconfiguration, the data will be unavailable for the application. The services may still be running on

the node even while the application I/O is being failed and making no attempt to relocate to anotherin the cluster node.

Hence failure of just the registered path is sufficient to lose access to the device even though theremay be other healthy paths to the device. With the use SCSI-3 PR fencing in RHEL5.2 or earlierreleases, all benefits of multi-path is lost.

Note: Not all active-passive arrays will work correctly with SCSI-3 PR in a multipath environment sothey are not supported.

If the use of SCSI-3 Persistent Reservation is desired, here are some guidelines on how to limit theabove impact of its use in a multipath environment:

1. Use only a static path selection based configuration active-active arrays. This avoids the

failover, failback ping-pong between registered and unregistered paths. On RHEL5, set thepath_grouping_policyparameter in / et c/ mul t i pat h. confto f ai l over for staticpath selection.

2. Use qdisk configuration to fence a node when the storage is no longer accessible. This resultsin relocating all cluster services to an alternate cluster member preventing cluster service fromrunning on a node when data is no longer accessible. In the event of failure of the registeredpath all access to the devices will be lost. The qdiskd on that node will no longer be able towrite to its area in the qdisk. The qdiskd on the other node will detect this and request CMAN
http://www.t10.org/ftp/t10/drafts/spc3/spc3r23.pdfhttp://www.t10.org/ftp/t10/drafts/spc3/spc3r23.pdf


19/53

19

to fence the node. After being fenced, the node will be reset by the auto-restart mechanismsetup. The node will reboot and join the cluster and possibly resolve the path failure.

Conclusions

Following are the important conclusions that must be user must be aware of when choosing SCSI-3 PRas the fencing mechanism:

1. When compared with HP iLO fencing, the SCSI-3 PR fencing is faster even when the system isloaded with large number of devices. Hence this approach reduces the application failovertime significantly.

2. SCSI-3 PR fencing can be supported on Serviceguard supported third party systems providedthe requirements of SCSI-3 PR fencing are met. Hence SCSI-3 PR fencing enables operators touse a common fencing mechanism for both HP and non-HP systems. Whereas if the HP iLOfencing approach is used for HP systems, then a different fencing mechanism has to beselected for non-HP systems.

3. In order to use SCSI-3 PR as a fencing method, all shared storage must use LVM2 clustervolumes, required for the purpose of device discovery. The fence scripts depends on LVMcommands to discover shared devices which is only possible if the volume groups are clusteraware. For migration, all the volume groups setup in SGLX have to be converted to clustervolume groups before being used in RHCS. In addition, all devices within these volumes mustbe SPC-3 compliant.

4. All nodes in the cluster must have a consistent view of storage i.e., all nodes in the clustermust register with the same devices. This is because each node must be able to removeanother node's registration key from all the devices that it registered with. Therefore the nodeperforming the fencing operation must be aware of all devices that other nodes are registeredwith. To meet this requirement, device names need to be persistent.

5. SCSI-3 PR fencing mechanism is supported only on multipath configurations only from releaseRHEL 5.3.

6. A qdisk configuration is defined so that an RHCS cluster can handle the various failurescenarios that are sustained in an SGLX environment.


20/53

20

RHCS resources, services, and failover domain

An introduction to the concept of a cluster resource, a service, and a failover domain are provided inthis section. This helps in the understanding of the package migration concepts discussed in the nextsection.

Services and resourcesAn application like Oracle database is made highly available by configuring an RHCS service. Acluster service is made up of cluster resources, components that can be failed over from one node toanother. Examples of cluster resources that are common to many services are an IP address, a Volumegroup or an ext3file system. Building a cluster service allows transparent client access to anapplication in the event of a failover. If a hardware or software failure occurs, the clusterautomatically restarts the failed node's cluster services on the functional node and thus ensures that nodata is lost and there is little disruption to users.

An RHCS service is analogous to an SGLX package. Application services are grouped together inthese packages so that in the event of a failure, they can automatically be transferred to another nodewithin the cluster, allowing services to remain available with minimal interruption.

Failover domain

A cluster service is associated with a failover domain: a subset of cluster nodes that are eligible to runa particular cluster service. However, each cluster service can run on only one cluster node at a timein order to maintain data integrity. One can specify whether or not the nodes in a failover domain areordered by preference. A cluster service can be restricted to run only on nodes of its associatedfailover domain.

By assigning a cluster service to a restricted failover domain, one can limit the nodes that are eligibleto run a cluster service in the event of a failover. Also, the nodes in a failover domain can be ordered

by preference to ensure that a particular node runs the cluster service (as long as that node is active).

A failover domain can have the following characteristics:

Unrestricted: Specifies that the subset of members is preferred. But that a cluster serviceassigned to this domain can run on any available member.

Restricted: The cluster service is allowed to run only on a subset of members.

Unordered: The member on which the cluster service runs is chosen from the available list offailover domain members with no preference order.

Ordered: The domain member on which the cluster service runs is selected based onpreference order. The member at the top of the list (as specified in the/ etc/ cl ust er / cl ust er . conf) is the most preferred, followed by the second member,and so on.

The different types of failover domains are possible using the ordering and restriction flag. Thepossible domains are ordered, restricted, unordered, restricted, ordered, unrestricted, andunordered, unrestricted.

For more information on failover domains, see http://sources.redhat.com/cluster/wiki/FailoverDomains.
http://sources.redhat.com/cluster/wiki/FailoverDomainshttp://sources.redhat.com/cluster/wiki/FailoverDomains


21/53

21

Resource Agents

Resource Agents (RA) are scripts or executables which handle operations for a given resource (suchas IP address, File system, and LVM).

An RA must be able to perform the following actions on a given resource instance on request by theResource manager:

1. Start: Brings the resource instance online and makes it available for use.2. Stop: Stops the resource instance.3. Monitor: Checks and returns the current status of the resource instance.4. Meta-data: Returns the resource agent meta-data via stdout.

The resource agents functions (start, stop, monitor, and meta-data) should be OCF compliant in orderto have their application work in RHCS cluster environments. For more information on how resourceagents work, see the OCF RA API v1.0 at http://sources.redhat.com/cluster/wiki/RGManager

For a list of supported resource agents, see the Red Hat Cluster Suite administrative guide. RHCS alsoprovides scriptRA which can be used to handle resources that are not currently delivered with Red

Hat Cluster Suite.
http://sources.redhat.com/cluster/wiki/RGManagerhttp://sources.redhat.com/cluster/wiki/RGManagerhttp://sources.redhat.com/cluster/wiki/RGManager


22/53

22

Migrating SGLX Packages

In earlier sections, the two familiar RHCS cluster configurations to be used for migration of the SGLXcluster have been described. This section describes how an SGLX package can be migrated to anRHCS cluster service.

Serviceguard Packages

The following table lists the type of packages whose migration is addressed in this white paper.

Table 1: Migration of SGLX packages to an RHCS cluster

Package Profile Description

Legacy Package withCustomer Defined Script

A Legacy failover package whose control scripts customer-defined area hascode which invokes another script or includes functions that are not a part of theServiceguard control script template.

Modular Package withCustomer Defined Script

A Modular failover package whose control scripts customer-defined area hascode which invokes another script or includes functions that are not a part of theServiceguard control script template.

Legacy Package With ControlScript and user definedvariables

A Legacy failover package whose package control script has user-definedenvironment variables that are not Serviceguard parameters.

Modular Package WithControl Script and userdefined variables

A Modular failover package whose package control script has user-definedenvironment variables that are not Serviceguard parameters.

Separate white papers will be available later to address the migration of various SGLX toolkits suchas Oracle and Samba.

Packages types

The SGLX failover type package - runs on one node at a time andin the case of a failure it willswitchto an alternate node - is the only type supported in Red Hat Cluster Suite. The SGLX Multi-nodepackage - runs on multiple nodes at the same time and can be independently started and halted onindividual nodes - is not supported in Red Hat Cluster Suite. Similarly, the SGLX System Multi-nodepackage - runs on all cluster nodes at the same time and which cannot be started and halted onindividual nodes - is also not supported in Red Hat Cluster Suite.

While Red Hat Cluster Suite does not support System Multi-node or Multi-node type SGLX packages,there is a workaround called a "Multi-Instance" service described inhttp://sources.redhat.com/cluster/wiki/MultipleInstanceServicesthat allows the same service to berunning on multiple nodes. It does not provide the full capability of Multi-node or System Multi-nodetype packages. Even with this workaround, services running on multiple nodes must be managed onper-node basis unlike in SGLX where Multi-node or System Multi-node packages are managed at acluster level. This means that the service on multiple nodes cannot be all started or stopped in a singlecommand as in a Multi-node or System Multi-node type package.
http://sources.redhat.com/cluster/wiki/MultipleInstanceServiceshttp://sources.redhat.com/cluster/wiki/MultipleInstanceServices


23/53

23

NOTE: The Watchdog timer (WDT) controller service implemented in this white paper is an exampleof the "Multi-instance service" workaround.

Restart, Failover, and Failback

In Serviceguard, after a package is successfully started, the package manager process monitors theprocess ID of the package service. If the service fails and the restart parameter for that service is set

to a value greater than 0, then the service is restarted on the same node, without halting the package.However, if the maximum allowed restarts are exceeded, the package is halted on its current nodeand, depending on the package switching flags and failover policy, is started on another node. In anRHCS cluster, if a resource fails a statuscheck function call, the resource manager stops theaffected service. Based on the recoveryparameter value, the service is either restarted, relocated,or disabled. If it is set to restart , the service is restarted locally before attempting to relocate it tothe best-available online node (based on the failover domain policy). If none of the nodes areavailable, or if all members fail to start the service, then the service is disabled. If set to relocate,the service will be relocated to another node and no attempt will be made to restart locally. If it is setto disablethe service startup would be disabled.

It is important to note that in Serviceguard, multiple restarts (up to infinity) of a service are allowed(defined by service_restart parameter) while in RHCS only a single restart is allowed.Therefore when migrating SGLX packages only a maximum of one service restart is allowed.

In Serviceguard, the package configuration parameter failover_policy is used to select a nodewhenever the package needs to be relocated. It could be set to either conf i gur ed_node ormi n_package_node. The conf i gur ed_node policy means that Serviceguard will select nodes inthe priority order from the list of node_nameentries. If configured with the mi n_package_nodethen Serviceguard will select the node from the list of node_nameentries, which is running fewestpackages when this package needs to start.

In Red Hat Cluster Suite, the failover domain type is used to select the node where the service needs

to be relocated. In an ordered, restrictedfailover domain, the highest priority member - whenever itis online - will always be selected to run the service bound to the domain. For example, if the domainmembers are nodes {A, B, C} then a service will always run on member 'A' whenever member 'A' isonline and when there is a quorum. This also means that if member A has a higher priority thanmember B, then the service running on member B will migrate to A whenever A transitions from offlineto online. If all domain members are offline, the service will not run. The ordered, restricted typefailover domain, achieves the same behavior of that of conf i gur ed_node failover policy.

In Serviceguard, the failback_policy parameter is used to determine the action to be taken forthe package, not running on its primary node, when its primary node becomes capable of running thepackage. When set to manual there will be no attempt to move the package back to its primary

node when it is running on an adoptive node. When set to aut omat i c, Serviceguard will attempt tomove the package back to its primary node as soon as the primary node is capable of running thepackage. In Red Hat Cluster Suite, enabling the nofailbackpolicy for ordered type failoverdomains will prevent automatic fail-back when a more-preferred node rejoins the cluster.

Currently, Red Hat Cluster Suite does not support failover domain that is equivalent of the SGLXmi n_package_nodefailover policy. For such packages, the failover policy has to be changed toone that is supported by Red Hat Cluster Suite. The table 2 shows the RHCS equivalent configured forSGLX package configurations.


24/53

24

Table 2: SGLX packages equivalent configuration in Red Hat Cluster Suite

SGLX package Red Hat Cluster Suite servicefailover_policy set toconf i gur ed_node

failback_policyset to automat i c

service_restart > 0

ordered set to 1

Node priority set in order in which nodes are listed underServiceguard package configuration node_name

restrictedset to 1

nofailback set to 0

recovery to be set torestart(maximum of one restarts ispossible)

failover_policy set toconf i gur ed_node

failback_policyset to manual

service_restartset to0

ordered set to 1

Node priority set in order in which nodes are listed underServiceguard package configuration node_name

restrictedset to 1

nofailback set to 1

recovery to be set tor el ocate(If service fails the no restartsare attempted)

failover_policy set tomi n_package_node

Not Supported

Note: For Modular packages, parameter names and its literal values used is lower case while for

Legacy packages the default is upper case. Unless, otherwise specified, the Modular packageparameter and literals referred in the white paper are valid for legacy packages as well.

Auto start capability

In Serviceguard, the auto_runparameter determines if the package should be started when thecluster starts, and whether it should automatically restart the package on a new node in response to afailure. Similarly, in Red Hat Cluster Suite, setting autostart to yesallows the service to bestarted automatically when a cluster is started, and in the event of a failure the service can be startedon an adoptive node. So, the autostart option can be used for the migration of Serviceguardpackages configured with auto_runoption.

Fast fail option

In Serviceguard, the node_fail_fast_enabledset to yeswill reset the node if the package fails.In addition, Serviceguard also supports service_fail_fast_enabledwhich when set to yeswilreset the node if the service fails.

In RHCS, the hardrecoveryoption when set to yes,will cause the member to reboot if the servicefails to stop on that node.


25/53

25

While it is not a direct equivalent of node_fail_fast_enabledorservice_fail_fast_enabled, the hard recovery option maybe used to reset a node wheneverthe service halt fails.Given that the service halt has failed, it is much safer to simply reboot the noderather than just attempting to relocate the service.

Package dependency

In Serviceguard, a package can have dependency on another package which is described primarily

by parametersdependency_condition

anddependency_location.

Thedependency_conditionparameter (upor down) ,describes the package condition necessary forthe dependency to be satisfied. If set to upit means that this package requires the package identifiedbypackage_nameto be up. Similarly, if set to downit means that this package requires thepackage identified bypackage_nameto be down. The dependency_location describes wherethe condition must be satisfied which could same_node,any_node, and di f f er ent _node.

In Red Hat Cluster Suite, a service can have dependencieson another service. The servicedependency option, depend, can be used to specify the name of the service it depends on. Thisdependency is subject to restrictions based on the depend_modeparameter which is the servicedependency mode. When set to hard, the service can be up only if the service it depends on is alsoup. This dependency needs to be met at both, initial service startup time, and also while the service is

running. The service will stop if the service it depends on is stopped. When the depend_modeis setto sof t , the dependency needs to be met only at startup time. The service can start only if theservice it depends on is up. However, once started, the service will not stop if the service it dependson stops.

Event Scripting is a way for administrators to trigger service transitions based on a number of thingswhich occur during cluster operation. In this case the event scripting is used to implement the SGLXdependency features, that not readily available in RHCS. This is an example of how the SGLX optionsthat do not have an equivalent match in RHCS can be easily implemented programmatically usingscripting. For more information on event scripting, seehttp://sources.redhat.com/cluster/wiki/RGManager.

The following are the SGLX dependencies that are not readily supported (via dependanddepend_mode) by RHCS. These can be implemented with event scripting:

1) For all SGLX configurations where thedependency_conditionis set to down.

2) For all SGLX configurations where the dependency_locationis set to same_nodeordi f f er ent _node. Since same_nodeis not supported, the Serviceguard parameter

priority becomes irrelevant.

The only type of SGLX configurations that can be readily supported in Red Hat Cluster Suite is:

A configuration where thedependency_conditionis set to upand thedependency_location set toany_node. In Serviceguard, the check for dependencycondition is made at startup time and required thereafter. Therefore to get the same behaviorin Red Hat, the depend_modeneeds is set to sof t .
http://sources.redhat.com/cluster/wiki/RGManagerhttp://sources.redhat.com/cluster/wiki/RGManager


26/53

26

Client Network monitoring

Serviceguard can be setup to have subnets monitored for packages. These subnets can be definedwith the monitored_subnetsparameter (in Legacy packages it is the parameter SUBNET) in thepackage configuration file. If any of the subnets defined goes down, the package will be relocated toanother node that is configured for this package and has all the defined subnets.

In RHCS, it is possible to monitor the subnet to which the IP address resource belongs to, but not anyother subnets. Setting the monitor_linkparameter of the IP Address resource to yes,will causethe statusfunction to fail if the link on the NIC to which the IP address is bound is not present. Thiswould trigger the resource manager to relocate the service to another node in the cluster.

For the migration of an SGLX package configured with monitored_subnets,only the subnet towhich the IP address belongs can be monitored on the RHCS cluster. The other subnets that weresetup to be monitored in SGLX will not be monitored in RHCS. Note that while SGLX monitoring canbe configured this way, it is not common.

Applications - startup, shutdown, and monitoring

RHCS currently supplies fully supported resource agents for standard applications such as Oracle orSamba. It is possible that existing SGLX customers have implemented customer-defined area (Legacyscript) or external script (for Modular package) for applications that are already supported by RedHat supplied resource agents. In the existing SGLX cluster, applications could have been made highlyavailable using either the customer-defined area of Legacy packages or with external scripts forModular packages. In the case of the Modular packages, the external_scriptidentifies the scripwhich is used to start and stop applications. In such cases it is recommended to use the suppliedresource agent instead of porting the customer-defined areas or external script into new Red Hatresource agent.

However, in many cases there may not be an equivalent resource agent for customer-defined area orthe external script. In such cases, the customer-defined areas or the external script has to be ported as

a Red Hat scriptresource. The RHCS resource type scriptallows users to integrate a LinuxStandard Base (LSB) compliant script to be a cluster service. This allows the application to be started,stopped, and monitored. The porting involves implementing the mandatory actions start, stop,monitor,and meta-datadescribed in the section Resource Agents.

In Legacy packages, the control scripts customer-defined area is any code between #STARTCUSTOMER DEFINED FUNCTIONSand #END CUSTOMER DEFINED FUNCTIONS. The logicunder the function customer_defined_run_cmds has to be ported as the startfunction in theresource script. In the case of the Modular packages, the external_scriptidentifies the scriptwhich is used to start and stop applications. The startfunction in the external_scripthas to beported as the startfunction in the resource script.

Similarly, the logic under the function customer_defined_halt_cmds has to be ported as thestopfunction in the resource script RA. With Modular packages, the stopfunction in theexternal_scripthas to be ported as the stopfunction in the resource script.

In RHCS, the term service refers to a collection of resources which is equivalent to an SGLX packageHowever, the same term in Serviceguard refers to a daemon which is monitored while the package isup. Serviceguard uses the existence of the package service to determine if a package is up or hasfailed.In Serviceguard, the package service provides a mechanism to determine if a package is up ordown. The same functionality is achieved in Red Hat via the monitorfunction which is implementedby every resource agent. This function is periodically called by the Resource group manager to


27/53

27

determine the health of the service. This period is defined by the intervalparameter specified forthe monitoraction (default of 30 seconds) in the / usr / shar e/ cl ust er / scr i pt . shfile.

Here is the snippet of / usr / shar e/ cl ust er / scr i pt . shfile that shows the default setting for themonitoring periodicity:

As part of the migration, the SERVICE_CMDscript logic has to be ported to the monitorfunction ofthe resource scriptRA.

In the case of Legacy packages, there may be non-Serviceguard variables defined in the Legacypackage control script that are used in the customer-defined area. Also, if the customer-defined area isimplemented as a script there may additional variables defined in the script. Both of these sets ofvariables need to be ported to newly implemented scriptRA. Similarly for Modular packages,there may be non-Serviceguard variables defined in package configuration and as well in theexternal script files. These have to be ported to newly implemented scriptRA.

The resource script RA has to comply with the LSB specification which is described in more detail athttp://refspecs.freestandards.org/LSB_2.0.1/LSB-Core/LSB-Core/iniscrptact.html. If the rules are notfollowed, especially the return codes for start,stop,and monitorfunctions; the resourcemanager can behave inconsistently.

The section Migration Procedure gives an example of how the customer-defined scripts or theexternal script can be ported to Red Hat scriptresource. The Appendix provides a sample listing ofa ported resource script.

Package startup and shutdown/Service hierarchical structure

In RHCS, a service is a collection of cluster resources configured into a single entity that is managed(started, stopped, or relocated) for high availability. A service is represented as a resource tree thatspecifies each resource, its attributes, and its relationship among other resources in the resource tree.The relationships could be parent, child, or sibling. Even though the service is seen as a single entity,the hierarchy of the resources determines the order in which each resource within the service is startedand stopped.

In the case of child-parent relation, the startup or shutdown is simple. All parents are started beforechildren and children must all stop cleanly before a parent may be stopped. For a resource to beconsidered in good health, all its children must be in good health.

For a typed child resource, the type attribute for the child resource defines the start order and the

stop order of each resource type. The following shows the start and stop values as they appear in theService resource agent, service.sh.
http://refspecs.freestandards.org/LSB_2.0.1/LSB-Core/LSB-Core/iniscrptact.htmlhttp://refspecs.freestandards.org/LSB_2.0.1/LSB-Core/LSB-Core/iniscrptact.html


28/53

28

For this Service resource, all LVM children are started first, followed by all File System children,followed by all Script children, and so forth. Additional considerations are required for non-typedchild resources.

For a non-typed child resource, starting order and stopping order are not explicitly specified by the

Service resource. Instead, starting order and stopping order are determined according to the order ofthe child resource in / et c/ cl ust er / cl ust er . conf. Additionally, non-typed child resources arestarted after all typed child resources and stopped before any typed child resources. Ordering withina resource type is preserved as it exists in the cluster configuration file,/ etc/ cl ust er / cl ust er . conf.

For Serviceguard package migration, only typed child resources are used, and therefore, theresources are started (or stopped) in same order as in an SGLX package.

Red Hat allows a resource to be setup in the common resource block so that it can be shared amongmany services. It is important to note that such reuse is only possible when the required attributes can

be inherited from parent resources in a service block so that their use in more than one service doesnot conflict with each other. Such use is possible and beneficial in an NFS configuration. Itsapplication in other resources, such as file system resource, is not possible since a file system resourcein multiple places can result in mounting one file system on two nodes, and thereby causingcorruption. Similarly, its usage is not applicable in IP addressand Volume groupresources.

In RHCS, a service is considered to have failed if any of its resources fail. In such an event, theexpected course of action is to restart the entire service; all of the resources, the one that has failedand the remaining ones that have not failed. In certain conditions it may be sufficient to restart onlypart of the service before attempting a normal recovery. This can be accomplished using the

__i ndependent_subt r eeattribute. Application scriptresource (used for starting, stopping, andmonitoring an application in a clustered environment) can be made an independent tree so that its

failure will only attempt to restart the script and not others resource such LVM, IP, and FS. Justrestarting the application may sometimes be sufficient to resolve the startup problem, avoidingunnecessary restarts of all resources.


29/53

29

Service IP address

One or more relocatable IP addresses on a specified subnet can be defined for a package. Similarly,in Red Hat Cluster Suite, an IP address resource is defined to configure a relocatable IP address. Aseparate IP resource has to be created for every ip_addressentry in the package configuration file

For migration of an SGLX package, create a separate IP resource for every ip_addressentry in the

package configuration.

NOTE: Red Hat Cluster Suite supports, both IPv4 and IPv6 addresses for cluster service usage. Forcluster communication only IPv4 are supported.

Volume group

The vgparameter in the package configuration file specifies an LVM volume group on which a filesystem needs to be mounted. Multiple vgscan be defined for a package. Similarly in RHCS, an LVMresource can be configured with a volume group for the application.

For migration of an SGLX package, create a separate LVM resource for every vgentry in the

package configuration.

Note: In RHCS, only the single system type volume groups are managed by LVM resource agents.The cluster-aware type volume groups, which are managed by Cluster LVM daemon (CLVMD), donot require any resource agents. When using a SCSI-3 PR fencing based Familiar configuration do notcreate LVM resource for the package vgentries. This is because, for SCSI-3 PR fencing, the singlesystem volume groups in the package configuration file have to be converted to cluster-awarevolume groups.

File system

At start time, an SGLX package activates one or more logical volumes (associated with the filesystems) and then mounts the file system. At halt time, the package un-mounts the file systems. In thecase of Legacy packages, the package control script contains the necessary file system commands tocomplete the mount, un-mount, and fsck operations. For Modular packages, these commands arepresent in the package configuration file. In the case of RHCS, one or more logical volumes to the filesystem are activated at service start time and the file system is mounted. At halt time, the service un-mounts the file systems. In Red Hat Cluster Suite, the File System resource agent contains thenecessary commands and logic.

In Serviceguard, the fs_nameparameter, in conjunction with fs_directory, fs_type,fs_mount_opt, fs_umount_opt, and fs_fsck_opt, specifies a filesystem that is to be mountedby the package. The fs_name(parameter LVin Legacy packages) specifies the block device file for

a logical volume and the parameter fs_directoryis the root of the file system specified byfs_name.

During package startup, file system check (i.e., fsckoperation) is done for every filesystem definedfor the package before being mounted. The fs_fsck_optspecifiesthe fsckoptions used for theoperation. If the mount point is busy and fs_mount_retry_count is greater than zero, the startupscript will attempt to kill the user process responsible for the busy mount point and then try to mountthe file system again. It will do this the number of times specified by fs_mount_retry_count. If themount still fails after the number of attempts specified by fs_mount_retry_count, package startupwill fail. The fs_mount_optspecifies the options used for mount operation.


30/53

30

At package shutdown time, each file system specified by fs_namewill be un-mounted. If the mountpoint is busy and fs_umount_retry_count(parameter FS_UMOUNT_COUNTin Legacy packages)is greater than zero, the shutdown script will attempt to kill the user process responsible for the busymount point and then try to umount the file system again. It will do this for the number of timesspecified by fs_umount_retry_count. If the umount still fails after the number of attemptsspecified by fs_mount_retry_count, package shutdown will fail. The fs_umount_optspecifies the options used for umount operation.

In RHCS, to define a file system for an application, a File system resource needs to be defined. TheFile system resource parameters - mount point, file system type, and device -specify the filesystemthat is mounted as part of the resource startfunction. If the parameter, force_fsckis set, then afilesystem check will be run on the file system before mounting it. This option is ignored for non-journal file systems such as ext2. During resource shutdown time, as part of stop function, the filesystem will be un-mounted. If the mount point is busy, force_unmountoption forces the file systemto unmount by killing all processes using the mount point. The parameter self_fence, if set,reboots the node if un-mounting of the file system fails.

Note: The file systems ext2, ext3and gf sare supported by both RHCS and SGLX.

When compared with the Red Hat File system resource, SGLX provides additional file systemcapability for a package. Following is a list of SGLX file system capabilities that are not supported byRed Hat file system resource:

1. concurrent_fsck_operations: The number of concurrent fsck operations is allowed onfile systems being mounted during package startup.

2. concurrent_mount_and_umount_operations : The number of concurrent mounts andumounts to allow during package startup or shutdown.

3. fs_mount_retry_count: The number of mount retries for each file system.

4. fs_umount_retry_count: The number of umount retries for each file system.

When migrating SGLX packages to RHCS, here some recommendations:

For every fs_nameentry in an SGLX package a corresponding RHCS file system resourceneeds to be created. Ensure that resources are defined in / etc/ cl ust er / cl ust er . confin the order in which the fs_name are defined in the SGLX package configuration file.

Set the force_fsckoption for the Red Hat file system resource so that a filesystem check isdone on the file system before mounting it.

Use force_unmountoption in conjunction with self_fenceso that if the umount fails thenode will be reset.


31/53

31

Migration Procedure

A step-by-step procedure is described with Apache Web server package as an example.

SGLX cluster to be migrated

This section describes an existing Serviceguard cluster which is migrated to an RHCS cluster. The

setup is a two-node SGLX 11.19 cluster, with x86_64 systems node1and node2running RHEL 5.3Linux operating system. A bonded pair of NICs (using Linux Channel Bonding) is used for clientaccess and a single NIC connected to a switch is used for cluster communication. A quorum server isused to arbitrate in the event of a network partition.

The cluster is configured with apache_modwhich is a Modular package that provides highavailability for an Apache server. An ext2type file system is created on the logical volume l vol 01which is created for the volume group vgap containing a disk LUN - not a partition since SCSI-3 PRis used - presented to all nodes in the cluster. Every node in the cluster has a mount point directory/ u01. The file system created on l vol 01will be mounted on the node where the apache_modpackage is running.

The Apache Web server is installed on all cluster nodes configured to run an Apache package i.e.both node1and node2. The DocumentRootand ServerRootdirectories reside on the / u01directory (which mounts shared file system) so that the information can accessed from any clustermember. The apache configuration file htt pd. confis located at / u01/ apache/ ht t pd. conf.

Following is a snippet of the Apache configuration that shows the DocumentRoot, ServerRoot,and Listen.

ServerRoot /u01/apache/httpd.conf

DocumentRoot "/u01/apache/doc"

Listen 80

Following are the package configuration parameters of the apache_mod package:

package_name Apache_mod

package_type failover

node_name no

migrating sglx cluster to rhcs cluster

Documents