oracle clusterware node management and voting disks

<Insert Picture Here>

Node Management in Oracle Clusterware

Markus Michalewicz

Senior Principal Product Manager Oracle RAC and Oracle RAC One Node

The following is intended to outline our general

product direction. It is intended for information

purposes only, and may not be incorporated into any

contract. It is not a commitment to deliver any

material, code, or functionality, and should not be

relied upon in making purchasing decisions.

The development, release, and timing of any

features or functionality described for Oracle’s

products remain at the sole discretion of Oracle.


Agenda

• Oracle Clusterware 11.2.0.1 Processes

• Node Monitoring Basics

• Node Eviction Basics

• Re-bootless Node Fencing (restart)

• Advanced Node Management

• The Corner Cases

• More Information / Q&A

Oracle Clusterware 11g Rel. 2 Processes Most are not important for node management

Oracle Clusterware 11g Rel. 2 Processes Most are not important for node management – focus!

CSSDMONITOR (was: oprocd)

ora.cssdmonitor

CSSD ora.cssd

OHASD


Node Monitoring Basics

SAN Network

SAN Network

Public Lan Public Lan

CSSD

Basic Hardware Layout Oracle Clusterware Node management is hardware independent

CSSD CSSD

Voting Disk

Private Lan / Interconnect

What does CSSD do? CSSD monitors and evicts nodes

• Monitors nodes using 2 communication channels:

– Private Interconnect Network Heartbeat

– Voting Disk based communication Disk Heartbeat

• Evicts (forcibly removes nodes from a cluster)

nodes dependent on heartbeat feedback (failures)

CSSD CSSD

“Ping”

“Ping”

Network Heartbeat Interconnect basics

• Each node in the cluster is “pinged” every second

• Nodes must respond in css_misscount time (defaults to 30 secs.)

– Reducing the css_misscount time is generally not supported

• Network heartbeat failures will lead to node evictions

– CSSD-log: [date / time] [CSSD][1111902528]clssnmPollingThread: node mynodename (5) at 75% heartbeat fatal, removal in 6.770 seconds

CSSD CSSD “Ping”

Disk Heartbeat Voting Disk basics – Part 1

CSSD CSSD

“Ping”

• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second

• Nodes must receive a response in (long / short) diskTimeout time

– I/O errors indicate clear accessibility problems timeout is irrelevant

• Disk heartbeat failures will lead to node evictions

– CSSD-log: … [CSSD] [1115699552] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1)

Voting Disk Structure Voting Disk basics – Part 2

• Voting Disks contain dynamic and static data:

– Dynamic data: disk heartbeat logging

– Static data: information about the nodes in the cluster

• With 11.2.0.1 Voting Disks got an “identity”:

– E.g. Voting Disk serial number: [GRID]> crsctl query css votedisk

1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]

• Voting Disks must therefore not be copied using “dd” or “cp” anymore

Node information Disk Heartbeat Logging

“Simple Majority Rule” Voting Disk basics – Part 3

CSSD CSSD

• Oracle supports redundant Voting Disks for disk failure protection

• “Simple Majority Rule” applies:

– Each node must “see” the simple majority of configured Voting Disks

at all times in order not to be evicted (to remain in the cluster)

trunc(n/2+1) with n=number of voting disks configured and n>=1

Insertion 1: “Simple Majority Rule”… … In extended Oracle clusters

CSSD CSSD

• Same principles apply

• Voting Disks are just

geographically dispersed

• http://www.oracle.com/goto/rac

– Using standard NFS to support

a third voting file for extended

cluster configurations (PDF)

http://www.oracle.com/goto/rac

• Oracle ASM auto creates 1/3/5 Voting Files

– Based on Ext/Normal/High redundancy

and on Failure Groups in the Disk Group

– Per default there is one failure group per disk

– ASM will enforce the required number of disks

– New failure group type: Quorum Failgroup

[GRID]> crsctl query css votedisk

1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]

2. 2 aafab95f9ef84f03bf6e26adc2a3b0e8 (/dev/sde5) [DATA]

3. 2 28dd4128f4a74f73bf8653dabd88c737 (/dev/sdd6) [DATA]

Located 3 voting disk(s).

Insertion 2: Voting Disk in Oracle ASM The way of storing Voting Disks doesn’t change its use


Node Eviction Basics

Why are nodes evicted? To prevent worse things from happening…

• Evicting (fencing) nodes is a preventive measure (a good thing)!

• Nodes are evicted to prevent consequences of a split brain:

– Shared data must not be written by independently operating nodes

– The easiest way to prevent this is to forcibly remove a node from the cluster

CSSD CSSD

1 2

How are nodes evicted in general? “STONITH like” or node eviction basics – Part 1

• Once it is determined that a node needs to be evicted,

– A “kill request” is sent to the respective node(s)

– Using all (remaining) communication channels

• A node (CSSD) is requested to “kill itself” “STONITH like”

– “STONITH” foresees that a remote node kills the node to be evicted

CSSD CSSD

1 2

How are nodes evicted? EXAMPLE: Heartbeat failure

• The network heartbeat between nodes has failed

– It is determined which nodes can still talk to each other

– A “kill request” is sent to the node(s) to be evicted

Using all (remaining) communication channels Voting Disk(s)

• A node is requested to “kill itself”; executer: typically CSSD

CSSD CSSD

1

2

How can nodes be evicted? Using IPMI / Node eviction basics – Part 2

• Oracle Clusterware 11.2.0.1 and later supports IPMI (optional)

– Intelligent Platform Management Interface (IPMI) drivers required

• IPMI allows remote-shutdown of nodes using additional hardware

– A Baseboard Management Controller (BMC) per cluster node is required

CSSD CSSD

1

Insertion: Node Eviction Using IPMI EXAMPLE: Heartbeat failure

• The network heartbeat between the nodes has failed

– It is determined which nodes can still talk to each other

– IPMI is used to remotely shutdown the node to be evicted

CSSD

1

Which node is evicted? Node eviction basics – Part 3

• Voting Disks and heartbeat communication is used to determine the node

• In a 2 node cluster, the node with the lowest node number should survive

• In a n-node cluster, the biggest sub-cluster should survive (votes based)

CSSD CSSD

1 2


Re-bootless Node

Fencing (restart)

Re-bootless Node Fencing (restart) Fence the cluster, do not reboot the node

• Until Oracle Clusterware 11.2.0.2, fencing meant “re-boot”

• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, because:

– Re-boots affect applications that might run an a node, but are not protected

– Customer requirement: prevent a reboot, just stop the cluster – implemented...

CSSD

Oracle RAC DB Inst. 1


Standalone App X

Standalone App Y

CSSD

Re-bootless Node Fencing (restart) How it works

• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:

– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted

• It starts with a failure – e.g. network heartbeat or interconnect failure

CSSD

Standalone App X

Standalone App Y

CSSD






• It starts with a failure – e.g. network heartbeat or interconnect failure

CSSD

Standalone App X

Standalone App Y

CSSD






• Then IO issuing processes are killed; it is made sure that no IO process remains

– For a RAC DB mainly the log writer and the database writer are of concern

CSSD

Standalone App X

Standalone App Y

CSSD





• Once all IO issuing processes are killed, remaining processes are stopped

– IF the check for a successful kill of the IO processes, fails → reboot

CSSD

Standalone App X

Standalone App Y

CSSD





• Once all remaining processes are stopped, the stack stops itself with a “restart flag”

Standalone App X

Standalone App Y

CSSD OHASD





• OHASD will finally attempt to restart the stack after the graceful shutdown

Standalone App X

Standalone App Y

CSSD OHASD


Re-bootless Node Fencing (restart) EXCEPTIONS

• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, unless…:

– IF the check for a successful kill of the IO processes fails → reboot

– IF CSSD gets killed during the operation → reboot

– IF cssdmonitor (oprocd replacement) is not scheduled → reboot

– IF the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot

CSSD

Standalone App X

Standalone App Y

CSSD




Advanced Node

Management

Determine the Biggest Sub-Cluster Voting Disk basics – Part 4

CSSD CSSD

1

2

3

1 2 3

CSSD

• Each node in the cluster is “pinged” every second (network heartbeat)

• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second

Determine the Biggest Sub-Cluster Voting Disk basics – Part 4

CSSD CSSD

1

3

1 2 3

CSSD

• In a n-node cluster, the biggest sub-cluster should survive (votes based)

2

Redundant Voting Disks – Why odd? Voting Disk basics – Part 5

CSSD

CSSD

CSSD

1

3 2

• Assume for a moment only 2

voting disks are supported…

• Redundant Voting Disks Oracle managed redundancy


CSSD

CSSD

CSSD

1

3 2

• Advanced scenarios need to be considered

• Without the “Simple Majority

Rule”, what would we do?

• Even with the “Simple

Majority Rule” in place

– Each node can see only one

voting disk, which would lead

to an eviction of all nodes


CSSD

CSSD

CSSD

1

3 2

1

2

3

1

2

3

1

2

3


CSSD

CSSD

CSSD

1

3 2

1

2

3

1

2

3

1

2

3


The Corner Cases

CSSD CSSD

• A properly configured cluster

with 3 voting disks as shown

Case 1: Partial Failures in the Cluster When somebody uses a pair of scissors in the wrong way…

• What happens if there is a

storage network failure as

shown (lost remote access)?

CSSD CSSD

• There will be no node eviction!

• IF storage mirroring is used

(for data files), the respective

solution must handle this case.

Case 1: Partial Failures in the Cluster When somebody uses a pair of scissors in the wrong way…

• Covered in Oracle ASM 11.2.0.2:

– _asm_storagemaysplit = TRUE

– Backported to 11.1.0.7

Case 2: CSSD is stuck CSSD cannot execute request

CSSD CSSD

1

• A node is requested to “kill itself”

• BUT CSSD is “stuck” or “sick” (does not execute) – e.g.:

– CSSD failed for some reason

– CSSD is not scheduled within a certain margin

OCSSDMONITOR (was: oprocd) will take over and execute

Case 2: CSSD is stuck CSSD cannot execute request

CSSD

CSSD

1

CSSDmonitor

• A node is requested to “kill itself”

• BUT CSSD is “stuck” or “sick” (does not execute) – e.g.:

– CSSD failed for some reason

– CSSD is not scheduled within a certain margin

OCSSDMONITOR (was: oprocd) will take over and execute

Case 3: Node Eviction Escalation Members of a cluster can escalate kill requests

• Cluster members (e.g Oracle RAC instances) can request

Oracle Clusterware to kill a specific member of the cluster

• Oracle Clusterware will then attempt to kill the requested member

CSSD CSSD



Inst. 1: kill inst. 2



• If the requested member kill is unsuccessful, a node eviction

escalation can be issued, which leads to the eviction of the

node, on which the particular member currently resides

CSSD CSSD









CSSD CSSD









CSSD



More Information

• My Oracle Support Notes:

– ID 294430.1 - CSS Timeout Computation in Oracle Clusterware

– ID 395878.1 - Heartbeat/Voting/Quorum Related Timeout Configuration

for Linux, OCFS2, RAC Stack to Avoid Unnecessary Node Fencing,

Panic and Reboot

• http://www.oracle.com/goto/clusterware

– Oracle Clusterware 11g Release 2 Technical Overview

• http://www.oracle.com/goto/asm

• http://www.oracle.com/goto/rac

More Information

http://www.oracle.com/goto/clusterware

http://www.oracle.com/goto/asm

http://www.oracle.com/goto/rac

oracle clusterware node management and voting disks

Technology

io issuing

bootless node

node eviction

node eviction

node evictions

requested

cssmisscount

voting disk