oracle clusterware node management and voting disks
DESCRIPTION
TRANSCRIPT
<Insert Picture Here>
Node Management in Oracle Clusterware
Markus Michalewicz
Senior Principal Product Manager Oracle RAC and Oracle RAC One Node
The following is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remain at the sole discretion of Oracle.
<Insert Picture Here>
Agenda
• Oracle Clusterware 11.2.0.1 Processes
• Node Monitoring Basics
• Node Eviction Basics
• Re-bootless Node Fencing (restart)
• Advanced Node Management
• The Corner Cases
• More Information / Q&A
Oracle Clusterware 11g Rel. 2 Processes Most are not important for node management
Oracle Clusterware 11g Rel. 2 Processes Most are not important for node management – focus!
CSSDMONITOR (was: oprocd)
ora.cssdmonitor
CSSD ora.cssd
OHASD
<Insert Picture Here>
Node Monitoring Basics
SAN Network
SAN Network
Public Lan Public Lan
CSSD
Basic Hardware Layout Oracle Clusterware Node management is hardware independent
CSSD CSSD
Voting Disk
Private Lan / Interconnect
What does CSSD do? CSSD monitors and evicts nodes
• Monitors nodes using 2 communication channels:
– Private Interconnect Network Heartbeat
– Voting Disk based communication Disk Heartbeat
• Evicts (forcibly removes nodes from a cluster)
nodes dependent on heartbeat feedback (failures)
CSSD CSSD
“Ping”
“Ping”
Network Heartbeat Interconnect basics
• Each node in the cluster is “pinged” every second
• Nodes must respond in css_misscount time (defaults to 30 secs.)
– Reducing the css_misscount time is generally not supported
• Network heartbeat failures will lead to node evictions
– CSSD-log: [date / time] [CSSD][1111902528]clssnmPollingThread: node mynodename (5) at 75% heartbeat fatal, removal in 6.770 seconds
CSSD CSSD “Ping”
Disk Heartbeat Voting Disk basics – Part 1
CSSD CSSD
“Ping”
• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second
• Nodes must receive a response in (long / short) diskTimeout time
– I/O errors indicate clear accessibility problems timeout is irrelevant
• Disk heartbeat failures will lead to node evictions
– CSSD-log: … [CSSD] [1115699552] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1)
Voting Disk Structure Voting Disk basics – Part 2
• Voting Disks contain dynamic and static data:
– Dynamic data: disk heartbeat logging
– Static data: information about the nodes in the cluster
• With 11.2.0.1 Voting Disks got an “identity”:
– E.g. Voting Disk serial number: [GRID]> crsctl query css votedisk
1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]
• Voting Disks must therefore not be copied using “dd” or “cp” anymore
Node information Disk Heartbeat Logging
“Simple Majority Rule” Voting Disk basics – Part 3
CSSD CSSD
• Oracle supports redundant Voting Disks for disk failure protection
• “Simple Majority Rule” applies:
– Each node must “see” the simple majority of configured Voting Disks
at all times in order not to be evicted (to remain in the cluster)
trunc(n/2+1) with n=number of voting disks configured and n>=1
Insertion 1: “Simple Majority Rule”… … In extended Oracle clusters
CSSD CSSD
• Same principles apply
• Voting Disks are just
geographically dispersed
• http://www.oracle.com/goto/rac
– Using standard NFS to support
a third voting file for extended
cluster configurations (PDF)
• Oracle ASM auto creates 1/3/5 Voting Files
– Based on Ext/Normal/High redundancy
and on Failure Groups in the Disk Group
– Per default there is one failure group per disk
– ASM will enforce the required number of disks
– New failure group type: Quorum Failgroup
[GRID]> crsctl query css votedisk
1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]
2. 2 aafab95f9ef84f03bf6e26adc2a3b0e8 (/dev/sde5) [DATA]
3. 2 28dd4128f4a74f73bf8653dabd88c737 (/dev/sdd6) [DATA]
Located 3 voting disk(s).
Insertion 2: Voting Disk in Oracle ASM The way of storing Voting Disks doesn’t change its use
<Insert Picture Here>
Node Eviction Basics
Why are nodes evicted? To prevent worse things from happening…
• Evicting (fencing) nodes is a preventive measure (a good thing)!
• Nodes are evicted to prevent consequences of a split brain:
– Shared data must not be written by independently operating nodes
– The easiest way to prevent this is to forcibly remove a node from the cluster
CSSD CSSD
1 2
How are nodes evicted in general? “STONITH like” or node eviction basics – Part 1
• Once it is determined that a node needs to be evicted,
– A “kill request” is sent to the respective node(s)
– Using all (remaining) communication channels
• A node (CSSD) is requested to “kill itself” “STONITH like”
– “STONITH” foresees that a remote node kills the node to be evicted
CSSD CSSD
1 2
How are nodes evicted? EXAMPLE: Heartbeat failure
• The network heartbeat between nodes has failed
– It is determined which nodes can still talk to each other
– A “kill request” is sent to the node(s) to be evicted
Using all (remaining) communication channels Voting Disk(s)
• A node is requested to “kill itself”; executer: typically CSSD
CSSD CSSD
1
2
How can nodes be evicted? Using IPMI / Node eviction basics – Part 2
• Oracle Clusterware 11.2.0.1 and later supports IPMI (optional)
– Intelligent Platform Management Interface (IPMI) drivers required
• IPMI allows remote-shutdown of nodes using additional hardware
– A Baseboard Management Controller (BMC) per cluster node is required
CSSD CSSD
1
Insertion: Node Eviction Using IPMI EXAMPLE: Heartbeat failure
• The network heartbeat between the nodes has failed
– It is determined which nodes can still talk to each other
– IPMI is used to remotely shutdown the node to be evicted
CSSD
1
Which node is evicted? Node eviction basics – Part 3
• Voting Disks and heartbeat communication is used to determine the node
• In a 2 node cluster, the node with the lowest node number should survive
• In a n-node cluster, the biggest sub-cluster should survive (votes based)
CSSD CSSD
1 2
<Insert Picture Here>
Re-bootless Node
Fencing (restart)
Re-bootless Node Fencing (restart) Fence the cluster, do not reboot the node
• Until Oracle Clusterware 11.2.0.2, fencing meant “re-boot”
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, because:
– Re-boots affect applications that might run an a node, but are not protected
– Customer requirement: prevent a reboot, just stop the cluster – implemented...
CSSD
Oracle RAC DB Inst. 1
Oracle RAC DB Inst. 2
Standalone App X
Standalone App Y
CSSD
Re-bootless Node Fencing (restart) How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted
• It starts with a failure – e.g. network heartbeat or interconnect failure
CSSD
Standalone App X
Standalone App Y
CSSD
Oracle RAC DB Inst. 1
Oracle RAC DB Inst. 2
Re-bootless Node Fencing (restart) How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted
• It starts with a failure – e.g. network heartbeat or interconnect failure
CSSD
Standalone App X
Standalone App Y
CSSD
Oracle RAC DB Inst. 1
Oracle RAC DB Inst. 2
Re-bootless Node Fencing (restart) How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted
• Then IO issuing processes are killed; it is made sure that no IO process remains
– For a RAC DB mainly the log writer and the database writer are of concern
CSSD
Standalone App X
Standalone App Y
CSSD
Oracle RAC DB Inst. 1
Re-bootless Node Fencing (restart) How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted
• Once all IO issuing processes are killed, remaining processes are stopped
– IF the check for a successful kill of the IO processes, fails → reboot
CSSD
Standalone App X
Standalone App Y
CSSD
Oracle RAC DB Inst. 1
Re-bootless Node Fencing (restart) How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted
• Once all remaining processes are stopped, the stack stops itself with a “restart flag”
Standalone App X
Standalone App Y
CSSD OHASD
Oracle RAC DB Inst. 1
Re-bootless Node Fencing (restart) How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted
• OHASD will finally attempt to restart the stack after the graceful shutdown
Standalone App X
Standalone App Y
CSSD OHASD
Oracle RAC DB Inst. 1
Re-bootless Node Fencing (restart) EXCEPTIONS
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, unless…:
– IF the check for a successful kill of the IO processes fails → reboot
– IF CSSD gets killed during the operation → reboot
– IF cssdmonitor (oprocd replacement) is not scheduled → reboot
– IF the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot
CSSD
Standalone App X
Standalone App Y
CSSD
Oracle RAC DB Inst. 1
Oracle RAC DB Inst. 2
<Insert Picture Here>
Advanced Node
Management
Determine the Biggest Sub-Cluster Voting Disk basics – Part 4
CSSD CSSD
1
2
3
1 2 3
CSSD
• Each node in the cluster is “pinged” every second (network heartbeat)
• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second
Determine the Biggest Sub-Cluster Voting Disk basics – Part 4
CSSD CSSD
1
3
1 2 3
CSSD
• In a n-node cluster, the biggest sub-cluster should survive (votes based)
2
Redundant Voting Disks – Why odd? Voting Disk basics – Part 5
CSSD
CSSD
CSSD
1
3 2
• Assume for a moment only 2
voting disks are supported…
• Redundant Voting Disks Oracle managed redundancy
Redundant Voting Disks – Why odd? Voting Disk basics – Part 5
CSSD
CSSD
CSSD
1
3 2
• Advanced scenarios need to be considered
• Without the “Simple Majority
Rule”, what would we do?
• Even with the “Simple
Majority Rule” in place
– Each node can see only one
voting disk, which would lead
to an eviction of all nodes
Redundant Voting Disks – Why odd? Voting Disk basics – Part 5
CSSD
CSSD
CSSD
1
3 2
1
2
3
1
2
3
1
2
3
Redundant Voting Disks – Why odd? Voting Disk basics – Part 5
CSSD
CSSD
CSSD
1
3 2
1
2
3
1
2
3
1
2
3
<Insert Picture Here>
The Corner Cases
CSSD CSSD
• A properly configured cluster
with 3 voting disks as shown
Case 1: Partial Failures in the Cluster When somebody uses a pair of scissors in the wrong way…
• What happens if there is a
storage network failure as
shown (lost remote access)?
CSSD CSSD
• There will be no node eviction!
• IF storage mirroring is used
(for data files), the respective
solution must handle this case.
Case 1: Partial Failures in the Cluster When somebody uses a pair of scissors in the wrong way…
• Covered in Oracle ASM 11.2.0.2:
– _asm_storagemaysplit = TRUE
– Backported to 11.1.0.7
Case 2: CSSD is stuck CSSD cannot execute request
CSSD CSSD
1
• A node is requested to “kill itself”
• BUT CSSD is “stuck” or “sick” (does not execute) – e.g.:
– CSSD failed for some reason
– CSSD is not scheduled within a certain margin
OCSSDMONITOR (was: oprocd) will take over and execute
Case 2: CSSD is stuck CSSD cannot execute request
CSSD
CSSD
1
CSSDmonitor
• A node is requested to “kill itself”
• BUT CSSD is “stuck” or “sick” (does not execute) – e.g.:
– CSSD failed for some reason
– CSSD is not scheduled within a certain margin
OCSSDMONITOR (was: oprocd) will take over and execute
Case 3: Node Eviction Escalation Members of a cluster can escalate kill requests
• Cluster members (e.g Oracle RAC instances) can request
Oracle Clusterware to kill a specific member of the cluster
• Oracle Clusterware will then attempt to kill the requested member
CSSD CSSD
Oracle RAC DB Inst. 1
Oracle RAC DB Inst. 2
Inst. 1: kill inst. 2
Case 3: Node Eviction Escalation Members of a cluster can escalate kill requests
• Oracle Clusterware will then attempt to kill the requested member
• If the requested member kill is unsuccessful, a node eviction
escalation can be issued, which leads to the eviction of the
node, on which the particular member currently resides
CSSD CSSD
Oracle RAC DB Inst. 1
Oracle RAC DB Inst. 2
Inst. 1: kill inst. 2
Case 3: Node Eviction Escalation Members of a cluster can escalate kill requests
• Oracle Clusterware will then attempt to kill the requested member
• If the requested member kill is unsuccessful, a node eviction
escalation can be issued, which leads to the eviction of the
node, on which the particular member currently resides
CSSD CSSD
Oracle RAC DB Inst. 1
Oracle RAC DB Inst. 2
Inst. 1: kill inst. 2
Case 3: Node Eviction Escalation Members of a cluster can escalate kill requests
• Oracle Clusterware will then attempt to kill the requested member
• If the requested member kill is unsuccessful, a node eviction
escalation can be issued, which leads to the eviction of the
node, on which the particular member currently resides
CSSD
Oracle RAC DB Inst. 1
<Insert Picture Here>
More Information
• My Oracle Support Notes:
– ID 294430.1 - CSS Timeout Computation in Oracle Clusterware
– ID 395878.1 - Heartbeat/Voting/Quorum Related Timeout Configuration
for Linux, OCFS2, RAC Stack to Avoid Unnecessary Node Fencing,
Panic and Reboot
• http://www.oracle.com/goto/clusterware
– Oracle Clusterware 11g Release 2 Technical Overview
• http://www.oracle.com/goto/asm
• http://www.oracle.com/goto/rac
More Information