3.rac instance recovery

16
RAC configuration with points of failure: The below given figure illustrates the various areas of the system (operating system, hardware and database) that could fail. The various failure scenarios in a two-node configuration, as illustrated in the below given figure are: 1. Interconnect failure 2. Node failure 3. Instance failure 4. Media failure 5. GSD/GCS failure Instance recovery is complete when Oracle has performed the following steps: 1. Replaying the online redo log files of the failed instance, called cache recovery. 2. Rolling back all uncommitted transactions of the failed instance, called transaction recovery. How does Oracle know that recovery is required for a given data file: 1. Start SCN: When a database checkpoints, an SCN (called the checkpoint SCN) is written to the data file headers. 2. Stop SCN: There is also an SCN value in the control file for every data file, which is called the stop SCN. The stop SCN is set to infinity while the database is open and running. 3. Checkpoint counter:

Upload: malru

Post on 27-Dec-2015

13 views

Category:

Documents


1 download

DESCRIPTION

Oracle RAC Instance Recovery

TRANSCRIPT

Page 1: 3.RAC Instance Recovery

RAC configuration with points of failure:

The below given figure illustrates the various areas of the system (operating system, hardware and database) that could fail. The various failure scenarios in a two-node configuration, as illustrated in the below given figure are:

1. Interconnect failure2. Node failure

3. Instance failure

4. Media failure

5. GSD/GCS failure

Instance recovery is complete when Oracle has performed the following steps:1. Replaying the online redo log files of the failed instance, called cache recovery. 2. Rolling back all uncommitted transactions of the failed instance, called transaction recovery.

How does Oracle know that recovery is required for a given data file:

1. Start SCN: When a database checkpoints, an SCN (called the checkpoint SCN) is written to the data file headers.

2. Stop SCN: There is also an SCN value in the control file for every data file, which is called the stop SCN. The stop SCN is set to infinity while the database is open and running.

3. Checkpoint counter: There is another data structure called the checkpoint counter in each data file header

and also in the control file for each data file entry. The checkpoint counter increments every time a checkpoint happens on a data file and

the start SCN value is updated. When a data file is in hot backup mode, the checkpoint information in the file header is

frozen but the checkpoint counter still gets updated.

Page 2: 3.RAC Instance Recovery

If the start SCN of a specific data file does not match the stop SCN value in the control file, then at least a crash recovery is required. This can happen when the database is shut down with the SHUTDOWN ABORT statement or if the

instance crashes.

Oracle performs the second check on the data files by checking the checkpoint counters. If the checkpoint counter check fails, then Oracle knows that the

data file has been replaced with a backup copy (while the instance was down) and therefore, media recovery is required.

An Oracle-provided CM is used and the heartbeat mechanism is wrapped with another process called the watchdog process provided by Oracle to give this functionality.

The watchdog process is only present in environments where Oracle has provided the CM layer of the product, for example Linux.

Heartbeat interval parameter is normally specified in seconds.

Setting a very small value could cause some performance problems.

Though the overhead of this process running is really insignificant, on very busy systems, frequent running of this process could turn out to be expensive.

Setting this parameter to an ideal value is important and is achieved by constant monitoring of the activities on the system and the amount of overhead this particular process is causing.

The heartbeat timeout interval, like the heartbeat interval parameter, should not be set low. Unlike the heartbeat interval parameter, in the case of the timeout interval it is not a performance concern; rather, a potential to cause false failure detections because the cluster might inversely determine that a node is failing due to transient failures if the timeout interval is set too low.

The false failure detections can occur on busy systems, where the node is processing tasks that are highly CPU intensive.

While the system should reserve a percentage of its resources for these kinds of activities, occasionally when systems are high on resources with high CPU utilization, the response to the heartbeat function could be delayed and hence, if the heartbeat timeout is set very low, could cause the CM to assume that the node is not available when actually it is up and running.

When the database is shut down gracefully, with the SHUTDOWN NORMAL or SHUTDOWN IMMEDIATE command, Oracle performs a checkpoint and copies the start SCN value of each data file to its corresponding stop SCN value in the control file before the actual shutdown of the database.

When the database is started, Oracle performs two checks (among other consistency checks):

1. To see if the start SCN value in every data file header matches with its corresponding stop SCN value in the control file.

2. To see if the checkpoint counter values match.

If both these checks are successful, then Oracle determines that no recovery is required for that data file. These two checks are done for all data files that are online.

1. Instance fails:

This is the first stage in the process, when an instance fails and recovery becomes a necessity.

2. Failure detected :

The CM of the clustered operating system does the detection of a node failure or an instance failure.

The CM is able to accomplish this with the help of certain parameters such as

1. the heartbeat interval and 2. the heartbeat timeout parameter.

The heartbeat interval parameter invokes a watchdog process that wakes up at a stipulated time interval and checks the existence of the other members in the cluster.

When the instance fails, the watchdog process or the heartbeat validation interval does not get a response from the other instance within the time stipulated in the heartbeat timeout parameter; the CM clears and declares that the instance is down. From the first time that the CM does not get a response from the heartbeat check, to the time that the CM declares that the node has failed, repeated checks are done to ensure that the initial message was not a false message.

Page 3: 3.RAC Instance Recovery

3. Cluster reconfiguration: When a failure is detected, the cluster reorganization occurs.

During this process, Oracle alters the node's cluster membership status. This involves Oracle taking care of the fact that a node has left the cluster.

The GCS and GES provide the CM interfaces to the software and expose the cluster membership map to the Oracle instances when nodes are added or deleted from the cluster.

The LMON process performs this exposure of the information to the remaining Oracle instances.

LMON performs this task by continually sending messages from the node it runs on and often writing to the shared disk.

When such write activity does not happen for a prolonged period of time, it provides evidence to the surviving nodes that the node is no longer a member of the cluster.

Such a failure causes a change in a node's membership status within the cluster and LMON initiates the recovery actions, which include remastering of GCS and GES resources and instance recovery.

The cluster reconfiguration process, along with other activities performed by Oracle processes,are recorded in the respective background process trace files and in the instance-specific alert log files.

Thread recovery:

A thread is a stream of redo, for example, all redo log files for a given instance. In a single stand-alone configuration there is usually only one thread, although it is possible to specify more, under certain circumstances.

An instance has one thread associated with it, and recovery under this situation would be like any stand-alone configuration. What is the difference in a RAC environment? In a RAC environment, multiple threads are usually seen; there is generally one thread per instance, and the thread applicable to a specific instance is defined in the server parameter or int<SID>.ora file.

In a crash recovery, redo is applied one thread at a time because only one instance at a time can dirty a block in cache; in between block modifications the block is written to disk. Therefore a block in a current online file can read redo for at most one thread. This assumption cannot be made in media recovery as more than one instance may have made changes to a block, so changes must be applied to blocks in ascending SCN order, switching between threads where necessary.

In a RAC environment, where instances could be added or taken off the cluster dynamically, when an instance is added to the cluster, a thread enable record is written, a new thread of redo is created. Similarly, a thread is disabled when an instance is taken offline through a shutdown operation. The shutdown operation places an end of thread (EOT) flag on the log header.

The below given figure illustrates the thread recovery scenario. In this scenario there are three instances, RAC1, RAC2, and RAC3 that form the RAC configuration. Each instance has set of redo log files and is assigned thread 1, thread 2, and thread 3 respectively.

Page 4: 3.RAC Instance Recovery

As discussed above, if multiple instances fail, or during a crash recovery, all instances have to synchronize the redo log files by the SCN number during the recovery operation. For example, in above figure, SCN #1 was applied to the database from thread 2, which belongs to instance RAC2, followed by SCN #2 from thread 3, which belongs to instance RAC 3, and SCN #3 also from thread 3, before applying SCN #4 from thread 1, which is assigned to instance RAC1.

Page 5: 3.RAC Instance Recovery

''kjxgfipccb'' is the callback on the completion of a CGS message send. If the delivery of message fails, a log would be generated. Associated with the log are message buffer pointer, recovery state object, message type and others.

Enqueue reconfiguration: Enqueue resources are reconfigured among the available instances.

Contents of LMON trace file:

Analyzing the LMON trace:    

In general, the LMON trace file listed above contains the recovery and reconfiguration information on locks, resources, and states of its instance group.

The six substates in the Cluster Group Service (CGS) that are listed in the trace file are:

1. State 0: Waiting for the instance reconfiguration.2. State 1: Received the instance reconfiguration event.3. State 2: Agreed on the instance membership.4. States 3, 4, 5: CGS name service recovery.5. State 6: GES/GCS (lock/resource) recovery.

Each state is identified as a pair of incarnation number and its current substate. ''Setting state to 7 6'' means that the instance is currently at incarnation 7 and substate 6.

*** 2002-11-16 23:48:02.753kjxgmpoll reconfig bitmap: 1*** 2002-11-16 23:48:02.753kjxgmrcfg: Reconfiguration started, reason 1

kjxgmcs: Setting state to 6 0.*** 2002-11-16 23:48:02.880Name Service frozenkjxgmcs: Setting state to 6 1.kjxgfipccb: msg 0x1038c6a88, mbo 0x1038c6a80, type 22, ack 0, ref 0, stat 6kjxgfipccb: Send cancelled, stat 6 inst 0, type 22, tkt (3744,204)

kjxgfipccb: msg 0x1038c6938, mbo 0x1038c6930, type 22, ack 0, ref 0, stat 6kjxgfipccb: Send cancelled, stat 6 inst 0, type 22, tkt (3416,204)kjxgfipccb: msg 0x1038c67e8, mbo 0x1038c67e0, type 22, ack 0, ref 0, stat 6kjxgfipccb: Send cancelled, stat 6 inst 0, type 22, tkt (3088,204)kjxgfipccb: msg 0x1038c6bd8, mbo 0x1038c6bd0, type 22, ack 0, ref 0, stat 6kjxgfipccb: Send cancelled, stat 6 inst 0, type 22, tkt (2760,204)kjxgfipccb: msg 0x1038c7118, mbo 0x1038c7110, type 22, ack 0, ref 0, stat 6kjxgfipccb: Send cancelled, stat 6 inst 0, type 22, tkt (2432,204)kjxgfipccb: msg 0x1038c6fc8, mbo 0x1038c6fc0, type 22, ack 0, ref 0, stat 6kjxgfipccb: Send cancelled, stat 6 inst 0, type 22, tkt (2104,204)kjxgfipccb: msg 0x1038c7268, mbo 0x1038c7260, type 22, ack 0, ref 0, stat 6kjxgfipccb: Send cancelled, stat 6 inst 0, type 22, tkt (1776,204)kjxgfipccb: msg 0x1038c6e78, mbo 0x1038c6e70, type 22, ack 0, ref 0, stat 6kjxgfipccb: Send cancelled, stat 6 inst 0, type 22, tkt (1448,204)kjxgfipccb: msg 0x1038c6d28, mbo 0x1038c6d20, type 22, ack 0, ref 0, stat 6kjxgfipccb: Send cancelled, stat 6 inst 0, type 22, tkt (1120,204)kjxgfipccb: msg 0x1038c7a48, mbo 0x1038c7a40, type 22, ack 0, ref 0, stat 6kjxgfipccb: Send cancelled, stat 6 inst 0, type 22, tkt (792,204)kjxgfipccb: msg 0x1038c7508, mbo 0x1038c7500, type 22, ack 0, ref 0, stat 6kjxgfipccb: Send cancelled, stat 6 inst 0, type 22, tkt (464,204)kjxgfipccb: msg 0x1038c73b8, mbo 0x1038c73b0, type 22, ack 0, ref 0, stat 6kjxgfipccb: Send cancelled, stat 6 inst 0, type 22, tkt (136,204)*** 2002-11-16 23:48:03.104Obtained RR update lock for sequence 6, RR seq 6

Page 6: 3.RAC Instance Recovery

''Synchronization timeout interval'' is the timeout value for GES to signal an abort on its recovery process. Since the recovery process is distributed, at each step, each instance waits for others to complete the corresponding step before moving to the next one. This value has a minimum of 10 minutes and is computed according the number of resources.

Resource reconfiguration:

This phase of the recovery is important in a RAC environment where the GCS commences recovery and remastering of the block resources, which involves rebuilding lost resource masters on surviving instances.

Remastering of resources is exhaustive by itself because of the various scenarios under which remastering of resources takes place.

*** 2002-11-16 23:48:04.611Voting results, upd 1, seq 7, bitmap: 1kjxgmps: proposing substate 2kjxgmcs: Setting state to 7 2.Performed the unique instance identification checkkjxgmps: proposing substate 3kjxgmcs: Setting state to 7 3.Name Service recovery startedDeleted all dead-instance name entrieskjxgmps: proposing substate 4kjxgmcs: Setting state to 7 4.Multicasted all local name entries for publishReplayed all pending requestskjxgmps: proposing substate 5kjxgmcs: Setting state to 7 5.Name Service normalName Service recovery done*** 2002-11-16 23:48:04.612kjxgmps: proposing substate 6kjxgmcs: Setting state to 7 6.kjfmact: call ksimdic on instance (0)*** 2002-11-16 23:48:04.613*** 2002-11-16 23:48:04.614Reconfiguration started

Synchronization timeout interval: 660 sec

Page 7: 3.RAC Instance Recovery

GRD freeze: The first step in the cluster reconfiguration process, before beginning the actual recovery process, is for the CM to ensure that the GRD is not distributed and hence freezes activity on the GRD so that no future writes or updates happen to the GRD on the node that is currently performing the recovery. This step is also recorded in the alert logs. Since the GRD is maintained by the GCS and GES processes, all GCS and GES resources and also the write requests are frozen. During this step of the temporary freeze, Oracle takes control of the situation and balances the resources among the available instances.

Sat Nov 16 23:48:04 2002Reconfiguration startedList of nodes: 1,Global Resource Directory frozenone node partitionCommunication channels reestablished

Enqueue thaw: After the reconfiguration of resources among the available instances, Oracle makes the enqueue resources available. At this point the process forks to perform two tasks in parallel, resource reconfiguration and pass 1 recovery.

Resource release: Once the remastering of resources is completed, the next step is to complete processing of pending activities. Once this is completed, all resources that were locked during the recovery process are released or the locks are downgraded (converted to a lower level).

List of nodes: 1,Global Resource Directory frozen

node 1* kjshashcfg: I'm the only node in the cluster (node 1)Active Sendback Threshold = 50%Communication channels reestablishedMaster broadcasted resource hash value bitmapsNon-local Process blocks cleaned outResources and enqueues cleaned outResources remastered 2413

35334 GCS shadows traversed, 0 cancelled, 1151 closed17968 GCS resources traversed, 0 cancelled20107 GCS resources on freelist, 37877 on array, 37877 allocatedset master node infoSubmitted all remote-enqueue requestsUpdate rdomain variablesDwn-cvts replayed, VALBLKs dubiousAll grantable enqueues granted*** 2002-11-16 23:48:05.41235334 GCS shadows traversed, 0 replayed, 1151 unopenedSubmitted all GCS cache requests0 write requests issued in 34183 GCS resources29 PIs marked suspect, 0 flush PI msgs*** 2002-11-16 23:48:06.007Reconfiguration complete

Post SMON to start 1st pass IR*** 2002-11-16 23:52:28.376

Page 8: 3.RAC Instance Recovery

kjxgmpoll reconfig bitmap: 0 1*** 2002-11-16 23:52:28.376kjxgmrcfg: Reconfiguration started, reason 1kjxgmcs: Setting state to 7 0.*** 2002-11-16 23:52:28.474Name Service frozenkjxgmcs: Setting state to 7 1.*** 2002-11-16 23:52:28.881Obtained RR update lock for sequence 7, RR seq 7*** 2002-11-16 23:52:28.887Voting results, upd 1, seq 8, bitmap: 0 1kjxgmps: proposing substate 2kjxgmcs: Setting state to 8 2.Performed the unique instance identification checkkjxgmps: proposing substate 3kjxgmcs: Setting state to 8 3.Name Service recovery startedDeleted all dead-instance name entrieskjxgmps: proposing substate 4kjxgmcs: Setting state to 8 4.Multicasted all local name entries for publishReplayed all pending requestskjxgmps: proposing substate 5kjxgmcs: Setting state to 8 5.Name Service normalName Service recovery done*** 2002-11-16 23:52:28.896kjxgmps: proposing substate 6kjxgmcs: Setting state to 8 6.*** 2002-11-16 23:52:29.116*** 2002-11-16 23:52:29.116Reconfiguration startedSynchronization timeout interval: 660 secList of nodes: 0,1,

Pass 1 recovery:

This step of the recovery process is performed in parallel with steps 7 and 8.

SMON will merge the redo thread ordered by SCN to ensure that changes are written in an orderly fashion.

SMON will also find BWR in the redo stream and remove entries that are no longer needed for recovery, because they are PI of blocks already written to disk.

A recovery set is produced that only contains blocks modified by the failed instance with no subsequent BWR to indicate that the blocks were later written.

Each entry in the recovery list is ordered by first-dirty SCN to specify the order to acquire instance recovery locks.

Reading the log files and identifying the blocks that need to be recovered completes the first pass of the recovery process.

Page 9: 3.RAC Instance Recovery

Post SMON to start 1st pass IRSat Nov 16 23:48:06 2002Instance recovery: looking for dead threadsSat Nov 16 23:48:06 2002Beginning instance recovery of 1 threadsSat Nov 16 23:48:06 2002Started first pass scanSat Nov 16 23:48:06 2002Completed first pass scan5101 redo blocks read, 490 data blocks need recoverySat Nov 16 23:48:07 2002Started recovery atThread 1: logseq 29, block 2, scn 0.115795034Recovery of Online Redo Log: Thread 1 Group 1 Seq 29 Reading mem 0Mem# 0 errs 0: /dev/vx/rdsk/oraracdg/partition1G_31Mem# 1 errs 0: /dev/vx/rdsk/oraracdg/partition1G_21Sat Nov 16 23:48:08 2002Completed redo applicationSat Nov 16 23:48:08 2002Ended recovery atThread 1: logseq 29, block 5103, scn 0.115820072420 data blocks read, 500 data blocks written, 5101 redo blocks readEnding instance recovery of 1 threads

Block resource claimed for recovery:

Once pass 1 of the recovery process completes and the GCS reconfiguration has completed, the recovery process continues by:

1. Obtaining buffer space for the recovery set, possibly by performing write operation to make room.

2. Claiming resources on the blocks identified during pass 1.

3. Obtaining a source buffer, either from an instance's buffer cache or by a disk read.

During this phase, the recovering SMON process will inform each lock element's master node for each block in the recovery list that it will be taking ownership of the block and lock for recovery. Blocks become available as they have been recovered. The lock recovery is based on the ownership of the lock element. This depends on one of the various scenarios of lock conditions:

Scenario 1: Let us assume that all instances in the cluster are holding a lock status of NL0; SMON acquires the lock element in XL0 mode, reads the block from disk and applies redo changes, and subsequently writes out the recovery buffer when complete.

Scenario 2: In this situation, the SMON process of the recovering instance has a lock mode of NL0 and the second instance has a lock status of XL0; however, the failed instance has a status similar to the recovery node, i.e., NL0. In this case, no recovery is required because the current copy of the buffer already exists on another instance.

Scenario 3: In this situation, let us assume that the recovering instance has a lock status of NL0; the second instance has a lock status of XG0.

However, the failed instance has a status similar to the recovery node. In this case also, no recovery is required because a current copy of the buffer already exists on another instance. SMON will remove the block entry from the recovery set and the recovery buffer is released. The recovery instance has a lock status of NG1; however, the second instance that originally had a XG0 status now holds NL0 status after writing the block to disk.

Scenario 4: Now, what if the recovering instance has lock status of NL0; the second instance has a lock status of NG1. However, the failed instance has a status similar to the recovery node. In this case the consistent read image of the latest PI is obtained, based on SCN. The redo changes

Page 10: 3.RAC Instance Recovery

are applied and the recovery buffer is written when complete. The recovery instance has a lock element of XG0 and the second instance continues to retain the NG1 status on the block.

Scenario 5: The recovering instance has a lock status of SL0 or XL0 and the other instance has no lock being held. In this case, no recovery is needed because a current copy of the buffer already exists on another instance. SMON will remove the block from the recovery set. The lock status will not change.

Scenario 6: The recovery instance holds a lock status of XG0 and the second instance has a lock status of NG1. SMON initiates the write of the current block. No recovery is performed by the recovery instance. The recovery buffer is released and the PI count is decremented when the block write has completed.

Scenario 7: The recovery instance holds a lock status of NG1 and the second instance holds a lock with status of XG0. In this case, SMON initiates a write of the current block on the second instance. No recovery is performed by the recovery instance. The recovery buffer is released and the PI count is decremented when the block write has completed.

Scenario 8: The recovering instance holds a lock status of NG1, and the second instance holds a lock status of NG0. In this case a consistent read copy of the block is obtained from the highest PI based on SCN. Redo changes are applied and the recovery buffer is written when complete.

GRD unfrozen( Reconfiguration complete) :

After the necessary resources are obtained, and the recovering instance has all the resources it needs to complete pass 2 with no further intervention, the block cache space in the GRD is unfrozen.

At this stage, the recovery process splits into two parallel phases while certain areas of the system are being made partially available; the second phase of the recovery begins.

1. Partial availability: At this stage of the recovery process the system is partially available for use. The blocks not in recovery can be operated on as before. Blocks being recovered are blocked by the resource held in the recovering instance.

2. Pass 2 recovery: The second phase of recovery continues, taking care of all the blocks identified during pass 1, recovering and writing each block, then releasing recovery resources. During the second phase the redo threads of the failed instances are once again merged by SCN and instead of performing a block level recovery in memory, during this phase the redo is applied to the data files.

3. Block availability: Since the second pass of recovery recovers individual blocks, these blocks are made available for user access as they are recovered a block at a time.

4. Recovery enqueue release: When all the blocks have been recovered and written and the recovery resources released, the system is completely available and the recovery enqueue is released.

During normal instance recovery operation, there is a potential that one or more (including the recovering instance) of the other instances could also encounter failure. If this happens, Oracle has to handle the situation appropriately, based on the type of failure:

If recovery fails without the death of the recovering instance, instance recovery is restarted.

If during the process of recovery the recovering process dies, one of the surviving instances will acquire the instance recovery enqueue and start the recovery process.

If during the recovery process, another non-recovering instance fails, SMON will abort the recovery, release the instance recovery (IR) enqueue and reattempt instance recovery.

During the recovery process, if I/O errors are encountered, the related fies are taken offline and the recovery is restarted.

If one of the blocks that SMON is trying to recover is corrupted during redo application, Oracle performs online block recovery to clean up the block in order for instance recovery to continue.

Page 11: 3.RAC Instance Recovery

Steps of media recovery:

Oracle has to perform several steps during a media recovery from validating the first data file up to the recovery of the last data file. Determining if the archive logs have to be applied etc. is also carried out during this process. All database operations are sequenced using the database SCN number. Similarly, during a recovery operation, the SCN plays an even more important role because data has to be recovered in the order in which it was created.

1. The first step during the media recovery process is to determine the lowest data file header checkpoint SCN of all data files being recovered. This information is stored in every data file header record.

The output below is from a data file header dump and indicates the various markers validated during a media recovery process.

DATA FILE #1:(name #233) /dev/vx/rdsk/oraracdg/partition1G_3creation size=0 block size=8192 status=0xe head=233 tail=233 dup=1tablespace 0, index=1 krfil=1 prev_file=0unrecoverable scn: 0x0000.00000000 01/01/1988 00:00:00Checkpoint cnt:139 scn: 0x0000.06ffc050 11/20/2002 20:38:14Stop scn: 0xffff.ffffffff 11/16/2002 19:01:17Creation Checkpointed at scn: 0x0000.00000006 08/21/2002 17:04:05thread:0 rba:(0x0.0.0)enabled threads: 00000000 00000000 00000000 0000000000000000 00000000 00000000 00000000Offline scn: 0x0000.065acf1d prev_range: 0Online Checkpointed at scn: 0x0000.065acf1e 10/19/2002 09:43:15thread:1 rba:(0x1.2.0)enabled threads: 01000000 00000000 00000000 00000000 0000000000000000 00000000 00000000Hot Backup end marker scn: 0x0000.00000000aux_file is NOT DEFINEDFILE HEADER:Software vsn=153092096=0x9200000, CompatibilityVsn=134217728=0x8000000Db ID=3598885999=0xd682a46f, Db Name='PRODDB'Activation ID=0=0x0Control Seq=2182=0x886, File size=115200=0x1c200File Number=1, Blksiz=8192, File Type=3 DATATablespace #0 - SYSTEM rel_fn:1Creation at scn: 0x0000.00000006 08/21/2002 17:04:05Backup taken at scn: 0x0000.00000000 01/01/1988 00:00:00 thread:0reset logs count:0x1c5a1a33 scn: 0x0000.065acf1e recovered at 11/16/2002 19:02:50status:0x4 root dba:0x004000b3 chkpt cnt: 139 ctl cnt:138begin-hot-backup file size: 0Checkpointed at scn: 0x0000.06ffc050 11/20/2002 20:38:14thread:2 rba:(0x15.31aa6.10)enabled threads: 01100000 00000000 00000000 00000000 0000000000000000 00000000 00000000Backup Checkpointed at scn: 0x0000.00000000thread:0 rba:(0x0.0.0)enabled threads: 00000000 00000000 00000000 00000000 00000000 0000000000000000 00000000External cache id: 0x0 0x0 0x0 0x0Absolute fuzzy scn: 0x0000.00000000Recovery fuzzy scn: 0x0000.00000000 01/01/1988 00:00:00Terminal Recovery Stamp scn: 0x0000.00000000 01/01/1988 00:00:00

If a data file's checkpoint is in its offline range, then the offline-end checkpoint is used instead of the data file header checkpoint as its media-recovery-start SCN.

Like the start SCN, Oracle uses the stop SCN on all data files to determine the highest SCN to allow recovery to terminate. This prevents a needless search beyond the SCN that actually needs to be applied.

Page 12: 3.RAC Instance Recovery

During the media recovery process, Oracle automatically opens any enabled thread of redo and if the required redo records are not found in the current set of redo log files, the database administrator is prompted for the archived redo log files.

2. Oracle places an exclusive MR (media recovery) lock on the files undergoing recovery. This prevents two or more processes from starting media recovery operation simultaneously. The lock is acquired by the session that started the operation and is placed in a shared mode so that no other session can acquire the lock in exclusive mode.

3. The MR fuzzy bit is set to prevent the files from being opened in an inconsistent state.

4. The redo records from the various redo threads are merged to ensure that the redo records are applied in the right order using the ascending SCN.

5. During the media recovery operation, checkpointing occurs as normal, updating the checkpoint SCN in the data file headers. This helps if there is a failure during the recovery process because it can be restarted from this SCN.

6. This process continues until a stop SCN is encountered for a file, which means that the file was taken offline, or made read-only at this SCN and has no redo beyond this point. With the database open, taking a data file offline produces a finite stop SCN for that data file; if this is not done, there is no way for Oracle to determine when to stop the recovery process for a data file.

7. Similarly, the recovery process continues until the current logs in all threads have been applied. The end of thread (EOT) flag that is part of the redo log header file of the last log guarantees that this has been accomplished.

The following output from a redo log header provides indication of the EOT marker found in the redo log file:

LOG FILE #6:(name #242) /dev/vx/rdsk/oracledg/partition1G_100(name #243) /dev/vx/rdsk/oracledg/partition1G_400Thread 2 redo log links: forward: 0 backward: 5siz: 0x190000 seq: 0x00000015 hws: 0x4 bsz: 512 nab:0xffffffff flg: 0x8 dup: 2Archive links: fwrd: 3 back: 0 Prev scn: 0x0000.06e63c4eLow scn: 0x0000.06e6e496 11/16/2002 20:32:19Next scn: 0xffff.ffffffff 01/01/1988 00:00:00FILE HEADER:Software vsn=153092096=0x9200000, CompatibilityVsn=153092096=0x9200000Db ID=3598885999=0xd682a46f, Db Name='PRODDB'Activation ID=3604082283=0xd6d1ee6bControl Seq=2181=0x885, File size=1638400=0x190000File Number=6, Blksiz=512, File Type=2 LOGdescrip:"Thread 0002, Seq# 0000000021, SCN0x000006e6e496-0xffffffffffff"thread: 2 nab: 0xffffffff seq: 0x00000015 hws: 0x4 eot: 2 dis: 0reset logs count: 0x1c5a1a33 scn: 0x0000.065acf1eLow scn: 0x0000.06e6e496 11/16/2002 20:32:19Next scn: 0xffff.ffffffff 01/01/1988 00:00:00Enabled scn: 0x0000.065acfa2 10/19/2002 09:46:06Thread closed scn: 0x0000.06ffc03e 11/20/2002 20:37:25Log format vsn: 0x8000000 Disk cksum: 0xb28f Calc cksum: 0xb28fTerminal Recovery Stamp scn: 0x0000.00000000 01/01/1988 00:00:00Most recent redo scn: 0x0000.00000000Largest LWN: 0 blocksMiscellaneous flags: 0x0

GCS and GES failure:

The GCS and GES services that comprise the LMS, LMD, and GRD processes provide the communication of requests over the cluster interconnect. These processes are also prone to failures. This could potentially happen when one or more of the processes participating in this configuration fails, or fails to respond within a predefined amount of time. Failures such as these could be as a result of failure of any of the

Page 13: 3.RAC Instance Recovery

related processes, a memory fault, or some other cause. The LMON on one of the surviving nodes should detect the problem and start the reconfiguration process. While this is occurring, no lock activity can take place, and some users will be forced to wait to obtain required PCM locks or other resources.

The recovery that occurs as a result of the GCS or GES process dying is termed online block recovery. This is another kind of recovery that is unique to the RAC implementation. Online block recovery occurs when a data buffer becomes corrupt in an instance's cache. Block recovery could also occur if either a foreground process dies while applying changes or if an error is generated during redo application. If the block recovery is to be performed as a result of the foreground process dying, then PMON initiates online block recovery. However, if this is not the case, then the foreground process attempts to make an online recovery of the block.

Under normal circumstances, this involves finding the block's predecessor and applying redo records to this predecessor from the online logs of the local instance. However, under the cache fusion architecture, copies of blocks are available in the cache of other instances and therefore the predecessor is the most recent PI for the buffer that exists in the cache of another instance. If, under certain circumstances, there is no PI for the corrupted buffer, the block image from the disk data is used as the predecessor image before changes from the online redo logs are used.

Instance hang or false failure:

Under very unusual circumstances, probably due to an exception at the Oracle kernel level, an instance could encounter a hang condition, in which case the instance is up and running but no activity against this instance is possible. Users or processes that access this instance could encounter a hung connection and no response is received. In such a situation, the instance is neither down nor available for access. The other surviving instance may not receive a response from the hung instance; however, it cannot declare that the instance is not available because the required activity of the LMON process, such as writing to the shared disk, did not complete. Since the surviving instance did not receive any failure signal, it attempts to shut down the non-responding instance and is unable to because of the reasons stated above. In these situations the only opportunity is to perform a hard failure of either the entire node holding the hung instance or the instance itself. In either situation human interruption is required.

In the case of forcing a hard failure of the node holding the hung instance, the systems administrator will have to perform a bounce of the node; when the node is back up and alive the instance can be started.

In the case where the instance shutdown is preferred, no graceful shutdown is possible; instead, an operating-system-level intervention by shutting down one of the critical background processes such as SMON will cause an instance crash.

Recovery in both these scenarios is an instance recovery. All steps discussed in the section on instance failures apply to this type of failure.