checkpoint info

8/6/2019 Checkpoint Info

1/4

Checkpointing in OracleAshok Joshi, William Bridge, Juan Loaiza, Tirthankar Lahiri500 Oracle Parkway, Redwood Shores, CA 94065

{ ajoshi, wbridge, jloaiza, tlahiri} @us.oracle.comAbstract

Checkpointing is an important mechanism forlimiting crash recovery times. This paperdescribes a new checkpointing algorithm thatwas implemented in Oracle 8.0. This algorithmefJiciently JWS buffers which need to be writtenfor checkpointing and easily scales to very largebuffer cache sizes: it has been tested with buffercaches as large as six million buffers. Based onthis algorithm, we have implemented a newcheckpointing mechanism which we refer to asthe incremental checkpointing mechanism. Incre-mental checkpoints are continuous, low overheadcheckpoints that wnte buffers as a backgroundactivity. Incremental checkpointing is able tocontinuously advance the database checkpoint,i.e., the starting position in the redo log for crashrecovery, resulting in dramatic improvements inrecovery time while imposing minimal overheadduring normal processing. The rate of bu$erwrites for incremental checkpointing can be con-trolled by the user to balance checkpoint writingoverhead with recovery time requirements. In thispaper; we describe the new data structures andalgorithms that have been implemented forcheckpointing and for incremental checkpointingin Oracle 8.0.

Permission to copy without fee all or part of this mate-rial is granted provided that the copies are not made ordistributed for direct commercial advantage, theVLDB copyright notice and the title of the publicationand its date appear, and notice is given that copying isby permission of the Very Large Database Endow-ment. To copy otherwise, or to republish, requires a feeand/or special permission from the Endowment.Proceedings of the 24th VLDB conference, New York,USA, 1998

1 IntroductionMost conventional database systems (including Oracle)follow the no-force-at-commit policy for data blocks[Haerder83] because of its significant performance bene-fits. The use of this policy implies that a page modified inmemory may need recovery if there is a system crash. Adatabase checkpoint is critical for ensuring quick crashrecovery when the no-force-at-commit policy is employedsince it limits the amount of redo log tha t needs to bescanned and applied during recovery.As the amount of memory available to a databaseincreases, it is possible to have database buffer caches aslarge as several million buffers. A large buffer cacheimposes two requirements on checkpointing. First, itrequires that the algorithms be scalable with the size of thebuffer cache. Second, it requires that the database check-point advance frequently to limit recovery time, sinceinfrequent checkpoints and large buffer caches can exacer-bate crash-recovery times significantly.In order to address these issues, the checkpointing algo-rithm has been completely rewritten in Oracle 8.0. Ourobjective was to make the algorithm scalable to very largebuffer caches, and to facilitate frequent checkpointing forfast crash recovery. Scalab ility is achieved by organizingall the modified buffers on ordered queues; such queuesincrease the efficiency of identifying the precise set ofbuffers that need to be written for checkpoints. Frequentadvancement of the database checkpoint is made possibleby the introduction of incremental checkpointing.The rest of this paper is organized as follows: We beginwith a brief description of the Oracle-specific require-ments and terminology relating to checkpointing and theOracle processes involved in a checkpoint. The next sec-tion contains a description of the checkpoint data struc-tures used by the Oracle 8.0 checkpointing algorithmfollowed by a description of the incremental checkpointmechanism. W e conclude with some observations on thebenefits of the new algorithm.

665


2/4

Arrows indicate RBA atfirst modified. Subsequecorresponding to multiple ates not shown.

I Instances redo logCheckpoint requested to this RBA

Bu$ers bl, b3 and b4 need to be written in order to complete the checkpoint.Buffer b6 need not be written. Buffers b,, b , b, are clean buffers.

tail of log

Figure 1: Selecting buffers for advancing checkpoint up to a specified RBA.

2 Oracle Checkpointing OverviewOracle supports a shared-disk architecture; the shared-memory and group of Oracle processes that run on eachnode in a multi-node shared disk cluster are collectivelyknown as an instance of Oracle. We briefly describe a con-ventional (as opposed to an incremental) checkpoint inOracle. For the purpose of this discussion, the log may bethought o f as an ever-growing file containing redo recordsgenerated by an instance. An RBA (redo byte address)indicates a position in the redo log. Oracle uses a set of adedicated processes (called the database writers orDBWRs) for writing data blocks to disk. A dedicated pro-cess (called the checkpoint process or CKP7) recordscheckpoint information to the Control File which repre-sents stable storage for maintaining bookkeeping informa-tion (such as checkpoint progress) for an Oracle database.Each instance in Oracle has its own log. An instancecheckpoint refers to some RBA (called the checkpointRBA) within an instances log such that all in memorybuffers whose redo appears prior to this RBA have beenwritten to disk. Hence, recovery for that instance needs torecover only those data blocks whose redo records occurbetween the checkpoint RBA and the end of the log.A checkpoint operation consists of three distinct phases. Inthe first phase, the process initiating the checkpoint cap-tures the checkpoint RBA. This RBA is most often thecurrent RBA (the RBA of the last change made to a buffer)at the time the request is initiated. In the second phase, theDBWR process writes out all required buffers, i.e., allbuffers that have been modified at RBAs less than or equal

to the checkpoint RBA. A fter all required buffers havebeen written, in the third phase, the CKFT process recordsthe completion of the checkpoint in the control file. Onlywhen the third phase has completed can we assert that theinstance checkpoint has advanced to the new RBA. Nor-mal transaction activity is not affected while a checkpointrequest is active. Figure 1 provides an example o f howbuffers are selected to be written for checkpoint opera-tions. [Oracle971 contains further details on the architec-ture of the buffer manager and the recovery subsystem inthe Oracle8 Universal Server.3 Oracle Checkpointing Data Structures andAlgorithm3.1 Buffer Checkpoint Queues (BCQ)The most significant enhancement to the checkpoint algo-rithm is the introduction of new data structures thatincrease the efficiency of finding the buffers that need tobe written for a checkpoint. The primary data structure isthe Buffer Checkpoint Queue which contains modifiedbuffers linked in ascending order of their low RBA. Thelow RBA is the RBA corresponding to the first (in-mem-ory) modification of the buffer. Each buffer header con-tains the value of the low RBA associated with the buffer;this value is set when the buffer is first modified. Obvi-ously, a clean buffer does not have any low RBA in itsbuffer header and is not linked on the checkpoint queue.In response to a checkpoint request, buffers are writtenfrom the head of the checkpoint queue i.e., in ascendingorder of low RBA values. After a buffer is written, it is

666


3/4

BCQ 1

BCQ 2\ \ \ Dashed arrow indicates\ \ buffers low RBA\

\\ \$ \ \( \I Iredo log tail of log=2 c-3Active checkpoints cl, c2, and c3 on the ACQ

Notes: DBWR always selects buffers in ascending low RBA order. After b4 is written, clis completed. After bl and b5 are written, c2 is completed. A fter b2 is written,c3 is completed.

Figure 2: Illustration of the organization of two BCQs and the ACQ with 3 entries.

unlinked from its checkpoint queue. Given a checkpointRBA, DBWR writes buffers from the head of the queueuntil the low RBA of the buffer at the head of the check-point queue is greater than the checkpoint RBA. At thispoint, CKPT can record this checkpoint as having com-pleted by updating the checkpoint progress record in thecontrol file (phase 3).The operations of linking and unlinking a buffer from thecheckpoint queue need to performed in a critical sectionunder a latch (a lightweight mutual exclusion primitive).In order to reduce hotspots on this critical section, it ispossible to configure an Oracle instance with multiplelatches, each protecting a different checkpoint queue. Eachbuffer is statically associated with a checkpoint queue.Having multiple checkpoint queues also improves scal-ability and write-throughput since it is then possible toalso configure the instance with multiple DBW R pro-cesses which are responsible for writing buffers from dif-ferent checkpoint queues.3.2 Active Checkpoint Queue (ACQ)The ACQ is a single queue that contains active checkpointrequests. Whenever a checkpoint is requested, a newactive checkpoint entry describing the request is added tothe ACQ. Each such entry on the ACQ contains the RBAup to which buffers need to be written in order to completethe checkpoint represen ted by the entry. The entry alsocontains other information specific to the checkpoint

request. Note that it is possible to have several checkpointrequests active at the same time. Figure 2 illustrates twoBCQs and three active checkpoints.A checkpoint may be requested for various reasons. A usercan initiate a checkpoint at any time, fo r instance, by issu-ing an ALTER SYSTEM CHECKPOINT command.Whenever an instance switches from one log file toanother, it starts a checkpoint so that it will be possible toreuse the log file later. In addition, datafiles must be check-pointed before they can be backed up, therefore, backupoperations initiate checkpoints. For all these reasons it ispossible to have multiple entries on the ACQ.4 Incremental CheckpointsThe incremental checkpoint technique uses the same datastructures that are used by conventional checkpoints. Itexploits the fact that the dirty buffers in the cache arelinked in low RBA order. If DBW R continually writesbuffers from the headof the checkpointqueue, the instancecheckpoint (lowest low-RBA of the modified buffers) willkeep advancing. Periodically, CKPT can record this lowestlow-RBA to the control file (using a very lightweight con-trol-file update protocol). This periodically recorded low-est low RBA is the current position of the incrementalcheckpoint for the instance. Since the incremental check-point is performed continuously, the value of the incre-mental checkpoint RBA will be much closer to the tail ofthe log than the RBA of a conventional checkpoint, thus

667


4/4

limiting the amount of recovery needed. When incremen-tal checkpointing is enabled, DBWR keeps writing buffersfrom the checkpoint queues in ascending low RBA orderin addition to performing other writing activity. In addi-tion, the CKPT process periodically records the progressof the incremental checkpoint in the control file. By con-trolling the rate at which buffers are written, we can reducethe overhead for incremental checkpoint. Quite often,writing buffers in ascending low RBA order also performsLRU replacement writes and vice versa. Hence, agingwrites and checkpoint writes can complement each other.Oracle provides tuning parameters which influence therate at which incremental checkpoint buffers are written.Note that a smaller ra te will advance the incrementalcheckpoint slowly; a higher rate will advance the check-point rapidly. It should also be noted that there is not nec-essarily any correlation between the number of bufferswritten for checkpoint reasons, and the progress of thecheckpoint, since the sizes of the redo records for differentchanges to blocks can vary considerably.5 Performance benefitsThe new checkpoint algorithm improves the performanceof checkpoint operations in two ways:First, the cost of performing a checkpoint is determinedonly by the number of dirty buffers in the buffer cachewith RBAs below the checkpoint RBA, and not by thetotal number of buffers in the cache. If the instance check-point advances quickly (as it does when incrementalcheckpointing is enabled), the number of such buffers isrelatively small.Second, multiple checkpoints can be serviced by servicingone single checkpoint that subsumes all other checkpoints:A checkpoint is said to subsume one another checkpoint ifthe former checkpoint has a higher RBA than the latter.For instance, in figure 2, c3 subsumes cl and c2 in that ifDBWR writes buffers in response to checkpoint c3, the czand c2 checkpoints are automatically performed along theway, as a result. This implies that the cost of servicing a setof concurrent checkpoint requests collapses down to thecost of servicing the checkpoint with the highest RBA,instead of being equal to the sum of the costs of perform-ing each checkpoint individually.

Since checkpoints frequently occur during normal pro-cessing, both these properties of the new algorithmimprove normal runtime performance.Incremental checkpointing significantly improves recov-ery performance. The performance of crash recovery is acritical (and often overlooked) component of overall sys-tem availability. We have experimented with various work-loads, buffer cache sizes ranging from a few hundred to sixmillion buffers, and various settings of the rate at whichincremental checkpoint buffers are written. When incre-mental checkpointing is enabled, we have found dramaticreductions in recovery times. In addition, our measure-ments indicate that in all but the most demanding of work-loads, it is possible to advance the incremental checkpointat a very high ra te without noticeable impact on run-timeperformance. These preliminary measurements are veryencouraging; it is possible to have very fast crash recoverywithout compromising performance at run-time.6 ConclusionsCheckpointing modified buffers is a critical aspect ofbuffer management because it reduces crash recoverytimes. In this paper, we have briefly described a novelalgorithm for checkpointing in version 8.0 of the OracleUniversal Server. This a lgorithm uses queues of buffersordered by low RBA to facilitate rapid identification ofbuffers to be written for checkpoint operations. We havealso implemented a new type of checkpointing which werefer to as incremental checkpointing, which continuouslyadvances the database checkpoint RBA as a lightweightbackground activity. The new algorithm significantlyimproves the performance of checkpoint operations, andincremental checkpointing greatly improves crash recov-ery times with negligible impact on normal activity.

7 References[Haerder83]: Haerder, T., and Reuter, A., Principles ofTransaction-Oriented Database Recovery, ACM Comput-ing Surveys, 15(4) 287-317[Oracle97]: Oracle8 Server Concepts Volume I and II:Oracle Corporation, Part Number A54646-01 andA54644-O1,June 1997

668

checkpoint info

Documents