icom 5016– intro. to database systems lecture 12 – raid and buffer management dr. bienvenido...

ICOM 5016– Intro. to ICOM 5016– Intro. to Database SystemsDatabase Systems

Lecture 12 – RAID and Buffer Management

Dr. Bienvenido Vélez

Electrical and Computer Engineering Department

Slides by Dr. Manuel Rodríguez

ICOM 5016 Dr. Manuel Rodriguez Martinez 2

ReadingsReadings

• Read– Chapter 11


Relational DBMS ArchitectureRelational DBMS Architecture

Disk Space Management

Buffer Management

File and Access Methods

Relational Operators

Query Optimizer

Query Parser

Client API

Client

DB

ExecutionEngine

Concurrencyand Recovery

ICOM 5016Dr. Manuel Rodriguez Martinez 4

Performance in DBMSPerformance in DBMS

• The performance of a DBMS depends on:– CPU usage– I/O usage– Network usage

• We shall concentrate on I/O• Disk I/O performance can be defined in terms of:

– Resource usage time: time using the disk– Response time: wall-clock time to complete the query – Number of I/Os: number of times an I/O operation is

performed• Parallel I/O – bringing data from various disk simultaneously

– Response time <> Resource usage time– In this case usually Response time << Resource usage time

ICOM 50165

Disks as performance bottlenecks …Disks as performance bottlenecks …

• Microprocessor speed increase 50% per year.• Disk performance improvements

– Access time decreases 10% per year– Transfer rate decreases 20% per year

• Disk crash results in data loss.• Solution: Disk array

– Have several disk behave as a single, large and very fast disk.

• Parallel I/O

– Put some redundancy to recover from a failure somewhere in the array

Dr. Manuel Rodriguez Martinez

ICOM 5016Dr. Manuel Rodriguez

Martinez6

Disk Array OrganizationDisk Array Organization

• Several disks are grouped into a single logical unit.

Disk ArrayDisk Array

Controller Bus

Controller

System Bus

CPU Memory


Martinez7

Disk StripingDisk Striping

• Disk Striping is a mechanism to divide the data in a file into segments that are scattered over the disks of the disk array.

• The minimum size of a segment is 1 bit, in which case each data blocks must be read from several disks to extract the appropriate bits.– The drawback of this approach is the overhead of managing

data at the level of bits.

• Better approach is to have a striping unit of 1 disk block.– Sequential I/O can be run parallel since block can be

fetched in parallel from the disks.


Martinez8

Disk Striping – Block sizedDisk Striping – Block sized

• Disk Striping can be used to partition the data in a file into equal-sized segments of a block size that are distributed over the disk array.

Disk Array

Controller Bus

File DiskBlocks


Martinez9

Data AllocationData Allocation

• Data is partitioned into equal sized segments– Striping unit

• Each segment is stored in a different disk of the arrays

• Typically, round-robin algorithm is used• If we have n disks, then block i is stored at disk

– i mod n

• Example: Array of 5 disks, and file of 1MB with a 4KB Striping unit– Disk 0: gets blocks: 0, 5, 10, 15, 20, …– Disk 1: gets blocks: 1, 6, 11, 16, 21, …– Disk 2: gets blocks: 2, 7, 12, 17, 22, …– Etc.


Martinez10

Benefits of StripingBenefits of Striping

• With Striping we can access data blocks in parallel!– issue a request to the proper disks to get the blocks

• For example, suppose we have a 5-disk array with 4KB striping and disk blocks. Let F be a 1MB file. If we need to access blocks 0, 11, 22, 23, then we need to ask:– Disk 0 for bock 0 at time t0– Disk 1 for bock 11 at time t0– Disk 2 for bock 22 at time t0– Disk 3 for bock 23 at time t0

• All these requests are issued by the DBMS and are serviced concurrently by the disk array!


Martinez11

Single Disk Time LineSingle Disk Time Line

t0 t1

Elapsed Clock Time

ReadRequest

DiskService

TimeReadRequest Completed

0 11 22 23


Martinez12

Striping Time LineStriping Time Line

t0 t1

Elapsed Clock Time

ReadRequest

DiskService

Time

ReadRequest Completed

Parallel I/OParallel I/O


Martinez13

Time access estimatesTime access estimates

• Access time: seek time + rotational delay + transfer time

• Disk used independently or in array: IBM Deskstar 14GPX 14.4 GB disk– Seek time: 9.1 milliseconds (msecs)– Rotational delay: 4.15 msecs– Tranfer rate: 13MB/sec

• How does striping compares with a single disk?• Scenario: 1disk block(4KB) striping-unit, access to

blocks 0, 11, 22, and 23. Disk array has 5 disks– Editorial Note: Looks like an exam problem!

ICOM 501614

Single Disk Access timeSingle Disk Access time

• Total time = sum of time to read each partition

• Time for block 0:9.1 msec + 4.15msec +

4KB/(13MB/1sec)*(1MB/1024KB)*(1000msec/1sec) =

9.1 msec + 4.15msec + 0.3 msecs = 13.55 msecs










• Total time: 4 * 13.55 msec = 54.20 msecs


Martinez15

Striping Access TimeStriping Access Time

• Total time: maximum time to complete any read quest.• Following same calculation as in previous slide:

– Time for block 0: 13.55 msec– Time for block 11: 13.55 msec– Time for block 22: 13.55 msec– Time for block 23: 13.55 msec

• Total time: – max{13.55msec, 13.55msec 13.55msec 13.55msec} = 13.55

msec

• In this case, Striping gives us a 4-1 better (4 times) performance because of parallel I/O.


Martinez16

The problem with StripingThe problem with Striping

• Striping has the advantage of speeding up disk access time.

• But the use of a disk array decrease the reliability of the storage system because more disks mean more possible points of failure.

• Mean-time-to-failure (MTTF)– Mean time to have the disk fail and lose its data

• MTTF is inversely proportional to the number of components in used by the system.– The more we have the more likely they will fall apart!


MTTF in terms of number of disksMTTF in terms of number of disks

• Suppose we have N disks in the array. Then the MTTF of the array is given by:

• This assumes any disk can fail with equal probability.


Martinez18

MTTF in disk arrayMTTF in disk array

• Suppose we have a single disk with a MTTF of 50,000 hrs (5.7 years).

• Then, if we build an array with 50 disks, then the have a MTTF for the array of 50,000/50 = 1000 hrs, or 42 days!, because any disk can fail at any given time with equal probability.– Disk failures are more common when disks are new (bad

disk from factory) or old (wear due to usage).

• Morale of the story: More does not necessarily means better!


Martinez19

Increasing MTTF with redundancyIncreasing MTTF with redundancy

• We can increase the MTTF in a disk array by storing some redundant information in the disk array.– This information can be used to recover from a disk failure.

• This information should be carefully selected so it can be used to reconstruct original data after a failure.

• What to store as redundant information?– full data block?– Parity bit for a set of bit locations across the disks


Increasing MTTF with redundancyIncreasing MTTF with redundancy

• We can increase the MTTF in a disk array by storing some redundant information in the disk array.– This information can be used to recover from a disk failure.

• This information should be carefully selected so it can be used to reconstruct original data after a failure.

• What to store as redundant information?– Full data block– Parity bit for a set of bit locations across all the disks

• Where to store it?– Check disks – disks in the array used only for this purpose– All disks – spread redundant information on every disk in the

array.


Redundancy unit: Data BlocksRedundancy unit: Data Blocks

• One approach is to have a back-up copy of each data block in the array. This is called mirroring.

• Back up can be in:– another disk, or disk array– Tape (very slow …)

• Advantage:– Easy to recover from failure, just read the block from backup.

• Disadvantages:– Requires twice the storage space– Writes are more expensive

• Need to write data block to two different locations each time• Snapshot writes are unfeasible (failures happen at any time!)


Redundancy Unit: Parity bitsRedundancy Unit: Parity bits

• Consider an array of N disks. Suppose k is the number of the k-th block in each disk. Each block consists of several kilobytes, and each byte is 8-bit.

• We can store redundant information about the i-th bit position in each data block.– Parity bit

• The parity bit gives the number of bits that were set to the value 1 in the group of corresponding bit locations of the data blocks.

• For example, if bit 1024 has a parity 0, then an even number of bits where set 1 at bit position 1024. Otherwise its value must be 1.


Parity bitsParity bits

• Consider bytes:– b1 = 00010001, b2 = 00111111, b3 = 00000011

• If we take the XOR these bytes we get 00010001

00111111

00000011

00101101 - this byte has the parity for all bits in b1, b2, b3

• Notice the following:– For bit position 0, the parity is 1,meaning an odd number of

bits have value 1 for bit position 0.– For bit position 1, the parity is 0, meaning an even number of

bits have value 1 for bit position 1


Redundancy as blocks of parity bitsRedundancy as blocks of parity bits

• For each corresponding data block compute and store a parity block that has the parity bits corresponding to the bit location in the data blocks.

Disk Array

Controller Bus

Disk 0Block 0

Disk 1Block 0

Disk 2Block 0

Disk 3Block 0

Check Disk 0Block 0


Reliability groups in disk arrayReliability groups in disk array

• Organize the disks in the array into groups called reliability groups.

• Each group has:– 1 or more data disks, where the data blocks are stored– 0 or more check disks where the blocks of parity bits are

stored

• If a data disk fails, then the check disk(s) for its reliability group can be used to recover the data lost from that disk.

• There is a recovery algorithm that works for any failed disk m in the disk array.

• Can recover from up to 1 disk failure.


Recovery algorithmRecovery algorithm

• Suppose we have an array of N disks, with M check disks (in this case there is one reliability group).

• Suppose disk p fails. We buy a replacement and then we can recover the data as follows.

• For each data block k on disk p:– Read data blocks k on every disk r, with r != p– Read parity block k from its check disk w– For each bit position i in block k of disk p:

• Count number of bits set 1 at bit i in each block coming from a disk other than p. Let this number be j

• If j is odd, and parity bit is 1 then bit position i is set to 0• If j is even, and parity bit is 0 then bit position i is set to 0• Else, bit position i is set to 1


Recovery algorithm: ExampleRecovery algorithm: Example

• Suppose we have an array of 5 disks, with 1 check disk (in this case there is one reliability group).

• Suppose disk 1 fails. We buy a replacement and then we can recover the data as follows.

• For each data block k on disk 1:– Read data blocks k on disks 0, 2, 3, 4,– Read parity block k on check disk 0– For each bit position i in block k of disk 1:

• Count number of bits set 1 at bit I in each block k coming from disks 0, 2, 3, 4. Let this number be j

• If j is odd, and parity bit is 1 then bit position i is set to 0• If j is even, and parity bit is 0 then bit position i is set to 0• Else, bit position i is set to 1


RAID OrganizationRAID Organization

• RAID:– Originally: Redundant Array of Inexpensive Disks– Now: Redundant Array of Independent Disks

• RAID organization combines the ideas of striping, redundancy as parity bits, and reliability groups.

• RAID system has one or more reliability groups– For simplicity we shall assume only one group …

• RAID systems can have various number of check disks for reliability groups, depending on the RAID level that is chosen for the system.

• Each RAID level represent a different tradeoff between storage requirements, write speed and recovery complexity.


RAID AnalysisRAID Analysis

• Suppose we have a disk array with 4 data disks.• Let’s analyze how many check disks we need to build

a RAID with 1 reliability group of 4 data disks plus the check disks.

• Note: Effective space utilization is a measure of the amount of space in the disk array that is used to store data. It is given as a percentage by the formula:


RAID Level 0RAID Level 0

• RAID Level 0: Non-redundant• Uses data striping to distributed data blocks, and

increase maximum disk bandwidth available.– Disk bandwidth refers to the aggregate rate of moving data

from the disk array to the main memory. Ex. 200MB/sec

• Solution with lowest cost, but with little reliability.• Write performance is the best since only 1 block is

written in every write operation, and the cost is 1 I/O.• Read performance is not the best, since a block can

only be read from one site.• Effective space utilization is 100%



• RAID Level 1: Mirrored• Each data block is duplicated:

– original copy + mirror copy

• No striping!• Most expensive solution since it requires twice the

space of the expected data set size.• Every write involves two writes (original + copy)

– cannot be done simultaneously to prevent double corruption– First write on data disk, then on copy at mirror disk

• Reads are fast since a block can be fetched from– Data disk– Mirror disk


RAID Level 1 (cont…)RAID Level 1 (cont…)

• In RAID Level 1, the data block can be fetched from the disk with least contention.

• Since we need to pair disks in groups of two (original + copy), the space utilization is 50%, independent on the amount of disks.

• RAID Level 1 is only good for small data workloads where the cost of mirroring is not an issue.


RAID Level 0+1RAID Level 0+1

• RAID Level 0+1: Striping and Mirroring– Also called RAID Level 10

• Combines mirroring and striping.• Data is striped over the data disks.

– Parallel I/O for high throughput (full disk array bandwidth)

• Each data disk is copied into a mirror disk• Writes require 2 I/Os – (original disk + mirror disk)• Blocks can be read from either original disk or mirror

disk– Better performance since more parallelism can be achieved.– No need to wait for busy disk, just go to its mirror disk!


RAID Level 1+0 (cont…)RAID Level 1+0 (cont…)

• Space utilization is 50% (half data and half copies)• RAID Level 1+0 is better than RAID 1 because of

striping.• RAID Level 1+0 is good for workloads with small data

sets, where cost of mirroring is not an issue.• Also good for workloads with high percentages of

writes, since a write is always 2 I/Os to unloaded disks (specially the mirrors).



• RAID Level 2: Error-Correcting Codes• Uses striping with a 1-bit striping unit.• Hamming code for redundancy in C check disks.

– Can indicate which disk failed– Make number of check disk grow logarithmically with respect

to the number of data disks. (???)

• Read is expensive since to read 1 bit we need to read 1 physical data block, the one storing the bit.

• Therefore, to read 1 logical data block from the array we need to read multiple physical data blocks from each disk to get all the necessary bits.


RAID Level 2 (Cont…)RAID Level 2 (Cont…)

• Since we are striping with 1-bit units, if we have an array with m data disks, then m reads for bits will require 1 block from each disk, for a total of m I/Os.

• Therefore, reading 1 logical data block from the RAID will require reading at least m blocks, and therefore the cost will be at least m I/Os.

• Level 2 is good for request of large contiguous data blocks since the system will fetch physical blocks that will have the required data.

• Level 2 is bad for request of small data since the I/Os will be wasted in fetching just a few bits and throwing away the rest.



• Writes are expensive with Level 2 RAID.• A write operation on N data disks involves:

– Reading at least N data blocks into the memory.– Reading C check disks– Modifying the N data blocks with the new data.– Modifying C check disks to update hamming codes– Writing N + C blocks to the disk array.

• This is called a read-modify-write cycle.• Level 2 has better space utilization than Level 1.



• Raid Level 3: Bit-Interleaved Parity• Uses striping with a 1-bit striping unit.• Does not uses Hamming codes, but simply computes

bit parity.– Disk controller can tell which disk has failed.

• Only need 1 check disk to store parity bits of the data disks in the array.

• A RAID Level 3 system will have N disks, where N-1 are data disks, and one is the check disk.



• Reading or writing a logical data block in a RAID Level 3 involves reading at least N-1 data blocks from the array.

• Writing requires a read-modify-write cycle.



• RAID Level 4: Block-Interleaved Parity• Uses striping with a 1-block striping unit.

– Logical data block is the same as physical data block.

• Computes redundancy as parity bits, and has 1 check disk to store parity bits for all corresponding block in the array.

• Reads can be run in parallel– Works well for both large and small data requests.

• Writes require read-modify-write cycle but only involve:– Data disk for block being modified (target block k)– Check disk (parity block for block k)



• The parity block k is updated incrementally to avoid reading all data blocks k from all data disks.– Only need to read parity block k and block k to be modified– Parity is computed as follows:

New parity block = ((Old block XOR New block)

XOR Old parity block)

• In this way Read-modify-write cycle avoids reading the data block in each disk to compute the parity.

• Read-modify-write cycle only performs 4I/Os (2 reads and 2 writes of the target data block and parity block)

• Space utilization is the same as RAID Level 3.



• In RAID Level 3 and 4, the check disk is only used in writing operations. It does not help with the reads.

• Moreover, the check disk becomes a bottleneck since it must participate in every write operation.



• RAID Level 5: Block-Interleaved Distributed Parity• Uses striping with a 1-block striping unit.• Redundancy is stored as blocks of parity bits, but the

parity blocks are distributed over all the disks in the array.– Every disk is both a data disk and a check disk.

• Best of both worlds:– Fast reads– Fast writes

• Reads are efficient since they can be run in parallel.



• Writes still involve a read-modify-write cycle• But the cost of writing the parity block is lowered by

scattering them over all the disks in the array.– Remove the contention at one check disk

• RAID Level 5 is a good general purpose system– Small reads– Large reads– Intensive writes

• Space utilization is equivalent to Level 3 and 4 since there is 1 disk worth of parity blocks in the system!



• RAID Level 6: P+Q Redundancy• RAID Levels 2-4 only recover from 1 disk failure.• In a large disk array, there is a high probability that

two disk might fail simultaneously.• RAID Level 6 provides recovery from 2 disk failures.• Uses striping with 1-block striping unit• Redundancy is stored as parity bits and Reed-

Solomon codes.– Require two check disks for the data disks in the array



• Reads are like in RAID Level 5.• Writes involve a read-modify-write cycle that involves

4 I/Os:– 1 for data block– 1 for parity block– 2 for Reed-Solomon Codes




Buffer Management



Query Optimizer

Query Parser

Client API

Client

DB

ExecutionEngine



Buffer ManagementBuffer Management

• All Data Pages must be in memory in order to be accessed.

• Buffer Manager – deals with asking Disk Space Manager for pages from disk

and store them into memory– Sends Disk Space Manager pages to be written to disk.

• Memory is faster that Disk– Keep as much data as possible in memory– If not enough space is available, need a policy to decide

what pages to remove from memory.


Buffer PoolBuffer Pool

• Frame – Data strucuture that can hold a data page and control flags

• Buffer pool – Array of frames of size N.

• In C#define POOL_SIZE 100#define PAGE_SIZE 4096typedef struct frame {

int pin_count;bool dirty;char page[PAGE_SIZE];

} frame;frame buffer_pool[POOL_SIZE];


Buffer PoolBuffer Pool

DB Disk

RAMFree Frame

Disk Page


Operational modeOperational mode

• All requested data pages must first be placed into the buffer pool.

• pin_count is used to keep track of number of transactions that are using the page– 0 means no body is using it

• dirty is used a flag (dirty bit) to indicate that a page has been modified since read from disk– Need to flush it to disk if the page is to be evicted from pool

• Page is an array of bytes where the actual is located– Need to interpret these bytes as the int, char, Date data

types supported by SQL• This is very complex and tricky!


Buffer replacementBuffer replacement

• If we need to bring a page from disk, we need to find a frame in the buffer to hold it

• Buffer pool keeps track on the number of frames in use– List of frames that are free

• If there is a free frame, we use it– Remove from list of free frame– Increment the pin_count– Store the data page into the byte array (page field)

• If the buffer is full, we need a policy to decide which page will be evicted


Buffer replacement AlgorithmBuffer replacement Algorithm

• Upon request of page X do– Look for page X in buffer pool– If found, return it – Else, determine if there is a free frame Y in the pool– If frame Y is found

• Increment its pin_count

• Read page in the frame’s byte array

– Use a replacement policy to find a frame Z to replace• Z must have pin_count == 0

– Increment the pin_count in Z– If dirty bit is set, write data currently in Z to disk– Read the new page into the byte array in Z


Some issuesSome issues

• Need to make sure pin_count is 0– Nobody is using the frame

• Need to write the data to disk if dirty bit is zero• This latter approach is called Lazy update

– Write to disk only when you have to!!!– Careful, if power fails, you are in trouble.– DBMS need to periodically flush pages to disk

• Force write

• If no page is found with pin_count equal to 0, then either:– Wait until one is freed– Abort the transaction (insufficient resources)


Buffer Replacement policiesBuffer Replacement policies

• LRU – Least Recently Used– Evicts the page is the least recently used page in the pool.– Can be implemented by having a priority queue with the

frame numbers. High Priority – not used frequently– Head of the queue is the LRU– Each time a page is used it must be removed from current

queue position and put back at the end• This queue need a method erase() that can erase stuff from the

middle of the queue

• LRU is the most widely used policy for buffer replacement– Most cache managers also use it


Other policiesOther policies

• Most Recently Used – Evicts the page that was most recently accessed – Can be implemented with a priority queue

• FIFO – Pages are replaced in a strict First-In-First Out– Can be implemented with a FIFO List (queue in the strict

sense)

• Random– Pick any page at random for replacement




Buffer Management



Query Optimizer

Query Parser

Client API

Client

DB

ExecutionEngine



Disk Space ManagemetDisk Space Managemet

• Disk Space Manager– DBMS module in charge of managing the disk space used to

store relations– Duties

• Allocate space

• Write data

• Read data

• De-allocate space

• Disk Space Manager supplies a stream of data pages.– Minimal unit of I/O– Often the size of a block (sector, several sectors, or more)


Disk PageDisk Page

• Disk page is simply an array of bytes• We impose the logic of an array of records!

123 Bob NY $1200

2178 Jil LA $9202

8273 Ned FL $2902

723 Al PR $300

Disk Page Records

Reading a Disk Page should be one I/O


Disk arrangement optionDisk arrangement option

• Suppose we need to create 10 GB of space to stored a database. Each page is 4 KB is size.– How to organize the disk to accomplish this.

• Cooked File– User the file system provide by OS– Create a file “mydb.dat”– Write to this file N pages of size 4KB

• N must be enough to reach the size of 10GB

• Page are full of bytes with zeros.

– Have a has table somewhere to store the information about this file “mydb.dat”.

– Now you can start writing pages with actual data.


Raw Disk PartitionRaw Disk Partition

• Don’t use the file provided by the DBMS• Instead create a parition on the disk, but don’t format

it with OS formats (e.g. FAT, FAT32, NTFS, LINUX)• Make your own file system on the disk

– Create a directory of pages– Need to implement all operation such as read, write, check,

etc. – Need to implement you own files…– Faster and more efficient that OS files, but more complex.

Alternative DBMS Views of the Alternative DBMS Views of the Storage SystemStorage System

• Single File– Each database is stored as a single file

– Example: SQLite

– Pro: Easy to store and exchange databases

– Cons: Concurrent/Transactional access very hard

• File System– Each database is stored as multiple files

– Example: MySQL

– Pro: Concurrent/Transactional access good, DBMS portable and simpler

– Cons: DBMS must rely on OS for performance

• Block System (Raw Disk)– The DBMS implements its own storage system on raw disk partition

– Example: Oracle 10g

– Pro: RDBMS can achieve maximal performance

– Cons: RDBMS complex/expensive

Storage Hierarchy ManagementStorage Hierarchy Management

“There is No Such Thing as a Big Fast Memory”

Intra-Processor

Buffer-Manager

Online/SecondaryStorage

Offline/TertiaryStorage

icom 5016– intro. to database systems lecture 12 – raid and buffer management dr. bienvenido...

Documents