data storage and disk access. memory hierarchy hard disks architecture processing requests writing...

CMPT 454Data Storage and Disk Access

Data Storage and Disk Access

Memory hierarchy Hard disks

Architecture Processing requests Writing to disk

Hard disk reliability and efficiency RAID

Solid State Drives Buffer management Data storage

Memory

DBMS and Memory

Main Memory

DiskVirtual Memory

Cache

File System

Tertiary Storage

DBMS

Memory Hierarchy

Primary memory: volatile Main memory Cache

Secondary memory: non-volatile Solid State Drive (SSD) Magnetic Disk (Hard Disk Drive, HDD)

Tertiary memory: non-volatile CD/DVD Tape - sequential access Usually used as backup or for long-

term storage

cost speed

Main Memory vs Secondary Storage

Speed Main memory is much faster than secondary

memory▪ 10 – 100 nanoseconds to move data in main memory

▪ 0.00001 to 0.0001 milliseconds

▪ 10 milliseconds to read a block from an HDD▪ 0.1 milliseconds to read a block from a SSD

Cost Main memory is around 100 times more

expensive than secondary memory SSDs are more expensive than HDDs

Main Memory vs Secondary Storage

System limitations On a 32 bit system only 232 bytes can be directly

referenced▪ Many databases are larger than that

Volatility Data must be maintained between program

executions which requires non-volatile memory▪ Nonvolatile storage retains its contents when the

device is turned off, or if there is a power failure

Main memory is volatile, secondary storage is not

Hard Disk Drives

Magnetic Disks

Database data is usually stored on disks A database will often be too large to be retained in

main memory When a query is processed data will need to be

retrieved from storage Data is stored on disk blocks

Also referred to as blocks, or in relation to the OS, pages

A contiguous sequence of bytes and The unit in which data is written to, and read from Block size is typically between 4 and 16 kilobytes

Magnetic Disk Structure

A hard disk consists of a number of platters Platters can store data on either one or both of its

surfaces so is referred to as▪ Single-sided or double sided

Surfaces are composed of concentric rings called tracks The set of all tracks with the same diameter is called a

cylinder Sectors are arcs of a track

And are typically 4 kilobytes in size Block size is set when the disk is initialized, usually a

small multiple of the sector size (hence 4 to 16 kilobytes)

Diagram of a Disk

track

plattersurfaces

cylinder

3*2*

*statistics for Western Digital Caviar Black 1 TB hard drive

Disk Heads

Data is transferred to or from a surface by a disk head

There is one disk head for each surface These disk heads are moved as a unit (called a

disk head array)▪ Therefore all the heads are in identical positions with

respect to their surfaces

To read or write a block a disk head must be positioned over it

Only one disk head can read or write at a time

Disk Anatomy

disk head array

platters

track

the disk spins – around 7,200rpm

moves in and out

Disk Controller

Disk drives are controlled by a processor called a disk controller which Controls the actuator that moves the head

assembly Selects sectors and determines when the disk

has rotated to a sector Transfers data between the disk and main

memory Some controllers buffer data from tracks in

the expectation that the data will be required

Accessing Data in a Disk

The disk constantly spins 7,200 rpm*

* Western Digital Caviar Black 1 TB hard drive (again)

The head pivots over the desired

track

The desired block is read as it passes underneath the

head

Accessing A Block

The disk head is moved in or out to the track This seek time is typically 10 milliseconds

▪ WD Caviar Caviar Black 1TB: 8.9 ms Wait until the block rotates under the disk

head This rotational delay is typically 4 milliseconds

▪ WD Caviar Caviar Black 1TB : 4.2 ms The data on the block is transferred to

memory This transfer time is the time it takes for the block

to completely rotate past the disk head▪ Typically less than 1 millisecond

Transfer Time

The seek time and rotational delay depend on Where the disk head is before the request, Which track is being requested, and How far the disk has to rotate

The transfer time depends on the request size The transfer time (in ms) for one block equals

▪ (60,000 / disk rpm) / blocks per track The transfer time (in ms) for an entire track equals

▪ (60,000 / disk rpm)

Main Memory versus Disk

Typical access time for a block on a hard disk 15 milliseconds

Typical access time for a main memory frame 60 nanoseconds

What’s the difference? 1 millisecond = 1,000,000 nanoseconds 60 ns = 0.000,060 ms

Accessing a hard drive is around 250,000 times slower than accessing main memory

Reducing Disk Access Time

Disk latency (access time) has three components seek time + rotational delay + transfer time The overall access time can be shortened by reducing,

or even eliminating seek time and rotational delay Related data should be stored in close proximity

Accessing two records in adjacent blocks on a track▪ Seek the desired track, rotate to first block, and transfer two

blocks = 10 + 4 + 2*1 = 16ms

Accessing two records on different tracks▪ Seek the desired track, rotate to the block, and transfer the

block, then repeat = (10 + 4 + 1)*2 = 30ms

Order of Closeness

What does it mean to say that related data should be stored close to each other? The term close refers not to physical proximity but

to how the access time is affected In order of closeness:

Same block Adjacent blocks on the same track Same track Same cylinder, but different surfaces Adjacent cylinders …

Which is Closer

xx

x

1

2

3

Is 2, or 3 "closer" to 1? 2 is in the adjacent

track And is clearly

physically closer, but The disk head must be

moved to access it 3 is in the same

cylinder The disk head does

not have to be moved Which is why 3 is

closer

Fulfilling Disk Requests A fair algorithm would take a first-

come, first-serve approach Insert requests in a queue and

process them in the order in which they are received

2,0001

4,0004

6,0002

10,0006

14,0003

16,0005

Cylinder Received Complete Moved Total

2,000 0 5 2,000 2,000

6,000 0 14 4,000 6,000

14,000 0 27 8,000 14,000

4,000 10 43 10,000 24,000

16,000 20 60 12,000 36,000

10,000 30 72 6,000 42,000

Elevator Algorithm

The elevator algorithm usually performs better than FIFO Requests are buffered and the disk

head moves in one direction, processing requests

The arm then reverses direction

2,0001

4,0004

6,0002

10,0006

14,0003

16,0005

Cylinder Received Complete Moved Total

2,000 0 5 2,000 2,000

6,000 0 14 4,000 6,000

14,000 0 27 8,000 14,000

16,000 20 35 2,000 16,000

10,000 30 46 6,000 22,000

4,000 30 58 6,000 28,000

Requests – Discussion

The elevator algorithm gives much better performance than FIFO on average And is a relatively fair algorithm

The elevator algorithm is not optimal The shortest-seek first algorithm is closer to

optimal but can result in a high variance in response time▪ And may even result in starvation for distant requests

In some cases the elevator algorithm can perform worse than FIFO

Modifying a Record

To modify an existing record (on a disk) the following steps must be taken Read the record Modify the record in main memory Write the modified record back to disk

It is important to remember that the smallest unit of transfer to / from a disk is a block A single disk block usually contains

many records

Read – Modify – Write Cycle

other records …

Landis#winner#Phonak#...

other records …

other records …


other records …

Read one block into main memory …


other records …


other records …

other records …

Landis#disq.#none#... other records …

other records …


other records …


… modify the desired record …


other records …


other records …


other records …


other records …


… modify the desired record …

… and write it back.

Inserting Records

Consider creating a new record The user enters the data for the record

Through some application interface The record is created in main memory And then written to disk

Does this process require a read-modify-write process? YES! Because, otherwise, the existing contents of the

disk block will be overwritten

Disk Failures

Intermittent failure Multiple attempts are required to read or write a sector

Media decay A bit or a number of bits are permanently corrupted

and it is impossible to read a sector Write failure

A sector cannot be written to or retrieved▪ Often caused by a power failure during a write

Disk crash The entire disk becomes unreadable

Checksums

An intermittent failure may result in incorrect data being read by the disk controller Such incorrect data can be detected by a

checksum Each sector contains additional bits whose

values are based on the data bits in the sector A simple single-bit checksum is to maintain an

even parity on the sector▪ If there is an odd number of 1s the parity is odd

▪ If there is an even number of 1s the parity is even

Parity Bits

Assume that there are seven data bits and a single checksum bit Data bits 0111011 – parity is odd

▪ Checksum bit is set to 1 so that the overall parity is even

Using a single checksum bit allows errors of only one bit to be detected reliably

Several checksum bits can be maintained to reduce the chance of failing to notice an error e.g. maintain 8 checksum bits, one for each bit

position in the data bytes

Stable Storage

Checksums can detect errors but can't correct them

Stable storage can be implemented on a disk to allow errors to be corrected Sectors are paired, with each pair representing a

single sector Pairs are usually referred to as Left and Right

▪ Errors in a sector (L or R) are detected using checksums

Stable storage can cope with media failures and write failures

Stable Storage Policy

For writing, write the value of some sector X into XL

Check that the value is correct (using checksums)

If the value is not correct after a given number of attempts then assume that the sector has failed▪ A spare sector should be substituted for XL

Repeat the process for XR

For reading, XL and XR are read from in turn until a correct value is returned

Problems with Hard Disks Hard disks act as bottlenecks for

processing DB data is stored on disks, and must be fetched

into main memory to be processed, and Disk access is considerably slower than main

memory processing There are also reliability issues with disks

Disks contain mechanical components that are more prone to failure than electronic components

One solution is to use multiple disks

Multiple Disks

Single disk Multiple

platters Disk

heads are always over the same cylinder

Multiple disks Each disk contains multiple

platters Disks can be read in parallel,

and Different disks can read from

different cylinders e.g. the first disk can access

data from cylinder 6,000, while the second disk is accessing data from cylinder 11,000

Improving Efficiency

Using multiple disks to store data improves efficiency as the disks can be read in parallel

To satisfy a request the physical disks and disk blocks that the data resides on must be identified The data may be on a single disk, or it may be split

over multiple disks The way in which data is distributed over the

disks affects the cost of accessing it In the same way that related data should be stored

close to each other on a single disk

Data Striping

A disk array gives the user the abstraction of a single, large, disk When an I/O request is issued the physical disk blocks

to be retrieved have to be identified How the data is distributed over the disks in the array

affects how many disks are involved in an I/O request Data is divided into partitions called striping units

The striping unit is usually either a block or a bit Striping units are distributed over the disks using a

round robin algorithm

Striping

4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 …

2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 …

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 …

3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 …

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

Notional File – the data is divided into striping units of a given size

The striping units are distributed across a RAID system in a round robin fashion

The size of the striping unit has an impact on the behaviour of the system

disk 1

disk 2

disk 3

disk 4

Disk 11 2 3 4 5 6 7 8 33 34 35 36 37 38 39 40 65 66 67 68 69 70 71 72 …

Disk 4

Disk 2

Disk 3

25 26 27 28 29 30 31 32 57 58 59 60 61 62 63 64 89 90 91 92 93 94 95 96 …

9 10 11 12 13 14 15 16 41 42 43 44 45 46 47 48 73 74 75 76 77 78 79 80 …

17 18 19 20 21 22 23 24 49 50 51 52 53 54 55 56 81 82 83 84 85 86 87 88 …

Striping Units – Block StripingAssume that a file is to be distributed across a four disk RAID

system, using block striping, and that,Purely for the sake of illustration, the block size is just

one byte!

1 2 3 4 5 6 7 8 9 10 11 12 13 12 15 16 17 18 19 20 21 22 23 24 …

Notional File – the numbers represent a sequence of individual bits in the file

Distribute these bits across a 4 disk RAID system using BLOCK striping:

Block 1 Block 2 Block 3

Disk 4

Disk 1

Disk 2

Disk 3

Striping Units – Bit StripingHere is the same file to be distributed across a four disk RAID

system, this time using bit striping, and again remember that

Purely for the sake of illustration , the block size is just one byte!

4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 …

2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 …

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 …

3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 …

1 2 3 4 5 6 7 8 9 10 11 12 13 12 15 16 17 18 19 20 21 22 23 24 …

Notional File – the numbers represent a sequence of individual bits in the file

Distribute these bits across a 4 disk RAID system using BIT striping:

Block 1 Block 2 Block 3

Disk Array Performance

Assume that a disk array consists of D disks Data is distributed across the disks using data

striping How does it perform compared to a single

disk? To answer this question we must specify the

kinds of requests that will be made▪ Random read – reading multiple, unrelated records▪ Random write▪ Sequential read – reading a number of records (such as

one file or table), stored on more than D blocks▪ Sequential write

The Basic Idea …

Use all D disks to improve efficiency, and distribute data using block striping

Random read performance Very good – up to D different records can be read at

once▪ Depending on which disks the records reside on

Random write performance – same as read performance

Sequential read performance Very good – as related data are distributed over all

D disks performance is D times faster than a single disk

Sequential write performance – same as read performance

But what about reliability …

Reliability

Hard disks contain mechanical components and are less reliable than other, purely electronic, components Increasing the number of hard disks decreases

reliability, reducing the mean-time-to-failure (MTTF)▪ The MTTF of a hard disk is 50,000 hours, or 5.7 years

In a disk array the overall MTTF decreases Because the number of disks is greater MTTF of a 100 disk array is 21 days – (50,000/100) /

24▪ This assumes that failures occur independently and▪ The failure probability does not change over time

Reliability is improved by storing redundant data

Redundancy

Reliability of a disk array can be improved by storing redundant data

If a disk fails the redundant data can be used to reconstruct the data lost on the failed disk The data can either be stored on a separate check

disk or Distributed uniformly over all the disks

Redundant data is typically stored using one of two methods Mirroring, where each disk is duplicated A parity scheme, where sufficient redundant data is

maintained to recreate the data in any one disk Other redundancy schemes provide greater

reliability

Parity Scheme

For each bit on the data disks there is a parity bit on a check disk If the sum of the data disks bits is even the parity bit is set to zero If the sum of the bits is odd the parity bit is set to one

The data on any one failed disk can be recreated bit by bit

0 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 1 …

1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 …

0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 0 1 1 0 0 1 …

0 0 0 1 1 1 0 1 0 0 1 1 0 0 0 1 1 0 1 1 1 0 0 1 …

1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1 1 …

5th check disk containing parity data

4 data disk system showing individual bit values

Parity Scheme Read and Write

Reading The parity scheme does not affect

reading Writing

A naïve approach would be to calculate the new value of the parity bit from all the data disks

A better approach is to compare the old and new values of the disk that is written to▪ And change the value of a parity bit if the

corresponding bits have changed

Introducing RAID

A RAID system consists of several disks organized to increase performance and improve reliability Performance is improved through data striping Reliability is improved through redundancy RAID stands for Redundant Arrays of Independent

Disks There are several RAID schemes or levels

The levels differ in terms of their▪ Read and write performance,▪ Reliability, and▪ Cost

RAID Level 0

All D disks are used to improve efficiency, and data is distributed using block striping

No redundant information is kept Read and write performance is very good But, reliability is poor

Unless data is regularly backed up a RAID 0 system should only be used when the data is not important

A RAID 0 system is the cheapest of all RAID levels As there are no disks used for storing redundant data

Level 1: Mirrored

An identical copy is kept of each disk in the system, hence the term mirroring

Read performance is similar to a single disk No data striping, but parallel reads of the duplicate disks

can be made which improves random read performance Write performance is worse than a single disk as the

duplicate disk has to be written to Writes to the original and mirror should not be performed

simultaneously in case there is a global system failure But write performance is superior to most other RAID levels

Very reliable but costly

With D data disks, a level 1 RAID system has 2D disks

Level 1+0: Striping + Mirroring Sometimes referred to as RAID level 10,

combines both striping and mirroring Very good read performance

Similar to RAID level 0 2D times the speed of a single disk for sequential

reads Up to 2D times the speed of a single disk for

random reads Allows parallel reads of blocks that, conceptually,

reside on the same disk Poor write performance

Similar to RAID level 1 Very reliable but the most expensive RAID level

Writing and Redundant Data Writing data is the Achilles

heel of RAID systems Data and check disks should

not be written to simultaneously

Parity information may have to be read before check disks can be written to

In many RAID systems writing is less efficient than with a single disk!

Writing Parity Data

Sequential writes, or random writes in a RAID system using bit striping: Write to all D data disks, using a read-modify-write

cycle Calculate the parity information from the written

data Write to the check disk(s)

▪ A read-modify-write cycle is not required Random writes in a system using block striping:

Write to the data disk using a read-modify-write cycle

Read the check disk(s), and calculate the new parity data

Write to the check disk(s)

Striping Units Performance A RAID system with D disks can read data up to

D times faster than a single disk system For sequential reads there is no performance

difference between bit striping and block striping Block striping is more efficient for random reads

With bit striping all D disks have to be read to recreate a single record (and block) of the data file

With block striping, a complete record is stored on one disk, so only one disk is required to satisfy a single random read

Write performance is similar except that it is also affected by the parity scheme

Levels 2 and 3

Level 2 does not use the standard parity scheme Uses a scheme that allows the failed disk to be

identified increasing the number of disks required However the failed disk can be detected by the disk

controller so this is unnecessary Can tolerate the loss of a single disk

Level 3 is Bit Interleaved Parity The striping unit is a single bit Random read and write performance is poor as all

disks have to be accessed for each request Can tolerate the loss of a single disk

RAID Level 4

Uses block striping to distribute data over disks

Uses one redundant disk containing parity data The ith block on the redundant disk

contains parity checks for the ith blocks of all data disks

Good sequential read performance D times single disk speed

Very good random read performance Disks can be read independently, up to

D times single disk speed

RAID Level 4: Writing

When data is written the affected block and the redundant disk must both be written to

To calculate the new value of the redundant disk Read the old value of the changed block Read the corresponding redundant disk block Write the new data block Recalculate the block of the redundant disk

To recalculate the redundant data consider the changes in the bit pattern of the written data block

RAID Level 4: Performance Cost is moderate

Only one check disk is required The system can tolerate the loss of one drive Write performance is poor for random writes

Where different data disks are written independently

For each such write a write to the redundant disk is also required

Performance can be improved by distributing the redundant data across all disks – RAID level 5

Level 5: Block-Interleaved Distributed Parity

The dedicated check disk in RAID level 4 tends to act as a bottleneck for random writes

RAID level 5 does not have a dedicated check disk but distributes the parity data across all disks This removes the bottleneck thus increasing the

performance of random writes Sequential write performance is similar to level 4

Cost is moderate, with the same effective space utilization as level 4

The system can tolerate the loss of one drive

Multiple Disk Crashes

RAID levels 4 and 5 can only cope with single disk crashes Therefore if multiple disks crash at the same

time (or before a failed disk can be replaced) data will be lost

RAID level 6 allows systems to deal with multiple disk crashes These systems use more sophisticated error

correcting codes One of the simpler error correcting codes is

the Hamming Code

Hamming Code

Consider a system with seven disks which can be identified with numbers from 1 to 7 Four of the disks are data disks, disks 1 to 4 Three of the disks are redundant disks, disks 5

to 7 Each of the three check disks contain parity

data for three of the four data disks Disk 5 contains parity data for disks 1, 2 and 3 Disk 6 contains parity data for disks 1, 2 and 4 Disk 7 contains parity data for disks 1, 3 and 4

Hamming Code Example

Disk 1 2 3 4 5 6 7

1 1 1 0 1 0 0

1 1 0 1 0 1 0

1 0 1 1 0 0 1

1,2,3

1,2,4

1,3,4

Data Redundant Data

RAID Level 6

Reads are performed as normal Only the data disks are used

Writes are performed in a similar way to RAID level 4 Except that multiple redundant disks

may be involved Cost is high, as more check disks are

required

RAID Level 6 Recovery

If one disk fails use the parity data to restore the failed disk like level 4

If two disks fail then both disks can be rebuilt using three of the other disks, e.g. If disks 1 and 2 fail

▪ Rebuild disk 1 using disks 3, 4 and 7


If disks 3 and 5 fail▪ Rebuild disk 3 using disks 1, 4 and 7


Parity Scheme and Reliability In real-life RAID systems the disk array is

partitioned into reliability groups A reliability group consists of a set of data disks

and a set of check disks The number of check disks depends on the

reliability level that is selected Consider a RAID system with 100 data disks

and 10 check disks i.e. 10 reliability groups The MTTF is increased from 21 days to 250 years!

Which RAID Level

Level 0 improves performance at the lowest cost but does not improve reliability

Level 1+0 is better than level 1 and has the best write performance

Levels 2 and 4 are always inferior to 3 and 5 Level 3 is good for large transfer requests of several

contiguous blocks but bad for many small request of a single disk block

Level 5 is a good general-purpose solution Level 6 is appropriate if higher reliability is required In practice the choice is usually between 0, 1, and 5

RAID Levels Comparison

The table that follows compares RAID levels using RAID level 0 as a baseline

Comparisons of RAID systems vary dependent on the metric used and the measurements of those metrics The three primary metrics are reliability,

performance and cost These can be measured in I/Os per second, bytes

per second, response time and so on The comparison uses throughput per dollar for

systems of equivalent file capacity File capacity is the amount of information that can

be stored on the system, which excludes redundant data

RAID Levels Comparison

Random Read

Random Write

Sequential Read

Sequential Write

Storage Efficiency

RAID 0 1 1 1 1 1

RAID 1 1 ½ 1 ½ ½

RAID 3 1/G 1/G (G-1)/G (G-1)/G (G-1)/G

RAID 4 (G-1)/G max(1/G, ¼)

(G-1)/G (G-1)/G (G-1)/G

RAID 5 1 max(1/G, ¼)

1 (G-1)/G (G-1)/G

RAID 6 1 max(1/G, 1/6)

1 (G-2)/G (G-2)/G G refers to the number of disks in a reliability

group (both data disks and check disks) RAID levels 10 and 2 not shown

Solid State Drives

Solid State Drives

Solid State Drives (SSDs) use NAND flash memory and do not contain moving parts like an HDD Accessing an SSD does not require seek time or rotational

latency and they are therefore considerably faster Flash memory is non-volatile memory that is used by smart-

phones, mp3 players and thumb (or USBO) drives NAND flash architecture is similar to a NAND (negated

and) logic gate hence the name NAND flash architecture is only able to read and write data one

page at a time There are two types of SSD

Multi-level cell (MLC) Single-level cell (SLC)

MLC SSD

MLC cells can store multiple different charge levels And therefore more than one bit

▪ With four charge levels a cell can store 2 bits Multiple threshold voltages makes reading

more complex but allows more data to be stored per cell

MLC SSDs are cheaper than SLC SSDs However write performance is worse And their lifetimes are shorter

SLC SSD

SLC cells can only store a single charge level They are therefore on or off, and can contain only

one bit SLC drives are less complex

They are more reliable and have a lower error rate

They are faster since it is easier to read or write a single charge value

SLC drives are more expensive And typically used for enterprise rather than

home use

SSD Performance

Reads are much faster than HDDs since there are no moving parts

Writes are also faster than HDDs However flash memory must be erased

before it is written, and entire blocks must be erased▪ Referred to as write amplification

The performance increase is greatest for random reads

Storing Data on a Disk

DBMS Structure

Database

Disk Space Manager

Buffer Manager

Query Evaluation

Transaction

and

Lock

Manager

DBMS

File and Access Code Recover

y

Manager

Accessing Data

When an SQL command is evaluated a request may be made for a DB record

Such a request is passed to the buffer manager

If the record is not stored in the (main memory) buffer the page must be fetched from disk

The disk space manager provides routines for allocating, de-allocating, reading and writing pages

Disk Space Management

The disk space manager (DSM) keeps track of available disk space The lowest level of the DBMS architecture

Supports the allocation and de-allocation of disk pages Pages are abstract units of storage, mapped to disk

blocks Reading and writing to a page is performed in one

disk I/O Sequences of pages are allocated to a contiguous

sequence of blocks to increase access speed The DSM hides the underlying details of storage

Allowing higher level processes to consider the data to be a collection of pages

Tracking Free Blocks

A DB increases and decreases in size over time In addition to mapping pages to blocks the DSM has

to record which disk blocks are in use As time goes on, gaps in sequences of allocated

blocks appear Free blocks need to be recorded so that they can

be allocated in the future, using either A linked list, the head points to the first free block A bitmap, each bit corresponds to a single block

▪ Allows for fast identification, and therefore allocation, of contiguous areas of free space

Using OS as the DSM

An OS is required to manage space on a disk Typically an OS abstracts a file as a sequence of

bytes While possible to build a DSM with the OS many

DBMS perform their own disk management This makes the DBMS more portable across platforms Using the OS may impose technical limitations such

as maximum file size In addition, OS files cannot typically be stored on

separate disks, which may be necessary in a DBMS

Record and Page Format

Record Formats

Attributes, or fields, must be organized within records Information that is common to all records of a particular

type is stored in the system catalog▪ Including the number and type of fields

Records of a single table can vary from each other In addition to differences in data (obviously) Different records may contain different number of

fields, or Fields of varying length

Examples of Field Types

INTEGER, represented by two or four bytes FLOAT, represented by four or eight bytes CHAR(n), fixed length character strings of n bytes

Unused characters are occupied with a pad character▪ e.g. if a CHAR(5) stored "elm" it would be stored as elm

VARCHAR(n), character strings of varying lengths Stored as arrays of n+1 bytes

▪ i.e. even though a VARCHAR's contents can vary, n+1 bytes are dedicated to them

The length of a VARCHAR is stored in the first byte, or Its end is specified by a null character

Fixed Length Records

Fields are a fixed length, and the number of fields is fixed Fields may then be stored consecutively And given the address of a record, the

address of a particular field can be found ▪ By referring to the field size in the system

catalog It is common to begin all fields at a

multiple of 4 or 8 bytes

Fixed Length Record Format

name address salary

0 12 44 300 308

Consider an employee record: {name CHAR(30), address VARCHAR(255), salary

FLOAT} Fields can be found by looking up the field size in

the schema and performing an offset calculationpointer to schema

header

length timestamp

Variable Length Fields

In the relational model each record contains the same number of fields However fields may be of variable length

If a record contains both fixed and variable length fields, store the fixed length fields first The fixed length fields are easy to locate

To store variable length fields include additional information in the record header The length of the record Pointers to the beginning of each variable length

field A pointer to the end of the record

Variable Length Record

name salary address

Consider an employee record with name, and salary being fixed length and address being variable length

The pointer to the first variable length field may be omittedother header information

header

record lengthaddress pointer

end of record

Repeating Fields

Records in a relational DB have the same number of fields But it is possible to have repeating fields For example a many to many relationship in a

record that represents an object References to other objects will have to be stored

▪ The references (or pointers) to other objects suggest that different records will have different lengths

There are three alternatives for recording such data

Storing Repeating Fields

Store the entire record in one block Maintain a pointer to the first reference

Store the fixed length portion in one location, and the variable length portion in another The header contains a pointer to the variable length

portion (the references to other objects), and

The number of such objects Store a fixed length record with an fixed number

of occurrences of the repeating fields, and A pointer to (and count of) any additional occurrences

Fixed vs. Variable Length

There are many advantages of keeping records (and therefore fields) fixed length More efficient for search Lower overhead (the header contains

less data) Easier to move records around

The main advantage of using variable length fields is that it can save space This can result in fewer disk I/O

operations

Variable Length Fields Issues Modifying a variable field in a record may make it

larger Later fields in the same record have to be moved,

and Other records may also have to be moved

When a variable field is modified the record’s size may increase to the extent that it no longer fits on the page The record must then be moved to another page,

but A "forwarding address" has to be maintained on the

old page, so that external references to the rid are still valid

A record may grow larger than the page size The record must then be broken into sections and

connected by pointers

Forwarding Addresses

Forwarding addresses may need to be maintained When a record grows too large, or When records are maintained in order (clustered)

When maintaining ordered data Provide a forwarding address if a record has to be

moved to a new page to maintain the ordering Delete records by inserting a NULL value, or tombstone

pointer in the header The record slot can be re-used when another record is

inserted

8

Insertion

1

8

3

1

6

3

8

21

17

Insert 6,

The numbers represent the primary keys of the records

17,

and 21

Tombstone Pointer

1

8

3

5

Delete 3, and insert 5

11

8

1

8

Other Data Types

There are other data types that requires special treatment in terms of record storage Pointers, and

reference variables Large objects such as

text, images, video, sound etc.

Records with Pointers

If a record represents an object, the object may contain pointers to or addresses of some other object Such pointers need to be managed by a DBMS

A data item may have two addresses Database address on disk, usually 8 bytes Memory address in main memory, usually 4 bytes

When an item is on the disk (i.e. secondary storage) its database address must be used

And when an item is in the buffer pool it can be referred to by either its database or memory address It is more efficient to use the memory address

Translation Tables

Database addresses of items in main memory should be translated to their current memory addresses To avoid unnecessary disk I/O It is possible to create a translation table that maps

database addresses to memory addresses However when using such a table addresses may

have to be repeatedly translated Whenever a pointer of a record in main memory is

accessed the translation table must be used Pointer swizzling is used to avoid repeated

translation table look-up

Pointer Swizzling

Whenever a block is moved from secondary to main memory pointers in that block may be swizzled i.e. translated from the database to the memory

address A pointer in main memory consists of

A bit that indicates whether the pointer is a database or a memory address, and

The memory address (four byte) or database address (8 byte) as appropriate▪ Space is always reserved for the database address

There are several strategies to decide when a pointer should be swizzled

Pointer Swizzling Example

…

Block 1

Block 2

Disk Memory

Read into memory

swizzled

unswizzled

Swizzling Strategies

When a new block is brought into main memory, pointers related to that block may be swizzled The block may contain pointers to records in the same

block, or other blocks, and Pointers in records in other blocks, already in main

memory, may point to records in the newly copied block There are four main swizzling strategies

Automatic swizzling Swizzling on demand No swizzling – i.e. just use the translation table Programmer controlled swizzling – when access

patterns are known

Automatic Swizzling

Enter the address of the block and its records into the translation table

Enter the address of any pointers in the records in the block into the translation table If such an address is already in the table, swizzle

the pointer giving it the appropriate memory address

If the address is not already in the table, copy its block into memory and swizzle the pointer

This ensures that all pointers in the new block are swizzled when the block is loaded, which may save time However, it is possible that some of the pointers

may not be followed, hence time spent swizzling them is wasted

Swizzling on Demand

Enter the address of the block and its records into the translation table

Leave all pointers in the block unswizzled When an unswizzled pointer is followed, look up

the address in the translation table If the address is in the table, swizzle the pointer If the address is not in the table, copy the

appropriate block into main memory, and swizzle the pointer

Unlike automatic swizzling, this strategy does not result in unnecessary swizzling

Returning Blocks to Disk

When a block is written to disk its pointer must first be unswizzled That is, the pointers to memory addresses must

be replaced by the appropriate database addresses

The translation table can be searched (by memory address) to find the database address This is potentially time consuming The translation table should therefore be indexed

to allow efficient lookup of both memory and database addresses

Pinned Records and Blocks Pointer swizzling may result in blocks being

pinned A block is pinned if it cannot safely be written back

to disk A block that is pointed to by a swizzled pointer

should be pinned Otherwise, the pointer can no longer be followed to

the block at the specified memory address If a block is unpinned pointers to it must be

unswizzled The translation table must also include the memory

addresses of pointers that refer to an entry▪ As a linked list attached to an entry in the translation

table, or▪ As a (pointer to a) linked list in the record's pointer

field

Large Object Blocks

How are large data objects stored in records? Video clips, or sound files or the text from a book

LOB data types store and manipulate large blocks of unstructured data Tables can contain multiple LOB columns The maximum size of a LOB is large

▪ At least 8 terabytes in Oracle 10g LOB data must be processed by application programs

LOB data is stored as either binary or character data BLOB – unstructured binary data CLOB, NCLOB – character data BFILE – unstructured binary data in OS files

LOB Storage

LOB's have to be stored on a sequence of blocks Ideally the blocks should be contiguous for efficient

retrieval, but It is possible to store the LOB on a linked list of blocks

▪ Where each block contains a pointer to the next block If fast retrieval of LOBs is required they can be

striped across multiple disks for parallel access It may be necessary to provide an index to a

LOB For example indexing by seconds for a movie to allow

a client to request small portions of the movie

Page Formats

Records are organized on pages Pages can be thought of as a collection of slots,

each of which contains a single record A record can be identified by its record id (rid)

▪ The rid is the {page ID, slot number} pair Before considering different organizations for

managing slots it is important to know if Records are fixed length or Variable length

There are two organizations based on how records are deleted

Packed Page Format

Records are stored consecutively in slots

When a record is deleted the last record on the page is moved to its location

Records are found by an offset calculation

All empty space is at the bottom of the page

But the rid includes the slot number As records are moved

external references become invalid

slot 1

slot 2

…

slot N

free space

N

number of records

Unpacked Page Format

The page header contains a bitmap Each bit represents a

single slot A slot's bit is turned

off when the slot is empty

New records are inserted in empty slots

A record's slot number doesn't change

slot 1

slot 2

slot 3

…

slot M

1 … 0 1 0 M

M 3 2 1

number of slotsbitmap

showing slot occupancy

Variable Length Records

With variable length records a page cannot be divided into fixed length slots If a new record is larger than the slot it cannot be

inserted If a new record is smaller it wastes space To avoid wasting space, records must be moved so

that all the free space is contiguous without changing the rids

One solution is to maintain a directory of page slots at the end of each page which contains A pointer (an offset value) to each record and The length of each record

free space

16 … 24 20 N

N 2 1

Organizing Variable Length Records Pointers are offsets to records

Moving a record on the page has no impact on its rid

Its pointer changes but its slot number does not

A pointer to the start of the free space is required

Records are deleted by setting the offset to -1 New records can be inserted

in vacant slots Pages should be periodically

reorganized to remove gaps The directory "grows" into

the free space

number of slots

slot directory

length = 24

length = 16

length = 20 pointer to start of free space

Files and Records

A page can be considered as a collection of records

Pages containing related records are organized into collections, or files One file usually represents a single table

One file may span several pages It is therefore necessary to be able to

access all of the pages that make up a file

The basic file structure is a heap file

Heap Files

Heap files are not ordered in any way But they do guarantee that all of the records in a file can

be retrieved by repeatedly requesting the next record Each record in a file has a unique record ID (rid) And each page in the file is the same size

Heap files support the following operations: Creating and destroying files Inserting and deleting records Scanning all the records in the file

To support these operations it is necessary to: Keep track of the pages in the file Keep track of which of those pages contain free space

Heap File Organization 1

Maintain the heap file as a pair of doubly linked lists of pages One list for pages with free space and One list for pages that are full The DBMS can record the first page in the list in a

table with one entry for each file If records are of variable length most pages will

end up in the list of pages with free space It may be necessary to search several pages on the

free space list to find one with enough free space

Heap File Organization 2

Maintain the heap file as a directory of pages Each directory entry identifies a page (or

a sequence of pages) in the heap file The entries are kept in data page

order and records for each page: Whether or not the page is full, or The amount of free space

▪ If the amount of free space is recorded there is no need to visit a page to determine if it contains enough space

Managing Data in Main Memory

The Buffer Manager

The buffer manager is responsible for bringing pages from disk to main memory as required Main memory is partitioned into a collection of

pages called the buffer pool Main memory pages are referred to as frames Other processes must tell the buffer manager if a

page is no longer required and whether or not it has been modified

A DB may be many times larger than the buffer pool Accessing the entire DB (or performing queries that

require joins) can easily fill up the buffer pool▪ When the buffer pool is full, the buffer manager must

decide which pages to replace by following a replacement policy

Buffer Pool Management

Program 1 Program 2

Buffer Manager

Database

MAIN MEMORY

DISK

Buffer Pool

Disk Page

Free Frame

Page Frames

Buffer pool frames are the same size as disk pages

The buffer manager records two pieces of information for each frame dirty bit – on if the page

has been modified pin-count – the number

of times the page has been requested but not released

Data Page

Main memory frame

Dirty BitPin Count

Requesting (Allocating) a Page If the page is already in the buffer pool

Increment the frame's pin-count (called pinning) Otherwise

Choose a frame to replace (using the policy)▪ A frame is only chosen for replacement if its pin-count

is zero▪ If there is no frame with a pin-count of zero the

transaction must either wait or be aborted▪ If the chosen frame is dirty write it to the disk

Read requested page into replacement frame and set its pin-count to 1

Return the address of the frame

Releasing a Page

When a process releases (de-allocates) a page its pin-count is reduced, known as unpinning The program indicates if the page has been

modified, if so the buffer manager sets the dirty bit to on

Processes for requesting and releasing pages are affected by concurrency and crash recovery policies These will be discussed at a later date

Buffer Replacement Policies The policy used to replace frames can affect

the efficiency of database operations Ideally a frame should not be replaced if it will be

needed again in the near future Least Recently Used (LRU) replacement policy

Assumes that frames that haven't been used recently are no longer required

Uses a queue to keep track of frames with pin-count of zero

Replaces the frame at the front of the queue Requires main memory space for the queue

Clock Replacement

A variant of the LRU policy with les overhead Instead of a queue the policy requires one bit per

frame, and a single variable, called current Assume that the frames are numbered from 0 to B-1

▪ Where B is the number of frames

Each frame has an associated referenced bit The referenced bit is initially set to off, and is Set to on when the frame's pin-count reaches zero

current is initially set to 0, and is used to indicate the next frame to be considered for replacement

Clock Replacement Process Consider the current frame for replacement

If pin-count 0, increment current If pin-count 0 and referenced bit is on

▪ Switch referenced to off and increment current If pin-count 0 and referenced is off

▪ Replace the frame If current equals B-1 set it to 0

Only replace frames with pin-counts of zero Frames with a pin-count of zero are only replaced

after all older candidates are replaced

Is LRU the Right Policy?

LRU and clock replacement are fair schemes They are not always the best strategies for a DB

system It is common for some DB operations to require

repeated sequential scans of data (e.g. Cartesian products, joins)

With LRU such operations may result in sequential flooding

An alternative is the Most Recently Used policy This prevents sequential flooding but is generally

poor Most systems use some variant of LRU

Some systems will identify certain operations, and apply MRU for those operations

No Sequential Flooding Assume that a process requests sequential scans of a file The file, shown below, has nine pages

Assume that the buffer pool has ten frames

p1 p2 p3 p4 p5 p6 p7 p8 p9

Read page 1 first, All the pages are in the buffer, when the next scan of the file is requested, no further disk access is required!

Buffer Pool

then page 2,

… then page 9

p1p1 p2p1 p2 p3 p4 p5 p6 p7 p8 p9

Sequential Flooding

Read pages 1 to 10 first, page 11 is still to be read

Assume that a process requests sequential scans of a file This file, shown below, has eleven pages

Assume that the buffer pool still has ten frames

p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11

Buffer Poolp1p1 p2p1 p2 p3 p4 p5 p6 p7 p8 p9 p10

p1 p2 p3 p4 p5 p6 p7 p8 p9 p10

Sequential Flooding




p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11

Buffer Poolp11 p2 p3 p4 p5 p6 p7 p8 p9 p10

Using LRU, replace the appropriate frame, which contains p1, with p11

p11 p2 p3 p4 p5 p6 p7 p8 p9 p10

Sequential Flooding




p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11

Buffer Pool

Using LRU, replace the appropriate frame, which contains p1, with p11

p11 p1 p3 p4 p5 p6 p7 p8 p9 p10

The first scan is complete, start the second scan by reading p1 from the fileReplace the LRU frame (containing p2) with p1

p11 p1 p3 p4 p5 p6 p7 p8 p9 p10

Sequential Flooding




p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11

Buffer Pool

Using LRU, replace the appropriate frame, which contains p1, with p11The first scan is complete, start the second scan by reading p1 from the fileReplace the LRU frame (containing p2) with p1

p11 p1 p2 p4 p5 p6 p7 p8 p9 p10

Continue the scan by reading p2, …

p11 p1 p2 p4 p5 p6 p7 p8 p9 p10

Sequential Flooding



p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11

Buffer Poolp11 p1 p2 p3 p5 p6 p7 p8 p9 p10

Each scan of the file requires that every page is read from the disk!In this case LRU is the WORST possible replacement policy!

p11 p1 p2 p3 p4 p6 p7 p8 p9 p10p11 p1 p2 p3 p4 p5 p7 p8 p9 p10p11 p1 p2 p3 p4 p5 p6 p8 p9 p10p11 p1 p2 p3 p4 p5 p6 p7 p9 p10p11 p1 p2 p3 p4 p5 p6 p7 p8 p10p11 p1 p2 p3 p4 p5 p6 p7 p8 p9p10 p1 p2 p3 p4 p5 p6 p7 p8 p9p10 p11 p2 p3 p4 p5 p6 p7 p8 p9

OS Buffer Management

There are similarities between OS virtual memory and DBMS buffer management Both have the goal of accessing more

data than will fit in main memory Both bring pages from disk to main

memory as needed and replace unneeded pages

A DBMS requires its own buffer management To increase the efficiency of database

operations To control when a page is written to disk

DBMS Buffer Management A DBMS can often predict patterns in the way in

which pages are referenced Most page references are generated by processes

such as query processing with known patterns of page accesses

Knowledge of these patterns allows for a better choice of pages to replace and

Allows prefetching of pages, where the page requests can be anticipated and performed before they are requested

A DBMS requires the ability to force a page to disk To ensure that the page is updated on a disk This is necessary to implement crash recovery protocols

where the order in which pages are written is critical

Prefetching

Some DBMS buffer managers are able to predict page requests And fetch pages into the buffer before they are

requested The pages are then available in the buffer pool as

soon as they are requested, and If the pages to be prefetched are contiguous, the

retrieval will be faster than if they had been retrieved individually

If the pages are not contiguous, retrieval may still be faster as access to them can be efficiently scheduled

The disadvantage of prefetching (aka double-buffering) is that it requires extra main memory buffers

Performance Strategies

Organizing data by cylinders Related data should be stored "close to" each other

Using a RAID system to improve efficiency or reliability Multiple disks and striping improves efficiency Mirroring or redundancy improves reliability

Scheduling requests using the elevator algorithm Reduces disk access time for random reads and

writes Most effective when there are many requests

waiting Prefetching (or double-buffering) data in large

chunks Speeds up access when needed blocks can be

predicted but requires more main memory buffers

data storage and disk access. memory hierarchy hard disks architecture processing requests writing...

Documents

disk access slide

disk blocks

disk rpm slide

time slide

disk hard disk reliability

disk drives

disk heads

disk spins