mass storage structure - uni koblenz-landau

Mass Storage Structure

Outline

Overview of Mass-Storage Structure Disk Structure Disk Attachment Disk Scheduling Disk and Swap-Space Management RAID Structure Stable-Storage Implementation Tertiary-Storage Structure

2

Overview of Mass-Storage Structure

Magnetic Disks

Hard Disks

4

Magnetic Disks

Disk speed Transfer rate = data rate between disk drive and computer Seek time = time to move arm to cylinder Rotational latency = Time to rotate to the sector Positioning time = seek time + rotational latency

Magnetic disks rotate fast (60 to 250 r/sec) Transfer rate of several megabytes Seek times and rotational latency of several milliseconds Head flies just above each platter (head crash)

Removable disks Floppy disks

Slow rotational speed Head lies on the disk

Removable hard disks

5

Magnetic Disks

Disks are attached by an I/O bus EIDE, ATA, SATA, USB, FC, SCSI, FireWire

Communication is done by controllers Host controller on computer Disk controller on each disk drive Memory mapped I/O or special machine

instructions Built-in caches

6

Magnetic Tapes

Characteristics Relatively permanent Large data quantities Slow access time Random access can take minutes

Usage Used early as secondary storage Backup Infrequently used data Transferring of data between different systems

7

SSD (compared to other techniques)MLC-NAND-Flash-Laufwerk1,0″ bis 3,5″

CompactFlash-Karteper ATA-Adapter

RAM-Diskals Teil des Arbeitsspeichers

Festplatte1,0″ bis 3,5″

Größe bis 1 TB bis 128 GB bis 16 GB je Modul bis 4 TB

Preis pro GB (Stand Juli 2012) ab ≈ 0,5 € ab ≈ 0,9 € (plus

Adapterpreis) ab ≈ 3,9 € ab ≈ 0,045 €

Anschluss S-ATA, P-ATA, mSATA, PCIe S-ATA, P-ATA hauptsächlich

DIMM-ConnectorS-ATA, P-ATA, SCSI, SAS

Lesen (kein RAID) bis 510 MB/s bis 100 MB/s bis 38400 MB/s bis 150 MB/sSchreiben (kein RAID) bis 490 MB/s bis 100 MB/s bis 38400 MB/s bis 150 MB/s

Mittlere Zugriffszeit lesen 0,2 ms 0,8 ms 0,00002 ms ab 3,5 ms

Mittlere Zugriffszeit schreiben

0,4 ms 10…35 ms 0,00002 ms ab 3,5 ms

Überschreibbar 3 bis 10 Tausend Mal (MLC)

3 bis 10 Tausend Mal (MLC) „beliebig“ „beliebig“

8

Source: http://de.wikipedia.org/wiki/Solid-State-Drive#Solid-State-Drives_im_Vergleich

Disk Structure

9

Disk Structure

Disk addressed as a linear array of blocks (0…N-1)

Each block stores a fixed amount of data (512 bytes)

Computing a mapping from block to (platter, cylinder, track, sector)? Some sectors might be defect Number of sectors per track depends on track position

10

Disk Structure

Constant linear velocity (CLV) – the farer a track from the center the more sectors it holds

• Track A: 8 sectors

• Track B: 4 sectors

different data rate

Solution: increase rotation speed as head moves from outer to inner tracks

A

B

11

Disk Structure

Constant angular velocity (CAV) Increase density towards inner tracks Keep the same rotation speed

A

B

12

Disk Attachment

13

Host-attached Storage

Communication techniques IDE, ATA: 2 devices per I/O bus SATA: serial ATA (one disk attached per cable) SCSI: 16 targets (including controller card)

8 units per target (different components of a device) FC (fiber channel): Variant1: switched with a 24 bit address space (useful

for SANs)Variant 2: FC-AL 126 devices

I/O Commands are reads and writes of logical data blocks Directed to identified storage units

14

BusController

DeviceController

Bus

CPU Device

Remark: Direct Memory Access

Network attached Storage

Access via RPC interface (NFS under UNIX or CIFS under Windows)

Connection over TCP/IP, UDP/IP or host attached protocol like ISCSI

Benefits of sharing but lower performance16

Storage Area Networks

Network attached storage uses the same network for storage I/O slow down of the overall network communication

Solution

17

Disk Scheduling

18

Disk Scheduling Responsibility of the operating system: use the hardware efficiently. Here: responsibility entails having fast access time and large disk

bandwidth

Access time = seek time + rotational latency Bandwidth = #bytes/time period Process requesting a disk transfer determines

Input or output Disk address Memory address Number of sectors

OS can’t transfer immediately if device is not free requires scheduling of concurrent disk transfers

Goal: fairness, improve bandwidth and access time reduce the head movement

19

First Come First Serve (FCFS)

Service requests in the order of arrival Example

Algorithm is fair but does not optimize head movements (here: 640 cylinders)

# head movements can be reduced significantly 20

Shortest Seek Time First (SSTF)

Select the request with the minimal seek time from the current

Example

Total head movement significantly reduced over FCFS (236/640)

21

Shortest Seek Time First (SSTF)

Starvation problem Consider a permanent stream of requests R1,

R2, … close to cylinder 14 Consider one request S for cylinder 183 S might never be serviced

SSTF is not optimal e.g. service 37, 14 before 65, 67, 98, 122, 144,

183 Reduces total amount of head movements to

208 cylinders22

SCAN Scheduling

Move disk arm back and forth and service request on the way

Example

23

C-SCAN Scheduling

Property of SCAN: when head reaches scan end, requests at the other side waited the longest time

Better solution: move head to the beginning and start scanning in the same direction again

24

LOOK and C-LOOK Scheduling

Variant of SCAN and C-SCAN: move head as far as necessary only

25

Disk and Swap-Space Management

26

Disk Management: Formatting

Low level formatting: create sectors on a blank platter Write header, data area and trailer for each sector Data area size: 256, 512, 1024 Bytes Header and trailer contain number and ECC

Error correcting code (ECC) On data write, compute ECC and store in trailer On data read, compute ECC and compare with stored

one If equal, OK If not equal, a few error bits can be corrected

Partitioning: organize disk in one or more groups of cylinders

Logical formatting: write file system data structures27

Disk Management: Boot Block

Bootstrap program Initialize registers in memory Load the operating system from disk

Code in ROM is the first to be executed on power up

Problem Bootstrap program may not fit into ROM Bootstrap program can not be changed

Solution: split boot program Bootstrap loader stored in ROM Remaining bootstrap program stored on secondary

storage

28

Disk Management: Bad Blocks

Disks are prone to errors (moving parts, small tolerances) Sectors might become defective Sectors might be defective from the very

beginning Simple disk controllers (e.g. IDE)

Marking bad blocks manually E.g. FAT stores special value in its table

Sophisticated disk controller (e.g. SCSI) Bad block recovery by the disk controller See sector sparing on next slide

29


Sector sparing Keep spare sectors on low-level formatting Controller can be instructed to replace a defect

sector with a spare block Protocol

OS reads sector X Controller detects error using ECC and reports to OS OS instructs controller to replace X Controller translates each further request from X to spare

block Y Whenever OS is restarted, the controller is

initialized to all so far installed replacements; OS memorizes this replacement somewhere in its disk organization data

30


Problem: Sector sparing invalidates each disk scheduling strategy

Solution Provide spare sectors in each cylinder (and a spare cylinder as

well) Only replace defect sectors with spare ones in the same cylinder

Alternative solution: sector slipping Keep spare sectors in each cylinder Move block sequence one step towards the next free spare block

18 27 Spare 28…

17

18 27 28…1731

Disk Management: Swap Space

Swapping and paging require space on secondary storage Swap space to small – process termination Swap space to large – less space for file system

usage but does no other harm It’s better to rather overestimate than to

underestimate that space Examples

Solaris: amount of virtual memory which exceeds pageable physical memory

Linux: double the amount of physical memory32

Swap Space: Swap Space Location

File system Normal file routines can be used Requires navigating the directory and disk

allocation structure inefficient Improvements: caching block location

information, allocating contiguous blocks

Raw partitions Fast swap space manager and no FS overhead Adding swap space is much more involved

(Repartitioning)33

RAID Structure

34

RAID Structure

Attach a set of disks to a computer system Parallel reads and writes improves data rate Redundancy improves reliability

General concept: redundant array of independent (inexpensive) disks (RAID)

35

RAID: Reliability by Redundancy

Example: Mean time of disk failure: 100000 hours Mean time of failure of one out of 100 disks: 100000 /

100 = 1000 hours (41,66 days)

Storing only one data copy results in high and unacceptable failure rate

Redundancy required: Even when a disk fails, data can be reconstructed from

the remaining disks36

RAID: Reliability by Redundancy

Mirroring Use two disks Every write goes to both On failure of one disk, data is still available Second disk may not crash during repair time

Example Assume independent disk failures Disk failure rate 100000 hours Mean time to repair 10 hours Mean time of data loss 57.000 years

Disk failures are not independent in general Power failure and other disasters will affect all devices

simultaneously Power failure solution

Write one copy first, then the next Non volatile RAM (NVRAM) caches for writing

37

RAID: Parallelism

Mirroring doubles the number of reading/writing per unit time

Improving the transfer rate by data striping

0 0Disk 1

1 0Disk 2

1 1Disk 8

Byte

1By

te 2

0100…0011 1001…1110

1110…1001 0001…1001

1101…1010 1110…1101

Block 1 Block 2

Bit Level Striping Block Level Striping

38

RAID Levels

RAID level 0

RAID level 1

39

RAID Levels

RAID level 2

The idea of parity bits Original information: 10010001 Parity bit: b = 1+0+0+1+0+0+0+1 Code: 100100011 Single bit error: 100101011 1+0+0+1+0+1+0+1+1 = 1 != 0 Error!

Can be extended with more bits for error correcting codes

40

RAID Level 3

Defect disk can be detected by the controller single bit can be used for error correction

41

RAID Level 3

Compared to RAID level 1 Less storage overhead Higher transfer rate for reading/writing a single block However, fewer I/O per second

0 1Disk 1

1 0Disk 2

? ?Disk 3

1 0Disk 4

0 1Disk 5

0 0 Disk 3

Data

Parity

42

RAID Level 4

Block 1,1 Block 2,1Disk 1




Block 1 Block 2Parity Disk 5

Data

b11+b12+b14+p1 b21+b22+b24+p2Disk 343

RAID Level 4

Slower data transfer rate since blocks reside in one disk

However, concurrent read and thus higher overall IO rate

Write of a single block requires four disk accesses Read old block modify and write back Update parity block

New disk can be introduced seamlessly Initialized to zero

44

RAID Level 5

No dedicated parity disk Spread data and parity among N+1 disks

E.g. parity for nth block on disk (n mod 5 + 1)

Avoids overuse of the parity disk Most frequently used parity block solution

45

RAID Level 6

Same as RAID 5 plus extra information Manages multiple disk failures (Reed-Solomon

Code)

46

Stable Storage Implementation

47

Stable Storage Implementation Remember the write ahead log in log based recovery (see

atomic transactions)

Disk write can have three outcomes Successful completion Partial failure Total failure

Error during writing might leave it in an inconsistent state Solution: maintain two physical blocks b1, b2 for each logical

one Write to b1 first On successful completion write to b2 Declare operation complete after successful write to b2

Recovery from failure: examine both blocks b1=b2 and ECC ok no further action bi has detectable error copy bi+1 over bi bi != bi+1 and ECC ok copy b2 to b1

Problem: Synchronous write is time consuming Use NVRAM cache

48

Tertiary-Storage Structure

49

Tertiary-Storage Devices

Low cost is the defining characteristics

Removable magnetic disks Range from 1 MB to 1 GB Can achieve performance close to hard disk Greater risk of damage

Magneto-optical disks Head is far away from the platter (protected by

glass/plastic) much more robust against head crash Magnetic field is too broad and weak for direct reading

or writing

50

Tertiary-Storage Devices

Writing: laser beam heats a spot which gets more susceptible to magnetic field

Reading: magnetic field influences the polarization of a laser beam (Kerr effect)

Both can be used to code a single bit

Optical disks Special materials that can be altered by laser light

2 states: crystalline (more transparent reflected laser beam is brighter), amorphous

Low power laser to read Medium power to erase to crystalline state High power laser to transform from crystalline to amorphous

51

Tertiary-Storage Devices Write once read many times

Aluminum film with holes (old technique) Organic polymer dye Both are very durable

Tapes Economical media when random access is not required Single tape drive Robotic tape changers + tape library

Further Holographic storage: 3D-array of pixels (storing 0 or 1) Micro-electronic mechanical devices

10000 disk heads on a single chip Faster than disks Cheaper than DRAM

52

Tertiary Storage: OS Support

Removable disks Provide random access Store a file system structure Mounted like a hard disk

Tapes Sequential access only Exclusive access only Writing: data can only be appended after EOT Reading: commands for relative and absolute block

lookup Bad blocks: simply skipped on write no formatting in

advance

53

Tertiary Storage: Further Issues

Unique name of a file on a removable media requires unique media name

Media might be used among different machines compatibility problem (data formats, encodings)

Hierarchical storage management Tertiary storage: cheap, can store a large amount of data Extend storage hierarchy to tertiary storage Swap out files not used for a longer time Preserve file name in file system On access

Retrieve file from tertiary storage Continue opening process when file is on hard disk

54

Summary and References

Summary Disk drives typically structured as a large array of blocks

This is seen by the OS OS provides a file system over this

Disk attachment variants: direct IO, network OS has to manage disk access (request for blocks coming from different

processes) by disk scheduling: SSTF, SCAN, C-SCAN, LOOK, C-LOOK OS organizes the disk blocks: low-level formatting (create blocks), logical

formatting (create file system), partitioning (e.g. boot partition, regular partition, swap partition), handle corrupted blocks, (handle external fragmentation; direct on block write or by separate defragmentation)

Swap space typically bypasses the file system (dedicated partition or reserve and memorize contiguous blocks in file system)

Disk have a failure rate; can be significant when storing large amount of data in many disks; RAID systems can be used to improve reliability and/or transfer rates

Tertiary storage (all removable storage) can be distinguished if used as a regular file system extension (e.g. CD-ROM) or for sequential access (tapes)

In general for mass storage: the important aspects of performance are bandwidth, latency and reliability

56

References

Silberschatz, Galvin, Gagne, „Operating System Concepts“, Seventh Edition, Wiley, 2005 Chapter 12 „Mass-Storage Structure“

57

mass storage structure - uni koblenz-landau

Documents