mass storage structure - uni koblenz-landau
TRANSCRIPT
Mass Storage Structure
Outline
Overview of Mass-Storage Structure Disk Structure Disk Attachment Disk Scheduling Disk and Swap-Space Management RAID Structure Stable-Storage Implementation Tertiary-Storage Structure
2
Overview of Mass-Storage Structure
Magnetic Disks
Hard Disks
4
Magnetic Disks
Disk speed Transfer rate = data rate between disk drive and computer Seek time = time to move arm to cylinder Rotational latency = Time to rotate to the sector Positioning time = seek time + rotational latency
Magnetic disks rotate fast (60 to 250 r/sec) Transfer rate of several megabytes Seek times and rotational latency of several milliseconds Head flies just above each platter (head crash)
Removable disks Floppy disks
Slow rotational speed Head lies on the disk
Removable hard disks
5
Magnetic Disks
Disks are attached by an I/O bus EIDE, ATA, SATA, USB, FC, SCSI, FireWire
Communication is done by controllers Host controller on computer Disk controller on each disk drive Memory mapped I/O or special machine
instructions Built-in caches
6
Magnetic Tapes
Characteristics Relatively permanent Large data quantities Slow access time Random access can take minutes
Usage Used early as secondary storage Backup Infrequently used data Transferring of data between different systems
7
SSD (compared to other techniques)MLC-NAND-Flash-Laufwerk1,0″ bis 3,5″
CompactFlash-Karteper ATA-Adapter
RAM-Diskals Teil des Arbeitsspeichers
Festplatte1,0″ bis 3,5″
Größe bis 1 TB bis 128 GB bis 16 GB je Modul bis 4 TB
Preis pro GB (Stand Juli 2012) ab ≈ 0,5 € ab ≈ 0,9 € (plus
Adapterpreis) ab ≈ 3,9 € ab ≈ 0,045 €
Anschluss S-ATA, P-ATA, mSATA, PCIe S-ATA, P-ATA hauptsächlich
DIMM-ConnectorS-ATA, P-ATA, SCSI, SAS
Lesen (kein RAID) bis 510 MB/s bis 100 MB/s bis 38400 MB/s bis 150 MB/sSchreiben (kein RAID) bis 490 MB/s bis 100 MB/s bis 38400 MB/s bis 150 MB/s
Mittlere Zugriffszeit lesen 0,2 ms 0,8 ms 0,00002 ms ab 3,5 ms
Mittlere Zugriffszeit schreiben
0,4 ms 10…35 ms 0,00002 ms ab 3,5 ms
Überschreibbar 3 bis 10 Tausend Mal (MLC)
3 bis 10 Tausend Mal (MLC) „beliebig“ „beliebig“
8
Source: http://de.wikipedia.org/wiki/Solid-State-Drive#Solid-State-Drives_im_Vergleich
Disk Structure
9
Disk Structure
Disk addressed as a linear array of blocks (0…N-1)
Each block stores a fixed amount of data (512 bytes)
Computing a mapping from block to (platter, cylinder, track, sector)? Some sectors might be defect Number of sectors per track depends on track position
10
Disk Structure
Constant linear velocity (CLV) – the farer a track from the center the more sectors it holds
• Track A: 8 sectors
• Track B: 4 sectors
different data rate
Solution: increase rotation speed as head moves from outer to inner tracks
A
B
11
Disk Structure
Constant angular velocity (CAV) Increase density towards inner tracks Keep the same rotation speed
A
B
12
Disk Attachment
13
Host-attached Storage
Communication techniques IDE, ATA: 2 devices per I/O bus SATA: serial ATA (one disk attached per cable) SCSI: 16 targets (including controller card)
8 units per target (different components of a device) FC (fiber channel): Variant1: switched with a 24 bit address space (useful
for SANs)Variant 2: FC-AL 126 devices
I/O Commands are reads and writes of logical data blocks Directed to identified storage units
14
BusController
DeviceController
Bus
CPU Device
Remark: Direct Memory Access
Network attached Storage
Access via RPC interface (NFS under UNIX or CIFS under Windows)
Connection over TCP/IP, UDP/IP or host attached protocol like ISCSI
Benefits of sharing but lower performance16
Storage Area Networks
Network attached storage uses the same network for storage I/O slow down of the overall network communication
Solution
17
Disk Scheduling
18
Disk Scheduling Responsibility of the operating system: use the hardware efficiently. Here: responsibility entails having fast access time and large disk
bandwidth
Access time = seek time + rotational latency Bandwidth = #bytes/time period Process requesting a disk transfer determines
Input or output Disk address Memory address Number of sectors
OS can’t transfer immediately if device is not free requires scheduling of concurrent disk transfers
Goal: fairness, improve bandwidth and access time reduce the head movement
19
First Come First Serve (FCFS)
Service requests in the order of arrival Example
Algorithm is fair but does not optimize head movements (here: 640 cylinders)
# head movements can be reduced significantly 20
Shortest Seek Time First (SSTF)
Select the request with the minimal seek time from the current
Example
Total head movement significantly reduced over FCFS (236/640)
21
Shortest Seek Time First (SSTF)
Starvation problem Consider a permanent stream of requests R1,
R2, … close to cylinder 14 Consider one request S for cylinder 183 S might never be serviced
SSTF is not optimal e.g. service 37, 14 before 65, 67, 98, 122, 144,
183 Reduces total amount of head movements to
208 cylinders22
SCAN Scheduling
Move disk arm back and forth and service request on the way
Example
23
C-SCAN Scheduling
Property of SCAN: when head reaches scan end, requests at the other side waited the longest time
Better solution: move head to the beginning and start scanning in the same direction again
24
LOOK and C-LOOK Scheduling
Variant of SCAN and C-SCAN: move head as far as necessary only
25
Disk and Swap-Space Management
26
Disk Management: Formatting
Low level formatting: create sectors on a blank platter Write header, data area and trailer for each sector Data area size: 256, 512, 1024 Bytes Header and trailer contain number and ECC
Error correcting code (ECC) On data write, compute ECC and store in trailer On data read, compute ECC and compare with stored
one If equal, OK If not equal, a few error bits can be corrected
Partitioning: organize disk in one or more groups of cylinders
Logical formatting: write file system data structures27
Disk Management: Boot Block
Bootstrap program Initialize registers in memory Load the operating system from disk
Code in ROM is the first to be executed on power up
Problem Bootstrap program may not fit into ROM Bootstrap program can not be changed
Solution: split boot program Bootstrap loader stored in ROM Remaining bootstrap program stored on secondary
storage
28
Disk Management: Bad Blocks
Disks are prone to errors (moving parts, small tolerances) Sectors might become defective Sectors might be defective from the very
beginning Simple disk controllers (e.g. IDE)
Marking bad blocks manually E.g. FAT stores special value in its table
Sophisticated disk controller (e.g. SCSI) Bad block recovery by the disk controller See sector sparing on next slide
29
Disk Management: Bad Blocks
Sector sparing Keep spare sectors on low-level formatting Controller can be instructed to replace a defect
sector with a spare block Protocol
OS reads sector X Controller detects error using ECC and reports to OS OS instructs controller to replace X Controller translates each further request from X to spare
block Y Whenever OS is restarted, the controller is
initialized to all so far installed replacements; OS memorizes this replacement somewhere in its disk organization data
30
Disk Management: Bad Blocks
Problem: Sector sparing invalidates each disk scheduling strategy
Solution Provide spare sectors in each cylinder (and a spare cylinder as
well) Only replace defect sectors with spare ones in the same cylinder
Alternative solution: sector slipping Keep spare sectors in each cylinder Move block sequence one step towards the next free spare block
18 27 Spare 28…
17
18 27 28…1731
Disk Management: Swap Space
Swapping and paging require space on secondary storage Swap space to small – process termination Swap space to large – less space for file system
usage but does no other harm It’s better to rather overestimate than to
underestimate that space Examples
Solaris: amount of virtual memory which exceeds pageable physical memory
Linux: double the amount of physical memory32
Swap Space: Swap Space Location
File system Normal file routines can be used Requires navigating the directory and disk
allocation structure inefficient Improvements: caching block location
information, allocating contiguous blocks
Raw partitions Fast swap space manager and no FS overhead Adding swap space is much more involved
(Repartitioning)33
RAID Structure
34
RAID Structure
Attach a set of disks to a computer system Parallel reads and writes improves data rate Redundancy improves reliability
General concept: redundant array of independent (inexpensive) disks (RAID)
35
RAID: Reliability by Redundancy
Example: Mean time of disk failure: 100000 hours Mean time of failure of one out of 100 disks: 100000 /
100 = 1000 hours (41,66 days)
Storing only one data copy results in high and unacceptable failure rate
Redundancy required: Even when a disk fails, data can be reconstructed from
the remaining disks36
RAID: Reliability by Redundancy
Mirroring Use two disks Every write goes to both On failure of one disk, data is still available Second disk may not crash during repair time
Example Assume independent disk failures Disk failure rate 100000 hours Mean time to repair 10 hours Mean time of data loss 57.000 years
Disk failures are not independent in general Power failure and other disasters will affect all devices
simultaneously Power failure solution
Write one copy first, then the next Non volatile RAM (NVRAM) caches for writing
37
RAID: Parallelism
Mirroring doubles the number of reading/writing per unit time
Improving the transfer rate by data striping
0 0Disk 1
1 0Disk 2
1 1Disk 8
Byte
1By
te 2
0100…0011 1001…1110
1110…1001 0001…1001
1101…1010 1110…1101
Block 1 Block 2
Bit Level Striping Block Level Striping
38
RAID Levels
RAID level 0
RAID level 1
39
RAID Levels
RAID level 2
The idea of parity bits Original information: 10010001 Parity bit: b = 1+0+0+1+0+0+0+1 Code: 100100011 Single bit error: 100101011 1+0+0+1+0+1+0+1+1 = 1 != 0 Error!
Can be extended with more bits for error correcting codes
40
RAID Level 3
Defect disk can be detected by the controller single bit can be used for error correction
41
RAID Level 3
Compared to RAID level 1 Less storage overhead Higher transfer rate for reading/writing a single block However, fewer I/O per second
0 1Disk 1
1 0Disk 2
? ?Disk 3
1 0Disk 4
0 1Disk 5
0 0 Disk 3
Data
Parity
42
RAID Level 4
Block 1,1 Block 2,1Disk 1
Block 1,2 Block 2,2Disk 2
Block 1,3 Block 2,3Disk 3
Block 1,4 Block 2,4Disk 4
Block 1 Block 2Parity Disk 5
Data
b11+b12+b14+p1 b21+b22+b24+p2Disk 343
RAID Level 4
Slower data transfer rate since blocks reside in one disk
However, concurrent read and thus higher overall IO rate
Write of a single block requires four disk accesses Read old block modify and write back Update parity block
New disk can be introduced seamlessly Initialized to zero
44
RAID Level 5
No dedicated parity disk Spread data and parity among N+1 disks
E.g. parity for nth block on disk (n mod 5 + 1)
Avoids overuse of the parity disk Most frequently used parity block solution
45
RAID Level 6
Same as RAID 5 plus extra information Manages multiple disk failures (Reed-Solomon
Code)
46
Stable Storage Implementation
47
Stable Storage Implementation Remember the write ahead log in log based recovery (see
atomic transactions)
Disk write can have three outcomes Successful completion Partial failure Total failure
Error during writing might leave it in an inconsistent state Solution: maintain two physical blocks b1, b2 for each logical
one Write to b1 first On successful completion write to b2 Declare operation complete after successful write to b2
Recovery from failure: examine both blocks b1=b2 and ECC ok no further action bi has detectable error copy bi+1 over bi bi != bi+1 and ECC ok copy b2 to b1
Problem: Synchronous write is time consuming Use NVRAM cache
48
Tertiary-Storage Structure
49
Tertiary-Storage Devices
Low cost is the defining characteristics
Removable magnetic disks Range from 1 MB to 1 GB Can achieve performance close to hard disk Greater risk of damage
Magneto-optical disks Head is far away from the platter (protected by
glass/plastic) much more robust against head crash Magnetic field is too broad and weak for direct reading
or writing
50
Tertiary-Storage Devices
Writing: laser beam heats a spot which gets more susceptible to magnetic field
Reading: magnetic field influences the polarization of a laser beam (Kerr effect)
Both can be used to code a single bit
Optical disks Special materials that can be altered by laser light
2 states: crystalline (more transparent reflected laser beam is brighter), amorphous
Low power laser to read Medium power to erase to crystalline state High power laser to transform from crystalline to amorphous
51
Tertiary-Storage Devices Write once read many times
Aluminum film with holes (old technique) Organic polymer dye Both are very durable
Tapes Economical media when random access is not required Single tape drive Robotic tape changers + tape library
Further Holographic storage: 3D-array of pixels (storing 0 or 1) Micro-electronic mechanical devices
10000 disk heads on a single chip Faster than disks Cheaper than DRAM
52
Tertiary Storage: OS Support
Removable disks Provide random access Store a file system structure Mounted like a hard disk
Tapes Sequential access only Exclusive access only Writing: data can only be appended after EOT Reading: commands for relative and absolute block
lookup Bad blocks: simply skipped on write no formatting in
advance
53
Tertiary Storage: Further Issues
Unique name of a file on a removable media requires unique media name
Media might be used among different machines compatibility problem (data formats, encodings)
Hierarchical storage management Tertiary storage: cheap, can store a large amount of data Extend storage hierarchy to tertiary storage Swap out files not used for a longer time Preserve file name in file system On access
Retrieve file from tertiary storage Continue opening process when file is on hard disk
54
Summary and References
Summary Disk drives typically structured as a large array of blocks
This is seen by the OS OS provides a file system over this
Disk attachment variants: direct IO, network OS has to manage disk access (request for blocks coming from different
processes) by disk scheduling: SSTF, SCAN, C-SCAN, LOOK, C-LOOK OS organizes the disk blocks: low-level formatting (create blocks), logical
formatting (create file system), partitioning (e.g. boot partition, regular partition, swap partition), handle corrupted blocks, (handle external fragmentation; direct on block write or by separate defragmentation)
Swap space typically bypasses the file system (dedicated partition or reserve and memorize contiguous blocks in file system)
Disk have a failure rate; can be significant when storing large amount of data in many disks; RAID systems can be used to improve reliability and/or transfer rates
Tertiary storage (all removable storage) can be distinguished if used as a regular file system extension (e.g. CD-ROM) or for sequential access (tapes)
In general for mass storage: the important aspects of performance are bandwidth, latency and reliability
56
References
Silberschatz, Galvin, Gagne, „Operating System Concepts“, Seventh Edition, Wiley, 2005 Chapter 12 „Mass-Storage Structure“
57