spring 2014 silicon valley university confidential 1 operating systems dr. jerry shiao, silicon...

Spring 2014SILICON VALLEY UNIVERSITY

CONFIDENTIAL 1

Operating Systems

Dr. Jerry Shiao, Silicon Valley University

SILICON VALLEY UNIVERSITY CONFIDENTIAL 2

Mass-Storage Structure Secondary (Magnetic Disk) and Tertiary (Tape) Physical

Structure: Lowest Level of File System.

Secondary Storage Devices: Disk Attachment Computer Access: Host-Attached (I/O Ports), Network-Attached

Storage, Storage-Area Network.

Operating System Support: Disk Scheduling (FCFS, SSTF, SCAN, CSCAN Algorithms) Disk Management (Partitions, MBR) Swap-Space Management (File System, Virtual Memory)

Mass Storage in RAID Structure (Redundant Array of Independent Disks) RAID Lvls 0–6 (Variations of RAID, reliability using redundancy)

Stable-Storage Implementation (No loss) Tertiary Storage Devices (Magnetic Tape)

Performance Issues (Speed, Reliability, Cost)Copyright @ 2009 John Wiley & Sons Inc.


Mass-Storage Structure

Magnetic Disks: Secondary Storage of Computer Systems Multiple Disk Platters Rotates at 60 – 200 times per second. Transfer Rate: Rate of data flow between Disk Drive and

Computer System. Typically several MegaBytes per second.

Positioning Time: Time move Disk Arm to Cylinder ( Seek Time) and time rotate desired sector to disk head (Rotational Latency).

Typicallly Seek Time and Rotational Latency several milliseconds.

Disk Head travels over disk, separated by microns. Head crash when Disk Head contacts the Disk.

Removable Disk: Consists of one Disk Platter. Disk Controller Connected to Host Controller with I/O Bus.

I/O Busses: EIDE (Enhanced Integrated Drive Electronics), ATA (Advanced Technology Attachment), SATA (Serial ATA), USB (Universal Serial Bus), FC (Fibre Channel), SCSI (Small Computer Systems Interface).

Copyright @ 2009 John Wiley & Sons Inc.



Magnetic Disks: Secondary Storage of Computer Systems


Platter Size related to performance. Reduce Head movements and improve seek times (Faster reads and writes).

1) Reduce platter size, improve stiffness and more resistant to shock and vibrations.

2) Flatness of surface eaiser to manufacture.

3) Spindle spins faster , less-powerful motors.

4) Less power, reduces noise and heat.

5) Increases seek performance, less head movements.



Magnetic Disks: Secondary Storage of Computer Systems


5.25” PC Hard Disk.

3.5” Desktop PC Hard Disk.Most common.

Compact Flash.

PC Card.

2.5” Laptop Hard Disk.



Magnetic Tapes: Early Secondary-Storage Medium. Large Quantities of Data (20 GBytes to 200 Gbytes). Limitations:

Access time slow ( 1000 times slower than Hard Disk). Backup File System

Mainly used for backup, storage of infrequently used data. Tape in spool and moves across Disk Head. Built-in Compression to increase capacity.

Magnetic Tapes categorized by width 4, 8, and 19 Millimeters ¼ and ½ Inches. LTO-2 (Linear Tape-Open) Tape Cartridge SDLT (Super Digital Linear Tape) Tape Cartridge




Disk Structure Mapped as one-dimensional Array of Logical Blocks

Logical Block: Typically 512 Bytes. Sector 0: First sector of first track on outermost cylinder.

Mapped outermost cylinder to innermost cylinder.

Disk Address: Cylinder number, Track number, Sector number Difficult to translate logical block to physical sector.

Number of sectors per track not constant in some drives. Further from center, greater number of sectors per track. CLV (Constant Linear Velocity) specifies uniform density of bits per track.

Drive increases rotational speed when head moves from outer to inner track.

Used in CDROM and DVDROM. CAV (Constant Angular Velocity) specifies bit density of inner track less to

outer track. Drive rotational speed constant. Used in hard disk.




Disk Attachment Computers Access Disk: Host-Attached Storage (I/O

Ports) or Network-Attached Storage (remote host). Host-Attached Storage

Access through I/O ports. IDE (Integrated Drive Electronics) or ATA (Advanced Technology Attachment)

Supports 2 drives per I/O Bus. SCSI (Small Computer Systems Interface)

Supports 16 drives or SCSI targets per I/O Bus. Host Controller Card Each SCSI target address up to 8 Logical Units

FC ( Fiber Channel) 24 bit address Switch Fabric.

Multiple hosts and Storage Devices can attach to the fabric. FC-AL (Arbitrated Loop)

Supports 126 devices (drives and controllers).




Disk Attachment Network-Attached Storage ( NAS )

Accessed remotely over an IP network using TCP or UDP. Clients uses RPC (Remote Procedure Calls)

NFS for UNIX CIFS for Windows

NAS usually a RAID array with RPC software. Lower performance than direct-attached storage devices. iSCSI: Uses IP network protocol to carry SCSI protocol.




Disk Attachment Storage-Area Network ( SAN ) Private Network Connecting Servers and Storage Units. Multiple Hosts ( Servers ) and Multiple Storage Arrays are attached

to same SAN. SAN Switch controls access to the Storage Units. Fiber Channel Interconnects SAN components.




Disk Scheduling Operating System must provide fast access time and

transfer rate to mass storage. Access Time Components:

Seek Time: Disk arm to move the heads to the cylinder of the sector.

Rotational Latency: Disk to rotate to the sector. Disk Bandwidth: Total Bytes transferred, divided by total time

between first request and completion of transfer. Improve Access Time (Seek Time and Rotational Latency) by

managing order of disk I/O Request. Disk Queue: Requests made when drive or controller is busy. Disk Scheduling: Operating System algorithm to choose the next

pending request to service.




Disk Scheduling FCFS Scheduling: First Come First Serve

Fair, but does not provide the fastest service. Large swings of the disk head possible, causing lengthy Access

Time.


Cylinders:Seek time is the Head Movement to a Cylinder.

Large jumps between Cylinders.



Disk Scheduling SSTF Scheduling: Shortest Seek Time First

Service requests closes to current Head Position before moving Head further away ( minimize Seek Time ).

Similar to Shortest Job First Scheduling. Starvation of Job furthest from the current request queue.


Cylinders:Minimize Cylinder access ( seek ) time.



Disk Scheduling SCAN Algorithm

Elevator Algorithm: Disk arm starts at one end of the disk and moves towards the other end, servicing requests at each cylinder. When the end is reached, head movement is reversed.

Requests collect at the cylinders that was just passed.


Cylinders:Access chart similar to SSTF. Minimize Cylinder access (seek) time.



Disk Scheduling C-SCAN Algorithm

Variant of SCAN with a uniform wait time. Head moves from one end to the other end servicing requests. When other end reached, immediately returns to beginning before servicing

requests. Treats cylinders as circular list that wraps from last to first cylinder.


Cylinders:Wraps from last cylinder to first cylinder before servicing requests again.



Disk Scheduling C-LOOK or LOOK Algorithm

Variant of C-SCAN Head moves from one end to the other end servicing requests. Head only goes as far as final request in one direction before reversing

directions. SCAN and C-SCAN typically follows this pattern.


Cylinders:Wraps as soon as last request is reached.



Disk Scheduling Selection of Disk-Scheduling Algorithm

SSTF simple, performs better than FCFS. SCAN, C-SCAN perform better for systems with heavy disk

usage. Require retrieval algorithm.

File Allocation Algorithm important. Contiguous blocks will have minimum head movement. Linked or Indexed File can have blocks scattered on the disk.

Caches for directories and index blocks required, otherwise frequent directory access causes excess head movements.

SSTF or C-LOOK ( LOOK ) typically used as default algorithm.

Disk Scheduling Algorithm written as separate module, allowing it to be replaced with a different algorithm.




Disk Management Magnetic disk initially is a platter of magnetic recording material. Low-Level Formatting or Physical Formatting:

Disk Controller instructed on how many bytes of data (256, 512, 1024 Bytes) between the header and trailer of each sector. Fills disk with sectors.

Header sector number. Trailer ECC (Error-Correcting Code) calculated every Write and

verified on Read. ECC used to recover actual value ( Soft Error ) and reported to Disk

Controller.

Operating System creates partitions (groups of cylinders). Logical Formatting: Creation of a File System in the partition or left “raw”.

File System data structures include maps of free and allocated space ( a FAT or inodes) and initial empty directory.

File System group blocks together into cluster for I/O performance. Raw disk is viewed as large sequential array of logical blocks ( No File

System services, file locking, prefetching, space allocation, file names, and directories).




Disk Management Bootstrap Loader

Stored in ROM at fixed location: Processor starts executing when reset. Small bootstrap loader program instructions disk controller to read boot

blocks into memory and starts executing the Bootstrap program.

Boot Block Bootstrap program at fixed location on disk (Master Boot Record) initializes

CPU registers to device controllers, contents of main memory, and loads Operating System from non-fixed location on disk and start Operating System running.


Windows 2000:Boot Block in first sector of the hard disk ( Master Boot Record ).MBR contains table of partitions and which partiion is the boot partition (Operating System and Device Drivers).



Disk Management Bad Blocks

Sectors can become defective (moving parts with small tolerances). IDE Disk Controller require bad blocks to be detected manually.

format () command updates FAT during File System initialization. chkdsk () command during runtime.

SCSI Disk Controller maintains list of bad blocks from low-level formatting. Maintains list of spare sectors to replace bad blocks.

Spare sectors are allocated on each cylinder and entire spare cylinder is also allocated (used to prevent disk scheduling algorithm from being invalidated when bad sectors are replaced with spare sectors).

Sector Sparing: SCSI Disk Controller detects bad sector for FCC and notifies Operating

System. During bootup, Operating System notifies controller to replace bad sector

with spare sector. Sector Slipping:

All sectors after the bad sector is moved down one sector until spare sector is reached. Bad sector is mapped to the “free” sector after it.




Swap-Space Management Operating Systems merged Swapping with Virtual Memory

Paging Techniques. In Paging Systems, only pages of a process are swapped out.

Virtual Memory uses disk space as extension of main memory. Disk access is slower than memory access: how swap space is used and

how swap space is managed is critical to performance of Operating System. Swap Space in raw partition:

Swap-Space Storage Manager used to allocate and deallocate blocks from the raw partition.’

Swap Space partition is fixed. Internal Fragmentation can occur, but process life is short. Increasing Swap Space require repartitioning.

Swap Space in File-System Space: Large file within the File System: File-System APIs to create/allocate. Straightforward to implement, but inefficient to use File-System data

structures.




Swap-Space Management Linux has swap space in a swap file on a File System or a raw-

swap-space partition. Swap space only used for anonymous memory ( memory allocated for stack,

heap, uninitialized data area, and regions of memory shared by other processes).

Text-segment pages are reread from disk when swapped out. Swap area has 4 Kbyte Page Slots and Swap Maps to track swap space.


Array of counters, each corresponding to a page slot in swap area.Count in swap map entry indicates multiple processes share the page.



RAID Structure RAID (Redundant Arrays of Independent Disks): Redundant

Information stored on multiple disks for Reliability. Disks attached directly to I/O Bus: Operating System implement RAID. Intelligent Host Controller: Control multiple disks and implement RAID at

controller. Storage Array or RAID Array Standalone unit with controller: Has cache,

control multiple disks attached to host with ATA, SCSI, FC controller.

Solution to Reliability is Redundancy: Store extra information that is used in event of disk failure to rebuild lost information.

Mirrored Volume: Logical disk consists of two physical disk and write is carried out on both disks. Simple, but expensive.

Mean Time Before Failure (MTBF): Mean time to failure on single disk (100,000 Hours).

Mean Time To Repair (MTTR): Time to replace disk and restore data (10Hrs).

Mean Time to Data Loss: (MTBF^2) / (2*MTTR) =100K^2 / (2*10)=10^6 Hrs NVRAM Cache protect agains power failure.




RAID Structure Performance Via Parallelism: Achieve Performance using

Parallelism Techniques. Mirroring: Read Reqs sent to either disk, 2 x Reqs per logical disk. Multiple Disks:

Bit-Level Data Striping: Splitting bits of each byte across multiple disks. Bit “n” of each byte to disk “n”. Multiple of 8 disks or divides 8 (i.e. 4 disks or 2 disks). Every disk participates in every access (read or write). Array of 8 disks, I/O access 8 times as fast.

Block-Level Striping: Splitting blocks across multiple disks. Block “n” with “m” disks, block “n” goes to disk (n mod m) + 1. Array of 4 disks, Block 0: (0 mod 4) = 0 + 1= disk 1.

Block 1: (1 mod 4) = 1 + 1= disk 2. Striping Parallelism achieves:

Increase throughput by load balancing. Reduce response time of large accesses.




RAID Structure RAID Levels

Schemes using mirroring (reliability) and striping (performance). RAID 0: Non-Redundant Striping.

Block Striping for performance, No redundancy. RAID 1: Mirrored Disks.

Disk Mirroring for redundancy. RAID 2: Memory-Style Error-Correcting Codes.

Bit-Level Striping with ECC bits stored in additional disks. RAID 3: Bit-Interleaved Parity.

Block Striping for performance, single bit for parity. Uses disk controller detection of read/write error of each sector using sector’s Error Correction Code. Uses dedicated parity hardware and NVRAM cache to store blocks during parity computation.




RAID Structure RAID 4: Block-Interleaved Parity.

Block Striping for performance, parity block to restore failed block. RAID 5: Block-Interleaved Distributed Parity.

Block Striping for performance, parity block to all disks in the RAID. Most common parity RAID system.

RAID 6: P + Q Reduncancy. Similar to RAID 5, extra redundant information guards against

multiple disk failure. RAID 0 + 1: Combination of RAID 0 (Performance) and RAID 1

(Reliability). Disk is striped and then stripe is mirrored. Limitation: If first disk fail, then entire stripe is unavailable.

RAID 1 + 0: Combination of RAID 1 and RAID 0. Disk is mirrored in pairs and then mirrored pairs are striped.




RAID Structure


C: Copy of the data on disk.

P: Error Correcting Bits.



RAID Structure

www.thegeekstuff.com

A, B, C, D, E, F: Blocks

P1, P2, P3: Parity



RAID Structure


Striping is done before the mirror. If the disk fails, then the mirror set cannot be used. Degrades to a RAID 0.

Disks are mirrored in pairs and then mirrors are striped together. If a disk fails, only that disk loses its portion of the stripe, the other disks retain their stripe portions.


Mass-Storage Structure RAID Structure

Variations to RAID Implementations: RAID implemented within Kernel or System Software Layer.

Storage hardware provides minimum features. Typically RAID 0, RAID 1, or RAID 0 + 1.

RAID implemented in Host Bus-Adapter (HBA) hardware. Restrictive, only disks connected to the HBA is in the RAID.

RAID implemented in hardware of storage array. Storage Array create multiple RAID sets. Operating System only implements File System.

RAID implemented in SAN interconnect layer, between hosts and storage. Accepts commands from the servers and manages access to storage.

Replication between Storage Arrays Auto duplication of writes between separate sites for redundancy and

disaster recovery. Hot Spare

Not used and configured as replacement in case of disk failure. Rebuild a mirrored pair and RAID level reestablished automatically.




RAID Structure Selecting RAID Level

Rebuild Performance Important in high-performance or interactive database systems. Easiest with RAID 1, data copied from another disk. Other RAID levels require accessing all disks in the array (RAID 5

slow). RAID 0 used in high-performance applications: Data loss not critical. RAID 1 popular for applications that require high reliability and fast recover. RAID 0 + 1: Performance and reliability important (small databases). RAID 1 + 0: Performance and reliability important (small databases). RAID 5: Large volumes of data. RAID 6: Not supported by many RAID implementations. Criterias:

How many disks in RAID set? More disks, data transfer rates higher, but more expensive.

How many bits protected by parity bit?




RAID Structure RAID Problem: Protects against physical media errors, not data

corruption, other hardware and software errors. Solaris ZFS File System: Checksums to protect all data/metadata.

Internal checksums kept with pointer to the data block. Inode has checksum of each data block. Directory entry has checksum for the inode. Provides a high level of consistency, error detection, and error

correction.


Checksum protects data and metadata.



RAID Structure RAID Problem: Lack of Flexibility, File Systems cannot grow or shrink

dynamically. ZFS combines File System Management and Volume Management. RAID set contains pools of storage: Pool contain one or more File Systems.

Entire pool free space available

to all File Systems within the pool. No artificial limit on storage and no

need to relocate File Systems between

volumes or resize volumes. Configure quotas on File System

growth .




Stable-Storage Implementation Certain applications (Write-Ahead Log) require concept of

Stable Storage. Information is never lost: Replicate the information on multiple

storage devices (disk) with independent failure modes. Coordinate writing of updates to ensure that stable data is

recovered after any failure during data transfer or even during recovery.

Detection and recovery procedure to restore the data block. Procedure must maintain two physical block for logical block.

Write operation complete after both physical blocks are written.

Usually two copies enough for Stable Storage, but could have an arbitrary number of copies.

NVRAM cache memory nonvolitile (battery power) will receive the write before writing to disk.




Tertiary Storage Devices Low cost, removable media:

Removable Disks: Floppy Disks, Removable Magnetic Disks. CD-ROM and DVD-ROM: Write-Once, Read-Many (WORM) disks with

thin aluminum film to record. Magneto-Optic Disk records data on platter with magnetic material and

uses Laser light to record. Optical Disks uses special materials altered by laser light instead of

magnetism (ReadOnly/WriteOnce/RW CDs and DVDs). Magnetic Tapes: Application not requiring fast random access. Used to hold

backup copies of disk data or large volumes of data used in research and record storage.

Tape Libraries: Stacker (Library holds few tapes), Silo (Library holds thousands of tapes).

Solid-State Disks (SSD): Nonvolatile SSDs has same characteristics as hard disks, but no moving parts (no seek time or latency) and faster than hard drives.




Tertiary Storage Devices Operating System Support: Abstractions for removable media.

Raw Device: Array of data blocks. File System: File System structures for storage media. Operating System

must queue and schedules requests. Tapes: Raw storage medium.

Application opens whole tape drive as raw device (no File System APIs and services).

Application must decide how to organize the array of blocks. Tape reserved for the exclusive use of that application (another

application would not know how to interpret the tape). Variable block size and size of block determined when block is written. locate( ) operation finds specific block number.

Cannot locate into empty space beyond written area. Last block written has end-of-tape (EOT) mark written. Tape drives are Append-Only devices (updating a block in the

middle of the tape erases everytihing beyond that block). read position ( ) returns the current logical block number of tape head.




Tertiary Storage Devices File Naming

Operating Systems mostly leave naming of removable media unresolved and depend on applications and users to determine how to access and interpret the data.

UNIX has mount table to identify the location of the media. Hierarchial Storage Management ( HSM)

Extends the storage hierarchy beyond primary memory and secondary memory (magnetic disk) to incoporate Tertiary Storage.

Tertiary Storage implemented as collection of tapes or removable disks.

Extends the File System. Small and frequently used files remain on magnetic disk and large and

inactive files are archived to Tertiary Storage ( Tape Drive ) HSM in supercomputing centers and large companies that have

large volumns of data.




Tertiary Storage Devices Performance Issues: Tertiary Storage performance aspects are

speed, reliability, and cost. Speed

Two aspects of speed in Tertiary Storage are Bandwidth and Latency.

Sustained Bandwidth: Average data rate during transfer – number of bytes divided by the transfer time.

Effective Bandwidth: Average over the I/O time, including seek( ), locate( ), and any cartridge switching time in a tape or disk library ( jukebox ).

Bandwidth of a drive is the Sustained Bandwidth. Disk: Few megabytes to > 60 Mbytes per second (affected by rotational

speed and ATA/SCSI controller). Tape: Few Megabytes to > 30 Mbytes per second.




Tertiary Storage Devices Speed

Access Latency: Amount of time needed to locate data. Disk: Two dimensional, moves arm to selected cylinder and wait for

rotational latency ( < 5 milliseconds ). Tape: Three dimensional, most of the data are buried below layers of

tape wound on a reel. Selected block reaching tape head takes tens or hundreds of seconds ( > 1000 times slower than disk ).

Disk Jukebox: Drive stops spinning, robotic arm switch disk cartridge, spins up new cartridge ( several seconds ) and the disk access latency.

Average latency of tens of seconds. Tape Jukebox: Tape rewinding (< 4 minutes), robotic arm switch tape

cartridge (1 or 2 minutes), drive calibration to tape ( many seconds ), and tape access latency.

Average latency of hundreds of seconds. Jukebox or removable library is best devoted to storage of infrequently

used data. Can only support relatively small number of I/O requests per hour.




Tertiary Storage Devices Reliability

Fixed hard disks are more reliable than removable magnetic disks. Removable disks exposed to dust, changes to temperature and

humidity, and mechanical abuse ( shock, bending ). Head fault: Head crash of hard disk will destroy the platter and data.

Optical disks are very reliable: layer storing bits is protected by transparent plastic or glass layer.

Magnetic Tape reliability varies: Depends on the tape drive. Inexpensive drives wear out tape after few dozen uses. Expensive drives allow tape to be used millions of times. Head fault: Head crash leaves the data cartridge unharmed.

In order of reliability: Optical disks, fixed-disk drive, removable-disk drives, removable-tape drives.




Tertiary Storage Devices Cost

Main memory is more expensive than disk storage by factor of 100. Cost of storage has fallen dramatically, with price of disk storage

dropped more, relative to price of DRAM and tape. Disk drive per megabyte is approaching cost of a tape cartridge

without the tape drive. Small and medium size tape libraries has higher storage cost than disk

systems with equivalent capacity. Cost of tape is a small fraction of the price of the tape drive. Overall cost of tape storage becomes lower as more tapes are

purchased per tape drive. Tertiary Storage is obsolete, no longer order of magnitude less expensive

than magnetic disk. Tape storage limited to backups of disk drives and archival storage in

tape libraries that greatly exceed the practical storage capacity of large disk farms.




Tertiary Storage Devices Cost: Price per Mbyte of DRAM from 1981 to 2004


4 price crashes: 1981, 1984. 1989, 1996.Caused by excess production.

1987 and 1993 there was a shortage and caused price increases.

As SIMM density increases, cost decreases per MB.



Tertiary Storage Devices Cost: Price per Mbyte of Magnetic Hard Disk from 1981 to 2004


Price decline has been steady. From 1981 to 2004, price has dropped by > 4 order of magnitude ($100 / MB to $.001 / MB).

In 2004, DRAM is $ .8 / MB and Disk is $ .001 / MB, about 100 to 1 difference.



Tertiary Storage Devices Cost: Price per Mbyte of Magnetic Tape Drive from 1981 to 2004


Tape drive prices has fell steadily up to 1997. Since 1997, tape drive prices has not pummeted as rapidly as disk drives.


Mass-Storage Structure Summary

Disk Drives are the major Secondary-Storage I/O Devices. Structured as large one-dimensional arrays of logical 512 disk blocks. Attached to Computer Systems through I/O Ports or through Network

Connections. Disk-Scheduling Algorithms improves the effective bandwidth, average

response time and variance in response time. SSTF, C-SCAN, LOOK, and C-LOOK.

Operating System manages the disk blocks. Formats the sectors on raw hardware and partitions the disk. Boot Blocks is created. File System created. Raw disk partition or File System for Swap space .

Reliability via Redundancy using Redundant Array of Independent Disks ( RAID): Different levels of RAID.

Tertiary Storage of Removable Disk and Tape Drives. Operating System Support Removable Disk with File System interface and

Tapes with specialilzed Device Driver.


spring 2014 silicon valley university confidential 1 operating systems dr. jerry shiao, silicon...

Documents

disk drive

storage access

disk controller

disk address

storage nas

disk arm

removable disk

secondary storage devices