Download - FFS, LFS, and RAID
FFS, LFS, and RAID
Andy WangCOP 5611
Advanced Operating Systems
UNIX Fast File System
Designed to improve performance of UNIX file I/O
Two major areas of performance improvementBigger block sizesBetter on-disk layout for files
Block Size Improvement
4x block size quadrupled amount of data gotten per disk fetch
But could lead to fragmentation problemsSo fragments introduced
Small files stored in fragmentsFragments addressable
But not independently fetchable
Disk Layout Improvements
Aimed toward avoiding disk seeksBad if finding related files takes many
seeksVery bad if find all the blocks of a single
file requires seeksSpatial locality: keep related things close
together on disk
Cylinder Groups
A cylinder group: a set of consecutive disk cylinders in the FFS
Files in the same directory stored in the same cylinder group
Within a cylinder group, tries to keep things contiguous
But must not let a cylinder group fill up
Locations for New Directories
Put new directory in relatively empty cylinder group
What is “empty”?Many free i_nodesFew directories already there
The Importance of Free Space
FFS must not run too close to capacityNo room for new filesLayout policies ineffective when too few
free blocksTypically, FFS needs 10% of the total
blocks free to perform well
Performance of FFS
4x to 15x the bandwidth of old UNIX file system
Depending on size of disk blocksPerformance on original file system
Limited by CPU speedDue to memory-to-memory buffer copies
FFS Not the Ultimate Solution
Based on technology of the early 80sAnd file usage patterns of those timesIn modern systems, FFS achieves only
~5% of raw disk bandwidth
The Log-Structured File System
Large caches can catch almost all readsBut most writes have to go to diskSo FS performance can be limited by
writesSo, produce a FS that writes quicklyLike an append-only log
Basic LFS Architecture
Buffer writes, send them sequentially to diskData blocksAttributesDirectoriesAnd almost everything else
Converts small sync writes to large async writes
A Simple Log Disk Structure
File A
Block7
File Z
Block1
File M
Block202
File A
Block3
File F
Block1
File A
Block7
File L
Block26
File L
Block25
Head of Log
Key Issues in Log-Based Architecture
1. Retrieving information from the log
No matter how well you cache, sooner or later you have to read
2. Managing free space on the disk
You need contiguous space to write - in the long run, how do you get more?
Finding Data in the Log
Give me block 25 of file LOr,Give me block 1 of file F
File A
Block7
File Z
Block1
File M
Block202
File A
Block3
File F
Block1
File A
Block7
File L
Block26
File L
Block25
Retrieving Information From the Log
Must avoid sequential scans of disk to read files
Solution - store index structures in logIndex is essentially the most recent
version of the i_node
Finding Data in the Log
How do you find all blocks of file Foo?
FooBlock 1
FooBlock2
FooBlock3
FooBlock1(old)
Finding Data in the Log with an I_node
FooBlock 1
FooBlock2
FooBlock3
FooBlock1(old)
How Do You Find a File’s I_node?
You could search sequentiallyLFS optimizes by writing i_node maps to
the logThe i_node map points to the most recent
version of each i_nodeA file system’s i_nodes cover multiple
blocks of i_node map
How Do You Find the Inode?
The Inode Map
How Do You Find Inode Maps?
Use a fixed region on disk that always points to the most recent i_node map blocks
But cache i_node maps in main memorySmall enough that few disk accesses
required to find i_node maps
Finding I_node Maps
New i_node mapsAn old i_node map
Reclaiming Space in the Log
Eventually, the log reaches the end of the disk partition
So LFS must reuse disk space like superseded data blocks
Space can be reclaimed in background or when needed
Goal is to maintain large free extents on disk
Example of Need for Reuse
Head of log
New data to be logged
Major Alternatives for Reusing Log
Threading+ Fast
- Fragmentation
- Slower reads
Head of log
New data to be logged
Major Alternatives for Reusing Log
Copying+Simple
+Avoids fragmentation
-Expensive
New data to be logged
LFS Space Reclamation Strategy
Combination of copying and threadingCopy to free large fixed-size segmentsThread free segments togetherTry to collect long-lived data permanently
into segments
A Threaded, Segmented Log
Head of log
Cleaning a Segment
1. Read several segments into memory
2. Identify the live blocks
3. Write live data back (hopefully) into a smaller number of segments
Identifying Live Blocks
Hard to track down live blocks of all filesInstead, each segment maintains a
segment summary blockIdentifying what is in each block
Crosscheck blocks with owning i_node’s block pointers
Written at end of log write, for low overhead
Segment Cleaning Policies
What are some important questions?When do you clean segments?How many segments to clean?Which segments to clean?How to group blocks in their new segments?
When to Clean
PeriodicallyContinuouslyDuring off-hoursWhen disk is nearly fullOn-demandLFS uses a threshold system
How Many Segments to Clean
The more cleaned at once, the better the reorganization of the diskBut the higher the cost of cleaning
LFS cleans a few tens at a timeTill disk drops below threshold value
Empirically, LFS not very sensitive to this factor
Which Segments to Clean?
Cleaning segments with lots of dead data gives great benefit
Some segments are hot, some segments are cold
But “cold” free space is more valuable than “hot” free space
Since cold blocks tend to stay cold
Cost-Benefit Analysis
u = utilizationA = ageBenefit to cost = u*A/(u + 1)Clean cold segments with some space,
hot segments with a lot of space
What to Put Where?
Given a set of live blocks and some cleaned segments, which goes where?Order blocks by ageWrite them to segments oldest first
Goal is very cold, highly utilized segments
Goal of LFS Cleaning
100% fullempty
number of segments
100% fullempty
number of segments
Performance of LFS
On modified Andrew benchmark, 20% faster than FFS
LFS can create and delete 8 times as many files per second as FFS
LFS can read 1 ½ times as many small files
LFS slower than FFS at sequential reads of randomly written files
Logical Locality vs. Temporal Locality
Logical locality (spatial locality): Normal file systems keep a file’s data blocks close together
Temporal locality: LFS keeps data written at the same time close together
When temporal locality = logical localitySystems perform the same
Major Innovations of LFS
Abstraction: everything is a logTemporal locality Use of caching to shape disk access
patterns Cache most readsOptimized writes
Separating full and empty segments
Where Did LFS Look For Performance Improvements?Minimized disk access
Only write when segments filled up
Increased size of data transfersWrite whole segments at a time
Improving localityAssuming temporal locality, a file’s blocks are
all adjacent on diskAnd temporally related files are nearby
Parallel Disk Access and RAID
One disk can only deliver data at its maximum rate
So to get more data faster, get it from multiple disks simultaneously
Saving on rotational latency and seek time
Utilizing Disk Access Parallelism
Some parallelism available just from having several disks
But not muchInstead of satisfying each access from
one disk, use multiple disks for each access
Store part of each data block on several disks
Disk Parallelism Example
open(foo) read(bar) write(zoo)
FileSystem
Data Striping
Transparently distributing data over multiple disks
Benefits –Increases disk parallelismFaster response for big requests
Major parameters Number of disks Size of data interleaf
Fine- vs. Coarse-grained Data Interleaving Fine grain data interleaving
+ High data rate for all requestsBut only one request per disk arrayLots of time spent positioning
Coarse-grain data interleaving+ Large requests access many disks
+ Many small requests handled at onceSmall I/O requests access few disks
Reliability of Disk Arrays
Without disk arrays, failure of one disk among N loses 1/Nth of the data
With disk arrays (fine grained across all N disks), failure of one disk loses all data
N disks 1/Nth as reliable as one disk
Adding Reliability to Disk Arrays
Buy more reliable disksBuild redundancy into the disk array
Multiple levels of disk array redundancy possible
Most organizations can prevent any data loss from single disk failure
Basic Reliability Mechanisms
Duplicate dataParity for error detectionError Correcting Code for detection and
correction
Parity Methods
Can use parity to detect multiple errorsBut typically used to detect single error
If hardware errors are self-identifying, parity can also correct errors
When data is written, parity must be written, too
Error-Correcting Code
Based on Hamming codes, mostlyNot only detect error, but identify which bit
is wrong
RAID Architectures
Redundant Arrays of Independent DisksBasic architectures for organizing disks
into arraysAssuming independent control of each
diskStandard classification scheme divides
architectures into levels
Non-Redundant Disk Arrays (RAID Level 0)
No redundancy at allSo, what we just talked aboutAny failure causes data loss
Non-Redundant Disk Array Diagram (RAID Level 0)
open(foo) read(bar) write(zoo)
FileSystem
Mirrored Disks (RAID Level 1)
Each disk has second disk that mirrors its contentsWrites go to both disksNo data striping
+ Reliability is doubled
+ Read access faster
- Write access slower
- Expensive and inefficient
Mirrored Disk Diagram (RAID Level 1)
open(foo) read(bar) write(zoo)
FileSystem
Memory-Style ECC (RAID Level 2)
Some disks in array are used to hold ECCE.g., 4 data disks require 3 ECC disks
+ More efficient than mirroring
+ Can correct, not just detect, errors
- Still fairly inefficient
Memory-Style ECC Diagram (RAID Level 2)
open(foo) read(bar) write(zoo)
FileSystem
Bit-Interleaved Parity (RAID Level 3)
Each disk stores one bit of each data block
One disk in array stores parity for other disks
+ More efficient that Levels 1 and 2
- Parity disk doesn’t add bandwidth
Bit-Interleaved RAID Diagram (Level 3)
open(foo) read(bar) write(zoo)
FileSystem
Block-Interleaved Parity (RAID Level 4)
Like bit-interleaved, but data is interleaved in blocks of arbitrary sizeSize is called striping unitSmall read requests use 1 disk
+ More efficient data access than level 3
+ Satisfies many small requests at once
- Parity disk can be a bottleneck
- Small writes require 4 I/Os
Block-Interleaved Parity Diagram (RAID Level 4)
open(foo) read(bar) write(zoo)
FileSystem
Block-Interleaved Distributed-Parity (RAID Level 5)
Spread the parity out over all disks
+ No parity disk bottleneck
+ All disks contribute read bandwidth
– Requires 4 I/Os for small writes
Block-Interleaved Distributed-Parity Diagram (RAID Level 5)
open(foo) read(bar) write(zoo)
FileSystem
Other RAID Configurations
RAID 6Can survive two disk failures
RAID 10 (RAID 1+0)Data striped across mirrored pairs
RAID 01 (RAID 0+1)Mirroring two RAID 0 arrays
RAID 15, RAID 51
Where Did RAID Look For Performance Improvements?Parallel use of disks
Improve overall delivered bandwidth by getting data from multiple disks
Biggest problem is small write performance
But we know how to deal with small writes . . .
Bonus
Given N disks in RAID 1/10/01/15/51, what is the expected number of disk failures before data loss? (1/2 critique)
Given 1-TB disks and probability p for a bit to fail silently, what is the probability of irrecoverable data loss for RAID 1/5/6/10/01/15/51 after a single disk failure? (1/2 critique)